Monday, January 7, 2008

Database Skills: Third Normal Form and Calculated Values

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Update: There is a new essay on Business Logic that also goes into how calculated values affect your code paradigm.

Third Normal Form Defined

Third normal form is pretty simple. In plain terms it means that no column can depend on a non-key column. In technical terms we say that there are no "transitive dependencies."

The example below shows a table that is not in third normal form. The problem is that the letter grade depends on the numeric grade:

SUBJECT  |  YEAR | TEACHER |  STUDENT    | GRADE |  LETTER
---------+-------+---------+-------------+-------+-----------
HIST-101 |  2008 | RUSSELL |  NIRGALAI   | 80    |   B
HIST-101 |  2008 | RUSSELL |  JBOONE     | 90    |   A
HIST-101 |  2008 | RUSSELL |  PCLAYBORNE | 95    |   A

To repeat: The key is SUBJECT + YEAR + TEACHER + STUDENT, there are no duplicate values of this combination of columns. In plain terms we mean that no student can be listed in the same class twice, which only makes sense after all. We note next that there is only one actual property of the table, the GRADE column. This is the student's final grade in a particular class.

The violation is the last column, LETTER, which is not a property of the key but is functionally dependent upon the GRADE value. The term "functionally dependent" means that column LETTER is a function of GRADE.

As always, normalization is meant to help us. The problem with the table as-is is that you cannot guarantee that when the GRADE column is updated that the LETTER column will be correctly updated. The non-normalized table can lead to what we call UPDATE ANOMALIES.

Problem: Calculated Values Are a Fact of Life

We have seen in prior essays that enforcing first normal form and second normal form leads to more tables and simpler code. We saw in both cases that a programmer's natural tendency to use a "simple" solution of fewer non-normalized tables leads to more complicated and error-prone code. But when we come to third normal form the benefits do not seem so clear-cut.

Third normal form tells us to leave out the LETTER column, but the plain fact is that the LETTER column is very useful! The situation will only get worse if there are a few dozen (or a few hundred!) tables. If we leave out all calculated values then we must provide for them outside of the database, but if we put them into the tables we risk all kinds of errors in live data.

This is where things get sticky. In this week's essay I am going to explain the classic solutions to this problem, and include my own as the last one. It is my personal opinion that the problem of calculated values is at the heart of many of the flame wars, arguments and battles that are fought in the arena of database application development. You can't live with 'em but you can't live without 'em. Some kind of logical approach is required that gives us the benefits of normalization and the benefits of calculated values.

Preserving 3NF With Another Table

In many cases something that looks like a calculation can actually be removed out to another table and turned into a foreign key. As a matter of fact, the example I gave above can be moved out to another table:

GRADE  | LETTER
-------+---------
 100   |  A
  99   |  A
  98   |  A
  97   |  A
  96   |  A
  95   |  A
  94   |  A
  93   |  A
  92   |  A
  91   |  A
  90   |  A
  89   |  B
  88   |  B
  87   |  B
  86   |  B
  ....
  .... etc ....
  ....
  1    |  F
  0    |  F

Most programmers would not think of a solution like this because it has too many rows! Your basic programmer finds it ugly, so he will not use it. This underscores a recurring theme of this series: skills that lead to good coding will lead to bad table design. Table design should be done according to the principles of table design, not the principles of coding. The customer will never know you have an "ugly" table, they will just know your system works well.

There are certain benefits that you always get when you normalize and when you put everything into tables. The big advantage is that you can change your system without modifying a program file. In the United States we would leave my "ugly" table as it is for public schools, but for many private schools we would change it to 93-100 for an "A", 85-92 for a "B", and so forth. We could make 70-100 "P" for pass and 0-69 "F" for fail. We can do all of this without changing any program code, which is always a good thing.

Preserving 3NF With a View

A "database view" is a SELECT statement that is given a name and stored permanently in the database. The nifty thing about views is that they can be queried with a SELECT statement just like a table. A view is a great tool for preserving normalization because you can keep the table itself fully normalized and put the calculated values in the view. In most databases a view is created with syntax like this:

CREATE OR REPLACE VIEW grades_extended AS
SELECT subject, year, teacher, student , grade
      ,CASE WHEN grade >= 90 THEN 'A'
            WHEN grade >= 80 THEN 'B'
            WHEN grade >= 70 THEN 'C'
            WHEN grade >= 60 THEN 'D' 
            ELSE 'F' END as LETTER
 FROM grades;
 
-- this command pulls only from the normalized table
SELECT * FROM grades

-- this command gives letter grades also
SELECT * FROM grades_extended

Because a view is simply a SELECT statement, a view can JOIN information from multiple tables, and it can also do aggregate functions like SUM, AVG and so forth, so views are very powerful indeed.

Views are very popular and can really help out in simple situations. I have found that they become very difficult to manage as the table count increases, and so I personally do not use them for this purpose anymore, but I would not discourage anybody who wants to try them out. They will probably benefit you a lot.

Preserving 3NF by Calculating As Needed

Another approach is to keep the calculated columns out of the database and calculate them as needed.

Some years ago I was made Lead Systems Architect on a package that already had bout 750 tables when I took the position. The tables were in strict third normal form: no calculated values, no exceptions.

All calculations were made as needed, and they were all performed in program code. This particular program was an ERP package, so there were a great many monetary calculations. Absolutely nothing about it was simple, every single calculation had many possible formulas that took into account discounts, locations, customer types, time of year, and as far as I could tell the sunspot cycle and the phase of the moon.

The sales team was very successful, and so this program was under heavy modification at all times by programmers on two continents. The end result was this:

  • Many calculations involved calls to subroutines in 2, 3 or 10 different programs, making them nearly impossible to debug.
  • The program was completely impossible to document, we didn't even try.
  • The program was impossible to test, because of course nobody really knew what it was supposed to be doing.
  • Performance was often terrible. The need to calculate every value on every access put a huge strain on the database as information was pulled over and over and cycles were spent doing calculations.
  • Only a few people "knew where the bodies were buried", which contradicts standard advice for keeping staff interchangeable.
  • Sometimes a customer would print an invoice, then print it a few months later after an upgrade and get different numbers, which we could not explain or correct except at great time and expense.

When somebody tells me to keep 3rd normal form religiously and keep all calculations in program code I say "been there, seen that, no thanks."

Discarding 3NF by Using Program Code

The mostly popular approach by far is to calculate and store derived values using program code.

Most programmers use some kind of object-oriented language, and are used to the idea that they will have a GRADES class that handles the calculations for the GRADES table. Whenever anybody inserts or updates the table, some method looks at the GRADE column (a value from 0-100) and re-calculates the LETTER code, then saves the whole thing in the table. This violates 3NF but makes the database more fully populated and therefore more useful.

The major advantage to this approach is that most programmers understand how to do it with little prompting or explanation. Moreover, it does make the database indeed much more useful.

The largest problem with this approach is that subvertible. Unless the programmer is absolutely certain that nobody can access the database except through her code, the possibility always remains that a different program will mess up the values. In fact, you must be able to guarantee that even within your program no rogue programmer can accidentally (or maliciously) issue commands that subvert the values. If you get even one prima donna on your team that does not believe in following rules, serious damage can be done.

Some programmers say, "that's ok, my team is small and tight and only my program will access the data," but this just means they never plan to sell a program to a company that has their own IT department. That company will demand access to the database, possibly even write access, and then you have a problem.

However, when all is said and done, this is a perfectly valid way to get working programs written. When it is reasonable to expect that nobody else can access the database, and you trust your team, then this method may be the Right Thing.

Discarding 3NF With Triggers

The last approach I will explain is the one that I use myself. In this approach you put a "trigger" on the table that performs the calculation for you. A trigger is a small program that resides in the database server itself and is written is some kind of SQL-like extended language (or Perl or C or anything supported by your server). The trigger executes whenever certain events occur in the database. Specifically, every trigger fires for exactly one table, and might fire before or after an event, and might fire for insert, update, or delete. Triggers are ideal for performing calculations.

Trigger code tends to be highly specific to the platform you are using. The code below works for PostgreSQL and will definitely not work for MS SQL Server or MySQL, you will have to look up the syntax for those systems and do a little translation. But here it is:

CREATE OR REPLACE FUNCTION grades_ins_before_f RETURNS TRIGGER AS
$$
BEGIN
    new.letter = CASE WHEN new.grade >= 90 THEN 'A'
                      WHEN new.grade >= 80 THEN 'B'
                      WHEN new.grade >= 70 THEN 'C'
                      WHEN new.grade >= 60 THEN 'D'
                      ELSE 'F' END;
END
$$ LANGUAGE PLPGSQL SECURITY DEFINER;

CREATE TRIGGER grades_ins_before_t BEFORE INSERT OR UPDATE 
    ON GRADES  
    FOR EACH ROW EXECUTE PROCEDURE grades_ins_before_f();

I should note that PostgreSQL's triggers are very powerful, but their two-stage command to make a trigger is very annoying. Other servers such as MS SQL Server, DB/2 and MySQL have simpler syntax. I put up with PostgreSQL because it is so powerful and it is free software.

One very big advantage of this system is that the calculated value is non-subvertible, no rogue commands can corrupt the calculated values. I personally love this because it gives me all of the advantages of normalized data while also getting useful derived data in the tables for easy access.

Conclusion: Denormalization and Sorting It Out

Third normal form itself is easy enough to understand, but it brings us our first conflict between good database principles and the real world. The principles of normalization keep out calculated values, but calculated values are part of every useful applicaiton.

This is why so many people say that it is necessary to denormalize. But keep in mind always that there is a difference between non-normalized and de-normalized. We have no interest in a non-normalized database, where the designer has not even bothered to identify the correct tables, primary keys, and foreign keys.

What we have seen instead is that it is very useful first to normalize a database by identifying the individual things we want to keep track of, making an individual table for each specific kind of thing, and then identifying the primary properties of each of these things. The process of de-normalization begins when we define calculated values and begin to put those into the tables. Once we do this, we must choose a method to ensure that these values are correct, using views, triggers, program code or a combination of the three.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

12 comments:

dportas said...

A very important clarification: ALL normal forms are concerned with CANDIDATE keys, not just PRIMARY KEYs. Considering PRIMARY keys only is not sufficient for normalization if the relation has more than one key.

Your definition of 3NF is actually BCNF not 3NF (which is a very common error that occurs in some textbooks), except where only a single candidate key is concerned.

KenDowns said...

dportas:

You may have noticed that Candidate Keys are not mentioned at all. From last week's entry:

"So far in this series I have not mentioned Candidate Keys. To make matters worse, I have defined normalization in terms of Primary Keys, while really it is defined in terms of Candidate Keys. I did this for the simple reason that I find people can learn normalization better before learning about candidate keys, and that teaching candidate keys later tends to work better."

Anonymous said...

Why is program code considered subvertible by folks with direct DB access - or even "rogue" programmers - but DB code (in the form of triggers) isn't? It is just as easy to disable a trigger as it is to write code.

KenDowns said...

Anon: w/respect to subverting code.

Many database apps are written so that all programs make a connection to the database using a universal connection. Nearly all web apps are written this way, and most web programmers don't know there is any other way to do it. This arrangement is completely open to subversion by a programmer, because every programmer's code is running at the highest privilege and you are depending on the programmer's good behavior to maintain security and biz rules.

However, when you use triggers, you make sure the universal connection does not have ownership rights over triggers, so a programmer cannot write application code that disables a trigger. If the programmer issues "update customers set balance=0" the trigger code says, "sorry no, I handle that column, I'm throwing an error."

BTW, that universal connection that everybody uses is the root source of SQL injection vulnerabilities, which I'll be talking about when we get to security.

Stefano said...

Talking about PostgreSQL annoyances: In 8.3, which I'm using, I had to modify the definition of the trigger function as follows:

create or replace function grades_ins_before_f () returns trigger as
$$
begin
...
return new;
end;
$$
language plpgsql security definer;

Bruno Martinez said...

The approach of having a row with the letter for each grade doesn't work if grades can have a non whole part. Having the intervals in triggers doesn't allow for easy updating of the values. Are there DBs with 'interval' keys, for these cases?

KenDowns said...

Bruno: I would hope it would go without saying that this is an example, if you need partial letter grades then use an appropriately defined column.

As for intervals, I know of know database server that implements them natively. They are really an extension of a primary or unique key. They are so useful I built them into my framework. This forces me to enforce the PK in a (generated) trigger which is somewhat distateful, but it gets the job done and it works well.

Mário Marinato said...

The ugly table from the first example is the best example of the 'less code, more data' idea you wrote a few months later.

If fact, this is what made me understand that other post.

Anonymous said...

This blog, and especially the guide to normal forms, is awesome. Thanks for your great work Ken!

Anonymous said...

Great Post! About preserving 3NF by using another table to map a grade to a letter... a problem with this is time-related. What if the range of grades changes later due to business rule changes. We want that prior grades not be affected and still show their original letter. So, I guess saving the letter is a must. I don't know what design pattern this is or if it is one. But this is very useful especially in business applications where we want to keep the original values and not be affected by master data changes. It is also very complicated to design another table containing different values depending on date/time or range of dates. Don't you agree? Maybe you could add this info in that section.

Stew Ashton said...

Oracle 11G appears to address this issue with "virtual columns". A virtual column contains a calculated value, like a mini-view, but it can be used in indexes and constraints. The nice thing about it is that you have to put the calculated value in the same table with all the values used in the calculation, which is the only way of guaranteeing data coherence.

KenDowns said...

Stew:

MS SQL Server also allows what they call computed columns.

I find these unsatisfactory because, at least in the case of SQL Server, you are limited to calculations within a row. A complete solution must allow for data that "travels" along foreign keys: fetched from parent to child and aggregated from child to parent.