The Database Programmer

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

Update: There is a new essay on Business Logic that also goes into how calculated values affect your code paradigm.

Third Normal Form Defined

Third normal form is pretty simple. In plain terms it means that no column can depend on a non-key column. In technical terms we say that there are no "transitive dependencies."

The example below shows a table that is not in third normal form. The problem is that the letter grade depends on the numeric grade:

SUBJECT  |  YEAR | TEACHER |  STUDENT    | GRADE |  LETTER
---------+-------+---------+-------------+-------+-----------
HIST-101 |  2008 | RUSSELL |  NIRGALAI   | 80    |   B
HIST-101 |  2008 | RUSSELL |  JBOONE     | 90    |   A
HIST-101 |  2008 | RUSSELL |  PCLAYBORNE | 95    |   A

To repeat: The key is SUBJECT + YEAR + TEACHER + STUDENT, there are no duplicate values of this combination of columns. In plain terms we mean that no student can be listed in the same class twice, which only makes sense after all. We note next that there is only one actual property of the table, the GRADE column. This is the student's final grade in a particular class.

The violation is the last column, LETTER, which is not a property of the key but is functionally dependent upon the GRADE value. The term "functionally dependent" means that column LETTER is a function of GRADE.

As always, normalization is meant to help us. The problem with the table as-is is that you cannot guarantee that when the GRADE column is updated that the LETTER column will be correctly updated. The non-normalized table can lead to what we call UPDATE ANOMALIES.

Problem: Calculated Values Are a Fact of Life

We have seen in prior essays that enforcing first normal form and second normal form leads to more tables and simpler code. We saw in both cases that a programmer's natural tendency to use a "simple" solution of fewer non-normalized tables leads to more complicated and error-prone code. But when we come to third normal form the benefits do not seem so clear-cut.

Third normal form tells us to leave out the LETTER column, but the plain fact is that the LETTER column is very useful! The situation will only get worse if there are a few dozen (or a few hundred!) tables. If we leave out all calculated values then we must provide for them outside of the database, but if we put them into the tables we risk all kinds of errors in live data.

This is where things get sticky. In this week's essay I am going to explain the classic solutions to this problem, and include my own as the last one. It is my personal opinion that the problem of calculated values is at the heart of many of the flame wars, arguments and battles that are fought in the arena of database application development. You can't live with 'em but you can't live without 'em. Some kind of logical approach is required that gives us the benefits of normalization and the benefits of calculated values.

Preserving 3NF With Another Table

In many cases something that looks like a calculation can actually be removed out to another table and turned into a foreign key. As a matter of fact, the example I gave above can be moved out to another table:

GRADE  | LETTER
-------+---------
 100   |  A
  99   |  A
  98   |  A
  97   |  A
  96   |  A
  95   |  A
  94   |  A
  93   |  A
  92   |  A
  91   |  A
  90   |  A
  89   |  B
  88   |  B
  87   |  B
  86   |  B
  ....
  .... etc ....
  ....
  1    |  F
  0    |  F

Most programmers would not think of a solution like this because it has too many rows! Your basic programmer finds it ugly, so he will not use it. This underscores a recurring theme of this series: skills that lead to good coding will lead to bad table design. Table design should be done according to the principles of table design, not the principles of coding. The customer will never know you have an "ugly" table, they will just know your system works well.

There are certain benefits that you always get when you normalize and when you put everything into tables. The big advantage is that you can change your system without modifying a program file. In the United States we would leave my "ugly" table as it is for public schools, but for many private schools we would change it to 93-100 for an "A", 85-92 for a "B", and so forth. We could make 70-100 "P" for pass and 0-69 "F" for fail. We can do all of this without changing any program code, which is always a good thing.

Preserving 3NF With a View

A "database view" is a SELECT statement that is given a name and stored permanently in the database. The nifty thing about views is that they can be queried with a SELECT statement just like a table. A view is a great tool for preserving normalization because you can keep the table itself fully normalized and put the calculated values in the view. In most databases a view is created with syntax like this:

CREATE OR REPLACE VIEW grades_extended AS
SELECT subject, year, teacher, student , grade
      ,CASE WHEN grade >= 90 THEN 'A'
            WHEN grade >= 80 THEN 'B'
            WHEN grade >= 70 THEN 'C'
            WHEN grade >= 60 THEN 'D' 
            ELSE 'F' END as LETTER
 FROM grades;
 
-- this command pulls only from the normalized table
SELECT * FROM grades

-- this command gives letter grades also
SELECT * FROM grades_extended

Because a view is simply a SELECT statement, a view can JOIN information from multiple tables, and it can also do aggregate functions like SUM, AVG and so forth, so views are very powerful indeed.

Views are very popular and can really help out in simple situations. I have found that they become very difficult to manage as the table count increases, and so I personally do not use them for this purpose anymore, but I would not discourage anybody who wants to try them out. They will probably benefit you a lot.

Preserving 3NF by Calculating As Needed

Another approach is to keep the calculated columns out of the database and calculate them as needed.

Some years ago I was made Lead Systems Architect on a package that already had bout 750 tables when I took the position. The tables were in strict third normal form: no calculated values, no exceptions.

All calculations were made as needed, and they were all performed in program code. This particular program was an ERP package, so there were a great many monetary calculations. Absolutely nothing about it was simple, every single calculation had many possible formulas that took into account discounts, locations, customer types, time of year, and as far as I could tell the sunspot cycle and the phase of the moon.

The sales team was very successful, and so this program was under heavy modification at all times by programmers on two continents. The end result was this:

Many calculations involved calls to subroutines in 2, 3 or 10 different programs, making them nearly impossible to debug.
The program was completely impossible to document, we didn't even try.
The program was impossible to test, because of course nobody really knew what it was supposed to be doing.
Performance was often terrible. The need to calculate every value on every access put a huge strain on the database as information was pulled over and over and cycles were spent doing calculations.
Only a few people "knew where the bodies were buried", which contradicts standard advice for keeping staff interchangeable.
Sometimes a customer would print an invoice, then print it a few months later after an upgrade and get different numbers, which we could not explain or correct except at great time and expense.

When somebody tells me to keep 3rd normal form religiously and keep all calculations in program code I say "been there, seen that, no thanks."

Discarding 3NF by Using Program Code

The mostly popular approach by far is to calculate and store derived values using program code.

Most programmers use some kind of object-oriented language, and are used to the idea that they will have a GRADES class that handles the calculations for the GRADES table. Whenever anybody inserts or updates the table, some method looks at the GRADE column (a value from 0-100) and re-calculates the LETTER code, then saves the whole thing in the table. This violates 3NF but makes the database more fully populated and therefore more useful.

The major advantage to this approach is that most programmers understand how to do it with little prompting or explanation. Moreover, it does make the database indeed much more useful.

The largest problem with this approach is that subvertible. Unless the programmer is absolutely certain that nobody can access the database except through her code, the possibility always remains that a different program will mess up the values. In fact, you must be able to guarantee that even within your program no rogue programmer can accidentally (or maliciously) issue commands that subvert the values. If you get even one prima donna on your team that does not believe in following rules, serious damage can be done.

Some programmers say, "that's ok, my team is small and tight and only my program will access the data," but this just means they never plan to sell a program to a company that has their own IT department. That company will demand access to the database, possibly even write access, and then you have a problem.

However, when all is said and done, this is a perfectly valid way to get working programs written. When it is reasonable to expect that nobody else can access the database, and you trust your team, then this method may be the Right Thing.

Discarding 3NF With Triggers

The last approach I will explain is the one that I use myself. In this approach you put a "trigger" on the table that performs the calculation for you. A trigger is a small program that resides in the database server itself and is written is some kind of SQL-like extended language (or Perl or C or anything supported by your server). The trigger executes whenever certain events occur in the database. Specifically, every trigger fires for exactly one table, and might fire before or after an event, and might fire for insert, update, or delete. Triggers are ideal for performing calculations.

Trigger code tends to be highly specific to the platform you are using. The code below works for PostgreSQL and will definitely not work for MS SQL Server or MySQL, you will have to look up the syntax for those systems and do a little translation. But here it is:

CREATE OR REPLACE FUNCTION grades_ins_before_f RETURNS TRIGGER AS
$$
BEGIN
    new.letter = CASE WHEN new.grade >= 90 THEN 'A'
                      WHEN new.grade >= 80 THEN 'B'
                      WHEN new.grade >= 70 THEN 'C'
                      WHEN new.grade >= 60 THEN 'D'
                      ELSE 'F' END;
END
$$ LANGUAGE PLPGSQL SECURITY DEFINER;

CREATE TRIGGER grades_ins_before_t BEFORE INSERT OR UPDATE 
    ON GRADES  
    FOR EACH ROW EXECUTE PROCEDURE grades_ins_before_f();

I should note that PostgreSQL's triggers are very powerful, but their two-stage command to make a trigger is very annoying. Other servers such as MS SQL Server, DB/2 and MySQL have simpler syntax. I put up with PostgreSQL because it is so powerful and it is free software.

One very big advantage of this system is that the calculated value is non-subvertible, no rogue commands can corrupt the calculated values. I personally love this because it gives me all of the advantages of normalized data while also getting useful derived data in the tables for easy access.

Conclusion: Denormalization and Sorting It Out

Third normal form itself is easy enough to understand, but it brings us our first conflict between good database principles and the real world. The principles of normalization keep out calculated values, but calculated values are part of every useful applicaiton.

This is why so many people say that it is necessary to denormalize. But keep in mind always that there is a difference between non-normalized and de-normalized. We have no interest in a non-normalized database, where the designer has not even bothered to identify the correct tables, primary keys, and foreign keys.

What we have seen instead is that it is very useful first to normalize a database by identifying the individual things we want to keep track of, making an individual table for each specific kind of thing, and then identifying the primary properties of each of these things. The process of de-normalization begins when we define calculated values and begin to put those into the tables. Once we do this, we must choose a method to ensure that these values are correct, using views, triggers, program code or a combination of the three.

The Database Programmer

Monday, January 7, 2008

Database Skills: Third Normal Form and Calculated Values

Third Normal Form Defined

Problem: Calculated Values Are a Fact of Life

Preserving 3NF With Another Table

Preserving 3NF With a View

Preserving 3NF by Calculating As Needed

Discarding 3NF by Using Program Code

Discarding 3NF With Triggers

Conclusion: Denormalization and Sorting It Out

Other Posts

20 comments:

Pages

In the Top 200 Tech Blogs!

Labels

Database Programmer Links

External Links

Blog Archive

Statcounter