Monday, December 17, 2007

Database Skills: First Normal Form

Since all non-trivial websites require a database, and since database skills are different from coding skills, there is very good reason for any programmer to master the principles of database design and use.

The most basic skill that you need for successful database projects is table design. A strong table design can be the foundation of a wonderful application, while a weak design brings nothing but delays, expense, headaches and heartaches.

One of the cornerstones of good table design is the process of normalization. Normalization proceeds through different stages. The result of each stage is called a "Normal form." So if we have completed the first stage we say the database is in "First Normal Form", and after we have completed the second stage we say the database is in "Second Normal Form" and so on.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

The Example

Our example is a school that has different kinds of schedules on different days. A normal day has 4 periods in the morning, lunch, and then 3 afternoon periods. Sometimes the periods are 35 minutes and there is an assembly at the end of the day that goes to the normal end of day. On short assembly days the periods are 40 minutes each. There is a also a variation where the short assembly might be after lunch instead of at the end of the day.

A First Draft Table

Taking a stab at some table design, we may say something like, "Well I can see that there are never more than nine periods, and there are four kinds of schedules. I can make a table that lists the four kinds of schedules and tells me what happens in each period, something like this:"

DAY_TYPE   | MINUTES | P0   P1   P2   P3   P4   P5   P6   P7  P8   
-----------+---------+----+----+----+----+----+----+----+----+----
NORMAL     | 45      | C  | C  |  C | C  | L  | C  | C  | C  | X 
ALONG      | 35      | C  | C  |  C | C  | L  | C  | C  | C  | A 
ASHORT     | 40      | C  | C  |  C | C  | L  | C  | C  | C  | A 
ALUNCH     | 40      | C  | C  |  C | C  | L  | A  | C  | C  | C 

....where "C" means a class period, "L" means lunch, "A" means assembly and "X" means nothing happens.

It so happens that this table is not in first normal form. We will look at what this means first in theory, and then we will see the practical problems that come from it.

First Normal Form as Theory

A table is in first normal form if it has no repeating groups. The example above repeats the period column nine times with columns P0, P1, P2. This is a repeating group, and so it violates first normal form. Theory tells us that we must redesign the table, splitting the repeating group out into a child table. This will leave us with two tables, one that is for day types and the other that tells us about periods. The table of day types is now quite simpler:

DAY_TYPE   | MINUTES | DESCRIPTION
-----------+---------+-----------------------------                  
NORMAL     | 45      | Normal Day                            
ALONG      | 35      | Long Assembly
ASHORT     | 40      | Short Assembly                           
ALUNCH     | 40      | Short Assembly After Lunch                        

The original table had ordering information buried in the names of the columns. We need to have that same information in our child table so we will have the PERIOD column be numeric and it will determine the order of the periods. Here is the table with some of its entries:

DAY_TYPE   |  PERIOD   |   ACTIVITY 
-----------+-----------+---------------
NORMAL     |  1        | Class
NORMAL     |  2        | Class
NORMAL     |  3        | Class
NORMAL     |  4        | Class
NORMAL     |  5        | Lunch 
NORMAL     |  6        | Class
NORMAL     |  7        | Class
NORMAL     |  8        | Class
ALONG      |  1        | Class
ALONG      |  2        | Class
ALONG      |  3        | Class
ALONG      |  4        | Class
ALONG      |  5        | Lunch 
ALONG      |  6        | Class
ALONG      |  7        | Class
ALONG      |  8        | Class
ALONG      |  9        | Assembly

We have now satisfied the theory, and put our fledgling database into First Normal Form. But the question arises, wasn't the single-table version a lot simpler? This new "fixed" version seems more complicated. A programmer might say something like, "I have to look all over the place to find what I need." To answer these objections we will now examine the practical side of the example.

Problem 1: Structure Changes

The non-normalized example is probably fine if you have just one school that never changes their schedule. But if you have multiple schools you will need to change your structure for them, adding or removing periods.

To this we might reply, "but sure, structures change all of the time." Yes, it is true that a structure change may be required to improve a program or add functionality, but in the example above you must change the structure in cases where you are not improving the program or adding functionality. This means the more people use your program, the more work for you! By contrast, in the normalized situation a different schedule just means an end-user task to make entries in the tables.

You can argue that this particular example is not so bad, because, after all, how many periods can any school really have? Let's make a table with 15 periods and be done with it, no? The problem is that if you win that argument and make a habit out of repeating groups then you saddle your development efforts with the ongoing need to do structure changes where simple data editing would have done the job.

Problem 2: Secondary Columns

I worked on a system one time that had a repeating group of 24 columns in several tables. The columns were for the size of an item of clothing. In one table there were 24 additional columns for price. Then 24 more for cost. Then 24 more for gross profit. There were more, should I continue?

The problem would appear in our school schedule if we wanted to put the start and end times of the periods into the tables so we did not have to figure them out on every page load. We would now have START0, START1, END0, END1, and so on, eighteen more columns. Such fat tables like this are very clumsy to work with both in code and when doing ad-hoc queries.

Problem 3: Constants Required

Imagine you have a simple routine called SQL_ALLROWS() that gives you query results in an array (if you don' have one, code it today!). If your database is well normalized your code will display this pattern a lot:

$rows = SQL_AllRows(
    "SELECT activity FROM period_info 
      WHERE day_type = 'NORMAL'
      ORDER BY sequence"
);
foreach($rows as $row) {
   // ...do something interesting...
}

This simple code is not possible when you violate first normal form. You must put a constant into your code somewhere so that your code knows how many columns there are. This is an extra bit of maitenance to worry about here:

$row = SQL_OneRow(
    "SELECT * FROM period_info 
      WHERE day_type = 'NORMAL'
);
for($x = 0; $x < NUM_PERIODS; $x++ ) {
   $period = $row['P'.$x];
   // ... do some display...
}

Problem 4: Query Construction

For problem 3 above, I put a "Select *" into the query. I really should not have done that. I should have listed all 9 "P" columnns (or all 27 columns if we put start and end times in also). Something like this:

SELECT P0,P1,P2,....

But this violates good programming practice. If the structure changes then I have to go to all of those queries and modify them. A better programming practice would be to use that NUM_PERIODs constant to generate column lists on the fly:

$cols = array();
for ($x = 0; $x < NUM_PERIODS; $x++ ) {
  $cols[] = 'P'.$x;
}
$query = "SELECT ".implode(',',$cols)." from...";

Most programmers will avoid a bit of code like this because it seems at first glance to be really stretching a point. Of course once the program is successful and the additions and changes come in that leaves you going through and changing every query every time you modify the structure. None of this is necessary with the normalized version.

It is starting to seem like a stretch to claim that we have take the simpler path here. Perhaps the "complicated" normalized tables really are simpler.

Problem 5: Hard-Coded Tools

As I write this in December of 2007 the web is rediscovering a UI widget that has been in use for decades in terminals and desktops: the grid. The grid is the natural method of editing child tables in admin interfaces.

Before 1995, when there were no web programmers, a programmer who violated first normal form paid the price by needing to code up things like grids by hand because he could not use grid-like tools available in his framework. The modern web programmer will pay the price in not being able to use the grids that are rapidly coming online.

Problem 6: Cannot Aggregate

Finally we will consider a simple question, tell me how many periods there are in any particular DAY_TYPE. In a normalized database the SQL is easy:

SELECT COUNT(*) FROM periods WHERE day_type='NORMAL'
-- or, to get them all:
SELECT COUNT(*),Day_type
  FROM periods
 GROUP BY day_type

Getting the same answer in the non-normalized database requires code. It does not matter how simple that code is, the fact is that if the tables were normalized the code would not be required at all. Multiply that by 10 tables, 100 tables, and you just have complication after complication.

Conclusion

Violating first normal form puts us in a situation where we need programmers to make changes, while normalized databases can be modified by end users. Violating first normal form also forces us to embed and maintaint constants we would not otherwise need and to change queries that could otherwise be left alone. These problems can be mitigated with libraries of code that build queries for us, but still those libraries have to be written.

In this light, we see can see first normal form as something that creates efficient code and easier-to-maintain databases.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

4 comments:

Anonymous said...

hi this is very nice. Keep it up. Be very happy to see more intermediate to advance database skills :)

KenDowns said...

Anonymous: thanks for the encouragement. The normalization entries will only go so far as third normal form, and then I'll be going to key selection which i hope is the beginning of intermediate stuff. If it doesn't get tough enough just complain :)

Alex said...

Thank you very much for your blog. Really a gem in the blogworld and the WWW ;-)

One remark: the concept of First Normal Form is not as easy as it seems (atomicity, repeating groups, etc), cf. http://phlonx.com/resources/nf3/#nf1 or http://en.wikipedia.org/wiki/First_normal_form

david mathers said...

It's impossible to violate first normal form using SQL.

In your example, at the relation level, your relation has an attrubute of type array named "periods".

If your RDMBS doesn't offer an array type then you need to create your own array representation. Whether you use P0, P1, etc. or use a table to hold the array is up to you. Both may be fine options, depending on your situation, but the choice is unrelated to relational theory and normalization.