Monday, December 24, 2007

Database Skills: Second Normal Form

The most basic skill that you need for successful database projects is table design. Strong table design leads directly to efficient and clean code, while weak table design leads in the opposite direction, code that is hard to write and hard to maintain.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

The Example

Our example is a school that has different kinds of schedules on different days. Periods may be 35, 40, or 45 minutes, and there may or may not be an assembly. Our assignment is to create a page that displays the periods for a day.

Nobody Argues About Second Normal Form

This week's essay is going to be rather short for two reasons. The first reasons is that I have never really heard anybody argue against second normal form, and so there is no real debate for me to review. In fact, from my own personal experience I would say that Second Normal Form is possibly the least controversial concept in computer science.

The second reason this essay will be short is that I have no real-world examples of violations that appear to make sense, so I have to make up the example. Many people will put in repeating groups (violating first normal form) and give you a determined argument claiming they are doing the Right Thing, but nobody has done that to me with second normal form. So I had to make something up to demonstrate what 2nd Normal Form is.

Prelude: Multi-Column Primary Key

Before we get to second normal form we have to talk very briefly about multi-column primary keys. A multi-column primary key is exactly what it sounds like, a primary key that has more than one column.

Our example from prior essays is a table of periods with a primary key on DAY_TYPE and PERIOD. No combination of these columns can appear twice. So in the table below the combinations in the first two columns must be unique:

NORMAL    |  HOME   |   0       | 8:00 am   |  8:10 am
NORMAL    |  1      |   1       | 8:15 am   |  9:00 am
NORMAL    |  2      |   2       | 9:05 am   |  9:50 am   

Second Normal Form

Second normal form is violated if you have a column that depends on only a part of the primary key. In this case it would mean a column that depended only on DAY_TYPE or PERIOD. This example has two violations of second normal form:

NORMAL    |  Normal       | HOME     | Homeroom    |  0        | 8:00 am   |  8:10 am
NORMAL    |  Normal       | 1        | Period 1    |  1        | 8:15 am   |  9:00 am
NORMAL    |  Normal       | 2        | Period 2    |  2        | 9:05 am   |  9:50 am   
...more rows for type NORMAL
ALONG     | Long Assembly | HOME     | Homeroom    |  0        | 8:00 am   |  8:10 am
ALONG     | Long Assembly | 1        | Period 1    |  1        | 8:15 am   |  9:00 am
ALONG     | Long Assembly | 2        | Period 2    |  2        | 9:05 am   |  9:50 am   

Getting to Second Normal Form

In the example above the column DAY_DESC always has the same value for a given value of DAY_TYPE. DAY_DESC depends on only part of the key (which is DAY_TYPE + PERIOD), and so it is a violation of second normal form.

The exact same error occurs with DESCRIPTION, which is always the same for a given value of PERIOD. This value depends on only part of the key, and so it is a violation of second normal form.

The fix in both cases is to create parent tables to hold these values. The primary key of the parent table is one of the columns from the original table. Here are the two tables we create when we normalize the table from above:


NORMAL            |   Normal Schedule 
ALONG             |   Long Assembly
ASHORT            |   Short Assembly
ALUNCH            |   Short Assembly After Lunch

HR           |   Homeroom
1            |   Period 1
2            |   Period 2
3            |   Period 3
4            |   Period 4
LUNCH        |   Lunch
6            |   Period 6
7            |   Period 7
8            |   Period 8
ASSM         |   Assembly

Why Normalize?

Because this is a short essay this week, now would be a good time to review the basic reasons why we normalize.

In a normalized database, every distinct fact is stored in exactly one location, and each location stores exactly one fact (where a "location" means the value of one column in one particular row). This goal itself serves the goal of avoiding something that the database theorists call "anomalies".

In database terms, an "UPDATE anomaly" occurs when you issue a reasonable update statement to change something, but in fact the same information is stored somewhere else as well, and your two locations are now out of sync, your database has become corrupted. An "INSERT anomaly" and a "DELETE anomaly" are similar.

What this means to us programmers is that when a database is not normalized we end up with a significant burden trying to keep the data correct. We waste a lot of time fixing data on live systems (admit it, you've done it!) or adding code trying to prevent the problem. Many of those problems would be avoided completely if the data were normalized.

Now, as always, programmer who do not know databases well will argue that a system of fewer tables is "simpler", because he only needs one table instead of three, and he does not need to use JOINs when querying, or his ORM system is easier to write. This may be true, but he now has the burden of making sure that when an end-user changes a description in one row that it gets changed in all of the others. I personally would rather spend my time adding a new feature then fixing one I coded last year.

Normalization cannot solve all of your problems, but it can very definitely solve the ones it was meant to solve. If you can get each fact stored in exactly one place then you never have to waste time coding special fix-it routines or crawling through inconsistent data on a live system.

Conclusion: Easier Coding

Today we have seen another example of how a normalized set of tables is actually easier to work with than a so-called "simpler" single table, because it is easier to make sure that the data is correct.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.


John Flack said...

First of all, thank you for this series. I love the way you are not just explaining HOW to normalize, but WHY you do it, and what can go wrong if you don't.

I'm afraid that violations of second normal form are not are rare as you think. I am myself guilty - I have a name and address system for contacts in various US State governments, and added some columns the relate to the State as a whole. This should have been in a separate table.

KenDowns said...

John, thanks for the compliment, I'm glad you find it helpful. Stay tuned, there is more to come each Monday!

phuson said...

Great post Kenneth, I'm enjoying all these articles so far. Keep them coming. :)

dr.bob said...

the correct spelling is ANOMALY..
otherwise, great stuff. Keep it coming:)

KenDowns said...

Dr. Bob, thanks for the correction and the kind words, I've updated the entry.

astembridge said...

Hi, thanks for the series. I feel I have a good understanding of first/second normal forms, but where I get lost is writing complex queries with first normal forms. Any chance you'd put together an article dealing with aggregate functions, joins, as relating to first normal design -- for someone new to the concept?

KenDowns said...

@astembridge: check out the table of contents, there are two entries on JOINs, as well as other entries on queries. If you still have questions, try posting something on one of those.

Stu said...

Thank you for this. I've found your clear explanations very helpful whilst completing an assignment on database normalization.