Sunday, October 12, 2008

The Argument For Normalization

This week we will review the practical arguments in favor of normalization. The major concern as always on this blog is to examine database decisions in light of how they affect the overall application. The major argument for normalization is very simple: you end up coding less, coding easier, and coding stronger, and you end up with fewer data errors.

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Informal Description of Normalization

When I find a programmer who is stuck trying to grasp the concepts of normalization, the problem usually comes down to not being able to see the big picture. The programmer may read any number of rigorous papers on the subject (the Wikipedia article is a good place to start) but still be unable to get the basic point. This leaves the programmer stumbling through table design, second-guessing himself, and then running through a frustrating sequence of redesigns. In the worst case it leads him to conclude normalization may not be worth the effort, at which point he starts writing really crappy applications.

The goal of normalization in simple terms is just this: to store each fact in exactly one place. When you put each fact in only one place, you always know where to go to read it or write it. When facts are repeated in the database, the application programmer has an increased burden to make sure they are all consistent. If he fails to shoulder this burden completely, the database will have inconsistent values for the same facts, leading to emergency phone calls and emails requesting help. These request for help always come at 4:30pm as you are getting ready for a date or an extended vacation.

Once the programmer understands this very simple goal, he usually has an "aha!" moment reading the various articles on normalization. Each rule for normalizing suddenly makes more sense, as it appears as just another way to make sure that there is a place for everything and everything is in its place.

The Programmer's Burden

When a programmer is dealing with a non-normalized database, he is going to run into 4 problems over and over again. Three of these are called "anomalies" and the fourth is "inconsistency" (a fancy way of saying the database holds bad data). Most authors who write on normalization take it for granted that the reader can readily see why the anomalies are bad, but I would like to spell it out here to make it crystal clear. First we will look at the three kinds of anomalies, then we will go to the inconsistency problem, and see exactly how they affect the programmer.

Consider a fledgling programmer who has ready too many Web 2.0 blogs saying that relational databases are bad, and so has not bothered to learn anything about them. In the name of "simplicity", he creates a single table that lists employees, their email addresses, the customers they are assigned to, and the primary email address of each customer. This will lead to three kinds of anomaly, each of which leads to inconsistency.

An Update Anomaly occurs when a fact is stored in multiple locations and a user is able to change one without changing them all. If a user goes to this employee-customer table and changes an employee's email on only one row, and no provision is made to change the others, then the database now has inconsistent values for the employee's email.

  • An Insert Anomaly occurs when it is not actually possible to record a fact. If an employee is hired but not yet assigned to any customers, it is not possible to store his email address!

  • A Delete Anomaly occurs when the user deletes one fact and clobbers some other fact along the way. If an employee goes on leave, so that we must remove (delete) their assignments, then we have lost their email address!

    This non-normalized database requires the programmer to write additional application code to try to intercept and correct these issues. This is the Programmer's Burden in a non-normalized situation, and it gets worse and worse as the program expands.

    The Programmer's Burden also emerges as a continuing stream of complaints from users that "the program is wrong." For every case where the programmer fails to provide exception-handling code, a user will stumble across inconsistent data. The customer says, "it says 'X' on this screen but it says 'Y' on that screen, what's going on?" As far as they are concerned it is a bug (which of course it is) and must be fixed. You can't make money coding new features when you are fixing garbage like that.

    The Basic Argument

    So the basic argument for normalization is: we wish to avoid the Programmer's Burden as completely as possible. We want to spend our time on cool features, not going back over and over to fix features we thought were finished already.

    Special Comment on First Normal Form

    First normal form is different from the others. When a database designer violates the higher normal forms, the result is that a fact is recorded in more than one place. However, when you violate first normal form it results in more than one fact in the same place.

    A basic example would be the same table of employees and customers, where we "solve" the problems listed above by storing only one row for each employee, with a comma-separated list of accounts, like so:

    EMPLOYEE   EMAIL                     CUSTOMERS
    ------------------------------------------------------------
    ARANDOLPH  art@praxis.com            100, 523, 638, 724
    SRUSSELL   sax@overlook.edu          516, 123, 158
    PBOYLE     phyllis@sp-elevataor.com  713, 928, 212
    

    The above scheme increases the Programmer's Burden because now he must decompose the data that comes from the server. In technical terms we say that the value CUSTOMERS is non-atomic, it is not a single fact. Every piece of code that touches that table must break down the list of customers and sometimes reassemble it.

    To see this, consider the basic task of adding a customer for employee Art Randolph. If the tables were set up properly, you would insert into a cross-reference of employees and customers, and duplicates would be trapped by a primary key. But here you must retrieve the list of existing customers, split it up in application code, and check that the value is not repeated. Then you have collapse the list back down and send it up to the server.

    All I can say is, no thanks.

    By The Way, What Is The Right Way?

    Now that we have beat up our fledgeling programmer's lousy employee-customer table, it would be worthwhile to spell out how to do it correctly.

    First off, we always need one table for each kind of thing we are keeping track of. That means we will have a table of employees and a table customers. This solves all of the anomalies and inconsistencies listed above because we put facts about employees in the employees table (like email address) and facts about customers in the customers table.

    This leaves the issue of linking employees to customers. There are three ways to do it:

    1. If each customer gets a team of employees assigned to them, but an employee only ever works for one customer, then put a Foreign Key on the employees table that links to the customers table.
    2. If each employee works on more than one customer, but each customer gets only one employee, then put a foreign key on the customers table that links back to employees.
    3. If an employee can work for more than one customer and vice-versa, make a Cross-reference between customers and employees.

    Conclusion

    This week we have seen a fairly simple argument for normalization, and one that regular readers of this blog have seen before: normalization eliminates unnecessary coding burdens. It is hard enough to get software projects done on time and on budget without imposing additional labor that could be avoided entirely by normalizing.

    I do not mean to imply that normalizing takes no time or is instantly easier than a fear-based retreat into coding your way out of things. It does take time to learn to normalize and it does take time to learn to code an application around normalized tables. In my own experience I passed through the various erroneous mindsets that I make fun of in this blog, and each time I put effort into learning the "right way" then every effort I made after that was forever easier, had fewer bugs, and made my customers more happy. So I am not saying it is free, but I am saying it is one of the best bargains in town.

    Related Essays

    This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

    The normalization essays on this blog are:

  • 1 comment:

    Anonymous said...

    You wrote ...

    "Consider a fledgling programmer who has ready too many Web 2.0 blogs saying that relational databases are bad, and so has not bothered to learn anything about them. In the name of "simplicity", he creates a single table that lists employees, their email addresses, the customers they are assigned to, and the primary email address of each customer."

    There is a very basic weakness with this argument. If the programmer thinks relational databases are bad, why would she "create a table", i.e., use a relational database, as her first choice? It's more likely that she'd use a hierarchical or network database, e.g., an XML or OO database, which would not store the email addresses twice, etc.