Welcome to the Database Programmer! The main theme of these essays is that your applications will be leaner, faster, and easier to write and maintain if you understand how databases work.
The Non-normalized Database
A denormalized database is one that has been meticulously normalized to eliminate redundancies, only to have redundancies deliberately put back in to meet other needs. That is the kind of database we are going to talk about today.
By contrast, a database is "non-normalized" if nobody ever bothered to normalize it. This is what you get when the programmer says, "I'm not concerned with the table structure yet, I'm working on the code first." In my experience this is something like saying, "I'm not worried about the plan for this work bench I'm building, I'm just going to throw the wood on the table saw and start cutting." We are not interested in non-normalized databases.
The Fully Normalized Database
The fully normalized database will have few if any redundancies, each fact will be stored in exactly one place. This database has one big advantage, which is that write actions, INSERT, UPDATE, and DELETE, will be very easy to code up correctly. The application code for a fully normalized database will be smooth, simple, easy to write and easy to maintain.
But decades of experience has shown that fully normalized databases have a few drawbacks, which is why practical-minded programmers always end up denormalizing their designs. The problems in particular are:
- No calculated values. If you have a shopping cart with the columns "price" and "qty", it may seem natural to put in a column "extended_price" that holds price * qty. This is actually forbidden by third normal form, though almost everybody puts it in anyway.
- Non-reproducible Calculations. Continuing from the point above, most shopping carts or ordering systems have many calculations that are far more complicated than simply doing price * qty. Often a calculation may depend on a dozen or more preceding calculations. In complex situations like this, it is all too common that changes in the program mean an invoice printed after an upgrade produces different numbers than the same invoice printed before the upgrade. You don't want that phone call!
- JOIN Jungles. A fully normalized database will have information scattered in many different tables. Though this makes it easy to get it right going in, it can make it terribly difficult to get seemingly simple combinations of data back out.
The Denormalized Database
As I said above, the denormalized database is one that was first normalized by having its redundancies removed, only to have some redundancies deliberately put back in. This can solve all three of the problems listed above. There is a cost of course. When we denormalize we have to keep two things in mind:
- We must dream up a way to keep the redundant values correct. If we can pull this off we get all of the advantages of denormalization with no drawbacks.
- Following up on the first point, we have a better chance of getting it right if we can identify a set of denormalization patterns. Once they are identified, we can code up something in the framework that supports them, and now we can see the gold at the end of the rainbow.
The Foreign Key Is Our Friend (Again)
I have said many times in these essays that the foreign key is the only way to establish relationships between facts stored in different tables. We will now see how this relates to denormalizing our database.
Denormalization means introducing redundancies. In other words, a fact that was stored in only one place is now stored in two or more places. If we are going to copy a value from one table to another table, it stands to reason that there must be some logical relationship between those two tables. Since the only kind of relationship we can have between two tables is a foreign key, our denormalization patterns must in some way work with foreign keys.
The First Pattern: FETCH
Consider a shopping cart that has a column "sku" and another column "price." Most programmers lay out these tables and write some code that copies the price from the ITEMS table to the ORDER_LINES table. I call this pattern the "FETCH" because the price is FETCHed from the ITEMS table and written into the ORDER_LINES table.
Most programmers code up FETCH operations all over the place and do not ever realize they are denormalizing. I think this pattern is just so natural that most of us never think about it. If you examine your database applications you will likely see that you are doing this all over the place.
In order to get this pattern to operate correctly, your framework must make sure at very least that the SKU is not null when an INSERT is made to ORDER_LINES, and that the price is copied during the INSERT. You can maintain correctness by not allowing users to change the SKU on this table, if they change their minds they must delete a line and enter a new one. Or, you can make your framework a little more flexible and execute the FETCH again if the SKU changes on an UPDATE.
Sidebar: Is FETCH Really Denormalizing?
Die-hard relational theorists will tell you not to copy price from the items table. You are supposed to leave it where it belongs and use JOINs to pick up the price when it is needed. There are three arguments against this sort of purity.
The first practical argument is that it is horribly difficult to deal with complex calculations this way. It is far easier to copy the price when the line goes in, so you never have to "go looking" for it again.
The second practical argument is that performance tanks if you follow the die-hard relational approach. If you have to look in 6 tables every time somebody refreshes their cart you will have a much slower program than one that only has to look in one table.
But the third argument is more theoretical, and it is this: the FETCH is not really denormalizing. The idea is that when the customer makes an order your store has entered in an agreement to sell something at a particular price. That price is stored on the order and is now a fact about that order. It is not redundant because you are not storing the SKU's generic price, you are only storing the price that that customer is going to pay on this order. If the price changes 5 minutes after the customer places the order, they will expect to get the price as it was when they put it in the cart, and so you are actually doing the right thing by writing it to the order.
The Second Pattern: Aggregations
The FETCH that was described above is all about copying a value from a parent table to a child table. The opposite pattern occurs when you roll up values from children to parents. These are usually done as totals (SUMS), counts, averages, minimums, and maximums.
Looking at the ORDER_LINES again, if a customer has 3 items in their cart, it is perfectly natural to most programmers to put a column "PRODUCT_TOTAL" onto their ORDERS table that holds the sum of all of the lines. This is called an aggregation, because the result in the parent table is always some operation performed on the aggregation of all of the child rows.
Aggregrations are always denormalizing because they are values that could be derived from other values. To be specific, an aggregration violates third normal form because it introduces a non-key dependency - a value that is dependent not on the key but on values from a completely different table!
In order to make sure this value is always correct, the framework must always update the total on the parent table when any line in the child table changes. If your framework can do that successfully, your aggregations will always be correct.
The Third Pattern: EXTEND
The first two patterns we saw dealt with foreign keys. The first pattern, the FETCH, involves values travelling "downward" on a foreign key from parent to child. The second pattern involves values travelling "upward" on a foreign key from child to parent. The third and final denormalizing pattern involves calculated values within a row.
The example at the beginning of this essay was the column EXTENDED_PRICE, which holds the value of PRICE * QTY. This is an EXTEND operation, because it extends a row by adding a new redundant value. This is denormalizing because it violates third normal form, it introduces a value that is not dependent on any candidate key.
If you want to makes sure your EXTENDs are always correct then you need a framework that will always update the calculation when either of its dependent values changes.
In describing the three denormalizing patterns above, I have explained what you need to make sure each one is performed successfully. There is a final requirement to keeping all of this correct, which is that the operations must be performed in the proper order.
Considering the shopping cart again, in particular the ORDER_LINES table, these three operations must occur in this order:
- The PRICE is FETCHed
- The EXTENDED_PRICE is calculated as an EXTEND
- The ORDERS table's PRODUCT_TOTAL value is adjusted.
Your framework must have a reasonable way to make sure that the operations are performed in the correct order, or they will not give the correct result. As a rule of thumb, in most systems the FETCHes come first, followed by the EXTENDs, and then the aggregations.
Meta-data can be a big help here. When I first contemplated these patterns about four years ago, it occurred to me that they could all be stored as formulas in the basic description of the database, and that a code generator would sequence them for me and generate the code, so that the operations would always occur in the correct order. I wrote the basic system in the fall of 2004 and have found it to work extremely well ever since. In my personal opinion, this is the only way to reliably handle these patterns.
Conclusion: Denormalization Also Follows Patterns
A fully normalized database makes it easy to get data in correctly, but makes it difficult to get it out. Denormalizing is the process of taking a normalized database and deliberately introducing redundancies to improve query writing, performance and correctness. Not surprisingly, denormalization has its own patterns. Two of these follow the foreign key, and the third one works inside of a single row. If you follow these patterns and fashion your framework to keep them correct, you get all of the benefits of denormalization without the concern for bad data.