Sunday, April 27, 2008

Denormalization Patterns

Welcome to the Database Programmer! The main theme of these essays is that your applications will be leaner, faster, and easier to write and maintain if you understand how databases work.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

The Non-normalized Database

A denormalized database is one that has been meticulously normalized to eliminate redundancies, only to have redundancies deliberately put back in to meet other needs. That is the kind of database we are going to talk about today.

By contrast, a database is "non-normalized" if nobody ever bothered to normalize it. This is what you get when the programmer says, "I'm not concerned with the table structure yet, I'm working on the code first." In my experience this is something like saying, "I'm not worried about the plan for this work bench I'm building, I'm just going to throw the wood on the table saw and start cutting." We are not interested in non-normalized databases.

The Fully Normalized Database

The fully normalized database will have few if any redundancies, each fact will be stored in exactly one place. This database has one big advantage, which is that write actions, INSERT, UPDATE, and DELETE, will be very easy to code up correctly. The application code for a fully normalized database will be smooth, simple, easy to write and easy to maintain.

But decades of experience has shown that fully normalized databases have a few drawbacks, which is why practical-minded programmers always end up denormalizing their designs. The problems in particular are:

  • No calculated values. If you have a shopping cart with the columns "price" and "qty", it may seem natural to put in a column "extended_price" that holds price * qty. This is actually forbidden by third normal form, though almost everybody puts it in anyway.
  • Non-reproducible Calculations. Continuing from the point above, most shopping carts or ordering systems have many calculations that are far more complicated than simply doing price * qty. Often a calculation may depend on a dozen or more preceding calculations. In complex situations like this, it is all too common that changes in the program mean an invoice printed after an upgrade produces different numbers than the same invoice printed before the upgrade. You don't want that phone call!
  • JOIN Jungles. A fully normalized database will have information scattered in many different tables. Though this makes it easy to get it right going in, it can make it terribly difficult to get seemingly simple combinations of data back out.

The Denormalized Database

As I said above, the denormalized database is one that was first normalized by having its redundancies removed, only to have some redundancies deliberately put back in. This can solve all three of the problems listed above. There is a cost of course. When we denormalize we have to keep two things in mind:

  1. We must dream up a way to keep the redundant values correct. If we can pull this off we get all of the advantages of denormalization with no drawbacks.
  2. Following up on the first point, we have a better chance of getting it right if we can identify a set of denormalization patterns. Once they are identified, we can code up something in the framework that supports them, and now we can see the gold at the end of the rainbow.

The Foreign Key Is Our Friend (Again)

I have said many times in these essays that the foreign key is the only way to establish relationships between facts stored in different tables. We will now see how this relates to denormalizing our database.

Denormalization means introducing redundancies. In other words, a fact that was stored in only one place is now stored in two or more places. If we are going to copy a value from one table to another table, it stands to reason that there must be some logical relationship between those two tables. Since the only kind of relationship we can have between two tables is a foreign key, our denormalization patterns must in some way work with foreign keys.

The First Pattern: FETCH

Consider a shopping cart that has a column "sku" and another column "price." Most programmers lay out these tables and write some code that copies the price from the ITEMS table to the ORDER_LINES table. I call this pattern the "FETCH" because the price is FETCHed from the ITEMS table and written into the ORDER_LINES table.

Most programmers code up FETCH operations all over the place and do not ever realize they are denormalizing. I think this pattern is just so natural that most of us never think about it. If you examine your database applications you will likely see that you are doing this all over the place.

In order to get this pattern to operate correctly, your framework must make sure at very least that the SKU is not null when an INSERT is made to ORDER_LINES, and that the price is copied during the INSERT. You can maintain correctness by not allowing users to change the SKU on this table, if they change their minds they must delete a line and enter a new one. Or, you can make your framework a little more flexible and execute the FETCH again if the SKU changes on an UPDATE.

Sidebar: Is FETCH Really Denormalizing?

Die-hard relational theorists will tell you not to copy price from the items table. You are supposed to leave it where it belongs and use JOINs to pick up the price when it is needed. There are three arguments against this sort of purity.

The first practical argument is that it is horribly difficult to deal with complex calculations this way. It is far easier to copy the price when the line goes in, so you never have to "go looking" for it again.

The second practical argument is that performance tanks if you follow the die-hard relational approach. If you have to look in 6 tables every time somebody refreshes their cart you will have a much slower program than one that only has to look in one table.

But the third argument is more theoretical, and it is this: the FETCH is not really denormalizing. The idea is that when the customer makes an order your store has entered in an agreement to sell something at a particular price. That price is stored on the order and is now a fact about that order. It is not redundant because you are not storing the SKU's generic price, you are only storing the price that that customer is going to pay on this order. If the price changes 5 minutes after the customer places the order, they will expect to get the price as it was when they put it in the cart, and so you are actually doing the right thing by writing it to the order.

The Second Pattern: Aggregations

The FETCH that was described above is all about copying a value from a parent table to a child table. The opposite pattern occurs when you roll up values from children to parents. These are usually done as totals (SUMS), counts, averages, minimums, and maximums.

Looking at the ORDER_LINES again, if a customer has 3 items in their cart, it is perfectly natural to most programmers to put a column "PRODUCT_TOTAL" onto their ORDERS table that holds the sum of all of the lines. This is called an aggregation, because the result in the parent table is always some operation performed on the aggregation of all of the child rows.

Aggregrations are always denormalizing because they are values that could be derived from other values. To be specific, an aggregration violates third normal form because it introduces a non-key dependency - a value that is dependent not on the key but on values from a completely different table!

In order to make sure this value is always correct, the framework must always update the total on the parent table when any line in the child table changes. If your framework can do that successfully, your aggregations will always be correct.

The Third Pattern: EXTEND

The first two patterns we saw dealt with foreign keys. The first pattern, the FETCH, involves values travelling "downward" on a foreign key from parent to child. The second pattern involves values travelling "upward" on a foreign key from child to parent. The third and final denormalizing pattern involves calculated values within a row.

The example at the beginning of this essay was the column EXTENDED_PRICE, which holds the value of PRICE * QTY. This is an EXTEND operation, because it extends a row by adding a new redundant value. This is denormalizing because it violates third normal form, it introduces a value that is not dependent on any candidate key.

If you want to makes sure your EXTENDs are always correct then you need a framework that will always update the calculation when either of its dependent values changes.

Dependency Tracking

In describing the three denormalizing patterns above, I have explained what you need to make sure each one is performed successfully. There is a final requirement to keeping all of this correct, which is that the operations must be performed in the proper order.

Considering the shopping cart again, in particular the ORDER_LINES table, these three operations must occur in this order:

  1. The PRICE is FETCHed
  2. The EXTENDED_PRICE is calculated as an EXTEND
  3. The ORDERS table's PRODUCT_TOTAL value is adjusted.

Your framework must have a reasonable way to make sure that the operations are performed in the correct order, or they will not give the correct result. As a rule of thumb, in most systems the FETCHes come first, followed by the EXTENDs, and then the aggregations.

Meta-data can be a big help here. When I first contemplated these patterns about four years ago, it occurred to me that they could all be stored as formulas in the basic description of the database, and that a code generator would sequence them for me and generate the code, so that the operations would always occur in the correct order. I wrote the basic system in the fall of 2004 and have found it to work extremely well ever since. In my personal opinion, this is the only way to reliably handle these patterns.

Conclusion: Denormalization Also Follows Patterns

A fully normalized database makes it easy to get data in correctly, but makes it difficult to get it out. Denormalizing is the process of taking a normalized database and deliberately introducing redundancies to improve query writing, performance and correctness. Not surprisingly, denormalization has its own patterns. Two of these follow the foreign key, and the third one works inside of a single row. If you follow these patterns and fashion your framework to keep them correct, you get all of the benefits of denormalization without the concern for bad data.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

19 comments:

Mark Wilden said...

Good stuff!

When I was reading about the first "denormalization," I was (silently) screaming "but this isn't a denormalization!" Then you addressed that. :)

However, even though the price in the order line and the price in the product are different entities, they are related, and, depending on business rules, still represent one piece of data being stored redundantly in more than one place - i.e., in more than one order line for a given date.

Removing this redundancy would require keeping a history of prices for a product, to which each order line would point (implicitly by date or explicitly by foreign key).

So simply copying the price into the order line is indeed a "real" denormalization, and thus fits in well with the rest of the article. :)

///ark

Greg Jorgensen said...

Great explanation, thanks!

I often hear programmers calling for denormalizing before the database is even designed, claiming vague performance benefits. What they are really doing is choosing not to normalize at all.

I don't think there's ever a good time to not normalize. There are times where it makes sense to denormalize, though, and you sum those up nicely with good examples.

Greg Jorgensen said...

Great explanation, thanks!

I often hear programmers calling for denormalizing before the database is even designed, claiming vague performance benefits. What they are really doing is choosing not to normalize at all.

I don't think there's ever a good time to not normalize. There are times where it makes sense to denormalize, though, and you sum those up nicely with good examples.

bewhite said...

I'm wondering about reasons of such denormalisations.

Lets talk about calculated values and comlex logic you have mentioned as first and second reasons of denormalisation. Database is not living in vacuum. It is used by other application tiers(layers). Otherwise it stores data that end user can't use, so this data is useless. Thus you can move calculated values and complex business logic to the other layer (for example, server side layer in the web applications). You can show these values as objects attributes. In this case you don't need denormalisation.

Lets talk about JOIN's. I agree with you that you can boost your performance by reducing JOIN number from 10 to 5 in some databases. For example, query in MySQL will work much faster. Meanwhile, lets talk about reasons of such big number of joins. Maybe this number is consequence of bad DB structure? If so then the problem is not even on the table level, but on the common DB level. If you will solve the problem of the DB then you will not need denormalisation because number of JOINs will be small.

Your thoughts?

Victor

KenDowns said...

bewhite: When you require calculations at read time, you require the use of your own application. If you put in the values, your customers' IT departments can generate their own reports.

Morever, your own code tends to be much simpler because you can separate somewhat the biz logic (done at write time) from the reporting and presentation (done at read time).

As for multiple JOINs being bad table structure, the opposite is usually true, good table design leads to many tables. The first few essays in this series demonstrate how that happens.

Anonymous said...

If using Oracle, one idea is to use materialized views for everything, even with crazy joins of 10s of tables...

You could set it to refresh any way you like (on commit? every X minutes? daily)...

Anonymous said...

If multiple applications make use of the same data, storing aggregate values in the DB can solve the problem that different applications may calculate the values differently, particularly when rounding. It's easier to keep and modify the "business rule" located in one central place.

KenDowns said...

Anonymous: I could not agree more. I use a data dictionary to define the calculated values, and "builder" to build triggers right onto the tables that perform the 3 basic calculation forms. This leaves the application free to concentrate on presentation.

Anonymous said...

Ken:

Would you be willing to discuss the concept of denormalization as it relates to SAP Dimension tables?

Regards,

Robert

KenDowns said...

Anonymous: I have very little experience in dimension tables. I have used them only in passing and never brought anything live, and I don't feel I have any particular insights to bring to bear. Sorry 'bout that.

Kathryn said...

My boyfriend and I are beginners to database construction/programming, and have some of the concepts in our heads.

We are looking to create individual websites that will employ the use of "Web 2.0" tag clouds, (or category/topic labels, if you will), but because we're both SQL newbies, we're a bit befuddled as to how to employ tags and record them in a database table while having as much normalization as possible.

Is creating a "tag"/label table in a database even possible, given the obvious redundant nature of tag/label use? E.G.: One might have individual articles written in different points of view but have a single common subject, which results in a repeated tag.

If creating a "tag" table in a database is possible, how would one go about normalizing it without creating a lot of extra data in each row (such as 'which articles each unique tag is associated with and how many times the tag is used.')

Any ideas you have are appreciated.

Thanks,
Kat ^.^

KenDowns said...

Kathryn:

1) Table of articles
2) Table of tags
3) Cross-reference tags to articles.

Build the cloud out of #3.

Kathryn said...

Hi, Ken.

Thanks for the input. :-) I think I understand the idea that you're putting forth.

I'm assuming, however, based on your suggestions, that the tag words in the "tag table" have to have one unique entry each, like a dictionary, in order for there to be proper normalization (1st and 2nd normal forms, at the very least).

If this is indeed the case, then I'm still puzzled as to how to handle repetitive usage of each individual tag/label/category that is created by the (hopefully) thousands of users that would come to our websites. Because naturally, the more users there are, the more repetition there'll be. Plus, from what precious little I truly understand so far about database normalization, repetition of data is not usually considered a good thing, whether or not one has the server space to contain a lot of extra database tables that result from normalization.

Am I thinking too strictly, considering your article that I was replying to in the first place was about DE-normalization? Or am I on the right track?

I am mostly asking because both my bf and I tend to overthink things when the answers are often staring us right in the face. :-P

Thanks again for your input.

Kat ^.^

KenDowns said...

Kathryn: I would encourage you to start making up the tables and playing with some code, it is the surest defense against over planning.

Start with the simplest possible tables that have only the columns you need for precisely this issue, then play with some code. As you actually try to write code that works, the advice in this and other blogs will start to make much more sense and you will gain some traction.

Good luck!

Chandu said...

I have recently started looking into implementing tag clouds for a web site I am working on.

I had initally decided to go with the structure as suggested previously(tag table, article table and cross reference table with many to many relationsships).

However am having second thought about this design as this might result in the cross reference table with many rows depending on the magnitude of tha tags and articles.

Is there a better design for the same?
-
Chandu

KenDowns said...

@chandu: The first thing I would say is that your design is the most accurate to store the data, and the most accurate usually leads to the best code and best performance.

Database veterans refer to such designs as "thin and tall", a well designed database often ends up with lots of tables of few columns and many rows.

For performance, the best thing is to have a summary table that counts articles per tag. Easiest is a view, but you have to make sure your server is not querying the entire source table every time you read the view. If your server supports materialized views, all the better. If no materialized views, consider an insert trigger on the M:M table that ups the counts in the summary table. Or you can do that in application code. Or you can also have a timed job that re-runs the counts every hour or so.

You have a good design. The best performance will come from having ready counts available, and you have plenty of ways to do that. Good luck, come back and tell us how it worked out!

Seun Osewa said...

"an aggregration violates third normal form because it introduces a non-key dependency - a value that is dependent not on the key but on values from a completely different table!"

I disagree. The third normal form doesn't care about dependencies outside a given relation (table)

sk8ter_boy said...

Thanks Ken,

I loved your article. I like to add this scenario too.

Take the case of a PurchaseOrder -> LineItems.

If we store the exchange rate, total value with the Purchase Order as denormalize it.

1. The application layer needs to make sure the Total of the PO is updated everytime a line item is updated.

2. Concurrency. If two (or more) people commit their changes to the db. (update price on exsisting items or add new line item) only one person's changes must go through. This requires serialized transactions with locking. The best way forward is to consider the 'PO/Line items' as one set and don't allow users to commit just the line item changes. Rather let them commit the changes to the PO as one set.

Anonymous said...

Very well discussed!

For the years of experience in using databases, I really never had concerned myself in denormalization process. Mostly because the books I read really never did cover them. I just read on a page or two that the author somewhat said that on some situations the use of "non-normalized" or denormalized tables might be more practical, but never discussed it. I didnt know what he meant by then so I thought of just learning the normalization part, and tried to follow it faithfully.
However, during the course of developing such systems, I realized that following the normalization process (especially the 3nf) on all the tables tends to make retrieving data on some tables a bit more complex than it should, and I know that if i follow the normalization principle faithfully (3nf), it would affect the overall performance, even when just retrieving just one group of related information. So even with hesitation, I tried to break the normalization process on some tables just to make things easier and faster.
At first, the thought really bothers me that some of my tables are not normalized. But throughout the use of the system, I just realized that it benefited the system more when those tables are denormalized than the opposite.
It is only when I read this articled that I just realized that what I was really doing was denormalization all along, and that im on the right track. This article really cleared things out for me. :)