Sunday, October 19, 2008

The Argument For Denormalization

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Non-normalized, Normalized and Denormalized

A nonnormalized database is a disorganized one, where nobody has bothered to work out where the facts should be stored. It is like a stack of paper files that has been tossed down the stairs. We are not interested in non-normalized databases.

A normalized database has been organized so that each fact is stored in exactly one place (2nf and greater) and no more than one fact is stored in each place (1nf). In a normalized database there is a place for everything and everything is in its place.

A denormalized database is a normalized database that has had redundancies deliberately re-introduced for some practical gain.

Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed. Values are copied from table to table, calculations are made within a row, and totals, averages and other aggregrations are made between child and parent tables.

Related Essays

If you are a first-time reader of this blog, I recommend taking a look at Third Normal Form and Calculated Values and Denormalization Patterns, which cover issues related to today's post.

The Practical Problems Of Normalization

There are four practical problems with a fully normalized database, three of which I have listed before. I will list them all here for completeness:

  1. No calculated values. Calculated values are a fact of life for all applications, but a normalized database lacks them. The burden of providing calculated values must be taken up by somebody somehow. Denormalization is one approach to this, though there are others.
  2. Non-reproducible Calculations. If you do not store calculated values in your database, your application must generate them on the fly as needed. If your application changes over time, you risk not being able to reproduce prior results when the business rules drift far enough from the original.
  3. Join Jungles. When each fact is stored in exactly one place, you may find it daunting to pull together everything needed for a certain query. A query joining 4,5, 7 or even 12 tables may be required for something the end-user considers trivial and easy. Such queries are hard to code, hard to debug, and dangerous to alter.
  4. Performance. When you face a JOIN jungle you almost always face performance problems. A JOIN is a very expensive operation compared to a single-table read, and the more JOINs you have the worse it gets.

The Convenience Argument

The convenience argument addresses the first problem listed above, no calculated values. When calculated values are generated and added to tables, it is far easier for downstream programmers (including members of the customer's IT department) to generate their own reports and ad-hoc queries. It is also much easier for members of the original team to generate display pages and reports.

This convenience is not a result of the simple presence of the calculated values. The convenience stems from the fact that the downstream programmers do not have to get involved in code that generates or calculates the values. They do not have to know anything about the API, the language the app was written in, or anything else, they just have to pull the data they need.

This convenience goes beyond the programmers to semi-technical users who may want to use their favorite 3rd party reporting tool (like Crystal Reports) to query the database. If your application API will not work with their favorite tool (or if you don't have an API), then you have a dissappointed customer. But if the data is right there in tables they can pretty much use anything.

At this point you may be saying, sure, that's fine, but views get all of this done without denormalizing. That is true, but when we go on to the next 3 arguments we will see something of why denormalizing often wins out over views.

The Stability Argument

Every healthy computer program changes and grows as new users and customers make use of it. During this process it is inevitable that later customers will request significant changes to very basic functions that were coded early on and are considered stable. When this happens the programmers have the daunting task of providing the original functionality unchanged for established customers, while providing the new functionality for the newer customers.

Denormalizing can help here. When derived values are calculated during write operations and put directly into the database, they can basically stay there forever unchanged. When a significant new version brings newer code to older users, there is no need to fear that that an invoice printed last week will suddenly come out with different numbers.

There still remains of course the fact that a bug in this whole effort means future calculations are wrong, and the worst case is when a bug gets out to production and generates bad calculated values. When this happens you face the prospect of fixing bad data on a live system. This is definitely my least favorite thing to do.

The Simple Queries Argument

The third problem listed above is JOIN jungles: queries that involve so many JOINs that they become impractical to write, difficult to debug, and dangerous to change.

When you denormalize a database by copying values around between parent and child tables, you reduce the number of JOINs that are required. Very obvious examples include things like copying an items price onto an order_lines table when a customer puts an item in their cart. Each time you copy a fact from one table to another, you eliminate the need for a JOIN between those two tables. Each eliminated JOIN is a simpler query that is easier to get right the first time, easier to debug, and easier to keep correct when changed.

This argument also goes directly back to the convenience argument. If that huge customer you just landed is happy to hear that they can use Crystal Reports to generate reports, you may still face disappointment when they find the reports involve "too many tables" from their perspective for reports that "ought to be simple".

The Performance Argument

The final argument proceeds from our fourth problem listed above. Normalized databases require a lot more JOINs than denormalized databases, and JOINs are very expensive. This means that, overall, any operation that reads and presents data will be more expensive in a normalized database than a denormalized one.

Once we reduce the JOINs by copying data between tables, we end up improving performance because we need fewer JOINs to retrieve the same number of facts.

Denormalization is not the only way to get the convenience of copied values and calculated values. Views and materialized views are the most often mentioned alternatives. The choice between denormalizing and using views has a lot to do with the Pay Me Now or Pay Me Later decision.

Conclusion

Denormalization is not an absolute: it is not one of those things that all wise experienced programmers always do, and it is not something that only fools ignore. The four arguments listed here have guided me well in deciding when to denormalize (and when not to), and I hope that they are of some benefit to you when you face the same decisions.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are:

14 comments:

Martand Joshi said...

Hi Ken,

Nice post and good food for thought while designing the database and programming.

Regards,
Martand Joshi

Noons said...

"JOINs are very expensive"


erm........

NO! Fundamentaly, wrong!

KenDowns said...

Noons: Facts? Examples?

Anonymous said...

I would add a fifth argument which is..."I'm a developer why the hell do you have me working in the database...hire a DBA damn it!!!"

Chase Saunders said...

Someone asked for backup on noons' comment. Based on my readings, I think he is correct.

Date and Darwen, The Third Manifesto guys (http://www.thethirdmanifesto.com), have shown not only that joins are not expensive, but that any RDBMS in which a denormalized operation is faster than a join operation is a flawed RDBMS. I was initially skeptical but they are convincing if you are willing to get deep enough in to follow their arguments. You do have to make an effort to compare apples to apples though... obviously reading from a tiny single table is going to beat selecting more total information from a join.

With all respect to the author (especially for writing the normalization article FIRST), claiming joins are inefficient is something you should do your homework on. I'd encourage you to at least read up on the critics of this position, starting with Date and Darwen articles accessible from the above-mentioned site. Many consider them among the foremost experts on RDBMS systems in academia.

A few additional comments. First, an RDBMS's failure to support calculated columns has nothing to do with normalization. For example, the newer versions of SQL Server have it, and you can use it in a flat table or in related tables. I have seen other make similar comments, which seem to relate to confusing denormalization with caching. Even if your RDBMS lacks calculated fields, caching the value of a calculation is NOT denormalization. On both counts, this criticism does not belong in the list

Second, I am happy NOT to see an old saw of the "joins are bad" crowd in this post... the assertion that storing, say, the then-current sales tax rate with each transaction is a form of denormalization. It isn't as the sales tax rate is functionally dependent on the transaction (Fifth Normal Form requires it to be there, as I see it)

KenDowns said...

Chase:
Thanks for the detailed comment.

W/respect to Date and Darwen, I would have to put them in the theoretical category, and compare what they have shown *ought* to be possible with what I have seen. This is not to disparage their work, only to say that they are not coding up db servers.

Not sure what you mean by "homework". Again there is theory and there are the products we have today with their strengths and weaknesses. My sphere and the sphere of this blog is application development using available db products, informed by careful longterm study of the theory behind the products, and tempered by my actual need to make a living and deliver products customers will accept. I have found my systems are more robust and perform better when JOINs are minimized.

SQL Server has had limited calculations since at least 2004 (the last time I used it). At that time they supported only calculations within a row whose formulas could not even refer to other calculated columns. I have since moved over to PostgreSQL and cannot comment on later versions of SQL.

With respect to the "joins are bad" crowd, did not know I was part of a crowd :) Just relating my experience that for *many* reasons the overall application is improved if JOINs are minimized.

Finally, w/respect to your example of sales tax code, I would add things like price of an item, and many others. It seems we likely agree on this point, as I stated in my article on denormalization patterns written some weeks back: http://database-programmer.blogspot.com/2008/04/denormalization-patterns.html

Anonymous said...

@Ken - In that case, if you haven't used SQL Server since 2004 and have only been using PostgreSQL since then, you would be very well served by making it clear in your post that your arguments only apply to PostgreSQL and are based solely on your personal experience with it, lest readers get the impression that you are communicating axioms that hold true for all RDBMS systems and, almost more importantly, all users of them.

As for my personal experiences with (de)normalization, in every client environment in which there have been strong arguments being made for denormalization there have always been serious issues already in play with regards to the fundamental database schema (a common one is to attempt to use one schema for all three of online transaction processing, order management, and reporting, for example) and the way in which the application or applications use (and usually abuse) it. Why treat the symptom (hey, let's denormalize!) when you can cure the disease (the schema and its usage are crap!).

Chase Saunders said...

Ken

I don't think you can write Date and Darwen off as theoreticians because they have done quite extensive benchmarking on this. From what I understand, the small family of top-performing databases like Oracle and SQL Server (see http://www.dbms2.com/2008/01/22/mid-range-database-management/ for what I mean by this)
it's not possible to produce a benchmark showing denormalization that improves performance.

I'm a bit surprised to hear that PostgreSQL might be in this camp, as it's pretty well optimized, and as Date and Darwen have shown, it's always poorly optimized RDBMS systems that lead to this result. I haven't seen a benchmark on this, I'm wondering if you've set up anything formal or if your report is more anecdotal? I know these sorts of performance evaluations are notoriously tricky to do.

I have seen several benchmarks showing denormalization can improve performance in MySQL, which doesn't surprise me at all. But given the update anomaly, delete anomaly, etc, reviewd in your previous post I doubt I'd ever want to pay that price if I did'nt have some major cash on the line.

Despite the flack I'm giving you over this, I think you have a great blog and I've subscribed. keep up the good work.

Noons said...

Ken, instead of regurgitanting your supposed "extensive" experience as proof of any of your wild claims, why don't you just get informed?

To dismiss the seminal work and the foundation of relational databases as "theories" only shows you as another of the plentyful idiotic and ignorant gits who populate the current development landscape, who consider themselves "experts" because they once created a "web site".

Sorry dude, but you don't have a clue.

Good bye.

KenDowns said...

Noons: Wonderful comment. I could not have provided a better example of how utterly unhelpful the die-hard relational theorists can be.

Anonymous said...

@noons and saunders:
I must second the author. In my experience JOINs among several huge tables (many thousands or millions of rows each) are extremely slow, even with optimized and correctly indexed databases. On the other hand, denormalized tables containing the same amount of data are extremely faster.
If this were not true, there would be no need of materialized views (or snapshots).
For further info, you could try asking this question to a data warehouse architect with experience on live, real databases. BTW, my experience is based on large Oracle and SQL Server databases, not MySQL.

Anonymous said...

fine but vague...examples?

Ed said...

Thanks for this write-up. I agree with the simplicity and stability arguments. I'm working in Oracle.

Anonymous said...

With all due respect, the only valid argument you gave for denormalization is the ability to see previous calculations. Even that one is flawed, because a change in data that affects a calculation offers no real value when compared to the current calculation. 1+2=3. If we change the added values we might end with a sum of 10. What can you derive by comparing the prior sum of 3 to the current sum of 10? (only that one or both of the added values have changed such that there is a difference of 7 between the two sums - not helpful) Perhaps, you would add an audit table where you could recalculate previous sums and also have access to the values that produced them.

There are database objects called views that can store these complex joins or calculations so they don't need rewritten over and over again - the logic is centralized supporting minimal change when needed.

As per your point on speed - prove it! Unless you're dealing with massive amounts of data, this is negligible. The loss of data integrity and time required to debug and fix issues caused by such far outweigh those gains you'll make.