There are links to related essays on normalization and denormalization at the bottom of this post.
Non-normalized, Normalized and Denormalized
A nonnormalized database is a disorganized one, where nobody has bothered to work out where the facts should be stored. It is like a stack of paper files that has been tossed down the stairs. We are not interested in non-normalized databases.
A normalized database has been organized so that each fact is stored in exactly one place (2nf and greater) and no more than one fact is stored in each place (1nf). In a normalized database there is a place for everything and everything is in its place.
A denormalized database is a normalized database that has had redundancies deliberately re-introduced for some practical gain.
Most denormalizing means adding columns to tables that provide values you would otherwise have to calculate as needed. Values are copied from table to table, calculations are made within a row, and totals, averages and other aggregrations are made between child and parent tables.
The Practical Problems Of Normalization
There are four practical problems with a fully normalized database, three of which I have listed before. I will list them all here for completeness:
- No calculated values. Calculated values are a fact of life for all applications, but a normalized database lacks them. The burden of providing calculated values must be taken up by somebody somehow. Denormalization is one approach to this, though there are others.
- Non-reproducible Calculations. If you do not store calculated values in your database, your application must generate them on the fly as needed. If your application changes over time, you risk not being able to reproduce prior results when the business rules drift far enough from the original.
- Join Jungles. When each fact is stored in exactly one place, you may find it daunting to pull together everything needed for a certain query. A query joining 4,5, 7 or even 12 tables may be required for something the end-user considers trivial and easy. Such queries are hard to code, hard to debug, and dangerous to alter.
- Performance. When you face a JOIN jungle you almost always face performance problems. A JOIN is a very expensive operation compared to a single-table read, and the more JOINs you have the worse it gets.
The Convenience Argument
The convenience argument addresses the first problem listed above, no calculated values. When calculated values are generated and added to tables, it is far easier for downstream programmers (including members of the customer's IT department) to generate their own reports and ad-hoc queries. It is also much easier for members of the original team to generate display pages and reports.
This convenience is not a result of the simple presence of the calculated values. The convenience stems from the fact that the downstream programmers do not have to get involved in code that generates or calculates the values. They do not have to know anything about the API, the language the app was written in, or anything else, they just have to pull the data they need.
This convenience goes beyond the programmers to semi-technical users who may want to use their favorite 3rd party reporting tool (like Crystal Reports) to query the database. If your application API will not work with their favorite tool (or if you don't have an API), then you have a dissappointed customer. But if the data is right there in tables they can pretty much use anything.
At this point you may be saying, sure, that's fine, but views get all of this done without denormalizing. That is true, but when we go on to the next 3 arguments we will see something of why denormalizing often wins out over views.
The Stability Argument
Every healthy computer program changes and grows as new users and customers make use of it. During this process it is inevitable that later customers will request significant changes to very basic functions that were coded early on and are considered stable. When this happens the programmers have the daunting task of providing the original functionality unchanged for established customers, while providing the new functionality for the newer customers.
Denormalizing can help here. When derived values are calculated during write operations and put directly into the database, they can basically stay there forever unchanged. When a significant new version brings newer code to older users, there is no need to fear that that an invoice printed last week will suddenly come out with different numbers.
There still remains of course the fact that a bug in this whole effort means future calculations are wrong, and the worst case is when a bug gets out to production and generates bad calculated values. When this happens you face the prospect of fixing bad data on a live system. This is definitely my least favorite thing to do.
The Simple Queries Argument
The third problem listed above is JOIN jungles: queries that involve so many JOINs that they become impractical to write, difficult to debug, and dangerous to change.
When you denormalize a database by copying values around between parent and child tables, you reduce the number of JOINs that are required. Very obvious examples include things like copying an items price onto an order_lines table when a customer puts an item in their cart. Each time you copy a fact from one table to another, you eliminate the need for a JOIN between those two tables. Each eliminated JOIN is a simpler query that is easier to get right the first time, easier to debug, and easier to keep correct when changed.
This argument also goes directly back to the convenience argument. If that huge customer you just landed is happy to hear that they can use Crystal Reports to generate reports, you may still face disappointment when they find the reports involve "too many tables" from their perspective for reports that "ought to be simple".
The Performance Argument
The final argument proceeds from our fourth problem listed above. Normalized databases require a lot more JOINs than denormalized databases, and JOINs are very expensive. This means that, overall, any operation that reads and presents data will be more expensive in a normalized database than a denormalized one.
Once we reduce the JOINs by copying data between tables, we end up improving performance because we need fewer JOINs to retrieve the same number of facts.
Denormalization is not the only way to get the convenience of copied values and calculated values. Views and materialized views are the most often mentioned alternatives. The choice between denormalizing and using views has a lot to do with the Pay Me Now or Pay Me Later decision.
Denormalization is not an absolute: it is not one of those things that all wise experienced programmers always do, and it is not something that only fools ignore. The four arguments listed here have guided me well in deciding when to denormalize (and when not to), and I hope that they are of some benefit to you when you face the same decisions.
The normalization essays on this blog are:
- Revisiting Normalization and Denormalization.
- Pay Me Now Or Pay Me Later
- The Argument for Normalization
- First Normal Form
- Second Normal Form
- Third Normal Form and Calculated Values
- The Argument for Denormalization (this essay)
- Denormalization Patterns
- Keeping Denormalized Values Correct
- Triggers, Encapsulation and Composition
- The Data Dictionary and Calculations, Part 1
- The Data Dictionary and Calculations, Part 2