Sunday, November 2, 2008

Keeping Denormalized Values Correct

A normalized database stores each fact in exactly one place. This makes for very robust write operations, it is much easier to get things right on the way in. But it becomes much harder to get things out efficiently or easily, so very often we denormalize, that is, we store facts in more than one place for easier retrieval. This requires a very well thought out strategy to make sure these repeated values are always correct.

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Review of Methods

For our example this week we will consider a shopping cart. The orders are kept in the ORDERS table and the items purchased are in the ORDER_LINES table. We have denormalized the database by keeping the value ORDER_TOTAL in the ORDERS table. Every time a line item is changed, the ORDER_TOTAL must be updated.

There are a handful of methods that are popular in the wild for doing this. Some practices emphasize programmer discipline, others seek to prevent actions that will lead to inccorrect values. Strategies also fall into application level or database level, while still others operate at the architecture level.

Personally I chose triggers about four years ago, which I will explain at the end of this essay.

The Weakest Approach: Discipline

The simplest approach is to require that programmers be made aware of all denormalized values and that they must remember if they modify a line item to update the ORDER_TOTAL. This works well enough actually for small programming teams, where there is only one or maybe two programmers, preferably sitting right next to other. Mind-reading helps here as well.

Of course this approach falls apart like a rotten burlap bag as soon as the team or the program exceeds the ability of the team to keep it all straight in their heads.

But I did not bring up this example just to ridicule it. I have found that even seasoned veteran programmers (including your humble author) will fall into the trap of trying to enforce conventions at many levels in their programming. To see why this is always a bad idea and should always be avoided, consider this hypothetical case.

Imagine a new programming language is introduced known as Super-G, which is fashionable and wonderful and everybody loves it. It has a well-thought out typing system except for one odd behavior: If you code a line that concatenates a string with a date, program execution ends with no error. The compiler does not trap for this and run-time does not tell you why it quits. You can Google for it and find out that everybody knows about it, and you just have to remember not to do that! The language's authors have no plans to fix it because nothing is wrong. The fully expec you to always remember never to concatenate strings and dates.

The example is meant to be absurd, but to reinforce that any strategy where you just have to remember is out of the running from the start. Since we would not accept this in any tool we use, we should certainly never build our own practices upon such sand, and certainly we would not count on it to keep denormalized values correct.

Limiting Access To The Database

The next simplest strategy is to prune down what agents (programs or users) can get at the database. The idea is simple: just let one program get at the database, make sure that program is correct, and force everybody to go through the application.

This will work if you can get your programs right and there is no chance that any of the check-signers will demand access except through your application. Many programmers believe this is true for them. Some of them are right, but many are not: their users would love to get access to the database but the programmer has created a situation where it is impossible.

Personally I try to avoid this approach completely, and my reasons are both philosophical and technical.

On the technical side, successful programs always expand in scope, and the demand for flexible database access always increases. Limiting access to the database means that eventually you have to recode the entire database interface. This means work for you, cost to the customer, and work for the customer in plugging into whatever interface you create. This may be doable, but the overriding fact is that databases already have an interface, and any time spent re-inventing it could better be spent on just about anything.

On the philosophical side I simply do not like any architecture where limitations are built in from the start. Call it a personal prejudice, but I much prefer to find the flexible solution where there is one (and personally I love to find it where it appears it does not exist). Overall the flexible solution always leads to more possibilities for work, more features, and just plain more fun.

Application Framework Strategies

If you are committed mainting the ORDER_TOTAL in application code, and you wish to avoid the "please remember to always...." blunder, then it must not be possible for new programmers or prima donna programmers to violate the requirement. This means your framework cannot allow random SQL commands, and must somehow force all write access to particular tables to route through particular objects or functions. A good ORM system should at very least not only provide a mechanism for updating related tables, but also prevent any access except through that mechanism.

Personally I have no use for these kinds of systems, for reasons explained in the previous section, and so I cannot really comment on them beyond describing these basic minimum requirements.

Server-Side Strategy: Stored Procedure

A few years ago I was working in Manhattan and a fellow programmer explained that at his wife's job all database write access had to go through stored procedures. The idea was to ensure that business rules were always enforced and to prevent any programmer from wittingly or unwittingly violating the rules. In the interest of full disclosure, I'll point out that I have never worked on such a system, and all of my knowledge of such systems is second-hand or third-hand.

With that being said, the obvious up-side to this method is that you avoid forcing database access through your application, making things much more robust and extendable. Further, you make sure, by coding up routines that handle UPDATES and INSERTS to ORDERS and ORDER_LINES that the useful but troubesome ORDER_TOTAL value is always updated when it needs to be. Further still, you can tie security to the stored procedures and control who can modify orders, which is a prime feature mentioned by everybody who has ever explained such a system to me.

There is a significant down-side if you intend to code the stored procedures manually. My own experience is that server-side code is the most difficult to debug (please feel free to post a comment trumpeting your favorite debugger for stored procedures, I'm all ears).

I have never been tempted to use a system like this because I believe it is still exactly one level more complicated than it needs to be. What I really want is to be able to directly code an INESRT to the ORDER_LINES table from any source and know the ORDER_TOTAL field will always be correct. If that were possible, then all parties are liberated from inventing and then using any API except SQL. Now of course many of us prefer to build some layer on top of SQL (myself inclued), but if the architecture supports direct SQL while enforcing business rules then all parties are free to use abstraction layers of their choosing, and nobody is forced to invent or accomodate anything they do not wish to.

Server-Side Stragey: Triggers

It is a simple technical fact that the tightest possible encapsulation of code and data occurs when you attach triggers to tables. In our example of the ORDER_TOTAL value, any INSERT, UPDATE, or DELETE to the ORDER_LINES table would update the ORDER_TOTAL in the ORDERS table. This approach gives maximum flexibility: you can directly access the database without violating rules, and any player can use an abstraction layer of their choice, or none at all.

Since many programmers find it very tedious and error-prone to code and debug server-side routines, this approach still faces a large obstacle if you intend to code the triggers by hand. But this should not be necessary when taking this approach, because all denormalization will follow patterns. This is a theme that I tend to repeat over and over in these essays: your tables will all follow predictable patterns and your denormalizations will likewise follow patterns. Whenever you have patterns you can have automation, and in this case that means generating the triggers instead of coding them by hand.

Another concern with this approach is security. I have been stressing the inevitable need for expanded database access as your application matures, but if you let somebody in with full priveleges, they could accidentally or maliciously cause huge damage if they can run willy-nilly wherever they want in the database. The trigger-based approach is the tightest possible way to enforce business rules, but it does nothing to address security. And if you end up granting database access based on confidence in triggers, then you are forced into enforcing security as well inside of the database -- but that is an essay for another day.

Conclusion

Once we decide to denormalize then we are required to dream up a strategy to keep things correct going in. The weakest strategies depend upon voluntary adherence to some set of conventions, and many strategies accept limitations in overall flexibilty to reduce the threat from unknown elements. The trigger option, not very popular these days, provides the tightest encapsulation of code and data, and lends itself well to code generation.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are:

8 comments:

John "Z-Bo" Zabroski said...

@My own experience is that server-side code is the most difficult to debug (please feel free to post a comment trumpeting your favorite debugger for stored procedures, I'm all ears).

SQLServer Profiler, for performance analysis and debugging.

Tonetheman said...

Stored procs are not easy but for me at least I think they make the most stable system

Greg said...

Oracle SQL Developer has a debugger for stored procedures as well, and they have the advantage that they can be run/debugged in isolation to the rest of the application

Jochen said...

Kevin, I think that stored procedures and triggers go hand in hand. Triggers are good for keeping data in sync and for tables that are updated from more than one place.

Procedures are fine for defineing and controlling access to your data. In cases where more than one application works on the same database schema, you only have to code certain statements once and can re-use the code in all applications without the need of duplicating it. This also makes you independent of a framework or programming language, at the cost that you need to write the code on the database-side and also database specific.

Last but not least, if you have applications where the SQL is hidden in a compiled (and maybe widely distributed) application, you can change certain behaviour by simpy altering the procedure, without having to recompile the app and roll out a new version all over the place.

patrick said...

Ken, I've really enjoyed reading through your blog. Even in cases where I do not agree with you, reading your blog helps crystallize the reasons for not agreeing which in turn makes it easier to explain my reasoning to others.

I think the most interesting thing is the difference in perspective between a DB person and a programmer. You've mentioned that data is king, and the data in the DB will out live the program using it. In one entry you even mention that putting more business logic in triggers/constraints makes it easier to port an application to a different language.

In contrast programmers consider the program to be king. To make it easier to use the program on every single DB ever made or to be made, we are taught to keep as many rules in the program since each DB may not have the same feature set.

I think it is important to recognize what type of project (is it about the data, or is it about the program) you are working on. My guess is that data based (pun intended) projects will have a much larger amount of tables then a program based one.

KenDowns said...

@patrick, I appreciate your comments.

Where I would differ would be at the distinction between programmer and "DB person". There are few reasons.

First, I am a programmer. I happen to program biz database apps, which requires me to be fully competent in how databases work. This is not an either-or, programmer-vs-db, but a "both", I must know programming and databases.

Second, all programs exist solely to process data, but this sometimes hard to recognize. A simple image browser like Windows Explorer reads its data from disk and displays it to users. A chat program accepts data from keyboard and delivers it to a remote client, the entire chain exists to process data. Firewall systems have databases of rules that they implement on traffic. And so on.

So there is no such thing as a project that is not about data. It might not be *SQL* *data*, but it is always about data.

The conclusion for anybody who might need a relational/Sql database is that they must learn how that database works just as surely as they need to learn how JPEG's work if they are working with image data. Otherwise how could they possibly get it right?

In conclusion, those programmers who consider programs to be king are not bad people, they are simply mistaken about the fundamental nature of their work. That mistake then leads to faulty practices (because the practices are not based in fact) and they are simply not as effective as they could be.

moleboy said...

Ken,
I agree with you completely regarding the necessity for application programmers to learn at least SOMETHING about database theory.
I've recently started at a position where I'm working on enhancing and extending an application that was written by people who did not understand anything about databases.
All the SQL is hidden in the C#. There are no primary keys. There are no foreign keys. There are no constraints at all.
The data isn't slightly denormalized. One table has, literally, a dozen columns from other tables. These columns are often not even in synch. Of course, I can't find out why they aren't, because all the SQL is hidden. In short, they've written a database application and ignored everything that databases are designed to do.
;)

I think that application level/front end developers look at SQL and see it is reasonably simple. Therefor, there's no need to learn anything about how databases work, and why they work that way.
Which leads to where I am now, which is only half a step away from just one big table, and only a full step away from a flat file.

lveronese said...

This post seems to support the trigger based solution to the denormalization problem.
I think triggers have some merits and many flaws.
Some of them are:
- totally incompatible implementations among database vendors;
- if stored procedures are hard to debug, trigger based code is even harder to debug;
- triggers destroy the set oriented nature of sql, at least the way they have been standardized. The approach FOR EACH ROW kills set oriented processing and can cause hard to fix performance problems.
In my opinion we need a higher level approach to relational database programming where SQL and stored procedures are treated as a compilation target and the programming language is based on relational principles. I'm currently exploring this possibility and if someone is interested I may start a blog to share my ideas about this approach