A history table allows you to use one table to track changes in another table. While the basic idea is simple, a naive implementation will lead to bloat and will be difficult to query. A more sophisticated approach allows easier queries and can produce not just information about single rows, but can also support aggregrate company-wide queries.
This week in the Database Programmer Blog we return to table design patterns with an essay on history tables. The basic premise of this blog is that good coding skills do not lead magically to good database skills -- you can only make optimal use of a database by understanding it on its own terms. There is a new essay each Monday, and there is a Complete Table of Contents and a List of Table Design Patterns.
What to Put Into A History Table
Naive approaches to history tables usually involve making a complete copy of the original (or new) row when something changes in the source table. This turns out to be of little use, for reasons I will explain below. A much more useful approach is to track only a few columns and to store any combination of old values, new values, and differences. A history table designed this way can be tremendously useful.
We will start with the example of a sales order table, called ORDERS. The columns we are interested in might look like this:
ORDER | CUSTOMER | DATE | LINES | TAX | TOTAL | PAID | BALANCE ------+----------+----------+--------+-------+--------+--------+--------- 1234 | 9876 | 5/1/08 | 48.00 | 5.00 | 53.00 | 0 | 53.00 2345 | 9876 | 5/3/08 | 150.00 | 0 | 150.00 | 150.00 | 0 3456 | 5544 | 6/8/08 | 25.00 | 2.60 | 27.60 | 15.00 | 12.60 4567 | 3377 | 7/3/08 | 125.00 | 7.00 | 132.00 | 50.00 | 82.00
We first have to ask which columns must be copied into history so that we can link the history table back to the ORDERS table. The only column we need for tracking is ORDER (the order number), so the history table will always have an ORDER column.
We should also assume that the history table will contain at least a timestamp and a column to track the user who made the change, which brings us to a minimum of three columns.
Finally, it tends to be very useful to track what action caused the history entry, be it an INSERT, UPDATE, or DELETE. This brings us up to four minimum columns.
Next we ask which columns we will definitely not need. There are two groups of columns we will not need, which are 1) the columns that never change and 2) the columns we do not care about. Columns that do not change are likely to be the CUSTOMER and the DATE column. There is no need to bloat the history table with these valus because we can just get them out of the ORDERS table. The second group, columns we do not care about, are are usually things like ship-to address, maybe an email, and other information. Naturally there is no hard-and-fast rule here, it depends entirely upon the needs of the application.
So now we know what we definitely need and what we definitely do not need, and we are ready to begin work considering the columns that will change. Not surprisingly, these are usually all about the numbers. Next we will see how to track the numbers.
Tracking Changes to Numbers
While it is certainly useful to store one or both of the old and new values for a number, it far more useful to store the change in the value, or the delta. Having this number in the history table makes for some really nifty abilities. If you store all three of the old, new, and delta, then you can more or less find out anything about the ORDER's history with very simple queries.
So we are now ready to consider what the history table might look like. We will take the case of an order that was entered by user 'sax', updated twice by two other users, and in the end it was deleted by user 'anne'. Our first stab at the history table might look like this:
ORDER | USER_ID | ACTION | DATE | LINES_OLD | LINES_NEW | LINES_DELTA ------+----------+--------+---------+-----------+-----------+------------- 1234 | sax | UPD | 5/1/08 | 0.00 | 48.00 | 48.00 1234 | arkady | UPD | 5/7/08 | 48.00 | 58.00 | 10.00 1234 | ralph | UPD | 6/1/08 | 58.00 | 25.00 | -33.00 1234 | anne | DEL | 6/4/08 | 25.00 | 0.00 | -25.00
I should note that if you keep LINES_OLD and LINES_NEW, then strictly speaking you do not need the LINES_DELTA columns. Whether or not you put it in depends on your approach to table design. If you framework allows you to guarantee that it will be correct, then your queries will be that much simpler with the LINES_DELTA column present.
You may wonder why there is no entry for the original INSERT. This is because you must enter an order before you can enter the lines, so the original value will always be zero. Only when lines start going in does the ORDER get any numbers. This is true for header tables, but it would not be true for detail tables like ORDER_LINES_HISTORY.
Some of the Obvious Queries
There are few obvious queries that we can pull from the history table right away. These include the following:
-- Find the value of of the line items of an -- order as of June 1st SELECT LINES_NEW FROM ORDERS_HISTORY WHERE ORDER = 1234 AND DATE <= '2008-06-01' ORDER BY DATE DESC LIMIT 1; -- Find the original value of the line items, -- and the user who entered it. SELECT LINES_NEW, USER_ID FROM ORDERS_HISTORY WHERE ORDER = 1234 ORDER BY date LIMIT 1; -- Find the users who have worked on an order SELECT DISTINCT USER_ID FROM ORDERS_HISTORY WHERE ORDER = 1234;
Most of queries should be pretty obvious, and there are plenty more that will suggest themselves once you start working with the history tables.
Queries Involving the Delta
The real power of the DELTA column comes into play when you are trying to compute back-dated values such as the company's total open balance on June 1, 2008. If you have a naive history table that stores only the old value or only the new value, this is truly a tortuous query to write, but if you have both then it is really quite easy.
-- Query to calculate the total open balance of all -- orders as of a given date SELECT SUM(BALANCE_DELTA) FROM ORDERS_HISTORY WHERE DATE <= '2008-06-01';
This magical little query works because paid orders will "wash out" of the total. Consider an order that is entered on May 20 for $200.00, and is then paid on May 23rd. It will have +200 entry in the BALANCE_DELTA column, and then it will have a -200.00 entry 3 days later. It will contribute the grand sum of zero to the total.
But an order entered on May 25th that has not been paid by June 1st will have only a +200 entry in the BALANCE_DELTA column, so it will contribute the correct amount of $200.00 to the balance as of June 1st.
If the company owner wants a report of his total open balances on each of the past 30 days, you can retrieve two queries and build his report on the client:
-- Get begin balance at the beginning of the period SELECT SUM(BALANCE_DELTA) as BEGIN_BALANCE FROM ORDERS_HISTORY WHERE DATE < '2008-06-01'; -- Get the total changes for each day. When you -- build the report on the client, add each day's -- change amount to the prior day's balance SELECT SUM(BALANCE_DELTA) AS BALANCE_DELTA FROM ORDERS_HISTORY WHERE DATE BETWEEN '2008-06-01' AND '2008-06-30' GROUP BY DATE;
Keeping History Tables Clean
A clean history table is one that contains no unnecessary information. You normally do not want entries going into the history table if nothing relevant changed. So your history table mechanism should examine the columns it is tracking, and only make an entry to the history table if one of the columns of interest actually changed.
Problems With The Naive History Table
A very basic history table will usually copy the entire original row from the source table into the history table whenever an INSERT, UPDATE or DELETE occurs. One simple problem is that you end up with bloated history tables. Because they are cluttered with unnecessary repititions, they are difficult to work with by inspection.
A much more serious technical problem with the naive approach is that it is horribly difficult to produce the queries demonstrated above. You must reproduce the concept of a delta by either running through all of the rows on the client, or you must make a difficult (and often impossible) JOIN of the history table to itself in which you connect each row to the row that came just before it. All I can say is, no thanks, I'll go with the delta.
History Table Security
History tables always involve some concept of auditing, that is, keeping track of user actions. This means we need to protect against deliberate falsification of the history tables, which leads to two rules. First, a user must have no ability to directly DELETE rows from the history table, or they could erase the record of changes. Second, the user must have no ability to directly INSERT or UPDATE existing rows, because if they could they can falsify the history. These rules apply to both regular users and system administrators, the administrator must have no privelege to subvert or manipulate the history.
Since history tables have a tendency to become seriously bloated, there must be some priveleged group that can DELETE from the history tables, which they would do as a periodic purge operation. This group should have no ability to UPDATE the tables, because such priveleges would open a potential hole for subverting the history. Regular system administrators should not be in this group, this should be a special group whose only purpose is to DELETE out of the history tables.
If you are making use of DELTA columns, then stricly speaking you do not want to purge, but compress history tables. If you want to purge out all entries in 2005, you must replace them with a single entry that contains a SUM of the DELTA columns for all of 2005.
So to sum up, we have the following security rules for a history table:
- No system user should be able to DELETE from the history table.
- No system user should be able to UPDATE the history table.
- No system user should be able to directly control the INSERT into the history table.
- A special group must be defined whose only ability is to DELETE from the history table, so that the tables can be purged (or compressed) from time to time.
Implementation
As always, you have your choice of implementing the history mechanism in the client code or in the database itself.
The best performing and most secure method is to implement history tables with triggers on the source table. This is the best way to implement both security and the actual business rules in one encapsulated object (the table). However, if you have no current practices for coding server-side routines, or you do not have a data dictionary system that will generate the code for you, then it may not be practical to go server-side for a single feature.
Implementing history tables in code has the usual benefit of keeping you in the language and habits you are most familiar with, but it means that you cannot allow access to your database except through your application. I cannot of course make a general rule here, this decision is best made by the design team based on the situation at hand and anticipated future needs.
Conclusion
History tables have many uses. Beyond the obvious first use of finding indidivual values at some point in the past, well crafted tables can produce company-wide aggregations like total open balances on a given day, changes in booked orders on a day or in a range of days, and many other queries along those lines. Security is very important to prevent history tables from being subverted.
29 comments:
The idea of the delta column is quite neat. I will probably be using that one somewhere in the near future.
But I think you are too easily dismissing the use of a complete copy of a row for auditing purposes. Auditing is an area where you can never totally be sure that you will not need some value of an attribute in the future. Be it to make a nice historical report or in a case of fraud detection.
We all know that the attributes which you think will never change, suddenly must be changed. You can bet on that.
Moreover, once you have dismissed a column from the historical table you will never be able to retrieve the history for that column.
I myself would use the 'better safe than sorry' approach and include all columns. What is the cost of storage anyway nowadays :-)
It seems that the custom history table for each original table and the need for associated triggers might result in high total code ownership thus causing maintenance overhead. Consider a common case of adding an extra column to history table as requirements change or a system with large number of tables. Are there any tips or tricks to create a metadata driven history system that minimizes this overhead?
Nice. I have also been learning from http://en.wikipedia.org/wiki/Slowly_changing_dimension
Hey, thanks for the post. We are planning to work on table history function for our fsm based workflow system and we came out with the following table definition:-
table_history (history_id, field_name, from_value, to_value, uid *, transition_id *, record_id *)
the differences between our definition and the one mentioned in your post are we track for all the modified fields (not sure whether it is necessary) instead of just the key fields. And because we are having a workflow system, therefore we can specify what kind of transaction (we call them transition) actually triggers the change in the record (hence transition_id is a foreign key field).
We are still finding out what is the best way to populate the history table though. Fortunately we are using postgresql that offers PL/SQL trigger scripting but I don't know if it is reliable enough.
Not all programs that write to the history tables are flawless, i.e. a bug could cause the tables to become corrupt/bogus, which means that someone, somewhere (probably the dba) will need insert/update/delete access...
Anonymous: no application code will ever be flawless, that's why I stated the only way to be completely reliable is to use triggers.
jeffrey04: You are implementing what is often called E-A-V. You may want to reconsider that, you will find it superficially easy to get started and more and more complicated to go forward. The basic problem is you are not really storing complete rows anymore, so you will have to reinvent much of SQL to get useful queries out of your history.
kiran: Well I can start with what I do myself. I define the entire database in a data dictionary, and use a "builder" program that reads the dictionary to build the tables and the triggers that do things like maintain history tables. I have found this produces the lowest code maintenance overhead of any other approach I've used in the last 15 years.
So a "tip" or "trick" would be to depend heavily on your data dictionary and put everything into it.
As far as I understand if we want to track web app. user activity with history tables on back end site -- we obliged to implement our history tables in code (not in triggers), since the web app. users are (usually) not database users. I wonder if there some real implementation examples..
EJS: Most web programmers use a single login. In fact, as far as I can tell, *all* of them do. This leaves you with no choice but to implement in code.
I take this as just another reason not to use a single login.
As for examples, I have a (badly outdated) demo at http://www.andromeda-project.org/demo that is a browser-based biz app that uses "real" database accounts and implements all biz logic in triggers which are themselves generated out of a data dictionary.
I've recently changed from a bloated version to something a bit easier to manage. It is similar to what jeffrey04 mentioned with a field name column and a field value column. I set up triggers, example at http://codespatter.com/2008/05/06/how-to-use-triggers-to-track-changes-in-mysql/ that will detect when there are changes. Tracking users that make the changes is done in the code.
What is it that you found more complicated with the E-A-V route? So far it has been much easier to use.
Greg: E-A-V always works well in simple situations and then most people find it falls apart later on. My original blog gave only simple examples, but it never ends there, requests for new queries become more complex. The EAV approach, which divides related information that should be in columns into rows instead, makes these queries very difficult to write.
But in the most basic sense, no database programmer is ever going to be comfortable with a system that takes related information, which belongs together in one row of a table, and spreads it out into multiple rows. It creates somewhat simpler code at first in the situation where you used it, and everything else after that gets worse because you cannot use SQL effectively, not for updates, not for deletes, not for select, and not for inserts. This is because you don't have rows of related information in tables anymore, and every action you take has to reconstruct the rows-in-tables structure from the split-up EAV structure.
Great article, which came at a great time for me. I am about to add change tracking to a rails based door access application.
I disagree, however, with your choice to place business logic (which the decision to track or not track a particular field or operation must be) in a trigger. If multiple applications must interact with the database, then they should implement their own history logic as well. I assert that this logic belongs in the model, under the MVC paradigm.
Tim e.: Thanks, I hope your history tables go well.
We will not likely agree on the issue of triggers vs application code, unless you use a data dictionary to generate the code. Without a dictionary, you are violating DRY, and practically speaking it will be impossible to get them all right (or at least I've never seen it successfully done, even by smart and motivated people). And if you're going to use a dictionary to generate the code, why not just generate it once instead of multiple times?
Programmers who disagree with the data dictionary approach need to hear the story of the data dictionary told from the stand-point of model-driven software development and real-time object-oriented modeling (the object-oriented modeling pattern all OO systems gravitate towards).
I think you almost got your criticism of EAV right when you mention in the comments how people dynamically rebuild the EAV table and transaform it into a more comprehensible materialized view. The reason they do this is related to the consequences of their actions. When you build an EAV data model, your column headers don't mean anything. It turns out people like looking at tables with meaningful column headers, and so even if they go the EAV route they will invariably materialize views that give them useful column headers.
Z-Bo: I think you are absolutely right about casting the dd in terms of MDA. Looks like we have a future essay topic...
what DELTA will you use for STRING columns?
Anonymous: I don't think there is a meaningful definition of DELTA for strings. I don't use them myself, just before and after where applicable.
Nice article!
But why not simply use SVN _if_ history is just needed to backup old data in case of errors etc. So if you just need access to _current_ data, SVN (of MySQL dump) is a good idea, isn't it?
andi: thanks for the compliment. If I understand your question, examining a dump is an offline activity, which makes it more difficult. Also, you may miss intermediate states of the database between dumps.
I imagine calling SVN commands from server side, eg via PHP. If this is possible, I can commit a dump of the last state of a row, which is going to change now. The benefit is that I use all SVN stuff (delta etc.) for my database data. So I can easily get the last status with one command - to reset it I just execute the dump;) The downside could be that it is not very performant to commit a new file for every row - but you could swap this process to another thread - so MySQL will not decelerate.
andi,
I think you are thinking like an application developer instead of a database administrator.
A systems developer (like a kernel developer), on the other hand, would tell you that you are using the wrong tool for the job. For instance, a kernel developer would never write a version control system specifically to highlight a newly created function between two commits. Instead, they'd just store the difference in the storage bits.
By the way, SQLServer2008 provides nifty support for monitoring changes to tables, including changes made by users with SA and/or DBO privileges. SQL DBMSes are just getting better and better these days.
Z-Bo: "I think you are thinking like an application developer..." That's why it's called The Database "programmer".
Also, one point Ken did not cover: You should definitely "unit test" your stored procedure that writes to the history table. The consequences of you going even a day where you didn't audit something is unthinkable, but possible, especially when you upgrade databases and your database vendor re-implements how its Grand/Deny/Revoke security model is defined between database objects.
Sure, I was just looking for a tool that does the right thing for me ;) I did not find it in MySQL so I just was "scanning" the tools I know and found (the diff implementation of) SVN. I don't want to rewrite the diff code for myself - and now I have found xdiff and its PHP implementation which can do the job for me.
Take a look @ http://php.net/manual/en/ref.xdiff.php
Hi,
How about a row based approach where you track changes to each column in a separate row in the audit/history table?
So the audit table can look like
Id | Column name | old value | action | date
To avoid the table from becoming too large, we can delete history older than X days.
Ony thing to add.
In Oracle 8i and above you have analytical functions which you can use to avoid the redundant delta_column in your history table.
With analytical functions you don't need to self-join the history table.
I hope that these function will be implemented into several other database products, because they rock.
Keep up the good work.
Oliver
In another post your wrote about putting historical and current data together into the same table. Isn't it equivalent to actually duplicate an entire row, which you qualify as naive? In my project I would like to track the changes of address, there are no computation involved (i.e. no delta), what would be your approach?
@erides: Let me clarify. In transaction tables, like orders, I will keep the "old" rows in the table, instead of creating tables like ORDERS_HISTORY and moving the rows when they are closed. This is what I mean by keeping old and new in the same table.
The history tables discussed here are for tracking changes. When you are tracking something like changes to address fields, you just decide to save either the old or the new, you don't need both.
Post a Comment