Sunday, July 20, 2008

History Tables

A history table allows you to use one table to track changes in another table. While the basic idea is simple, a naive implementation will lead to bloat and will be difficult to query. A more sophisticated approach allows easier queries and can produce not just information about single rows, but can also support aggregrate company-wide queries.

This week in the Database Programmer Blog we return to table design patterns with an essay on history tables. The basic premise of this blog is that good coding skills do not lead magically to good database skills -- you can only make optimal use of a database by understanding it on its own terms. There is a new essay each Monday, and there is a Complete Table of Contents and a List of Table Design Patterns.

What to Put Into A History Table

Naive approaches to history tables usually involve making a complete copy of the original (or new) row when something changes in the source table. This turns out to be of little use, for reasons I will explain below. A much more useful approach is to track only a few columns and to store any combination of old values, new values, and differences. A history table designed this way can be tremendously useful.

We will start with the example of a sales order table, called ORDERS. The columns we are interested in might look like this:

ORDER | CUSTOMER | DATE     | LINES  |   TAX |  TOTAL |   PAID | BALANCE
------+----------+----------+--------+-------+--------+--------+---------
1234  | 9876     | 5/1/08   |  48.00 |  5.00 |  53.00 |      0 |   53.00
2345  | 9876     | 5/3/08   | 150.00 |     0 | 150.00 | 150.00 |       0
3456  | 5544     | 6/8/08   |  25.00 |  2.60 |  27.60 |  15.00 |   12.60
4567  | 3377     | 7/3/08   | 125.00 |  7.00 | 132.00 |  50.00 |   82.00

We first have to ask which columns must be copied into history so that we can link the history table back to the ORDERS table. The only column we need for tracking is ORDER (the order number), so the history table will always have an ORDER column.

We should also assume that the history table will contain at least a timestamp and a column to track the user who made the change, which brings us to a minimum of three columns.

Finally, it tends to be very useful to track what action caused the history entry, be it an INSERT, UPDATE, or DELETE. This brings us up to four minimum columns.

Next we ask which columns we will definitely not need. There are two groups of columns we will not need, which are 1) the columns that never change and 2) the columns we do not care about. Columns that do not change are likely to be the CUSTOMER and the DATE column. There is no need to bloat the history table with these valus because we can just get them out of the ORDERS table. The second group, columns we do not care about, are are usually things like ship-to address, maybe an email, and other information. Naturally there is no hard-and-fast rule here, it depends entirely upon the needs of the application.

So now we know what we definitely need and what we definitely do not need, and we are ready to begin work considering the columns that will change. Not surprisingly, these are usually all about the numbers. Next we will see how to track the numbers.

Tracking Changes to Numbers

While it is certainly useful to store one or both of the old and new values for a number, it far more useful to store the change in the value, or the delta. Having this number in the history table makes for some really nifty abilities. If you store all three of the old, new, and delta, then you can more or less find out anything about the ORDER's history with very simple queries.

So we are now ready to consider what the history table might look like. We will take the case of an order that was entered by user 'sax', updated twice by two other users, and in the end it was deleted by user 'anne'. Our first stab at the history table might look like this:

ORDER | USER_ID  | ACTION | DATE    | LINES_OLD | LINES_NEW | LINES_DELTA 
------+----------+--------+---------+-----------+-----------+-------------
1234  | sax      | UPD    | 5/1/08  |      0.00 |     48.00 |       48.00
1234  | arkady   | UPD    | 5/7/08  |     48.00 |     58.00 |       10.00
1234  | ralph    | UPD    | 6/1/08  |     58.00 |     25.00 |      -33.00
1234  | anne     | DEL    | 6/4/08  |     25.00 |      0.00 |      -25.00

I should note that if you keep LINES_OLD and LINES_NEW, then strictly speaking you do not need the LINES_DELTA columns. Whether or not you put it in depends on your approach to table design. If you framework allows you to guarantee that it will be correct, then your queries will be that much simpler with the LINES_DELTA column present.

You may wonder why there is no entry for the original INSERT. This is because you must enter an order before you can enter the lines, so the original value will always be zero. Only when lines start going in does the ORDER get any numbers. This is true for header tables, but it would not be true for detail tables like ORDER_LINES_HISTORY.

Some of the Obvious Queries

There are few obvious queries that we can pull from the history table right away. These include the following:

-- Find the value of of the line items of an
-- order as of June 1st
SELECT LINES_NEW 
  FROM ORDERS_HISTORY
 WHERE ORDER = 1234
   AND DATE <= '2008-06-01'
 ORDER BY DATE DESC LIMIT 1;
 
-- Find the original value of the line items,
-- and the user who entered it.  
SELECT LINES_NEW, USER_ID
  FROM ORDERS_HISTORY
 WHERE ORDER = 1234
 ORDER BY date LIMIT 1;
   
-- Find the users who have worked on an order
SELECT DISTINCT USER_ID
  FROM ORDERS_HISTORY
 WHERE ORDER = 1234;

Most of queries should be pretty obvious, and there are plenty more that will suggest themselves once you start working with the history tables.

Queries Involving the Delta

The real power of the DELTA column comes into play when you are trying to compute back-dated values such as the company's total open balance on June 1, 2008. If you have a naive history table that stores only the old value or only the new value, this is truly a tortuous query to write, but if you have both then it is really quite easy.

-- Query to calculate the total open balance of all
-- orders as of a given date
SELECT SUM(BALANCE_DELTA) 
  FROM ORDERS_HISTORY
 WHERE DATE <= '2008-06-01';

This magical little query works because paid orders will "wash out" of the total. Consider an order that is entered on May 20 for $200.00, and is then paid on May 23rd. It will have +200 entry in the BALANCE_DELTA column, and then it will have a -200.00 entry 3 days later. It will contribute the grand sum of zero to the total.

But an order entered on May 25th that has not been paid by June 1st will have only a +200 entry in the BALANCE_DELTA column, so it will contribute the correct amount of $200.00 to the balance as of June 1st.

If the company owner wants a report of his total open balances on each of the past 30 days, you can retrieve two queries and build his report on the client:

-- Get begin balance at the beginning of the period
SELECT SUM(BALANCE_DELTA) as BEGIN_BALANCE
  FROM ORDERS_HISTORY
 WHERE DATE < '2008-06-01';

-- Get the total changes for each day.  When you
-- build the report on the client, add each day's
-- change amount to the prior day's balance
SELECT SUM(BALANCE_DELTA) AS BALANCE_DELTA
  FROM ORDERS_HISTORY
 WHERE DATE BETWEEN '2008-06-01' AND '2008-06-30'
 GROUP BY DATE;

Keeping History Tables Clean

A clean history table is one that contains no unnecessary information. You normally do not want entries going into the history table if nothing relevant changed. So your history table mechanism should examine the columns it is tracking, and only make an entry to the history table if one of the columns of interest actually changed.

Problems With The Naive History Table

A very basic history table will usually copy the entire original row from the source table into the history table whenever an INSERT, UPDATE or DELETE occurs. One simple problem is that you end up with bloated history tables. Because they are cluttered with unnecessary repititions, they are difficult to work with by inspection.

A much more serious technical problem with the naive approach is that it is horribly difficult to produce the queries demonstrated above. You must reproduce the concept of a delta by either running through all of the rows on the client, or you must make a difficult (and often impossible) JOIN of the history table to itself in which you connect each row to the row that came just before it. All I can say is, no thanks, I'll go with the delta.

History Table Security

History tables always involve some concept of auditing, that is, keeping track of user actions. This means we need to protect against deliberate falsification of the history tables, which leads to two rules. First, a user must have no ability to directly DELETE rows from the history table, or they could erase the record of changes. Second, the user must have no ability to directly INSERT or UPDATE existing rows, because if they could they can falsify the history. These rules apply to both regular users and system administrators, the administrator must have no privelege to subvert or manipulate the history.

Since history tables have a tendency to become seriously bloated, there must be some priveleged group that can DELETE from the history tables, which they would do as a periodic purge operation. This group should have no ability to UPDATE the tables, because such priveleges would open a potential hole for subverting the history. Regular system administrators should not be in this group, this should be a special group whose only purpose is to DELETE out of the history tables.

If you are making use of DELTA columns, then stricly speaking you do not want to purge, but compress history tables. If you want to purge out all entries in 2005, you must replace them with a single entry that contains a SUM of the DELTA columns for all of 2005.

So to sum up, we have the following security rules for a history table:

  • No system user should be able to DELETE from the history table.
  • No system user should be able to UPDATE the history table.
  • No system user should be able to directly control the INSERT into the history table.
  • A special group must be defined whose only ability is to DELETE from the history table, so that the tables can be purged (or compressed) from time to time.

Implementation

As always, you have your choice of implementing the history mechanism in the client code or in the database itself.

The best performing and most secure method is to implement history tables with triggers on the source table. This is the best way to implement both security and the actual business rules in one encapsulated object (the table). However, if you have no current practices for coding server-side routines, or you do not have a data dictionary system that will generate the code for you, then it may not be practical to go server-side for a single feature.

Implementing history tables in code has the usual benefit of keeping you in the language and habits you are most familiar with, but it means that you cannot allow access to your database except through your application. I cannot of course make a general rule here, this decision is best made by the design team based on the situation at hand and anticipated future needs.

Conclusion

History tables have many uses. Beyond the obvious first use of finding indidivual values at some point in the past, well crafted tables can produce company-wide aggregations like total open balances on a given day, changes in booked orders on a day or in a range of days, and many other queries along those lines. Security is very important to prevent history tables from being subverted.

NEXT ESSAY: Different Foreign Keys For Different Tables

40 comments:

Anonymous said...

Not all programs that write to the history tables are flawless, i.e. a bug could cause the tables to become corrupt/bogus, which means that someone, somewhere (probably the dba) will need insert/update/delete access...

Eric said...

The idea of the delta column is quite neat. I will probably be using that one somewhere in the near future.
But I think you are too easily dismissing the use of a complete copy of a row for auditing purposes. Auditing is an area where you can never totally be sure that you will not need some value of an attribute in the future. Be it to make a nice historical report or in a case of fraud detection.
We all know that the attributes which you think will never change, suddenly must be changed. You can bet on that.
Moreover, once you have dismissed a column from the historical table you will never be able to retrieve the history for that column.
I myself would use the 'better safe than sorry' approach and include all columns. What is the cost of storage anyway nowadays :-)

Jeffrey04 said...

Hey, thanks for the post. We are planning to work on table history function for our fsm based workflow system and we came out with the following table definition:-

table_history (history_id, field_name, from_value, to_value, uid *, transition_id *, record_id *)

the differences between our definition and the one mentioned in your post are we track for all the modified fields (not sure whether it is necessary) instead of just the key fields. And because we are having a workflow system, therefore we can specify what kind of transaction (we call them transition) actually triggers the change in the record (hence transition_id is a foreign key field).

We are still finding out what is the best way to populate the history table though. Fortunately we are using postgresql that offers PL/SQL trigger scripting but I don't know if it is reliable enough.

Eoin said...

Nice. I have also been learning from http://en.wikipedia.org/wiki/Slowly_changing_dimension

Kiran said...

It seems that the custom history table for each original table and the need for associated triggers might result in high total code ownership thus causing maintenance overhead. Consider a common case of adding an extra column to history table as requirements change or a system with large number of tables. Are there any tips or tricks to create a metadata driven history system that minimizes this overhead?

KenDowns said...

Anonymous: no application code will ever be flawless, that's why I stated the only way to be completely reliable is to use triggers.

KenDowns said...

jeffrey04: You are implementing what is often called E-A-V. You may want to reconsider that, you will find it superficially easy to get started and more and more complicated to go forward. The basic problem is you are not really storing complete rows anymore, so you will have to reinvent much of SQL to get useful queries out of your history.

KenDowns said...

kiran: Well I can start with what I do myself. I define the entire database in a data dictionary, and use a "builder" program that reads the dictionary to build the tables and the triggers that do things like maintain history tables. I have found this produces the lowest code maintenance overhead of any other approach I've used in the last 15 years.

So a "tip" or "trick" would be to depend heavily on your data dictionary and put everything into it.

ejs said...

As far as I understand if we want to track web app. user activity with history tables on back end site -- we obliged to implement our history tables in code (not in triggers), since the web app. users are (usually) not database users. I wonder if there some real implementation examples..

KenDowns said...

EJS: Most web programmers use a single login. In fact, as far as I can tell, *all* of them do. This leaves you with no choice but to implement in code.

I take this as just another reason not to use a single login.

As for examples, I have a (badly outdated) demo at http://www.andromeda-project.org/demo that is a browser-based biz app that uses "real" database accounts and implements all biz logic in triggers which are themselves generated out of a data dictionary.

Greg said...

I've recently changed from a bloated version to something a bit easier to manage. It is similar to what jeffrey04 mentioned with a field name column and a field value column. I set up triggers, example at http://codespatter.com/2008/05/06/how-to-use-triggers-to-track-changes-in-mysql/ that will detect when there are changes. Tracking users that make the changes is done in the code.

What is it that you found more complicated with the E-A-V route? So far it has been much easier to use.

KenDowns said...

Greg: E-A-V always works well in simple situations and then most people find it falls apart later on. My original blog gave only simple examples, but it never ends there, requests for new queries become more complex. The EAV approach, which divides related information that should be in columns into rows instead, makes these queries very difficult to write.

But in the most basic sense, no database programmer is ever going to be comfortable with a system that takes related information, which belongs together in one row of a table, and spreads it out into multiple rows. It creates somewhat simpler code at first in the situation where you used it, and everything else after that gets worse because you cannot use SQL effectively, not for updates, not for deletes, not for select, and not for inserts. This is because you don't have rows of related information in tables anymore, and every action you take has to reconstruct the rows-in-tables structure from the split-up EAV structure.

Tim E. said...

Great article, which came at a great time for me. I am about to add change tracking to a rails based door access application.

I disagree, however, with your choice to place business logic (which the decision to track or not track a particular field or operation must be) in a trigger. If multiple applications must interact with the database, then they should implement their own history logic as well. I assert that this logic belongs in the model, under the MVC paradigm.

KenDowns said...

Tim e.: Thanks, I hope your history tables go well.

We will not likely agree on the issue of triggers vs application code, unless you use a data dictionary to generate the code. Without a dictionary, you are violating DRY, and practically speaking it will be impossible to get them all right (or at least I've never seen it successfully done, even by smart and motivated people). And if you're going to use a dictionary to generate the code, why not just generate it once instead of multiple times?

John "Z-Bo" Zabroski said...

Programmers who disagree with the data dictionary approach need to hear the story of the data dictionary told from the stand-point of model-driven software development and real-time object-oriented modeling (the object-oriented modeling pattern all OO systems gravitate towards).

I think you almost got your criticism of EAV right when you mention in the comments how people dynamically rebuild the EAV table and transaform it into a more comprehensible materialized view. The reason they do this is related to the consequences of their actions. When you build an EAV data model, your column headers don't mean anything. It turns out people like looking at tables with meaningful column headers, and so even if they go the EAV route they will invariably materialize views that give them useful column headers.

KenDowns said...

Z-Bo: I think you are absolutely right about casting the dd in terms of MDA. Looks like we have a future essay topic...

Anonymous said...

what DELTA will you use for STRING columns?

KenDowns said...

Anonymous: I don't think there is a meaningful definition of DELTA for strings. I don't use them myself, just before and after where applicable.

Andi said...

Nice article!

But why not simply use SVN _if_ history is just needed to backup old data in case of errors etc. So if you just need access to _current_ data, SVN (of MySQL dump) is a good idea, isn't it?

KenDowns said...

andi: thanks for the compliment. If I understand your question, examining a dump is an offline activity, which makes it more difficult. Also, you may miss intermediate states of the database between dumps.

Andi said...

I imagine calling SVN commands from server side, eg via PHP. If this is possible, I can commit a dump of the last state of a row, which is going to change now. The benefit is that I use all SVN stuff (delta etc.) for my database data. So I can easily get the last status with one command - to reset it I just execute the dump;) The downside could be that it is not very performant to commit a new file for every row - but you could swap this process to another thread - so MySQL will not decelerate.

John "Z-Bo" Zabroski said...

andi,

I think you are thinking like an application developer instead of a database administrator.

A systems developer (like a kernel developer), on the other hand, would tell you that you are using the wrong tool for the job. For instance, a kernel developer would never write a version control system specifically to highlight a newly created function between two commits. Instead, they'd just store the difference in the storage bits.

By the way, SQLServer2008 provides nifty support for monitoring changes to tables, including changes made by users with SA and/or DBO privileges. SQL DBMSes are just getting better and better these days.

John "Z-Bo" Zabroski said...

Also, one point Ken did not cover: You should definitely "unit test" your stored procedure that writes to the history table. The consequences of you going even a day where you didn't audit something is unthinkable, but possible, especially when you upgrade databases and your database vendor re-implements how its Grand/Deny/Revoke security model is defined between database objects.

KenDowns said...

Z-Bo: "I think you are thinking like an application developer..." That's why it's called The Database "programmer".

Andi said...

Sure, I was just looking for a tool that does the right thing for me ;) I did not find it in MySQL so I just was "scanning" the tools I know and found (the diff implementation of) SVN. I don't want to rewrite the diff code for myself - and now I have found xdiff and its PHP implementation which can do the job for me.

Take a look @ http://php.net/manual/en/ref.xdiff.php

Rahul said...

Hi,

How about a row based approach where you track changes to each column in a separate row in the audit/history table?

So the audit table can look like

Id | Column name | old value | action | date

To avoid the table from becoming too large, we can delete history older than X days.

Anonymous said...

Ony thing to add.

In Oracle 8i and above you have analytical functions which you can use to avoid the redundant delta_column in your history table.
With analytical functions you don't need to self-join the history table.

I hope that these function will be implemented into several other database products, because they rock.

Keep up the good work.

Oliver

ericdes said...

In another post your wrote about putting historical and current data together into the same table. Isn't it equivalent to actually duplicate an entire row, which you qualify as naive? In my project I would like to track the changes of address, there are no computation involved (i.e. no delta), what would be your approach?

KenDowns said...

@erides: Let me clarify. In transaction tables, like orders, I will keep the "old" rows in the table, instead of creating tables like ORDERS_HISTORY and moving the rows when they are closed. This is what I mean by keeping old and new in the same table.

The history tables discussed here are for tracking changes. When you are tracking something like changes to address fields, you just decide to save either the old or the new, you don't need both.

Craig S. said...

There is a different kind of history table that is not mentioned here. I am in the process of building an e-voting application (nothing "presidential" in nature...). The members of the organization, at one point, may no longer be members. It would be nice to delete these members after a period of time, but the by-laws of the organization state that committee nominations and elections history must be kept for 5 years. Unfortunately, at this point, we are using the standard ASP.NET membership provider schema. It would seem my "history" transaction tables would need to store all pertinent information regarding committee election history and committee assignments in order to "reconstruct" the past. Would you agree?

KenDowns said...

Craig S: I would always agree that if the data is required it must be saved :)

In terms of patterns, it seems you do not need so much an audit trail of changes as a permanent record of transactions (votes), which can survive deletion of parent records (the voters).

The simplest approach is to put some type of membership flag or perhaps termination date on the members table so that you keep membership information perpetually, but only accept votes from active members.

Another approach, which is technically a Data Warehouse approach, but which can be merged into a smaller app w/o difficulty is to capture the relevant member info and redundantly store it as part of the vote itself, so that the votes table need no longer be a child of the members table.

Brent said...

I'm trying to puzzle through tracking changes to fields with different datatypes. For example, if your table contained three fields -- integer, datetime, and mediumtext -- and you wanted to track changes to these, would you use the largest "common denominator" datatype for the new and old value fields? In this case, I think the only datatype that could store all three is the mediumtext.

KenDowns said...

Brent: for dates/times, use an interval. In MS SQL Server make it an int (or bigint if you anticipate large differences) and calculate as datediff(). For char/text the delta concept is a "diff", but that would usually be done by some outside tool.

Anonymous said...

I find the delta approach to go bananas quite quickly. How do you save a delta of average, min, max, clear etc?
Text also suffers from no obvious delta.

In my experience it's easier to just store the previous value and the new value. This also doesn't require you to iterate through all rows to find the latest version of a value.

Anonymous said...

Hello,

While your approach for 'mostly numeric data' is of some interest, I believe you are missing a lot on what a 'complete copy' history table can bring you.

For instance, in my software, I can easily give you the state of a group of items at any given time, whereas if I were using your method, I would need to compute that state first, and while that looks easy on the numeric side, it would be quite hectic on the other side, scanning through every 'older' row to find what was the last update (and thus the actual value at that time) on each and every column.

You speak about hard / impossible join of the history table to itself, let me tell you it's time you either change DBMS or learn more SQL because I can't quite see how you get that idea.

Among the other comments, implementing history in code is good for cross-dbms portability, but it cannot in any way be compared with triggers on the reliability/speed side - so why do it ?

Once again, to all those who think E A V is better, trust me just try to rebuild the given state of a set of items and relations (let's say 10 joins to remain in reasonable proportions) in a flash and you'll quickly reconsider the approach ;)

SO yes, E A V is good for some (very very simple) cases but it does NOT replace (even though this means a lot of data duplication) a full copy history. (As said above, just give me your SQL query for the 10 joins, mostly non numeric fields set of items at any given time and you'll see how impractical it can be)


L.

Anonymous said...

Forgot a small detail:
Who cares about data duplication when the world is (eventually) turning to all dedupe ;)

L.

Anonymous said...

I definitely have a bit of a problem with the whole "delta" thing. Not sure where the term came from, but those (majority) of us would just call it what it is: a CALCULATED FIELD in a table - something that truly should not be done.

It is somewhat absurd to have a column that shows A=15, another that shows B=20, and then feel you must have another one that calculates C=(20-15)=5.

Store data in fields, do calculations in queries. The principle has outlasted all of us and continues.'

Giving it a new and chic name Delta doesn't change anything. Don't use calculatd fields in tables= DB developing 101

Sudhir DBAKings said...

Nice post very helpful

dbakings

Highwaybird said...

i see that you put a computed field for balance in the table even we can calculate it

Armando Iswahyudi said...

Hi there,I enjoy reading through your article post, I wanted to write a little comment to support you and wish you a good continuationAll the best for all your blogging efforts.

Indonesia Furniture
Mebel Jepara