Sunday, July 20, 2008

History Tables

A history table allows you to use one table to track changes in another table. While the basic idea is simple, a naive implementation will lead to bloat and will be difficult to query. A more sophisticated approach allows easier queries and can produce not just information about single rows, but can also support aggregrate company-wide queries.

This week in the Database Programmer Blog we return to table design patterns with an essay on history tables. The basic premise of this blog is that good coding skills do not lead magically to good database skills -- you can only make optimal use of a database by understanding it on its own terms. There is a new essay each Monday, and there is a Complete Table of Contents and a List of Table Design Patterns.

What to Put Into A History Table

Naive approaches to history tables usually involve making a complete copy of the original (or new) row when something changes in the source table. This turns out to be of little use, for reasons I will explain below. A much more useful approach is to track only a few columns and to store any combination of old values, new values, and differences. A history table designed this way can be tremendously useful.

We will start with the example of a sales order table, called ORDERS. The columns we are interested in might look like this:

ORDER | CUSTOMER | DATE     | LINES  |   TAX |  TOTAL |   PAID | BALANCE
------+----------+----------+--------+-------+--------+--------+---------
1234  | 9876     | 5/1/08   |  48.00 |  5.00 |  53.00 |      0 |   53.00
2345  | 9876     | 5/3/08   | 150.00 |     0 | 150.00 | 150.00 |       0
3456  | 5544     | 6/8/08   |  25.00 |  2.60 |  27.60 |  15.00 |   12.60
4567  | 3377     | 7/3/08   | 125.00 |  7.00 | 132.00 |  50.00 |   82.00

We first have to ask which columns must be copied into history so that we can link the history table back to the ORDERS table. The only column we need for tracking is ORDER (the order number), so the history table will always have an ORDER column.

We should also assume that the history table will contain at least a timestamp and a column to track the user who made the change, which brings us to a minimum of three columns.

Finally, it tends to be very useful to track what action caused the history entry, be it an INSERT, UPDATE, or DELETE. This brings us up to four minimum columns.

Next we ask which columns we will definitely not need. There are two groups of columns we will not need, which are 1) the columns that never change and 2) the columns we do not care about. Columns that do not change are likely to be the CUSTOMER and the DATE column. There is no need to bloat the history table with these valus because we can just get them out of the ORDERS table. The second group, columns we do not care about, are are usually things like ship-to address, maybe an email, and other information. Naturally there is no hard-and-fast rule here, it depends entirely upon the needs of the application.

So now we know what we definitely need and what we definitely do not need, and we are ready to begin work considering the columns that will change. Not surprisingly, these are usually all about the numbers. Next we will see how to track the numbers.

Tracking Changes to Numbers

While it is certainly useful to store one or both of the old and new values for a number, it far more useful to store the change in the value, or the delta. Having this number in the history table makes for some really nifty abilities. If you store all three of the old, new, and delta, then you can more or less find out anything about the ORDER's history with very simple queries.

So we are now ready to consider what the history table might look like. We will take the case of an order that was entered by user 'sax', updated twice by two other users, and in the end it was deleted by user 'anne'. Our first stab at the history table might look like this:

ORDER | USER_ID  | ACTION | DATE    | LINES_OLD | LINES_NEW | LINES_DELTA 
------+----------+--------+---------+-----------+-----------+-------------
1234  | sax      | UPD    | 5/1/08  |      0.00 |     48.00 |       48.00
1234  | arkady   | UPD    | 5/7/08  |     48.00 |     58.00 |       10.00
1234  | ralph    | UPD    | 6/1/08  |     58.00 |     25.00 |      -33.00
1234  | anne     | DEL    | 6/4/08  |     25.00 |      0.00 |      -25.00

I should note that if you keep LINES_OLD and LINES_NEW, then strictly speaking you do not need the LINES_DELTA columns. Whether or not you put it in depends on your approach to table design. If you framework allows you to guarantee that it will be correct, then your queries will be that much simpler with the LINES_DELTA column present.

You may wonder why there is no entry for the original INSERT. This is because you must enter an order before you can enter the lines, so the original value will always be zero. Only when lines start going in does the ORDER get any numbers. This is true for header tables, but it would not be true for detail tables like ORDER_LINES_HISTORY.

Some of the Obvious Queries

There are few obvious queries that we can pull from the history table right away. These include the following:

-- Find the value of of the line items of an
-- order as of June 1st
SELECT LINES_NEW 
  FROM ORDERS_HISTORY
 WHERE ORDER = 1234
   AND DATE <= '2008-06-01'
 ORDER BY DATE DESC LIMIT 1;
 
-- Find the original value of the line items,
-- and the user who entered it.  
SELECT LINES_NEW, USER_ID
  FROM ORDERS_HISTORY
 WHERE ORDER = 1234
 ORDER BY date LIMIT 1;
   
-- Find the users who have worked on an order
SELECT DISTINCT USER_ID
  FROM ORDERS_HISTORY
 WHERE ORDER = 1234;

Most of queries should be pretty obvious, and there are plenty more that will suggest themselves once you start working with the history tables.

Queries Involving the Delta

The real power of the DELTA column comes into play when you are trying to compute back-dated values such as the company's total open balance on June 1, 2008. If you have a naive history table that stores only the old value or only the new value, this is truly a tortuous query to write, but if you have both then it is really quite easy.

-- Query to calculate the total open balance of all
-- orders as of a given date
SELECT SUM(BALANCE_DELTA) 
  FROM ORDERS_HISTORY
 WHERE DATE <= '2008-06-01';

This magical little query works because paid orders will "wash out" of the total. Consider an order that is entered on May 20 for $200.00, and is then paid on May 23rd. It will have +200 entry in the BALANCE_DELTA column, and then it will have a -200.00 entry 3 days later. It will contribute the grand sum of zero to the total.

But an order entered on May 25th that has not been paid by June 1st will have only a +200 entry in the BALANCE_DELTA column, so it will contribute the correct amount of $200.00 to the balance as of June 1st.

If the company owner wants a report of his total open balances on each of the past 30 days, you can retrieve two queries and build his report on the client:

-- Get begin balance at the beginning of the period
SELECT SUM(BALANCE_DELTA) as BEGIN_BALANCE
  FROM ORDERS_HISTORY
 WHERE DATE < '2008-06-01';

-- Get the total changes for each day.  When you
-- build the report on the client, add each day's
-- change amount to the prior day's balance
SELECT SUM(BALANCE_DELTA) AS BALANCE_DELTA
  FROM ORDERS_HISTORY
 WHERE DATE BETWEEN '2008-06-01' AND '2008-06-30'
 GROUP BY DATE;

Keeping History Tables Clean

A clean history table is one that contains no unnecessary information. You normally do not want entries going into the history table if nothing relevant changed. So your history table mechanism should examine the columns it is tracking, and only make an entry to the history table if one of the columns of interest actually changed.

Problems With The Naive History Table

A very basic history table will usually copy the entire original row from the source table into the history table whenever an INSERT, UPDATE or DELETE occurs. One simple problem is that you end up with bloated history tables. Because they are cluttered with unnecessary repititions, they are difficult to work with by inspection.

A much more serious technical problem with the naive approach is that it is horribly difficult to produce the queries demonstrated above. You must reproduce the concept of a delta by either running through all of the rows on the client, or you must make a difficult (and often impossible) JOIN of the history table to itself in which you connect each row to the row that came just before it. All I can say is, no thanks, I'll go with the delta.

History Table Security

History tables always involve some concept of auditing, that is, keeping track of user actions. This means we need to protect against deliberate falsification of the history tables, which leads to two rules. First, a user must have no ability to directly DELETE rows from the history table, or they could erase the record of changes. Second, the user must have no ability to directly INSERT or UPDATE existing rows, because if they could they can falsify the history. These rules apply to both regular users and system administrators, the administrator must have no privelege to subvert or manipulate the history.

Since history tables have a tendency to become seriously bloated, there must be some priveleged group that can DELETE from the history tables, which they would do as a periodic purge operation. This group should have no ability to UPDATE the tables, because such priveleges would open a potential hole for subverting the history. Regular system administrators should not be in this group, this should be a special group whose only purpose is to DELETE out of the history tables.

If you are making use of DELTA columns, then stricly speaking you do not want to purge, but compress history tables. If you want to purge out all entries in 2005, you must replace them with a single entry that contains a SUM of the DELTA columns for all of 2005.

So to sum up, we have the following security rules for a history table:

  • No system user should be able to DELETE from the history table.
  • No system user should be able to UPDATE the history table.
  • No system user should be able to directly control the INSERT into the history table.
  • A special group must be defined whose only ability is to DELETE from the history table, so that the tables can be purged (or compressed) from time to time.

Implementation

As always, you have your choice of implementing the history mechanism in the client code or in the database itself.

The best performing and most secure method is to implement history tables with triggers on the source table. This is the best way to implement both security and the actual business rules in one encapsulated object (the table). However, if you have no current practices for coding server-side routines, or you do not have a data dictionary system that will generate the code for you, then it may not be practical to go server-side for a single feature.

Implementing history tables in code has the usual benefit of keeping you in the language and habits you are most familiar with, but it means that you cannot allow access to your database except through your application. I cannot of course make a general rule here, this decision is best made by the design team based on the situation at hand and anticipated future needs.

Conclusion

History tables have many uses. Beyond the obvious first use of finding indidivual values at some point in the past, well crafted tables can produce company-wide aggregations like total open balances on a given day, changes in booked orders on a day or in a range of days, and many other queries along those lines. Security is very important to prevent history tables from being subverted.

NEXT ESSAY: Different Foreign Keys For Different Tables

267 comments:

«Oldest   ‹Older   401 – 267 of 267
«Oldest ‹Older   401 – 267 of 267   Newer› Newest»