Monday, August 25, 2008

Advanced Algorithm: Sequencing Dependencies

Some database applications require you to perform a series of actions where you know only that some actions must be performed before others. Before you can perform the actions, you must work out a safe sequence that takes into account all of the dependencies. This week in The Database Programmer we will see an algorithm for doing this.

Examples

There are many examples where a programmer must work out dependencies before doing something.

A manufacturing package may track many steps in the manufacture of an item. Some steps cannot be performed until others are complete. A simple system would require the end-user to work out the entire process, but a better system would let the user enter only the dependencies: which processes require others to be complete. In this kind of system the computer can be used to schedule manufacturing tasks.

All popular Linux distributions have a package installation system in which each package lists its required dependencies. If you want to install a large number of packages in one shot, producing a tangled bunch of related dependencies, today's algorithm can be used to work them all out.

If you are using a data dictionary to build tables, every foreign key represents a dependency, where the child table requires the parent table to exist before it can be built. Today's algorithm can be used to sequence the tables and build them in order.

Another database example is generating code to perform calculations. Some calculations will depend on previous calculations, so your code generator must be able to sequence them all so that the calculations are performed in the proper order.

Big Words: Directed Acyclic Graph

The examples abvoe are all cases of what mathematicians call a Directed Acyclic Graph. If you do not want to read the entire Wikipedia article, the main points are these:

  • We have a set of items. These can be anything you are keeping track of in your database.
  • Any item may be connected to zero or more other items.
  • The connection is one-way only. So if we say A requires B, we are not saying that B also requires A (in fact it is forbidden).
  • There can be no loops (cycles). If A requires B, B may not require A. Further, if A requires B, and B requires C, C may not require A.

Whenever I can, I like to point out that it is very useful to read up on the mathematical foundations of certain programming techniques. We can often pick up very useful insights from those who think of these things at the most abstract level. It is also much easier to get advice from the more abstract-minded database people if you are at least marginally familiar with the mathematical terms.

The Tables

So now let us proceed to the tables and the code. The tables below show a data dictionary that will be used to generate DDL to build a database:

Table: TABLES

TABLE       | DESCRIPTION            | SEQUENCE
------------+------------------------+---------
ORDERS      | Sales Orders Headers   |  ?
ORDER_LINES | Sales order lines      |  ?
CUSTOMERS   | Customers              |  ?
ITEMS       | Items                  |  ?


Table: DEPENDENCIES

CHILD_TABLE  | PARENT_TABLE
-------------+---------------
ORDERS       | CUSTOMERS
ORDER_LINES  | ORDERS
ORDER_LINES  | ITEMS

The problem here is knowing the safe order in which to build the tables. If I try to build ORDER_LINES before I have built ITEMS, then I cannot put a foreign key onto ORDER_LINES, because ITEMS is not there. In short, I need to know the value of the SEQUENCE column in the example above.

The Expected Answer

The example above is simple enough that we can work it out by hand. This is actually a good idea, because we want to get an idea of what the answer will look like:

TABLE       | DESCRIPTION            | SEQUENCE
------------+------------------------+---------
ORDERS      | Sales Orders Headers   |  1
ORDER_LINES | Sales order lines      |  2
CUSTOMERS   | Customers              |  0
ITEMS       | Items                  |  0

This answer should be self-explanatory, except maybe for the fact that both CUSTOMERS and ITEMS have the same value. We need to look at that before we can see the code that produces it. Is it OK that two entries have the same value, and how would our program handle that?

The short answer is that it is perfectly OK and natural for two or more entries to have the same value. All this means is that they can be done in any order relative to each other, so long as they are done before the other entries.

In terms of the example, where we want to build these tables in a database, it means that:

  • We would query the list of tables and sort by SEQUENCE
  • We would loop through and build each table
  • We don't care about ITEMS and CUSTOMERS having the same value, they get built in whatever which-way the server gives us the list.

The same concept applies to the other potential examples: manufacturing, software packages, and generating calculations. So long as you follow the sequence, we don't care about items that have the same value.

Stating the Solution in Plain English

We are now ready to work out a program that will generate the SEQUENCE column. The basic steps the program must perform are:

  1. Initialize the column to -1. A value of -1 means "Not sequenced."
  2. Update the column to zero for all items that have no dependencies.
  3. Repeat the following action until the affected rows are zero: Update the SEQUENCE column to 1 (then 2, then 3) for all rows that have all of their dependencies sequenced already.
  4. Once the command in step 3 is no longer affecting any rows, check for any rows that have -1, these are involved in circular dependencies and we cannot proceed until the user straightens them out.

Stating the Solution in Code

The first step is very easy, we initialize the table with this command:

UPDATE TABLES SET SEQUENCE = -1;

The next step is also very easy, we mark with a '0' all of the tables that have no dependencies. The basic idea is to find all of the entries that have no entries in DEPENDENCIES.

UPDATE TABLES SET SEQUENCE = 0
 WHERE NOT EXISTS (SELECT child FROM DEPENDENCIES
                    WHERE child = TABLES.TABLE)

Now for the hard part. We now have to execute a loop. On each pass of the loop we are looking for all items whose dependencies have all been sequenced. We will do this over and over until the command is not affecting any rows. It is important that we cannot exit the loop by testing if all rows are sequenced, because a circular dependency will prevent this from happening and we will have an infinite loop.

You can control this loop from client code, but I wrote mine as a Postgres stored procedure. This algorithm turns out to be surprisingly complicated. The UPDATE command below may not be all that self-explanatory. What it works out is:

  • Get a list of child tables from the DEPENDENCIES table
  • JOIN through to TABLES to look at the SEQUENCE value of their parents.
  • Group and check that the minimum value is greater than zero, if it is it means all parents are sequenced and the table can be sequenced.
  • Update the SEQUENCE value for the tables we found
CREATE OR REPLACE FUNCTION zdd.Table_Sequencer() RETURNS void AS
$BODY$
DECLARE
    -- Note that rowcount is initialized to be > 0, this makes
    -- the loop work properly
    rowcount integer := 1;
    
    -- This tracks the value we are assigning to SEQUENCE.  We
    -- initialize it to 1 because we already took care of the
    -- the rows that have value 0
    lnSeq integer := 1;
BEGIN
    while rowcount > 0 LOOP
        UPDATE tables set SEQUENCE = lnSeq
          FROM (SELECT t1.CHILD 
                  FROM DEPENDENCIES t1 
                  JOIN TABLES       t2 ON t1.PARENT = t2.TABLE
                 GROUP BY t1.CHILD
                HAVING MIN(t2.SEQUENCE) >= 0
                ) fins
          WHERE TABLES.TABLE = fins.CHILD
            AND TABLES.SEQUENCE = -1;

  lnSeq := lnSeq + 1;
  GET DIAGNOSTICS rowcount = ROW_COUNT;
 END LOOP;
 
 RETURN;
END;
$BODY$
LANGUAGE plpgsql;

The stored procedure above will stop executing once the UPDATE command is no longer having any effect. Once that happens, your final step is to make sure that all rows have a valid SEQUENCE value, which is to say that no entry has SEQUENCE of -1. If any of the rows have that value then you have a circular dependency. You must report those rows to the user, and you can also report the dependencies that are causing the loop.

Conclusion

Sequencing dependencies is a fundamental algorithm that has a lot of use cases in database applications. It is easy enough to accomplish, but the innermost UPDATE command can be a little puzzling when you first look at it. Once you have mastered this algorithm you are on the way to the "big leagues" of database applications such as ERP, MRP and others.

Next Essay: Secure Password Resets

3 comments:

Robert Young said...

The Postgres folk have been promising support for Common Table Expressions for a while now. When (if) CTEs are supported, this exercise becomes very much easier.

KenDowns said...

Robert, it looks like they are also saying "With Recursive" will be in 8.4, which is also very exciting.

Troutinator said...

Very interesting post. I've been reading them through from the beginning.

I know this is an old post, but I think plenty of people are still reading it. That said, I have small suggestion. In step 4 of "Stating the Solution in Plain English", the -1 is breaking on the dash which confused me on my first read. Might I suggest wrapping that in the <nobr> tag like so, <nobr>-1</nobr>, this will prevent the break.