Sunday, January 18, 2009

The Data Dictionary and Calculations, Part 1

The stunning power of a data dictionary comes into play once the dictionary contains formulas for calculated values. The dictionary can then be used to generate code, and also to generate documentation. This double-win is not available without the calculations because the resulting docs and database would be incomplete, requiring tedious and error-prone manual completion.

There are links to related essays on normalization and denormalization at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Calculations and Normalization

Before I begin, I will point out that all calculated values stored in a database are denormalizing, they all introduce redundancies. This does not mean they are bad, it just means you need a way to make sure they stay correct (see Keeping Denormalized Values Correct, also see Normalization and Automation). If you cannot keep them correct, they will get very bad very fast. This essay will show you one approach to ensuring calculated values are always correct.

However, before I start, I have to point out how important it is to begin by normalizing your database (to at least 3NF) and adding calculations only upon the strong foundation of a normalized database. If you do not normalize first, you will discover that it is impossible to work up formulas that make any sense -- values will always seem to be not quite where you need them, and it will always seem you need one more kind of calculation to support, and it will be very difficult to write the code generator that gives strong results. But if you build on a normalized database, it turns out you only need a few features in your dictionary and your code generator.

Use Denormalization Patterns

Once you have normalized your database, you will find that your calculations all fall into three basic categories (detailed in April 2008 in Denormalization Patterns). These three patterns are:

  • FETCH operations, like copying an item's price from the ITEMS table to the ORDERLINES table.
  • EXTEND operations, which are calculations within a row, such as assigning EXTENDED_PRICE the value of QUANTITY * PRICE.
  • AGGREGATE operations, like a SUM of the lines of an order to the order header.

This week we will look at the first type of operations, the FETCH operations.

Putting the FETCH Into Your Data Dictionary

So we have an ORDERLINES table, and it contains a PRICE column, and the value of that column should be copied from the ITEMS table. This is an extremely common operation in most database applications, so we decide it would be really cool if we could specify that in the data dictionary and have the code generator take care of it. This would chop a lot of labor off the development process.

Here is how a column like this would appear in my own dictionary format:

table orderlines:
    description: Order Lines
    module: sales
    
    column price:
        automation_id: fetch
        auto_formula: items.price
    ...more columns...

This looks nice, I have put the formula for the PRICE column into the data dictionary. Now of course I need that formula to get out into the application somehow so that it will always be executed and will never be violated. We will now see how to do that.

The Trigger Approach

When it comes to code generators, if there are ten programmers in a room, there are going to be at least 10 opinions on how to write and use a code generator (the non-programmer boss will also have an opinion, so that makes 11). I have no interest in bashing anybody's approach or trying to list all of the possibilities, so I will stick with the approach I use myself, which is to generate database trigger code. If you want to know why that approach works for me, check out Triggers, Encapsulation and Composition.

When I work on code generators, I begin by manually coding an example of what I'm getting at, so I know it works. The trigger snippet we are looking for must do two things. It must make sure the price is always copied, and it must make sure that no user can subvert the value. This snippet (which is in the PostgreSQL flavor of server-side SQL) does this on an insert:

-- top of trigger....

    -- PART 1: Prevent users from subverting the
    --         the formula by throwing error if they
    --         try to supply a value:
    IF new.price IS NOT NULL THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
    END IF;
    
    -- PART 2: If the value of SKU exists, use it to look
    --         up the price and copy it into the new row
    IF new.sku IS NOT NULL THEN 
      SELECT INTO new.price par.price
        FROM items par 
       WHERE new.sku = par.sku ;
    END IF;

-- more trigger stuff

NOTE! You may notice my trigger code somehow seems to "know" to use the SKU column when searching the ITEMS table, yet my formula did not specify that. I am assuming your data dictionary contains definitions of primary keys and foreign keys, otherwise it is of no real use. I am further assuming that when I see the formula to "FETCH" from the ITEMS table, I can look up the foreign key that matches ORDERLINES to ITEMS and find out what column(s) to use.

The example above works on INSERT operations only. You need a slightly different version for updates, which throws an error if the user attempts to change the price, and which does a new FETCH if the user has changed the SKU value.

    IF new.price <> old.price THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
    END IF;
    IF  coalesce(new.sku,'') <> coalesce(old.sku,'')  THEN 
       SELECT INTO new.price par.price
         FROM items par WHERE new.sku = par.sku ;
    END IF;

Sidebar: A Complete Trigger

If you want a teaser on how many amazing things the trigger can do once you've loaded up your dictionary and builder with features, here is a bit of code from a demo application. Most everything in it will get treated in this series on the data dictionary.

CREATE OR REPLACE FUNCTION orderlines_upd_bef_r_f()
  RETURNS trigger AS
$BODY$
DECLARE
    NotifyList text = '';
    ErrorList text = '';
    ErrorCount int = 0;
    AnyInt int;
    AnyInt2 int;
    AnyRow RECORD;
    AnyChar varchar;
    AnyChar2 varchar;
    AnyChar3 varchar;
    AnyChar4 varchar;
BEGIN
    SET search_path TO public;


    -- 1010 sequence validation
    IF (new.recnum_ol <> old.recnum_ol)  THEN 
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'recnum_ol,3002, may not be re-assigned;';
    END IF;

    -- 1010 sequence validation
    IF (new.skey <> old.skey)  THEN 
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'skey,3002, may not be re-assigned;';
    END IF;

    -- 3100 PK Change Validation
    IF new.recnum_ol <> old.recnum_ol THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'recnum_ol,1003,Cannot change value;';
    END IF;
    -- 3100 END


    IF new.flag_taxable <> old.flag_taxable THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'flag_taxable,5001,may not be explicitly assigned;';
    END IF;

    IF new.price <> old.price THEN 
        ErrorCount = ErrorCount + 1; 
        ErrorList = ErrorList || 'price,5001,may not be explicitly assigned;';
    END IF;
   IF  coalesce(new.sku,'') <> coalesce(old.sku,'')  THEN 
       SELECT INTO new.flag_taxable
                   ,new.price
                   par.flag_taxable
                   ,par.price
         FROM items par WHERE new.sku = par.sku ;
   END IF;

    -- 5000 Extended Columns
    IF new.amt_retail <> old.amt_retail THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_retail,5002,Cannot assign value directly to column amt_retail ;';
    ELSE 
        new.amt_retail =  CASE WHEN  1 = 1  THEN new.price*new.qty        ELSE 0 END ;
    END IF;


    IF new.pct99_discount <> old.pct99_discount THEN 
       IF new.pct99_discount <> (SELECT par.pct99_discount FROM orders par WHERE new.recnum_ord = par.recnum_ord ) THEN 
            ErrorCount = ErrorCount + 1; 
            ErrorList = ErrorList || 'pct99_discount,5001,may not be explicitly assigned;';
       END IF;
    END IF;

    IF new.taxauth <> old.taxauth THEN 
       IF new.taxauth <> (SELECT par.taxauth FROM orders par WHERE new.recnum_ord = par.recnum_ord ) THEN 
            ErrorCount = ErrorCount + 1; 
            ErrorList = ErrorList || 'taxauth,5001,may not be explicitly assigned;';
       END IF;
    END IF;

    IF new.taxpct <> old.taxpct THEN 
       IF new.taxpct <> (SELECT par.taxpct FROM orders par WHERE new.recnum_ord = par.recnum_ord ) THEN 
            ErrorCount = ErrorCount + 1; 
            ErrorList = ErrorList || 'taxpct,5001,may not be explicitly assigned;';
       END IF;
    END IF;
   IF  coalesce(new.recnum_ord,0) <> coalesce(old.recnum_ord,0)  THEN 
       SELECT INTO new.pct99_discount
                   ,new.taxauth
                   ,new.taxpct
                   par.pct99_discount
                   ,par.taxauth
                   ,par.taxpct
         FROM orders par WHERE new.recnum_ord = par.recnum_ord ;
   END IF;

    -- 5000 Extended Columns
    IF new.amt_discount <> old.amt_discount THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_discount,5002,Cannot assign value directly to column amt_discount ;';
    ELSE 
        new.amt_discount =  CASE WHEN  1 = 1  THEN new.amt_retail*new.pct99_discount*.01        ELSE 0 END ;
    END IF;

    -- 5000 Extended Columns
    IF new.amt_net <> old.amt_net THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_net,5002,Cannot assign value directly to column amt_net ;';
    ELSE 
        new.amt_net =  CASE WHEN  1 = 1  THEN new.amt_retail-new.amt_discount        ELSE 0 END ;
    END IF;

    -- 5000 Extended Columns
    IF new.amt_tax <> old.amt_tax THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_tax,5002,Cannot assign value directly to column amt_tax ;';
    ELSE 
        new.amt_tax =  CASE WHEN new.flag_taxable = 'Y' THEN new.amt_net*new.taxpct*.01        ELSE 0 END ;
    END IF;

    -- 5000 Extended Columns
    IF new.amt_due <> old.amt_due THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'amt_due,5002,Cannot assign value directly to column amt_due ;';
    ELSE 
        new.amt_due =  CASE WHEN  1 = 1  THEN new.amt_net+new.amt_tax        ELSE 0 END ;
    END IF;

    -- 7010 Column Constraint
    new.flag_taxable = UPPER(new.flag_taxable);
    IF NOT (new.flag_taxable  IN ('Y','N')) THEN 
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'new.flag_taxable,6001,Column -Taxable- can be either Y or N;';
    END IF;

    -- 8001 Insert/Update Child Validation: NOT NULL
    IF new.sku IS NULL THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'sku,1005,Required Value;';
    END IF;
    -- 8001 FK Insert/Update Child Validation
    IF new.sku IS NULL THEN
        --Error was reported above, not reported again
        --ErrorCount = ErrorCount + 1;
        --ErrorList = ErrorList || '*,1005,Foreign key columns may not be null: sku;';
    ELSE
        -- LOCK TABLE items IN EXCLUSIVE MODE;
        SELECT INTO AnyInt COUNT(*) FROM items par 
            WHERE par.sku = new.sku;
        IF AnyInt= 0 THEN

            ErrorCount = ErrorCount + 1;
            ErrorList = ErrorList || 'sku,1006,Please Select Valid Value: ' || new.sku::varchar || ';';
        END IF;
    END IF;

    -- 8001 Insert/Update Child Validation: NOT NULL
    IF new.recnum_ord IS NULL THEN
        ErrorCount = ErrorCount + 1;
        ErrorList = ErrorList || 'recnum_ord,1005,Required Value;';
    END IF;
    -- 8001 FK Insert/Update Child Validation
    IF new.recnum_ord IS NULL THEN
        --Error was reported above, not reported again
        --ErrorCount = ErrorCount + 1;
        --ErrorList = ErrorList || '*,1005,Foreign key columns may not be null: recnum_ord;';
    ELSE
        -- LOCK TABLE orders IN EXCLUSIVE MODE;
        SELECT INTO AnyInt COUNT(*) FROM orders par 
            WHERE par.recnum_ord = new.recnum_ord;
        IF AnyInt= 0 THEN

            ErrorCount = ErrorCount + 1;
            ErrorList = ErrorList || 'recnum_ord,1006,Please Select Valid Value: ' || new.recnum_ord::varchar || ';';
        END IF;
    END IF;

    IF ErrorCount > 0 THEN
        RAISE EXCEPTION '%',ErrorList;
        RETURN null;
    ELSE
        IF NotifyList <> '' THEN 
             RAISE NOTICE '%',NotifyList;
        END IF; 
        RETURN new;
    END IF;
END; $BODY$
  LANGUAGE 'plpgsql' VOLATILE SECURITY DEFINER
  COST 100;
ALTER FUNCTION orderlines_upd_bef_r_f() OWNER TO postgresql;

Variatons on FETCH

I have found two variations on FETCH that have proven very useful in real world applications.

The first I call DISTRIBUTE. It is dangerous because it can be a real performance killer, and turns out you very rarely need it. However, that being said, sometimes you want to copy a value from a parent table down to every row in a child table when the value changes in the parent. The first time I did this was to copy the final score from a GAMES table into a WAGERS table on a fake sports betting site.

The other variation I have found useful is FETCHDEF, my shorthand for "fetch by default." In this variation the user is free to supply a value of their own, but if they do not supply a value then it will be fetched for them.

The Code Generator Itself

As for writing the code generator itself, that is of course far more than I can cover in one blog entry or even 10. Morever, since anybody who decides to do so will do so in their own language and in their own style, there is little to be gained by showing code examples here.

Conclusion: Expand Your Dictionary!

If you make up a data dictionary that only contains structure information like columns and keys, and you write a builder program to build your database, you can get a big win on upgrades and installs. However, you can take that win much farther by adding calculated values to your database and expanding your builder to write trigger code. This week we have seen what it looks like to implement a FETCH calculation in your dictionary and what the resulting trigger code might look like.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The normalization essays on this blog are:

3 comments:

Anonymous said...

It would be helpful if the author provides a working example of the logic above for a popular database such as Northwind (MS SQL Server) in the corresponding flavour of the SQL language. Otherwise it sounds a bit academic and ... not very clear.

But thanks for the effort anyway.

Regards,
Sergei.

Jorge Monasterio said...

I miss this blog! More, please! :)

Donald J Organ IV said...

Sorry Jorge, Unfortunately the blogger is no longer updating this blog.