Sunday, January 4, 2009

Dictionary Based Database Upgrades

The number one search term that brings people to this blog is "data dictionary." So this week I will begin a series on how to use the data dictionary to improve your own productivity and reduce errors.

Building And Upgrading

This week we are going to see how to use a data dictionary to eliminate upgrade scripts (mostly) and make for more efficient upgrades. The approach described here also works for installing a system from scratch, so an install and an upgrade become the same process.

The major problems with upgrade scripts are these:

  • They are the least-tested code in any system, and are the most likely to break. If a script breaks and anybody but the original programmer is running the upgrade, this leads to aborted upgrades and upset customers.
  • They are horribly inefficient when a customer upgrades after a long time: the same table may be rebuilt many times as script after script adds a column or two.

By contrast, a dictionary-based upgrade can take any customer from any version of your software and in the fewest steps possible bring them completely current, with no possibility of broken scripts.

But first, a quick review of an important idea...

Review: The Text File

Back in June of 2008 this blog featured an overview of the data dictionary. There were many ideas in that essay, but I wish to focus on one in particular, the question of where to put the data dicationary and in what format.

In terms of where to put it, the data dictionary should be just another application file, in source control, and delivered with your code. When the dictionary is in a plaintext file (or files) and treated like other application code, you do not have to invent any new methods for handling it, just use the same methods you use for the rest of your code.

In simple practical terms, it is best if a data dictionary can be easily read and written by both people and computers. This leads to a plaintext file or files in some format such as JSON or YAML. I personally prefer YAML because it is a superset of JSON, so using YAML gives me the option to sneak in JSON syntax later if I choose, but starting with JSON does not let me go the other way.

Requiring easy handling by people tends to rule out XML, which is bletcherous to work with manually (in my humble opinion). Requiring readability by the computer rules out UML unless your UML drawing tool can produce a usable data file (comments always welcome, tell us your favorite tool for doing this!). When considering UML, it is the class diagrams that are most likely to be translatable into a data dictionary.

Finally, encoding dictionary information in program class files technically meets the practical requirements listed above, but it has the disadvantage of trapping data in code, which unnecessarily couples your dictionary to whatever language you are using at the moment. It is much better if the dictionary sits outside of the code as pure data. Not to mention that spreading the dictionary out in a collection of one-class-per-table files makes it much harder to do upgrades in the way I am about to describe.

Review of Steps

When using a dictionary-based approach, you write some type of "builder" program that reads your dictionary file, examines the current structure of the database, and then generates SQL commands to alter and create tables to make them all current.

There are plenty of ways to do this. My own approach is to load the dictionary itself into tables, pull the current state into similar tables, and then do queries to find new and altered tables and columns. If you want to see a full-blown program of this type, check out androBuild.php, the Andromeda implementation of this idea. The routines that apply to today's topic include "RealityGet()", "Differences()", "Analyze()" and "PlanMake()".

Step 1: Load the Dictionary to RAM

To use the approach in this essay, you begin by parsing your plaintext file and loading it to tables. Here is a simple example of a what a dictionary file might look like in YAML format:

table states:
    description: US States
    
    columns:
        state:
            type: char
            colprec: 2
            caption: State Code
            primary_key: "Y"
        description:
            type: varchar
            colprec: 25
            caption: State Name

If you are using PHP, you can parse this file using the spyc program, which converts the file into associative arrays. All or nearly all modern languages have a YAML parser, check out the YAML site to find yours.

Step 2: Load the Dictionary To Tables

The database you are building should have some tables that you can use as a scratch area during the upgrade. You may say, "The builder gives me tables, but I need tables to run the builder, how can I do this?" The simplest way is to hardcode the creation of these tables. A more mature solution would use a separate dictionary file that just defines the dictionary tables.

The structure of the tables should match the data file, of course. Here is what the YAML above would like like after being loaded to tables:

TABLE   | DESCRIPTION
--------+--------------------------------
states  | US States  
   

TABLE   |COLUMN      |CAPTION    |TYPE    |PRECISION  
--------+------------+-----------+--------+-----------
states  |state       |State Code |char    |2          
states  |description |State Name |varchar |25

Step 3: Fetch The Current State

All modern databases support the "information_schema" database schema, a schema inside of each database that contains tables that describe the structure of the database. While you can make queries directly against the information_schema tables, I prefer to fetch the information out of them into my own tables so that all column names are consistent with my own. A simple query to do this might look like this:

-- Postgres-specific example of pulling info out of the
-- information_schema table:

insert into TABLES_NOW (table_id)  -- my dictionary table
SELECT table_name as table_id 
  FROM information_schema.tables  
 WHERE table_schema = 'public'
   AND table_type = 'BASE TABLE'

Pulling column information out can be much more complicated, owing to differences in how vendors implemement information_schema, and owing to the complex way data is stored in it. Here is my own code to pull out the definitions of columns from the Postgres information_schema, which also simplifies the definition dramatically, to make my downstream coding easier:

insert into zdd.tabflat_r 
 (table_id,column_id,formshort,colprec,colscale)  
 SELECT c.table_name,c.column_name, 
        CASE WHEN POSITION('timestamp' IN data_type) > 0 THEN 'timestamp'
           WHEN POSITION('character varying' IN data_type) > 0 THEN 'varchar'
           WHEN POSITION('character' IN data_type) > 0 THEN 'char'
             WHEN POSITION('integer' IN data_type) > 0 THEN 'int'
             ELSE data_type END,
        CASE WHEN POSITION('character' IN data_type) > 0 THEN character_maximum_length
         WHEN POSITION('numeric'   IN data_type) > 0 THEN numeric_precision 
     ELSE 0 END,
        CASE WHEN POSITION('numeric'   IN data_type) > 0 THEN numeric_scale
         ELSE 0 END
   FROM information_schema.columns c 
   JOIN information_schema.tables t ON t.table_name = c.table_name  
  WHERE t.table_schema = 'public' 
    AND t.table_type   = 'BASE TABLE'");

Step 4: The Magic Diff

Now we can see how the magic happens. Imagine you have 20 tables in your application, and in the past week you have modified 5 of them and added two more. You want to upgrade your demo site, so what is the next step for the builder?

The builder must now do a "diff" between your dictionary and the actual state of the database, looking for:

  • Completely new tables.
  • New columns in existing tables.

Lets say you have two tables, "TABLES_SPEC" which lists the tables in your application. Then you have "TABLES_NOW" that lists the tables in your database. The following query will give you a list of new tables:

SELECT spec.table_id
  FROM TABLES_SPEC spec 
 WHERE NOT EXISTS (
       SELECT table_id from TABLES_NOW now
        WHERE now.table_id = spec.table_id
       )

It is now a simple thing pull the column definitions for each table and generate some DDL to create the tables.

But we also have tables that have new columns. We can pull those out like so:

SELECT * from COLUMNS_SPEC spec
 -- the first where clause gets new columns
 WHERE not exists (
       SELECT table_id FROM COLUMNS_NOW now
        WHERE spec.table_id = now.table_id
          AND spec.column_id= now.column_id
       )
   -- this second subquery makes sure we are
   -- getting only existing tables
   AND EXISTS (
       SELECT table_id from TABLES_NOW now
        WHERE now.table_id = spec.table_id
       )

Now again it is a simple matter to generate DDL commands that add all of the new columns into each table. Some databases will allow multiple columns to be added in one statement, while others will require one ALTER TABLE per new column (really horrible when you have to do that).

Please note this is sample code only, just to give you ideas, and it will not cover every case.

Sidebar: Avoid Destructive Actions

Do not rush into writing code that drops columns or tables that are not in your spec. The results of a misstep can be disastrous (as in lose your job or your customer). My own builder code is now 4 1/2 years old and I have never yet bothered to write a "destructive" upgrade that will clean out unused tables and columns. Maybe someday...

What I Left out: Validation

There was no space in this essay to discuss a very important topic: validating the spec changes. It may be that a programmer has done something nasty like change a column type from character to integer. Most databases will fail attempting to alter the column because they don't know how to convert the data. Your builder program can trap these events by validating the upgrade before any changes are made. This will be treated fully in a future essay.

More that I Left Out: Indexes, Keys...

There are many many other things you can and really must create during the build, beginning with primary key and foreign keys, not to mention indexes as well. These will be covered in a future essay.

More That I Left Out: When You Still Need Scripts

There are plenty of reasons why you may still need a few upgrade scripts, these will be discussed in a future essay. They all come down to moving data around when table structures change significantly.

Conclusion: One Upgrade To Rule Them All

The approach described this week for upgrading databases has many advantages. It is first and foremost the most efficient way to upgrade customers from any version directly to the latest version. It is also the simplest way to handle both installations and upgrades: they are both the same process. Putting the dictionary file into plaintext gives you complete source control just like any other application file, and overall you have a tight, efficient and error-free upgrade process.

11 comments:

Seun Osewa said...

Welcome back and thanks for the post.

amihay gonen said...

nice note .
please also speak on data and code upgrade

Jeffrey Kemp said...

Very interesting post, made me think a bit about a topic I haven't given much thought to previously. Your idea means that a user can upgrade from any version to any later version. The fact that your method is nondestructive means it should never cause a broken install if the installation fails halfway through; in fact, restarting the installation should be able to automatically take over where the failed one stopped.
It does not, however, allow for an upgrade (successful or failed) to be rolled back, because it does not drop columns that were added (e.g. new mandatory columns without default values will cause earlier versions to fail when they try to insert new rows).
In addition, because it does not drop unused columns, a major upgrade could leave the customer with a lot of wasted space.
For these reasons I prefer the scheme (used, e.g., by Oracle Application Express) wherein upgrades do not modify the old schema at all - a new schema is created, all the DDL is run first; all that remains is to run conversion scripts that convert the data from the old schema to the new schema.
Admittedly, these conversions scripts will necessarily be more complex, and perhaps more expensive, especially if they're driven from the data dictionary (to cater for upgrades from various versions); however the benefit is that restoring a failed install (or rolling back a successful install) is a piece of cake - just drop the new schema.
Thanks!

zippy1981 said...

Ken,

Another good article, but I'd like to comment on a few points.

First of all, XML is very easy to work with if you have the correct tools. A text editor is not that tool, an XML editor is.

The best XML editor out there is XML Spy (http://www.altova.com/). Its expensive, but worth it. However, if you only have one xml format to work with, any decent xml editor with auto-complete will work. On the open source front, Xml Copy Editor (http://xml-copy-editor.sourceforge.net/) is a great standalone editor for windows and linux. I once submitted a patch to a very small bug for it.

Also, if your using a modern IDE of some sort, it probably has a built in xml editor. Visual Studio, Eclipse and SharpDevelop (http://www.icsharpcode.net/OpenSource/SD/) all have good xml editors. As an added bonus, XmlSpy has plugins for Eclipse and Visual Studio.

The data dictionary is a good idea. However, a similar idea is a database diff tool like SqlCompare for sql server (http://www.red-gate.com/products/SQL_Compare/index.htm?gclid=CI3Mhsi095cCFQq3sgodw0vZCg). There are similar tools for most database flavors, although some are command line only. I'm pretty sure that internally SqlCompare is using a data dictionary.

The difference between the andro build script and these compare tools is andromeda compares a real database to a theoretical spec in a yaml file and the database compare tools compare to real databases. This means you used real sql commands to build the dev database. Both have about the same limitations where a manual upgrade script will be required for complex schema changes that require initial data population.

KenDowns said...

Suen: thank you very much, it's good to be back!

KenDowns said...

@Jeffrey: In terms of rollback, wouldn't it be simpler to back up the database and restore if necessary? This saves the need to copy all data from old to new schema during each upgrade.

amihay gonen said...

zippy1981- are some links to tool to compare which I'm aware (I havn't tried them)

perl - http://sqlfairy.sourceforge.net/
pyton - http://halfcooked.com/code/gerald/

ken - regrading the backup. This is a good point , don't forget that backup also mean copying the data (so in term of performance it may be the same) .
Something the database hosts several applications , so restore db can effect other application.

KenDowns said...

@amihay: I believe a good backup would be more efficient than copying data between schemas -- unless the copy can be done with bulk operations. I would need to know more about the tools to really know.

Anonymous said...

A tool created specifically to handle database upgrades is Liquibase (http://www.liquibase.org). How does the Liquibase approach compare to your thoughts?

Mastodont said...

Very interesting post, thanks.

But I do not understand why should I convert your DD from YAML file into some tables an run complicated SELECTs - you can read YAML file into array and serialize this array into file. Then, if you'd need to upgrade schema, you can compare this array with new YAML's content.

KenDowns said...

Mastodont: I think its a matter of choice, I don't think you "should" load them to tables, that is just how I do it and so that is how I fleshed out the example. But if your own style gets the job done, you won't hear any complaints from me :)

To begin, it is a personal inclination of mine to always perform analysis by loading to tables and putting results in tables. The real concrete advantage to this, and the reason i do it, is that it is much easier to debug the build program by examining those tables than it is to crawl through memory dumps, especially as the feature set gets larger and larger. My own builder has something like 30 DD tables and does triggers, indexes, views, and many complex automations, which I can only manage by getting them into tables. If I can sit comfortably and examine the tables and know the dictionary is sound, the code is a lot easier. This is of course simply the "minimize code, maximize data" rule all over again.