The number one search term that brings people to this blog is "data dictionary." So this week I will begin a series on how to use the data dictionary to improve your own productivity and reduce errors.
Building And Upgrading
This week we are going to see how to use a data dictionary to eliminate upgrade scripts (mostly) and make for more efficient upgrades. The approach described here also works for installing a system from scratch, so an install and an upgrade become the same process.
The major problems with upgrade scripts are these:
- They are the least-tested code in any system, and are the most likely to break. If a script breaks and anybody but the original programmer is running the upgrade, this leads to aborted upgrades and upset customers.
- They are horribly inefficient when a customer upgrades after a long time: the same table may be rebuilt many times as script after script adds a column or two.
By contrast, a dictionary-based upgrade can take any customer from any version of your software and in the fewest steps possible bring them completely current, with no possibility of broken scripts.
But first, a quick review of an important idea...
Review: The Text File
Back in June of 2008 this blog featured an overview of the data dictionary. There were many ideas in that essay, but I wish to focus on one in particular, the question of where to put the data dicationary and in what format.
In terms of where to put it, the data dictionary should be just another application file, in source control, and delivered with your code. When the dictionary is in a plaintext file (or files) and treated like other application code, you do not have to invent any new methods for handling it, just use the same methods you use for the rest of your code.
In simple practical terms, it is best if a data dictionary can be easily read and written by both people and computers. This leads to a plaintext file or files in some format such as JSON or YAML. I personally prefer YAML because it is a superset of JSON, so using YAML gives me the option to sneak in JSON syntax later if I choose, but starting with JSON does not let me go the other way.
Requiring easy handling by people tends to rule out XML, which is bletcherous to work with manually (in my humble opinion). Requiring readability by the computer rules out UML unless your UML drawing tool can produce a usable data file (comments always welcome, tell us your favorite tool for doing this!). When considering UML, it is the class diagrams that are most likely to be translatable into a data dictionary.
Finally, encoding dictionary information in program class files technically meets the practical requirements listed above, but it has the disadvantage of trapping data in code, which unnecessarily couples your dictionary to whatever language you are using at the moment. It is much better if the dictionary sits outside of the code as pure data. Not to mention that spreading the dictionary out in a collection of one-class-per-table files makes it much harder to do upgrades in the way I am about to describe.
Review of Steps
When using a dictionary-based approach, you write some type of "builder" program that reads your dictionary file, examines the current structure of the database, and then generates SQL commands to alter and create tables to make them all current.
There are plenty of ways to do this. My own approach is to load the dictionary itself into tables, pull the current state into similar tables, and then do queries to find new and altered tables and columns. If you want to see a full-blown program of this type, check out androBuild.php, the Andromeda implementation of this idea. The routines that apply to today's topic include "RealityGet()", "Differences()", "Analyze()" and "PlanMake()".
Step 1: Load the Dictionary to RAM
To use the approach in this essay, you begin by parsing your plaintext file and loading it to tables. Here is a simple example of a what a dictionary file might look like in YAML format:
table states: description: US States columns: state: type: char colprec: 2 caption: State Code primary_key: "Y" description: type: varchar colprec: 25 caption: State Name
If you are using PHP, you can parse this file using the spyc program, which converts the file into associative arrays. All or nearly all modern languages have a YAML parser, check out the YAML site to find yours.
Step 2: Load the Dictionary To Tables
The database you are building should have some tables that you can use as a scratch area during the upgrade. You may say, "The builder gives me tables, but I need tables to run the builder, how can I do this?" The simplest way is to hardcode the creation of these tables. A more mature solution would use a separate dictionary file that just defines the dictionary tables.
The structure of the tables should match the data file, of course. Here is what the YAML above would like like after being loaded to tables:
TABLE | DESCRIPTION --------+-------------------------------- states | US States TABLE |COLUMN |CAPTION |TYPE |PRECISION --------+------------+-----------+--------+----------- states |state |State Code |char |2 states |description |State Name |varchar |25
Step 3: Fetch The Current State
All modern databases support the "information_schema" database schema, a schema inside of each database that contains tables that describe the structure of the database. While you can make queries directly against the information_schema tables, I prefer to fetch the information out of them into my own tables so that all column names are consistent with my own. A simple query to do this might look like this:
-- Postgres-specific example of pulling info out of the -- information_schema table: insert into TABLES_NOW (table_id) -- my dictionary table SELECT table_name as table_id FROM information_schema.tables WHERE table_schema = 'public' AND table_type = 'BASE TABLE'
Pulling column information out can be much more complicated, owing to differences in how vendors implemement information_schema, and owing to the complex way data is stored in it. Here is my own code to pull out the definitions of columns from the Postgres information_schema, which also simplifies the definition dramatically, to make my downstream coding easier:
insert into zdd.tabflat_r (table_id,column_id,formshort,colprec,colscale) SELECT c.table_name,c.column_name, CASE WHEN POSITION('timestamp' IN data_type) > 0 THEN 'timestamp' WHEN POSITION('character varying' IN data_type) > 0 THEN 'varchar' WHEN POSITION('character' IN data_type) > 0 THEN 'char' WHEN POSITION('integer' IN data_type) > 0 THEN 'int' ELSE data_type END, CASE WHEN POSITION('character' IN data_type) > 0 THEN character_maximum_length WHEN POSITION('numeric' IN data_type) > 0 THEN numeric_precision ELSE 0 END, CASE WHEN POSITION('numeric' IN data_type) > 0 THEN numeric_scale ELSE 0 END FROM information_schema.columns c JOIN information_schema.tables t ON t.table_name = c.table_name WHERE t.table_schema = 'public' AND t.table_type = 'BASE TABLE'");
Step 4: The Magic Diff
Now we can see how the magic happens. Imagine you have 20 tables in your application, and in the past week you have modified 5 of them and added two more. You want to upgrade your demo site, so what is the next step for the builder?
The builder must now do a "diff" between your dictionary and the actual state of the database, looking for:
- Completely new tables.
- New columns in existing tables.
Lets say you have two tables, "TABLES_SPEC" which lists the tables in your application. Then you have "TABLES_NOW" that lists the tables in your database. The following query will give you a list of new tables:
SELECT spec.table_id FROM TABLES_SPEC spec WHERE NOT EXISTS ( SELECT table_id from TABLES_NOW now WHERE now.table_id = spec.table_id )
It is now a simple thing pull the column definitions for each table and generate some DDL to create the tables.
But we also have tables that have new columns. We can pull those out like so:
SELECT * from COLUMNS_SPEC spec -- the first where clause gets new columns WHERE not exists ( SELECT table_id FROM COLUMNS_NOW now WHERE spec.table_id = now.table_id AND spec.column_id= now.column_id ) -- this second subquery makes sure we are -- getting only existing tables AND EXISTS ( SELECT table_id from TABLES_NOW now WHERE now.table_id = spec.table_id )
Now again it is a simple matter to generate DDL commands that add all of the new columns into each table. Some databases will allow multiple columns to be added in one statement, while others will require one ALTER TABLE per new column (really horrible when you have to do that).
Please note this is sample code only, just to give you ideas, and it will not cover every case.
Sidebar: Avoid Destructive Actions
Do not rush into writing code that drops columns or tables that are not in your spec. The results of a misstep can be disastrous (as in lose your job or your customer). My own builder code is now 4 1/2 years old and I have never yet bothered to write a "destructive" upgrade that will clean out unused tables and columns. Maybe someday...
What I Left out: Validation
There was no space in this essay to discuss a very important topic: validating the spec changes. It may be that a programmer has done something nasty like change a column type from character to integer. Most databases will fail attempting to alter the column because they don't know how to convert the data. Your builder program can trap these events by validating the upgrade before any changes are made. This will be treated fully in a future essay.
More that I Left Out: Indexes, Keys...
There are many many other things you can and really must create during the build, beginning with primary key and foreign keys, not to mention indexes as well. These will be covered in a future essay.
More That I Left Out: When You Still Need Scripts
There are plenty of reasons why you may still need a few upgrade scripts, these will be discussed in a future essay. They all come down to moving data around when table structures change significantly.
Conclusion: One Upgrade To Rule Them All
The approach described this week for upgrading databases has many advantages. It is first and foremost the most efficient way to upgrade customers from any version directly to the latest version. It is also the simplest way to handle both installations and upgrades: they are both the same process. Putting the dictionary file into plaintext gives you complete source control just like any other application file, and overall you have a tight, efficient and error-free upgrade process.