Monday, February 18, 2008

Database Development: Table Structure Changes

Welcome to the Database Programmer, the blog for anybody who wants to learn the practical realities of working with databases.

There is a new entry every Monday morning, and the complete table of contents is here. This week we are talking about tools again, specifically upgrades.

Structure Change Tools Are Crucial

There are two big things that your toolkit must contain if you are going to be a successful database programmer. The first, which we talked about last week, is a framework that respects relational concepts. The second, which we will talk about now, is a strong upgrade tool, something that can modify your structures for you.

Iterative Development

Requirements change in one way or another in the life of a project. Two philosophies exist to deal with this fact, which I tend to think of as the figure-it-all-out-before-we-code-damnit! school and the lets-make-sure-we-can-adapt-as-we-go school.

If you are a member of the second school, you know that a table design that seemed correct yesterday will reveal mistakes and weaknesses today, and will have to be changed.

In this situation the unredeemable code grinder will attempt to code his way out of the situation, adding edge-case handlers and special condition routines to make up for the fact that the tables do not match the requirements. After a few years of this most programmers except a handful of those present from day 1 will be afraid to touch the code, and inevitably the team will declare they have some huge wonderful great new ideas that require them to start from scratch.

The database programmer by contrast will first revise the table design and update the structures before even considering what changes to make to the code. While the code grinder is always looking for the ultimate class structure first, the database programmer wants first to make sure that the tables match the requirements, always ensuring that any special-purpose code that follows is built on a strong foundation.

Therefore I state the iterative argument: the development process itself will require structure changes. The choice of framework will heavily influence your ability to perform this task.

The Operational Argument

Consider these four actions, are they more alike to each other, or more different from each other?

  1. A programmer spends four hours creating a new feature start to finish on his own workstation.
  2. A testing server is upgraded, merging 3 days' work from 4 programmers.
  3. A live server is upgraded to the latest published version of a package.
  4. A server has a package installed on it for the first time.

From a manager's perspective these are all very different operations, involving different kinds of people, different approval processes and different kinds of follow-up. A manager would say they are different.

However, from a technical standpoint they are all exactly the same operation. All four operations require the execution of these two steps:

  1. Modify table structures.
  2. Deploy new code.

A programmer sitting at a workstation on a new feature will frequently need to add a column to a table or make one or two new tables. Then she adds supporting user interface code where the framework default screens won't do the job. The operation is change tables + make new code.

When a QA server is updated with the latest efforts of more than one programmer, the operation is exactly the same: all table structure changes are made and all new code is deployed. The operation is change tables + use new code.

When a production server is upgraded it is the exact same as the QA server, new table structures and new code.

When a package is installed for the first time and there is no database, it must be created. This is no different than the previous two examples. Instead of modifying tables it adds tables. Instead of adding one or two it adds them all. Instead of overwriting code it puts the code down for the first time. The operation then is still the same: change tables + use new code.

Approach For Trivial Cases

If your database structure is modest (less than 10 tables), if you have one or perhaps two programmers, and if you do not change the application much, then you can probably forget about a tool. Most tools and conventions are meant to coordinate more than one programmer makes lots of changes. If you don't need that coordination then you don't need the tool, unless it really gives you something you like.

The Pros and Cons of Build Scripts

A build script is a program that does one or both of these two tasks:

  • Contains commands that modify table structures
  • Contains commands that manipulate data, such as populating rows in a newly created table or fixing some bug from a prior release.

Build scripts are a natural first step when your team and your customer count begins to grow. One of the biggest first wins is that you can put them into source control. This alone can be worth going to scripts even if you are just one programmer.

It is also fairly trivial to make a small library of commands like "AddColumn" that will check to see if the column is already there and only attempt to add it if necessary. This makes the scripts safer to rerun, which is always a good thing.

You can also have programmers create a build script for each distinct feature they work on, also making it easier to identify which structure changes are tied to which unit of work.

A few years ago I became programmer #5 on what had been a four man team. This team used build scripts exactly as I have just described them. When I left a little over three years later we were about 20 programmers on two continents and over 50 customers. This growth revealed a few weaknesses in the build script method.

  1. Separate build scripts are slow. If you keep separate scripts for each unit of work, you risk that a major upgrade will rewrite a large table more than once. When the table is 10 Gigs this is no laughing matter.
  2. Quality Control is Imposible. Scripts are usually the least tested part of the system. The programmer runs a script at most a handful of times, and usually in unrealistically ideal circumstances. Then an IT person runs the upgrade during off hours (Friday night at 8:00pm) and the script crashes and either you get a phone call and go back to work or the customer is told the upgrade failed. Yuck.
  3. Dependencies Become Horrible. There will always be a need to run the scripts in the order they were written. Even if clever programming gives you a certain leeway here, you will keep running into cases where a programmer must write his script on the assumption that another script has already run. If you have even 3 features tying each other up like this then suddenly we have 3 or 4 programmers stuck in meetings trying to coordinate, and we all know how fun it is trying to get programmers to coordinate in a meeting!

I do not mean to say that you should not use scripts. All of the hazards mentioned above can be acceptable if the scripts give you enough productivity to operate profitably. But if you want maximum upgrade performance and quality, you will sooner or later have to go to meta-data and a differences engine.

End of the Line: Meta Data

The only way to solve all three of the problems mentioned above is to have a single file of some form, whether it be YAML, XML or even plaintext, which describes the database in its entirety. This file is placed under source control and is delivered with other files in the upgrade. We call this file a "meta-data" file because it contains data about data.

When an upgrade runs, the build tool takes that file of yours and creates a picture of what the database should be, and then examines the current structure. Once this is done the tool can work out which changes need to be made, and execute them. Tools such as these work for development, upgrades, and new installs, without needing a separate tool for each case.

A difference engine is the best performing because it does only the changes required in each situation. It has the highest QA because programmers are just supplying inputs to the builder instead of writing new code themselves every time. Finally it reduced dependencies by the simplest of all approaches: it just builds what it knows about and is immune from any trap where one feature requires another.

Conclusion: Pick a Tool for Your Situation

This week I have reviewed three basic approaches to handling structure changes. The simplist situation may require no special tool, while a larger but modest situation requires a bit more, and the final stage requires an intelligent meta-data based agent. Also I have shown that many operations in the development cycle are in fact the same operation over and over: table changes + new code. This means that a good tool for handling these two operations can be useful for development, QA, the new install, and the upgrade.

Next week we will return to Table Design Patterns with "The Primary Key That Wasn't."


Anonymous said...

Do you have any suggestions for a meta-data database upgrade tool? This certainly sounds like the best solution, but I am not aware of any software that would help with this.

KenDowns said...

Anonymous: In July of 2004 I went to work on my own, Andromeda, which is GPL'd and free to use, a description of it is here: cms/Short+Description+of+Andromeda.html

Since that time I have lost touch with other similar tools because I have been focusing on the unique aspects of Andromeda, particularly its handling of calculated values.