Sunday, January 11, 2009

Upgrading Indexes With a Data Dictionary

When you set out to use a Data Dictionary to control your database upgrades, you must consider not just columns and tables, but also indexes and keys. The process for adding indexes and keys is almost the same as that for columns and tables, but there are a few wrinkles you have to be aware of.

Review of Basic Ideas

In my First Essay On Data Dictionaries, one of the major points was that the dictionary is easiest to use if it is in some type of plaintext format that is in source control along with the rest of your application files, and is the processed with a "builder" program.

Last week we saw the Basic compare operation used by the builder to build and update tables. You read in your data dictionary, query the information_schema to determine the current structure of the database, and then generate commands to add new tables and add new columns to existing tables.

The Importance of Keys and Indexes

If a builder program is to be useful, it must be complete, if it leaves you with manual tasks after a build then the entire concept of automation is lost. The builder must be able to build the entire structure of the database at very least, and this means it must be able to work out keys and indexes.

The Basic Steps

The basic steps of building indexes and keys are these:

  • Load your dictionary to some format you can work with easily. I prefer to load it to tables.
  • Query the databases's INFORMATION_SCHEMA to determine which indexes and keys already exist.
  • Execute some type of diff to determine which indexes need to be built.
  • Build the indexes that are not there.
  • If you like, drop the indexes that are not in the spec.

Sidebar: Dropping Indexes

A builder program can add things, and it can also drop things. When it comes to destructive operations, I prefer not to have my builder drop tables or columns, because the consequences of a mistake can be unthinkable.

However, when it comes to indexes, it is much more likely to be ok to drop a stray index. Dropping indexes does not destroy user data. Also, extraneous indexes will slow down inserts and updates, so getting rid of them is usually the Right Thing.

Step 1: Your Specification

Your data dictionary format must have some way of letting you specify an index. It is also a good idea to allow you to specify an ascending or descending property for each column, and to specify if the index is to be unique (effectively making it a unique constraint).

Here is an example of a very simple set of indexes:

table example:
    description: My example table
    
    index first_name:
        column first_name:
    index last_name:
        column last_name:
    index socialsec:
        unique: "Y"
        column socialsec:
        
    # ... column definitions follow...

I am currently working on a program that requires frequent access by three columns, where the first two are in descending order but not the first. An index spec for this might look like:

table shipments:
    description: Incoming Magazines
    
    index history:
        column bipad:
        column year:
            flag_asc: "N"
        column issue:
            flag_asc: "N"

    # ... column definitions follow...

As far as loading this into memory, I covered that in some detail last week and will not dwell on it here. I will simply assume you have code to parse and load the spec to a format that works for you.

Step 2: The Information Schema or Server Tables

When I set out to write my builder, I found that the information_schema was a bit more complicated than I needed. The server I was using, Postgres, had a simpler way to get what I wanted. I also found I would get all kinds of extraneous definitions of indexes on system tables or tables that were not in my spec. The query below was the easiest way to get index definitions that were limited to the tables in my spec on the Postgres platform:

Select tablename,indexname,indexdef 
 FROM pg_indexes
 JOIN zdd.tables on pg_indexes.tablename = zdd.tables.table_id
 WHERE schemaname='public'

As far as primary keys and foreign keys go, the story is basically the same, your server may provide them in a convenient way the way Postgres gives index definitions, or you may have to dig a little deeper to get precisely what you want.

Step 3: The Diff

So now we have a picture of the indexes we need to exist, and the indexes that already exist. It is time to look at how to diff them effectively. This step does not work the same way as it does with columns and tables.

Before we go into how to do the diff, let's review how we did it with tables and columns. We can basically diff tables and columns by name. If our spec lists table CUSTOMERS and it does not appear to exist in the database, we can build the table CUSTOMERS, simple as that. But with indexes the name really does not mean anything, what really matters is what columns are being indexed.

This is why we diff indexes on the column definitions, not on their names. If you want a complete trail, you would begin with this table that describes your own indexes:

SPEC_NAME   | TABLE     | COLUMNS
------------+-----------+-----------------
CUST1       | CUSTOMERS | zipcode:state
HIST1       | HISTORY   | bipad:year:issue
ORDERS1     | ORDERS    | bipad:year:issue

Then you pull the list of indexes from the server, and lets say you get something like this:

DB_NAME   | TABLE     | COLUMNS
----------+-----------+-----------------
CUST1     | CUSTOMERS | zipcode:state
ABCD      | HISTORY   | bipad:year:issue
ORDER1    | ORDERS    | year:issue

When you join these two together, you are matching on TABLE and COLUMNS, we do not care about the index names. A query to join them might look like this:

SELECT spec.spec_name,spec.table,spec.columns
      ,db.db_name,db.columns as db_cols
  FROM spec
  FULL OUTER JOIN db   On spec.table = db.table
                      AND spec.columns = db.column

This query would give us the following output:

SPEC_NAME   | TABLE     | COLUMNS          | DB_NAME | DB_COLS 
------------+-----------+------------------+---------+---------------------
CUST1       | CUSTOMERS | zipcode:state    | CUST1   | zipcode:state
HIST1       | HISTORY   | bipad:year:issue | ABCD    | bipad:year:issue
ORDERS1     | ORDERS    | bipad:year:issue |         |
            |           |                  | ORDER1  | year:issue

Now let us examine the results row by row.

  • The first row shows that the index on zipcode+state on the customers table is in the spec and in the database, we take no action on that index.
  • The second row shows that the index on bipad+year+issue is also in both the database and the spec. This particular index has a different name in the database, but we don't care. (Maybe the programmer changed the name in the spec). We take no action on this index.
  • The third line shows an index on the ORDERS table that is not in the database, we must build that index.
  • The fourth line shows an index in the database that is not in the spec, you can drop that if you want to.

The Rest of It

From here it is a simple matter to generate some commands to create the indexes we need.

Keys work the same way, with a few obvious differences in how they might be named.

We can add features from here to track if the columns are being indexed in ascending or descending order.

Conclusion: Indexes Go By Definition

When writing a database upgrade "builder" program, they key thing to understand about indexes and keys is that you are looking to indentify and build indexes according to their definition, and that names do not matter at all.

No comments: