Sunday, February 24, 2008

The Primary Key That Wasn't

Welcome to the Database Programmer, the blog for anybody who wants to learn the practical realities of working with databases.

There is a new entry every Monday morning, and the complete table of contents is here. This week we are talking about tools again, specifically upgrades.

The Impermanent Primary Key

Sometimes you get a situation where you have a great possibility for a natural key, except that the key will change from time to time. I call this the "Impermanent Primary Key" and we are going to look at this pattern today.

Magazines and BIPADS

I am currently writing a system for a magazine distributor, while also maintaining his old system for him. When I first started with him he explained that a magazine is identified by a "BIPAD", a 5 digit number unique to the magazine. It seemed to me that if you take a BIPAD, and issue number, and a volume year, you have a pretty good choice for a natural primary key.

Then one day he walked up to me and said, "I have fourteen magazines with new bipads this month." This was a surprise, as I had been fairly sure that a BIPAD and a magazine were uniquely matched for all eternity.

Sidebar: Did the Requirement Change?

Before we get into the design pattern for this, it is worth asking, is this a case where the customer changed the requirement? The answer is a most emphatic no! The customer told me, if I had listened precisely, that a magazine "has a BIPAD". He never said that a magazine has only one unique BIPAD, I inferred that myself. Such inferrences are mistakes, and we cannot blame them on the customer, because it was my job to ask all sides of the question, such as "can a magazine have more than one BIPAD?" or "Is a BIPAD permanent?" or "Can a BIPAD ever be used again on another magazine?"

Spelling Out The Requirements

So the reality is that a BIPAD is assigned to only one magazine, but a magazine may get more than one BIPAD, as the bipad will change from time to time. My customer's operational requirements are:

  1. The history of individual BIPADs must be preserved in the transaction tables.
  2. The histories of multiple BIPADs for a single magazine may need to be combined from time to time.
  3. During a BIPAD change, transactions will continue for both BIPADs.

I should also note that transaction information comes from outside sources and always contains the BIPAD. It is our responsibility to make sure the BIPAD can be mapped into our database correctly. We have no power to ask outside parties to use numbers created or assigned by us.

Some Candidate Solutions

There are a few ways to handle this. Here are four choices that are most likely to come up in a brainstorming session:

  1. Use BIPAD as a primary key in the BIPADS table, and change the BIPAD as required.
  2. Create a From-To table that maps old BIPAD values to new values.
  3. Use an integer primary key in the BIPADS table and make BIPAD just a regular column, so you can just change it as required.
  4. Make the table of BIPADs a child table to a table of MAGAZINEs. When a BIPAD changes make a new entry in the BIPADs table.

We can eliminate option one immediately, because foreign keys do not allow this possibility. If you have a table of BIPADS, with BIPAD as a primary key, then your transaction tables will have foreign keys to this table. If you could change the value of a BIPAD then suddenly those transactions that reference the old value would not be valid. We say in every day language that you would orphan the transactions. A foreign key exists to prevent such things, so option one is out.

Option two looks pretty reasonable, but it is no good at all. I included option two as an example of what happens when you use a coding mentality to design tables. Option two records accurately an action, the changing of a BIPAD, by recording the old and new values. While this may seem harmless, it makes a wreck out the idea of using primary keys and foreign keys. What is the primary key of such a table? What table and column do the transaction tables reference? When an idea gives problems for the basic building blocks of a database, we know that nothing but trouble is going to follow.

This leaves options three and four, which we will now consider in more detail.

The First Two Requirements in Detail

The first two requirements are that we must be able to examine the individual histories and the combined histories as requested. Options three and four both allow this, as both of them have the BIPAD values in the transaction tables and both options also have a table that can be used to combine histories.

Option four uses the BIPAD values natively in all tables, so there is no need for any special planning or actions when saving data or querying it. If you want to query for a single BIPAD, then specify that. If you want to query for a magazine, then JOIN to the MAGAZINES table and filter on your magazine and there you go. Option four satisfies the requirements with no fanfare.

Option three requires some additional storage. The transaction tables need to keep track of the BIPAD for accuracy, but they also need that meaningless integer foreign key to the table of BIPADS. This always strikes me as funny because the integer key is promoted as a performance method, but the basic reality of a disk drive is that performance goes as the amount of data you have to read and write. Where option four writes only the value of BIPAD, option three must write more data, and therefore will always be slower.

Option three also requires more code and more disk activity. Because we receive transaction data as BIPAD values, but we are using a meaningless number as the foreign key to the BIPADS table, we have to do a lookup into the BIPADS table to find out the meaningless integer key. This means we need a read operation that was not necessary with Option four. Again this is rather ironic since the integer key is promoted as a performance tool.

Finally, the problem of having to look up the value of the BIPAD's integer key introduces application code, so now with Option three I have to write code where for Option four I do not.

But when all is said and done, if your tools lock you into option three and you have no choice, you can in fact satisfy requirements one and two. You wll be able to examine both the individual histories and the combined histories. So we probably have to say so far that it is a matter of taste, tools, and habits.

Disqualifying Option Three

The third requirement is that transactions will occur for both BIPADs during a transition. For option four this is not really an issue because there are two rows in the BIPADS table, one for each BIPAD. Any transaction table that is a child table of BIPADS is automatically covered by the foreign key.

Option three however has a fatal problem. When a transaction comes in with a BIPAD on it, we have to validate it by looking up the BIPAD in the BIPADs table, so we can get the meaningless integer primary key. Except that after a new value is entered the old value will not be there. We cannot satisfy the third requirement and Option three is now disqualified.

Those who are comfortable using integer primary keys for everything may suggest a way to rescue the situation, but because Option four does not require any application code and does not require rescuing, I will leave it to others to rescue option three.

More Details On The Impermanent Primary Key

When we have an impermanent primary key, one that changes from time to time, we can create a pair of tables. The master entities are tracked in the top table, and the child table tracks the impermanent values. This second table is the parent of all of the transactions:

In the example above, which has to do with magazines and BIPADs, the master table I built uses character primary keys, and the child table uses the BIPAD values, so it looks something like this:

MAGAZINE (PK) | DESCRIPTION
--------------+--------------------------
TVGUIDE       | TV GUIDE
COSMO         | COSMOPOLITAN
ASTORIES      | AMAZING STORIES
CROSSW        | CROSSWORDS
  |
  |
  |
  |
 /|\
MAGAZINE | BIPAD (PK) |  OTHERS...
---------+------------+-------------------
TVGUIDE  | 12345      | XXXXX
TVGUIDE  | 34345      | XXXXX
COSMO    | 29830      | XXXXX
COSMO    | 23813      | XXXXX 
             |
             |
             |
            /|\
     Transaction Tables have foreign keys
     to this column

The SQL that would create these two tables might look like this:

CREATE TABLE MAGAZINES (
  magazine char(10)
  ,description varchar(35)
  ,primary key (magazine)
);

CREATE TABLE BIPADS (
   magazine char(10)
   ,bipad char(5)
   ,primary key (biapd)
   ,foreign key (magazine) references magazines (magazine)
)

Conclusion

The "Impermanent Primary key" pattern occurs when some value is permanent as far as individual transactions are concerned, but may change over the lifetime of the master entity. In these cases, we create a parent-child table pair. The actual master table sits at the very top, with a child table below it that holds the semi-permanent values. All transactions are children of the child table.

Next Essay: The Requirements are Always Wrong, Or, Iterative Database Development.

Upcoming Topics

This is an extra entry that I will update from time to time that lists planned upcoming essay topics.

Last Revised: May 4, 2008. Removed some topics that have since been covered. No additional topics were listed in this update.

ACID and Transactions

A huge part of effective database use is knowing what the ACID concept is and how to effectively use transactions. There will be at least one essay on these topics sometime in the future.

Additional Design Patterns

Some of the patterns coming up in the future are:

  • The ranged primary key, usually thought of in terms of dates but also useful for quantity price breaks, among other things.
  • How a foreign key into a ranged primary key can work.
  • The DELETE CASCADE pattern
  • Various ways to treat summary and detail table sets, like an inventory table and all of the various child tables that track things in and out of it.
  • Audit tables

Server-Side Code

Databases become very powerful when you tap into the abilities of triggers and stored procedures. We could easily have 4 or 5 essays on the various amazingly nifty things you can do with these technologies.

A Vocabulary Essay

Reasonable discussion of database concepts is often frustrated by the fact that there are three different groups of people using different terms for some of the same ideas: the relational crowd, the SQL crowd, and the coding crowd. What makes things worse for the new database programmer is that these three groups have different and often conflicting assumptions and goals, and these conflicts are not always stated outright when they speak. At some point I would like to do an essay explaining the three groups, the terminology they use, and what their assumptions and goals are.

Physical Reality and Performance

I have always found that great performance begins with table design and query design. After that, there are some basics that you need to know about indexes and database tuning, but after that performance tends to come down to knowing your application tools. One programming language may work best by fetching down the entire results of a query before processing it, while another works better by fetching and processing row-by-row.

At some point there will be at least one essay on the details of performance.

Handling Data Imports

Many database applications must accept imported data from other systems, often in large quantities. The source and destination tables often have different structures, and major performance issues can really slow down a project if you do not know the tricks for handling large quantities of data.

I have two or three essays in mind on the issue of importing large blocks of data into databases.

Security

It is probably safe to say that in 2008 most programmers using databases have no idea that a database is even capable of performing security. It is probably a complete surprise to most young programmers that proper use of server-side security renders your application immune to SQL injection!

In future articles I will present the basic and advanced concepts of server-side security.

Philosophy of Life

I have a few essays in mind which are more about the ideas that guide development. One of them has to do with where you put application logic, and another has to do with the analysis and table design process. There will be others as well over the coming months.

Monday, February 18, 2008

Database Development: Table Structure Changes

Welcome to the Database Programmer, the blog for anybody who wants to learn the practical realities of working with databases.

There is a new entry every Monday morning, and the complete table of contents is here. This week we are talking about tools again, specifically upgrades.

Structure Change Tools Are Crucial

There are two big things that your toolkit must contain if you are going to be a successful database programmer. The first, which we talked about last week, is a framework that respects relational concepts. The second, which we will talk about now, is a strong upgrade tool, something that can modify your structures for you.

Iterative Development

Requirements change in one way or another in the life of a project. Two philosophies exist to deal with this fact, which I tend to think of as the figure-it-all-out-before-we-code-damnit! school and the lets-make-sure-we-can-adapt-as-we-go school.

If you are a member of the second school, you know that a table design that seemed correct yesterday will reveal mistakes and weaknesses today, and will have to be changed.

In this situation the unredeemable code grinder will attempt to code his way out of the situation, adding edge-case handlers and special condition routines to make up for the fact that the tables do not match the requirements. After a few years of this most programmers except a handful of those present from day 1 will be afraid to touch the code, and inevitably the team will declare they have some huge wonderful great new ideas that require them to start from scratch.

The database programmer by contrast will first revise the table design and update the structures before even considering what changes to make to the code. While the code grinder is always looking for the ultimate class structure first, the database programmer wants first to make sure that the tables match the requirements, always ensuring that any special-purpose code that follows is built on a strong foundation.

Therefore I state the iterative argument: the development process itself will require structure changes. The choice of framework will heavily influence your ability to perform this task.

The Operational Argument

Consider these four actions, are they more alike to each other, or more different from each other?

  1. A programmer spends four hours creating a new feature start to finish on his own workstation.
  2. A testing server is upgraded, merging 3 days' work from 4 programmers.
  3. A live server is upgraded to the latest published version of a package.
  4. A server has a package installed on it for the first time.

From a manager's perspective these are all very different operations, involving different kinds of people, different approval processes and different kinds of follow-up. A manager would say they are different.

However, from a technical standpoint they are all exactly the same operation. All four operations require the execution of these two steps:

  1. Modify table structures.
  2. Deploy new code.

A programmer sitting at a workstation on a new feature will frequently need to add a column to a table or make one or two new tables. Then she adds supporting user interface code where the framework default screens won't do the job. The operation is change tables + make new code.

When a QA server is updated with the latest efforts of more than one programmer, the operation is exactly the same: all table structure changes are made and all new code is deployed. The operation is change tables + use new code.

When a production server is upgraded it is the exact same as the QA server, new table structures and new code.

When a package is installed for the first time and there is no database, it must be created. This is no different than the previous two examples. Instead of modifying tables it adds tables. Instead of adding one or two it adds them all. Instead of overwriting code it puts the code down for the first time. The operation then is still the same: change tables + use new code.

Approach For Trivial Cases

If your database structure is modest (less than 10 tables), if you have one or perhaps two programmers, and if you do not change the application much, then you can probably forget about a tool. Most tools and conventions are meant to coordinate more than one programmer makes lots of changes. If you don't need that coordination then you don't need the tool, unless it really gives you something you like.

The Pros and Cons of Build Scripts

A build script is a program that does one or both of these two tasks:

  • Contains commands that modify table structures
  • Contains commands that manipulate data, such as populating rows in a newly created table or fixing some bug from a prior release.

Build scripts are a natural first step when your team and your customer count begins to grow. One of the biggest first wins is that you can put them into source control. This alone can be worth going to scripts even if you are just one programmer.

It is also fairly trivial to make a small library of commands like "AddColumn" that will check to see if the column is already there and only attempt to add it if necessary. This makes the scripts safer to rerun, which is always a good thing.

You can also have programmers create a build script for each distinct feature they work on, also making it easier to identify which structure changes are tied to which unit of work.

A few years ago I became programmer #5 on what had been a four man team. This team used build scripts exactly as I have just described them. When I left a little over three years later we were about 20 programmers on two continents and over 50 customers. This growth revealed a few weaknesses in the build script method.

  1. Separate build scripts are slow. If you keep separate scripts for each unit of work, you risk that a major upgrade will rewrite a large table more than once. When the table is 10 Gigs this is no laughing matter.
  2. Quality Control is Imposible. Scripts are usually the least tested part of the system. The programmer runs a script at most a handful of times, and usually in unrealistically ideal circumstances. Then an IT person runs the upgrade during off hours (Friday night at 8:00pm) and the script crashes and either you get a phone call and go back to work or the customer is told the upgrade failed. Yuck.
  3. Dependencies Become Horrible. There will always be a need to run the scripts in the order they were written. Even if clever programming gives you a certain leeway here, you will keep running into cases where a programmer must write his script on the assumption that another script has already run. If you have even 3 features tying each other up like this then suddenly we have 3 or 4 programmers stuck in meetings trying to coordinate, and we all know how fun it is trying to get programmers to coordinate in a meeting!

I do not mean to say that you should not use scripts. All of the hazards mentioned above can be acceptable if the scripts give you enough productivity to operate profitably. But if you want maximum upgrade performance and quality, you will sooner or later have to go to meta-data and a differences engine.

End of the Line: Meta Data

The only way to solve all three of the problems mentioned above is to have a single file of some form, whether it be YAML, XML or even plaintext, which describes the database in its entirety. This file is placed under source control and is delivered with other files in the upgrade. We call this file a "meta-data" file because it contains data about data.

When an upgrade runs, the build tool takes that file of yours and creates a picture of what the database should be, and then examines the current structure. Once this is done the tool can work out which changes need to be made, and execute them. Tools such as these work for development, upgrades, and new installs, without needing a separate tool for each case.

A difference engine is the best performing because it does only the changes required in each situation. It has the highest QA because programmers are just supplying inputs to the builder instead of writing new code themselves every time. Finally it reduced dependencies by the simplest of all approaches: it just builds what it knows about and is immune from any trap where one feature requires another.

Conclusion: Pick a Tool for Your Situation

This week I have reviewed three basic approaches to handling structure changes. The simplist situation may require no special tool, while a larger but modest situation requires a bit more, and the final stage requires an intelligent meta-data based agent. Also I have shown that many operations in the development cycle are in fact the same operation over and over: table changes + new code. This means that a good tool for handling these two operations can be useful for development, QA, the new install, and the upgrade.

Next week we will return to Table Design Patterns with "The Primary Key That Wasn't."

Monday, February 11, 2008

The Framework And The Database

Welcome to the Database Programmer. This is a regular blog for anybody who wants practical information about how databases work and how your database and application code work together.

There is a new entry every Monday morning, and the complete table of contents is here. This week we are going to depart from our discussion of Table Design Patterns to get into the area of frameworks and Tools.

The Trouble With Deep Nesting

Today's example deals with a situation of "deep nesting", where one table has not just a parent, but a grandparent and perhaps a great-grandparent. This case is going to bring into clear light one of the basic realities of writing database applications, which is that the tools and framework you use matter a lot. Complex and deeply nested schemas really lay bare how a framework supports the use of a database.

The trouble becomes plain to see when you have dived deeply into a complete application and are starting to spell out a large number of relationships between entities. Orthodox database theory will lead you first to simple primary keys and foreign keys, but from there it will lead you to multi-column primary and foreign keys and overlapping unique keys and overlapping foreign keys. Your choice of framework and other tools can have a dramatic effect on the choices you have in table design.

The Example: A School Class Schedule

In the past three weeks we have looked at the enrollment of students in courses and the assignments of teachers to teach those courses. In these three weeks we have been working through five rules:

  1. A teacher may not teach a class he/she is not qualified to teach.
  2. No two classes can be given in the same classroom in the same period.
  3. No teacher can be in two places at once, a teacher can only teach one class in one period.
  4. No student can be in two places at once, so no student can be in more than one class in the same period.
  5. A student cannot take the same class again after passing the class once.

In each of the past three weeks we have looked at just one or two rules. Now we are going to look at all of them together, including Rule 4 which so far we have not looked at.

For the Sake Of Argument: Completely Normalized

We will start by looking at a schema that has no integer primary keys and uses only natural keys. This is a very useful learning exercise even if you do not do things this way yourself.

  • Complex multi-table schemas stretch your table design skills so they improve your overall abilities.
  • Normalized tables that enforce all rules with primary keys and foreign keys require zero application code to maintain correctness, so if our framework supports the normalized schema then that is the way we want to go.
  • Following up with the previous point, if our framework does not make it easy for us to use the normalized tables, the exercise allows us to see what we are giving up (in terms of having to write more application code) for whatever benefits the framework offers.
  • The completed normalized schema is a solid reference point for whatever design we actually choose or are forced to choose. Having it available may be useful in resolving difficulties that come up if we are forced to abandon it by inadequate tools.

With that being said, here is the completely normalized schema where five rules are enforced solely through primary keys and foreign keys:

COURSES          TEACHERS
---------        ---------
 XXXXX            XXXXXX
 XXXXX            XXXXXX
  |                  |
  |         +--------+
  |         |
 /|\       /|\
COURSE | TEACHER         Teacher qualifications table
-------+---------        Primary key: course + teacher
XXXXX  | XXXXX
   |      |
   +--+---+ 
      |
      |Rule 1: Teachers must be qualified for each course
      |
     /|\             +--------+-------+---------> Rule 2: Only one course
  ===========        |        |       |           per classroom
COURSE | TEACHER | PERIOD | YEAR | CLASSROOM  
-------+---------+--------+------+-----------   
 XXXX  | XXXXX   | XXXXX  | XXXX | XXXXX    
 XXXX  | XXXXX   | XXXXX  | XXXX | XXXXX
 XXXX  | XXXXX   | XXXXX  | XXXX | XXXXX
 XXXX  | XXXXX   | XXXXX  | XXXX | XXXXX   
           |         |        |
           +---------+--------+-----------> Rule 3: Teacher cannot be
                                            in two places at the 
   |                  |       |      |      same time.
   |                  |       |      |
   +----------------+-+-------+------+
                    |            Primary key describes 
                    |            the actual unique entry,
                    |            4 column primary key!!
                    |
                    |
                    |  Unstated Rule: We never said this anywhere 
                    |  but a student can only enroll in 
                    |  classes that are actually offered
                   /|\          
   ===========================
COURSE | PERIOD | YEAR | CLASSROOM | STUDENT
-------+--------+------+-----------+---------
 XXXX  | XXXXX  | XXX  | XXXX      | XXXXX    
 XXXX  | XXXXX  | XXX  | XXXX      | XXXXX
 XXXX  | XXXXX  | XXX  | XXXX      | XXXXX
 XXXX  | XXXXX  | XXX  | XXXX      | XXXXX
   |       |       |                   |
   |       +-------+-------------------+---> PRIMARY KEY  
   |                                         Rule 4: Student cannot be
   |                                   |     in two places at once
   +-----------------------------------+     
                                       |
                                       |
                      Rule 5: Student may not take         
                      a course more than once.

Wow! That is quite a bit more complicated than anything we have seen before in these essays. Yet for all of its complexity it is composed entirely of unique constraints, primary keys and foreign keys. Because a primary key is functionally the same as a unique constraint, we have satisfied all five rules with only two building blocks.

So we are now ready to look what this may look like if we go with integer primary keys. Then we will be ready to compare them.

Looking at it With Integer Keys

I am going to take just the bottom two tables for this example, because the changes for those tables will be the same as the others. I am also concentrating only on rules 4 and 5, since the pattern repeats for the others. These two tables might look like this:

ID | COURSE | TEACHER | PERIOD | YEAR | CLASSROOM  
---+--------+---------+--------+------+-----------   
 1 | XXXX   | XXXXX   | XXXXX  | XXXX | XXXXX    
 2 | XXXX   | XXXXX   | XXXXX  | XXXX | XXXXX
 3 | XXXX   | XXXXX   | XXXXX  | XXXX | XXXXX
 4 | XXXX   | XXXXX   | XXXXX  | XXXX | XXXXX   
 |     |                  |       |
 |     |                  |       |   
 |     |        +---------+       |  These three columns
 |     |        |        +--------+  must be copied??
 |     |        |        |
 |    \|/      \|/      \|/
ID | COURSE | PERIOD | YEAR | STUDENT
---+--------+--------+------+---------
 2 | XXXX   | XXXXX  | XXX  | XXXXX    
 2 | XXXX   | XXXXX  | XXX  | XXXXX
 3 | XXXX   | XXXXX  | XXX  | XXXXX
 3 | XXXX   | XXXXX  | XXX  | XXXXX
      |        |        |       | |
      |        +--------+---------+---> Rule 4: Student cannot
      |                         |       be in two places at once
      |                         |                                  
      +-------------------------+            
                                |       
                                |       
                      Rule 5: Student may not take         
                      a course more than once.

In this case the child entries need only an ID value from the parent and a value for STUDENT. We can still use unique constraints to enforce rules 4 and 5 if we copy values from parent to child.

Only One Difference Between Them

There is only one difference between these two approaches. The normalized approach requires no application code to maintain correctness, but the non-normalized integer key requires applicaton code to enforce rules four and five. This application code can take one of two approaches:

  1. Copy Values from parent table to child and use unique constraints enforced by the server, which is how I've depicted it, or
  2. Check Rules 4 and 5 in Code before an insert to the child table. There will be other considerations like possibly disallowing changes to PERIOD or CLASSROOM once the schedule is populated, otherwise you have to recheck every student when these change.

Because the normalized version requires less applicaton code, that would be my choice, barring any downstream difficulties that would make it impractical. Before today I have never in these essays raised the issue of possible problems with "downstream code." Most of use libraries of code called "frameworks", some of them taken from the internet and some of them homegrown. I now raise this issue of frameworks because we have now seen quite a few examples and a very strong pattern is emerging. The pattern is that many business requirements can be cast as unique keys and foreign keys in normalized tables. Normalized tables require less code and keys require no code. If this pattern continues as we see more Table Design Patterns, it will become more and more important to make sure the framework can support this approach.

Why The Framework Matters

Let us look at a very common user interface issue, filling in values from lookup tables. Consider the case where the complete schedule has been generated by a custom program, but now a school administrator must make a manual change for a particular student. The administrator will need some editing screen where she brings up the student's schedule, drops some course, and adds another.

This operation requires that the administrator be shown a list of available courses. A few years ago we would have done this with an HTML SELECT, but nowadays we probably have some fancy Ajax-ified DIV-based list doing it. But the result is the same: when the user has made her selection, the values are filled in.

Now at this point, if the framework does not know how to handle multi-column foreign keys, the whole normalized idea is out the window. Because it is true that a well-designed database is no good without a user interface, and because we usually depend on the framework to provide the user interface, the choice of framework suddenly seems very serious indeed.

Conclusion: The Importance of Tools

From the beginning of this series in November 2007 until this essay, February 12 2008, I have shown examples of how normalized tables lead to less code and easier code. Further, I have shown examples of how different table design patterns can be used to enforce business rules with the two basic tools of a unique constraint and a foreign key.

Now however we have seen enough examples and learned enough basics that we have to begin to ask about the layers above the database. Do they support our decisions in the database layer and can they implement those decisions? Or will they get in the way and cause us to compromise our table designs and end up with more application code that is complicated and error prone?

So now it is time to expand the original statement that 'good tables lead to good applications' and say as well that 'a good framework enhances (or wipes out) the strengths of the table design'.

Next week we are going to continue to examine how much tools matter by looking at the process of modifying table structures. Many frameworks and philosophies that are out there today are based consciously or subconsciously on a simple desire to try to avoid changing table structures. However, since table structure changes are a fact of life, we will look at how best to handle them.

Sunday, February 3, 2008

False Patterns Such as The Reverse Foreign Key

Welcome to the Database Programmer. This is a regular blog for anybody who wants practical information about how databases work and how your database and application code work together.

There is a new entry every Monday morning, and the complete table of contents is here. We are currently looking at Table Design Patterns and how they lead to tight and efficient code.

The Example: A School Class Schedule

In the past two weeks we have looked at the table of courses that teachers are teaching and students are enrolled in for a year. The five rules are:

  1. A teacher may not teach a class he/she is not qualified to teach.
  2. No two classes can be given in the same classroom in the same period.
  3. No teacher can be in two places at once, a teacher can only teach one class in one period.
  4. No student can be in two places at once, so no student can be in more than one class in the same period.
  5. A student cannot take the same class again after passing the class once.

Two weeks ago we looked at rule 1 and recognized a Cross Reference Validation Pattern, and last week we looked at rules 2 and 3 and saw a Limited Transactions pattern. This week we will look at rule 5 and see the Primary Key In Disguise, and next week we will wrap up this example by looking at how to do rule 4.

First Attempt: Restating Rule 5 In Terms of Tables

There are plenty of times when a customer will give you a requirement that sounds something like, "If X happens, Y may not happen." Rule 5 sounds like this kind of rule:

"If X Happens..."    | "...then Y may not happen"
                     |
If a Student passes  | the student may not
a course             | take the course again

The first thing we want to do is translate this into rules about entries in tables. Our first attempt might sound something like "An entry in the completed courses table prohibits an entry in this year's schedule." It might look like this:

HISTORY TABLE                                    THIS YEAR'S CLASSES

Student | Course                                 Student  |  Course        
--------+---------                              ----------+----------
Nirgal  | History       <-----------             Nirgal   |  Calculus
Jackie  | History     A "reverse" foreign        Nirgal   |  Geometry  
Nirgal  | Physics     key says entries in
Jackie  | Physics     history prevent entries
                      in current enrollment

The problem with this statement is that there is no way to enforce such an idea. It is not a foreign key, it is more like a reverse foreign key. A foreign key says a a parent row must exist but we are saying a parent row must not exist. If there were such a thing as a reverse foreign key we would be ready to move on to the next task. But since there is no such thing as a reverse foreign key, we must keep looking.

The Customer Does Not Design Tables

So far we have a formulation "An entry in the completed courses table prohibits an entry in this year's schedule." But here I have deliberately made a common mistake to demonstrate how programmers sometimes deal with the requirements given to us by end users. I have taken the rule and turned it into a statement about tables without first doing any real thinking. This is like translating a sentence word-for-word from one language to another. When you transliterate the user's requirements this way instead of translating, the result is exactly the same as for human languages: you get nonsense.

So we must now remember that the customer is not in the business of table design. The customer will explain their needs as best they can, but we should not expect the customers' statements to translate directly into table definitions.

So if we look at this rule again we realize that when one thing excludes another thing we may be looking at a primary key or unique constraint. If the student's completed courses and the student's current courses are stored in the same table then the problem is solved. We make a primary key or unique constraint on student + course and that is the end of that.

Relational Does Not Meet My Needs!

Nowadays (February 2008) it is pretty common to hear programmers who are relatively new to the database world say: "Relational just cannot do what I need, my customers have complex needs that don't fit into relational concepts." If that programmer is handling text, like books, or media files then they may be right. But if not, that programmer probably:

  • ...has been taking the customer's requests at face value instead of translating them into solid concepts.
  • ...is not aware of dead-end patterns like the "Reverse Foreign Key" and how to recognize the valid patterns that are hiding behind them.

Now that we know we probably have a primary key, and not some kind of weird reverse foreign key, it is time to design the table.

A Primary Key In Disguise

Putting them into the same table is not that big a deal. It is nothing more than a list of courses the student has taken or is currently taking. The table would look something like this:

The students_x_courses Table

  Primary Key (or perhaps an int primary
       |       key and this is a 
       |       unique constraint)
       |
  +---------+
  |         |
Student | Course   | Year  | Flag_hist |  Grade
--------+----------+-------+-----------+--------
 23     | HIST-302 | 2005  |  N        | 
 23     | PHYS-101 | 2003  |  Y        |  92     
 23     | CHEM-211 | 2004  |  Y        |  96       

Deletions Required

We should note that if a student fails a course, the approach we are taking requires that the row be deleted outright, otherwise they cannot take it again. All database programmers share a deep uneasiness about deleting data, it makes us nervous. There is good reason for this. As soon as you delete the failed classes somebody will ask for some statistics on how often students fail courses, which you cannot tell them because you deleted the data! Nobody wants to be having that conversation!

Therefore we should assume that we will have a separate table just to store a record of course failures. We will copy failure records to that table before we delete them.

Putting Current Data and Historical Data Together

There are plenty of strange ideas out there, and sometimes an idea from ages past will persist long after anybody remembers where the idea even came from. One such idea is that you should not mix together historical and current data.

This idea came from long ago when hard drives where extremely expensive, when many data files were stored on tapes or cartridges and never got near a hard drive. The idea was to separate your live and historical data because the bulk of your operations required one or the other but rarely both, and you could not afford to keep them both available at the same time. In really old fashioned batch operations something we take for granted like a simple query would be done by loading tape after tape onto a refrigerator-sized tape machine that executed some query program. If you kept history and live data together you might have to load 30 tapes, but if you kept them separate you would only have to load 3 tapes to pass the live data. That basic practice has stayed with us ever since in the form of prejudices and vague advice about "not mixing current and history."

But nowadays we do not need to worry too much about hard drives, there is no validity to most of the reasons for keeping this data separate.

Conclusion: Translation, Not Transliteration

This week we saw what happens when we take user requirements at face value. They lead us into dead-end design patterns, that is, patterns that are not valid and cannot be implemented. We saw how important it is to take the users' statements and seek out the "disguised" primary keys, foreign keys and so forth. This is the sad result of transliteration, the direct conversion of user statements into table designs.

The correct process is translation, looking at the user statements with a critical eye and seeking out the primary keys, foreign keys and table definitions that are lurking there. Those programmers who do not learn this skill will be led into the false belief that they are somehow dealing with problems that no human being has ever seen before, and that they need some post-relational or extra-relational or razmataz-relational system that will provide a unique, strange and clever solution to their troubles. The truth is usually much less interesting and not nearly so ego-inflating. The truth is that most data does fit into tables pretty neatly, and that most rules (but not all) can be expressed as unique constraints and foreign keys.

Next week we will wrap up this example by looking at rule 4, that no student can be in two places at once.

Next Week: The Framework and The Database