Sunday, February 24, 2008

The Primary Key That Wasn't

Welcome to the Database Programmer, the blog for anybody who wants to learn the practical realities of working with databases.

There is a new entry every Monday morning, and the complete table of contents is here. This week we are talking about tools again, specifically upgrades.

The Impermanent Primary Key

Sometimes you get a situation where you have a great possibility for a natural key, except that the key will change from time to time. I call this the "Impermanent Primary Key" and we are going to look at this pattern today.

Magazines and BIPADS

I am currently writing a system for a magazine distributor, while also maintaining his old system for him. When I first started with him he explained that a magazine is identified by a "BIPAD", a 5 digit number unique to the magazine. It seemed to me that if you take a BIPAD, and issue number, and a volume year, you have a pretty good choice for a natural primary key.

Then one day he walked up to me and said, "I have fourteen magazines with new bipads this month." This was a surprise, as I had been fairly sure that a BIPAD and a magazine were uniquely matched for all eternity.

Sidebar: Did the Requirement Change?

Before we get into the design pattern for this, it is worth asking, is this a case where the customer changed the requirement? The answer is a most emphatic no! The customer told me, if I had listened precisely, that a magazine "has a BIPAD". He never said that a magazine has only one unique BIPAD, I inferred that myself. Such inferrences are mistakes, and we cannot blame them on the customer, because it was my job to ask all sides of the question, such as "can a magazine have more than one BIPAD?" or "Is a BIPAD permanent?" or "Can a BIPAD ever be used again on another magazine?"

Spelling Out The Requirements

So the reality is that a BIPAD is assigned to only one magazine, but a magazine may get more than one BIPAD, as the bipad will change from time to time. My customer's operational requirements are:

  1. The history of individual BIPADs must be preserved in the transaction tables.
  2. The histories of multiple BIPADs for a single magazine may need to be combined from time to time.
  3. During a BIPAD change, transactions will continue for both BIPADs.

I should also note that transaction information comes from outside sources and always contains the BIPAD. It is our responsibility to make sure the BIPAD can be mapped into our database correctly. We have no power to ask outside parties to use numbers created or assigned by us.

Some Candidate Solutions

There are a few ways to handle this. Here are four choices that are most likely to come up in a brainstorming session:

  1. Use BIPAD as a primary key in the BIPADS table, and change the BIPAD as required.
  2. Create a From-To table that maps old BIPAD values to new values.
  3. Use an integer primary key in the BIPADS table and make BIPAD just a regular column, so you can just change it as required.
  4. Make the table of BIPADs a child table to a table of MAGAZINEs. When a BIPAD changes make a new entry in the BIPADs table.

We can eliminate option one immediately, because foreign keys do not allow this possibility. If you have a table of BIPADS, with BIPAD as a primary key, then your transaction tables will have foreign keys to this table. If you could change the value of a BIPAD then suddenly those transactions that reference the old value would not be valid. We say in every day language that you would orphan the transactions. A foreign key exists to prevent such things, so option one is out.

Option two looks pretty reasonable, but it is no good at all. I included option two as an example of what happens when you use a coding mentality to design tables. Option two records accurately an action, the changing of a BIPAD, by recording the old and new values. While this may seem harmless, it makes a wreck out the idea of using primary keys and foreign keys. What is the primary key of such a table? What table and column do the transaction tables reference? When an idea gives problems for the basic building blocks of a database, we know that nothing but trouble is going to follow.

This leaves options three and four, which we will now consider in more detail.

The First Two Requirements in Detail

The first two requirements are that we must be able to examine the individual histories and the combined histories as requested. Options three and four both allow this, as both of them have the BIPAD values in the transaction tables and both options also have a table that can be used to combine histories.

Option four uses the BIPAD values natively in all tables, so there is no need for any special planning or actions when saving data or querying it. If you want to query for a single BIPAD, then specify that. If you want to query for a magazine, then JOIN to the MAGAZINES table and filter on your magazine and there you go. Option four satisfies the requirements with no fanfare.

Option three requires some additional storage. The transaction tables need to keep track of the BIPAD for accuracy, but they also need that meaningless integer foreign key to the table of BIPADS. This always strikes me as funny because the integer key is promoted as a performance method, but the basic reality of a disk drive is that performance goes as the amount of data you have to read and write. Where option four writes only the value of BIPAD, option three must write more data, and therefore will always be slower.

Option three also requires more code and more disk activity. Because we receive transaction data as BIPAD values, but we are using a meaningless number as the foreign key to the BIPADS table, we have to do a lookup into the BIPADS table to find out the meaningless integer key. This means we need a read operation that was not necessary with Option four. Again this is rather ironic since the integer key is promoted as a performance tool.

Finally, the problem of having to look up the value of the BIPAD's integer key introduces application code, so now with Option three I have to write code where for Option four I do not.

But when all is said and done, if your tools lock you into option three and you have no choice, you can in fact satisfy requirements one and two. You wll be able to examine both the individual histories and the combined histories. So we probably have to say so far that it is a matter of taste, tools, and habits.

Disqualifying Option Three

The third requirement is that transactions will occur for both BIPADs during a transition. For option four this is not really an issue because there are two rows in the BIPADS table, one for each BIPAD. Any transaction table that is a child table of BIPADS is automatically covered by the foreign key.

Option three however has a fatal problem. When a transaction comes in with a BIPAD on it, we have to validate it by looking up the BIPAD in the BIPADs table, so we can get the meaningless integer primary key. Except that after a new value is entered the old value will not be there. We cannot satisfy the third requirement and Option three is now disqualified.

Those who are comfortable using integer primary keys for everything may suggest a way to rescue the situation, but because Option four does not require any application code and does not require rescuing, I will leave it to others to rescue option three.

More Details On The Impermanent Primary Key

When we have an impermanent primary key, one that changes from time to time, we can create a pair of tables. The master entities are tracked in the top table, and the child table tracks the impermanent values. This second table is the parent of all of the transactions:

In the example above, which has to do with magazines and BIPADs, the master table I built uses character primary keys, and the child table uses the BIPAD values, so it looks something like this:

TVGUIDE  | 12345      | XXXXX
TVGUIDE  | 34345      | XXXXX
COSMO    | 29830      | XXXXX
COSMO    | 23813      | XXXXX 
     Transaction Tables have foreign keys
     to this column

The SQL that would create these two tables might look like this:

  magazine char(10)
  ,description varchar(35)
  ,primary key (magazine)

   magazine char(10)
   ,bipad char(5)
   ,primary key (biapd)
   ,foreign key (magazine) references magazines (magazine)


The "Impermanent Primary key" pattern occurs when some value is permanent as far as individual transactions are concerned, but may change over the lifetime of the master entity. In these cases, we create a parent-child table pair. The actual master table sits at the very top, with a child table below it that holds the semi-permanent values. All transactions are children of the child table.

Next Essay: The Requirements are Always Wrong, Or, Iterative Database Development.


--DD said...

Sounds like your solution assumes the answer to "Can a BIPAD ever be used again on another magazine?" is no, right?

Unless you use BIPAD+Magazine_ID as PK and therefore FK in the transactions table(s).

Just trying to follow ;-)

KenDowns said...

DD: you are correct, I did not state it in the essay but our design does not expect BIPADs to be used again.

Bruno Martinez said...

Option number one got shafted because it was incompatible with FKs. Can't you change the values all over your database, in child tables too? I thought cascade update did exactly this.

KenDowns said...

Bruno: Option 1 actually violates the customer's first requirement, that histories for both BIPADS be preserved.

I think the OP could have been clearer here. While it is possible to specify an UPDATE CASCADE on the primary key, such an action would modify the history of transactions with old bipads, which is not allowed. If the UPDATE CASCADE were not used, then older transactions would become orphaned. So either way it is out.

Mike said...

Seems that your experience shouldn't have happened: Looking into BIPAD Inc, they emphatically guarantee "Your BIPAD number NEVER CHANGES"

Seems even the best laid plans can always change. :)

ericdes said...

I didn't quite understand why you disqualified option 3, or maybe I'm assuming the wrong thing, isn't option 3 about a table like this? ID (PK), BIPAD, OTHERS...

KenDowns said...

@erides: option 3 is disqualified mostly because it requires a lot of application code.

But really it comes down to option four being a complete and sane record of what is going on, so it will be the easiest to code against. All other options will require more code or will corrupt history.

Malcolm said...

Firstly let me say I'm really enjoying your series and am finding it very useful.

re the present essay: think you're being a bit disingenuous with Option 3 vs 4.

Retro-converting to Option 4 introduces an abbreviated magazine name field that must be populated either manually, or by executing specially developed (and customer dependent) coding to find an algorithm that creates unique abbreviations (or is it just my ignorance of SQL?)

In any case the resulting field is longer than the (automatically generatable) 4(?) byte int that would do the job under Option 3 which you decry for introducing extra fields.

KenDowns said...

@malcom: The key difference between 3 and 4 is not the kind of key, but getting the records to be accurate. Option 4 is a completely accurate record of what is going on, so it will require the least code to maintain and be the easiest to query.

As for the magazine codes, i did in fact write a generator for abbreviations when I uploaded the data, new magazines are left to the user to pick an abbreviation -- they just don't happen that often. See the essay "A sane approach to primary keys."

Anonymous said...

So what's the solution if you want to change the magazine name and keep track of the sales of the magazine under both name ( combined and individually)?

KenDowns said...

@anonymous: happens all the time. The magazine name is not the same as the code. While the example does not show it, the name can be set in the lower transaction tables. When summarizing you do it as you always do, by joining up to the top level tables.

Robert said...

I've been enjoying reading through your posts, it's been very informative.

On this particular post, I'm getting a broken image: