Monday, December 3, 2007

Database Skills: Primary Keys

This is the second in a series of articles that will help you develop your database skills, Since all non-trivial websites require a database, and since database skills are different from coding skills, there is very good reason for any programmer to master the principles of database design and use.

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.

The Example

This week the example continues from last week. A school has 8 periods in a day that are varying lengths, sometimes an assembly, and of course homeroom. Our assignment is to code a screen that shows the begin and end times of each period depending on what schedule was being followed for the day.

Last week we introduced the idea that the database programmer will tend towards trying to spell it out. That is to say, the database programmer will cast the solution in terms of saving information in a database that can be retrieved later on. He will generally avoid coding an algorithm that "figures it out" if he has the chance to simply store data that can be retrieved as needed. We will now see how a primary key works by looking at two tables that make up part of the solution.

The Primary Key Defined

The primary key is pretty easy to understand. It is simply the column or columns in a table that must be unique. In our example of the school schedule the database programmer is going to create two reference tables that have single-column primary keys and just a few rows each. Here they are, the table of schedule types and the table of periods:

NORMAL            |   Normal Schedule 
ALONG             |   Long Assembly
ASHORT            |   Short Assembly
ALUNCH            |   Short Assembly After Lunch
HR           |   Homeroom
1            |   Period 1
2            |   Period 2
3            |   Period 3
4            |   Period 4
LUNCH        |   Lunch
6            |   Period 6
7            |   Period 7
8            |   Period 8
ASSM         |   Assembly

The first table has a primary key of SCHEDULE_TYPE. No two rows in that table can have the same value of SCHEDULE_TYPE. The second table has a primry key of PERIOD, so that no two rows in that table can have the same value of PERIOD.

A primary key is in fact very simple and does not really require a lot of discussion except for one further point: a primary key identifies something. For example, the value 'NORMAL' identifies a certain kind of schedule, and if I am given this value I can look up things about the schedule in the SCHEDULE_TYPES table.

Detour: Code and Data Decisions

These examples are so far very simple, so it may not be obvious that the tables SCHEDULE_TYPES and PERIODS are really necessary. After all, we can just put them into an associative array in some type of include file and not bother making tables, right?

Technically, you can of course put anything you want into an include file and save yourself some time at the front of a project. The drawbacks emerge over time and are the consequence of "trapping" data inside of code.

The first and most obvious drawback is that if the customer wants to change a description then a programmer must change a program file, instead of just having a user go to an admin screen. But worse, what if our program becomes a smash hit and schools all over the country are using it? What if they all want different descriptions, or what if they have different numbers of periods and kinds of schedules? If we start off by putting such things into tables then we will be well prepared for future demands, but if we "trap" the data in code then future success will produce choking demands on the programmers to change the files.

If the tables were more complicated we would have another issue. Databases were designed for efficient querying, but arrays were not. Querying the data that is trapped in code is much harder than querying tables in the database.

So we will take it as a given that data belongs in a database, even for simple two-column tables.

The Primary Key And a Classic Blunder

Programmers are by nature visual thinkers and we tend to be good at pattern recognition. This is why it is so common for programmers to look at the example above and say, "Hey! I just noticed that both of those tables have the same structure, a code and a description! I've got a great idea, I'm going to combine them into one table, which will be more efficient!"

This is a classic blunder. It is a result of applying wisdom from one situation into a different situation where that wisdom does not apply. Here is what happens. The programmer takes his idea of efficiency, which he learned from writing code, and tries to apply it to table design, which operates on completely different principles. He thinks that by replacing two tables with one he is being more efficient, but the real result is less efficiency.

First off, the new combined table now requires a new column, ROW_TYPE, which was not necessary before. More complicated, less efficient.

Secondly, a simple query to pull all types of schedules now requires a filter on ROW_TYPE that was not required before. This raises major alarms for seasoned programmers because any time you make a simple action more complicated with no apparent gain then you are on a bad road!

But most importantly for today's essay, the "efficient" single-table approach has muddled the primary key. The value 'ASHORT' by itself no longer uniquely identifies a schedule type. To solve this we end up in strange discussions. Should the column ROW_TYPE be part of the primary key? Or should we keep the single column key and have the odd situation that a schedule type and a period cannot have the same code? These conversations should raise alarms with any seasoned programmer because we have barely gotten started with two simple tables and already we are struggling with the very basics of picking primary keys! Even a programmer with no database experience should be suspicious of an approach that requires wrangling with the very basics. What good can come of that? The big irony is that these complications arose from the "simplification" of combining two tables into one.

Finally, there is a big technical problem with foreign keys and the one-table approach, but that can wait until next week.

All of these problems can be avoided by remembering that each table in a database contains information about one kind of thing and only one kind of thing. The primary key is that column or columns that uniquely identifies each of these individual things.

Conclusion: Look for the Keys

The primary key is simple to understand. It is a column or columns that must be unique in a table. But beyond that, the primary key is not just unique values, but unique values for like items. Database design skills begin with identifying the kinds of things that must be tracked, putting each into a table, and assigning the primary keys to those tables.

Other Posts

This essay is part of the Keys, Normalization and Denormalization series. There is also a Complete Table of Contents and a Skills-oriented Table Of Contents.


Anonymous said...

Good articles, thanks. Small typo in this line:

"columns that uniquely idenfifies each of these individual things"


KenDowns said...

Louis, thanks.