Welcome to the Database Programmer. This blog is for
anyone who wants to learn about databases, both
simple and advanced. Since all
non-trivial websites require a database, and since
database skills are different from coding skills, there is very good
reason for any programmer to master the principles
of database design and use.
Every Monday morning there is a new entry. The complete
table of contents for the Monday morning series
is here.
This week's entry is rather long. I strongly considered splitting
it into two weeks, but decided to keep it as one so that
it would be an easier source of reference in the future.
There is no One True Primary Key
There are several competing theories out there on how
to choose primary keys. Most of them tell you to use
a single kind of key for all tables, usually an
integer. In contrast to
those theories I have found that a robust application
uses different kinds of keys for different kinds of
tables. In the last 15 years I have worked on projects
large and small, simple and complex. Sometimes I had
total technical control, and sometimes I had to work
with what others gave me, and sometimes it was a little
of both. Today's essay reflects what I have worked
out in those years, and how I build my tables today.
My goal is to report what actually works,
not to promote a particular theory about how everybody
should do something.
This week we will see "rules of thumb". A rule of thumb
is a guiding idea that will tend to hold true most of
the time, but which you may decide to change in certain
circumstances.
Rule of Thumb 1: Use Character Keys For Reference Tables
A reference table is one that tends to be strongly constant over
time and has relatively few columns. Sometimes a reference table
may come already populated by the programmer.
Examples include tables of country codes (perhaps with
international telephone prefixes), a table of provinces or states within
a country, or a table of timezones. In this series I have been using
the example of a school management program, for that program we might
give the user a reference table of school subjects like history, math,
physics and so forth.
For these tables it is best to make a character primary key, which we
often call a "code", as in "timezone code" or "country code" or
"subject code." The strategy is to make a code which can be used
on its own as a meaningful value that people can understand.
This gives us tables that are easier to use for both programmer
and end-user.
Let's consider our school management program. We have a table of
teachers (populated by the school staff), and a table of subjects which
we have provided as a reference table. When a teacher joins the faculty,
somebody must enter the subjects that that teacher is qualified to each.
The tables below show two examples of what this table might look like,
which is easier to read?
TEACHER - SUBJECT CROSS REFERENCE
EXAMPLE 1: INTEGER KEYS EXAMPLE 2: CHARACTER KEYS
Teacher | Subject Teacher | Subject
--------+---------- ------------+-----------
72 | 28 SRUSSEL | PHYSICS
72 | 32 SRUSSEL | CALCULUS
72 | 72 SRUSSEL | HISTORY
45 | 28 ACLAYBORNE | PHYSICS
45 | 29 ACLAYBORNE | CELLBIOLOGY
45 | 45 ACLAYBORNE | RUSSIAN
The table of character keys is much easier to work with, for the simple
reason that many times you can just use the codes themselves, so
you can avoid a lot of JOINs to the main tables. With integers you
must always JOIN to the master table so you can get a meaningful
value to show the user. But not only is the table itself easier
to read when you are debugging, it is easier to work with when
writing queries:
-- The character key example is pretty simple:
SELECT teacher,subject FROM teachers_x_subjects
-- The integer key absolutely requires joins
SELECT x.teacher_id,x.subject_id
t.name,s.description
FROM teachers_x_subjects x
JOIN teachers t ON x.teacher_id = t.teacher_id
JOIN subjects s ON x.subject_id = s.subject_id
I often hear people say they do not like SQL because it is so complicated
and they hate doing so many JOINs. It makes me wonder if the
person is lost in a JOIN jungle caused by very bad
advice about always using integer primary keys.
If you are using some kind of ORM system that tries to protect you from
coding any SQL, that basic problem of over-complicated tables will still
appear in your code. One way or another you must enter details that tell
the ORM system how to get the descriptions, which would not be necessary
if the keys were meaningful character values.
We can now see the surprising fact that the integer keys will slow us down
in many situations. Not only do they have no performance advantage, but
they actually hurt performance. The reason is because they require
joins on almost every query. A 3-table query with two joins will
always be much slower than a 1-table query with no joins. If you are using
an ORM system that does not do JOIN's, but instead does separate fetches,
then you have 3 round trips to the server instead of 1, and heaven
forbid you have queries in a nested loop, the performance will simply
crash and burn. All of this is kind of
ironic since you so often hear people blindly repeat the dogmatic
phrase "We will use integer keys for performance reasons..."
Rule of Thumb 2: Use Character Keys for Small Master Tables
Many database programmers use the term "master table" to mean
any table that
lists the properties of things that have some permanence, like customers,
teachers, students, school subjects, countries, timezones, items (skus),
and anything else that can be listed once and used many times in other
places. Generally a master table has more columns than a simple
reference table.
Some master tables are small and do not change often. In our ongoing
example of a school management application, the list of teachers is a good
example of a small master table. Compared to the list of students, which
is much larger and changes every year, the table of teachers at most schools
(except for huge state universities) will
have only a few changes each year.
For tables like this it is good to allow the user to enter character keys
if they want to. Some schools will insist on being allowed to choose
their own codes like
'SRUSSEL' for "Saxifrage Russel", while others will say, "Why should I have
to make up a code, can't the computer do that?"
For these tables I have found it useful to always define the primary key
as a character column, and then to allow some flexibility in how it is
generated. Common ways of generating codes include:
- Letting the user make up their own code
- Generating a code out of some other column or columns, like
first letter of first name, plus 5 letters of last name, plus
three numeric digits. (This used to be very popular in decades past).
- Generate a number.
The key idea here is to follow the needs of your users. Option #2 above
is one of the most useful because it gives you the best of both worlds.
Rule of Thumb 3: Use Integers For Large Master Tables
Some master tables are large or they change often, or both.
In our ongoing
example of a school management application, the list of
students will change every year, with many students coming and going.
Another example is a doctor's office that has patients coming and going
all of the time. I have found it best to use plain integer keys here
because:
- Unlike small master tables (like teachers) or
reference tables (like school subjects), a code is not likely to have any
meaning for the end-user, so the biggest argument for using it does
not hold.
- Unlike reference tables, the master table is likely to have many
more columns and you will probably end up JOINing to the table
many times. This means our other big reason for using codes, which
is to avoid JOINs, does not hold either.
- It is not realistic to expect end-users to be making up codes for
large tables, and since the codes will have no value, why should
the end-user be troubled with the job?
- Writing algorithms to generate unique codes will run into more
difficulties, and since the code has no value why bother?
Rule of Thumb 4: Use Integers For Transaction Tables
Many database programmers use the term "transaction table" to mean
any kind of table that records some kind of interaction
or event between master tables. In an eCommerce program
the shopping cart tables are all transaction tables, they record
the purchase of items by customers. In our school management program
the actual classes taken by students are transactions, because they
record specific interactions between students and teachers.
For these tables the auto-generated integer key tends to be the
most useful. I am not going to present any arguments for this
because most programmers find it self-evident. It should be
enough to say that any attempt to use a compound key (like
customer + date ) always ends up causing a problem by
limiting what can be entered, so the meaningless integer key
is the way to go.
Rule of Thumb 5: Use Multi-Column Keys In Cross References
A useful database will end up with a lot of cross reference tables
in it. A cross-reference table is any table that lists various
facts about how master tables relate to each other. These tables
are extremely useful for validating transactions. In fact, next
week's entry will be all about these tables and how to use them.
For now the important point is that the primary key of a
cross-reference is a combination of the foreign keys. We do not
make up an extra column, either integer or character.
TEACHER-SUBJECT CROSS REFERENCE
Teacher | Subject
------ -----+-------------
SRUSSEL | PHYSICS
SRUSSEL | CALCULUS
SRUSSEL | HISTORY
ACLAYBORNE | PHYSICS
ACLAYBORNE | CELLBIOLOGY
ACLAYBORNE | RUSSIAN
The SQL for this table would resemble something like this:
CREATE TABLE teachers_x_subjects (
teacher char(10)
,subject char(10)
,primary key (teacher,subject)
,foreign key (teacher) references teachers(teacher)
,foreign key (subject) references subjects(subject)
)
The reasons for this are rather complex, and next week
the entire entry will be devoted to this and similar
ideas. For now we will note that
this approach lets us validate teacher-class assignments
so that no
teacher is assigned to teach a class she is not qualified
for. Using a new column as a primary key does not
allow that, and therefore leads to more complicated
and error-prone code.
Rule of Thumb 6: Use Given Keys For Non-Insert Imports
Many systems today that we create will interact with systems
that already exist. A typical eCommerce program will get a
list of items and maybe even customers from the company's
main computer system.
For some of these tables, your own system will absolutely
never make new rows. A very common example is a table of
items on an eCommerce site that is loaded up from some other
computer system.
For these tables, the simplest route is to use whatever key
exists on the table as it is given to you. Any other route
involves more work with no clear motivation for putting
out the effort.
Rule of Thumb 7: Use Integer Keys for Import/Export Tables
Sometimes you may have a table whose original values come
from another system, but unlike the previous case your own
system is generating new rows for the table, and you may have
to send these rows back to the original system.
One classic example of this is a list of customers. I created
a website a few years ago where the list of customers is
updated from a different system from time-to-time. However,
new customers can also sign up online. Both systems
are handing the customer list back and forth from time to
time to keep them reconciled.
In these cases I have an integer primary key for the
table because it follows Rule of Thumb 3, it is a large
master table. The most important concept here is that
you must not try to combine your key and the key from the
original table. Keep the key from the original table in its
own column, index on it, and use it for updates, but do not
try to enforce it as a unique column. The other system must
take care of its own key, and your system must take care
of yours.
Rule of Thumb 8: Use An Object Id On All Tables
Back when people were getting excited about the concept
of "Object-Relational Databases", they came up with the
term "object id" to denote a column that contains some
unique value but otherwise has no meaning. The same idea
exists with different names, but Object ID is now
the term that most people understand so that is the term
I will use.
Your programs can be made simpler in many cases if you
add an object id to every single table in addition to the
primary key. An object id is useful specifically for
user interface code. If you use an object id, then it
is easier to write UPDATE and DELETE statements, and it is
easier to write framework or ORM code that does these
things for you.
If you are following these rules of thumb closely in your
project then it is important not to use the object id
as a primary key, and therefore you may never use it
as a foreign key either. If you use an object id as the
primary key then you lose a lot
of the benefits of the character keys listed above.
Also if you follow these rules in your projects it
means that your transaction tables have both an
auto-generated primary key like CART_ID and an
auto-generated object
id. Some programmers are bothered by
this because we don't like the idea that two columns
appear to be doing the same thing, and we try to
save a column. But personally this does not
bother me because it helps me write
robust applications, and this is not 1985 where a
10MB hard drive cost hundreds of dollars.
Absolute Rule 1: Only Atomic Values
This is not merely a "rule of thumb" but a rule that I follow
absolutely. It is actually part of
First Normal Form, which is that column values must be
atomic, or indivisible. Another way to say it is that the
column must not have "subvalues" buried in it.
I have included this rule here instead of with First Normal
Form because when most programmers violate this rule they
are making primary keys by combining different values
together. In our example of a school program, if we
have a list of the actual students taking classes in a
given school year, you might have a squashed-up primary
key column like this:
CLASS_CODE | STUDENT
--------------------------+---------------
SRUSSEL-2007-PHYSICS | NAI
SRUSSEL-2007-MATH | PCLAYBORNE
ACLAYBORNE-2007-RUSSIAN | JBOONE
ACLAYBORNE-2007-MATH | NAI
There are two practical problems with doing this:
- You cannot use a foreign key to validate the
sub-values, so you must code validation manually.
- Retrieving the sub-values requires extra code,
either in the SELECT or in your client code. If the
values were in separate columns this would not be
necessary.
Absolute Rule 2: No Magic Values
Another rule that I follow is to absolutely never have
magic values. A magic value is a value in a column that
causes some non-obvious result. I have included this is in
this essay because most programmers who break this rule
do so by hard-coding special actions to occur based on values
of keys in reference tables and master tables.
An example might be a table
of teachers, where one of the teacher values is something
like "SUBSTITUTE", and the program is hardcoded to do a lot
of different things when it sees this value. Magic values
are bad because the code is harder to debug.
It may not be obvious to a programmer that some special value
of the TEACHER column would cause special actions to occur.
But if you have a column called FLAG_SUBSTITUTE then any
programmer who must maintain code written by somebody else
will have a much easier time of it.
Magic numbers also confuse end-users. It may seem obvious
to us that the value "SUBSTITUTE" in the teacher column
means substitute, but if this value causes other things to
occur, and we are in the regular habit of having these
values in lots of tables, then the compound effect can be
lots and lots of phone calls from confused users, and big
trouble for the software developer's bottom line.
Finally, magic numbers limit you. If you use the value
"SUBSTITUTE" as a single teacher in the teachers file,
then how do you keep track of the dozen-odd substitutes
the school may hire in a year? The end-user is stuck
here, they must use pen and paper. It is much better to
allow them to enter the substitute as a regular faculty
member with a FLAG_SUBSTITUE column to check off.
Magic numbers have plagued programming since long before
databases came around. Here is a link to
The Jargon File,
which talks about magic numbers in other contexts.
Conclusion: Many Kinds of Tables, Many Kinds of Keys
This week we have seen that there can be many practical
benefits to using different kinds of keys for different
kinds of tables. Using the right kind of key for the right
kind of table leads to simpler code and better performance,
whether you code
SQL directly or use an ORM system.
Remember always that your application will always follow
the same structure as your tables. If the tables are
designed well, the code will be lean, tight, efficient, and
robust. Because table design is so important, it is best
to know well the different kinds of tables there are:
reference, master, cross-reference, and transaction, and
to build the keys wisely.
Next week we will zoom into cross references and find
out how important they are in good table design and
how they will improve your applications.
Next Essay: Cross Reference Validation Pattern