Sunday, January 20, 2008

Table Design Patterns: Cross-Reference Validation

Welcome to the Database Programmer. This is a blog for anybody who wants to learn about databases. I try to make it simple and easy to read without dumbing it down. Most programmers these days work on web sites of one sort or another, and since all non-trivial websites require a database, there is very good reason to learn how they work.

There is a new entry every Monday morning, and the complete table of contents is here. The overall theme right now is that good table design leads to tight and efficient code. Last week's entry gave a collection of guiding rules for crafting your primary keys, and now we can turn to the concept of Design Patterns in Table Design. Table Design Patterns are recurring patterns in tables and the relationships between them. If you learn what they are, you can learn to recognize these patterns in the requirements that users give you.

Most database programmers are comfortable with the idea of Table Design Patterns, and have their own rules they have worked out through experience. The patterns I present here are well known amongst database programmers, my only contribution is to explain them and to try over time to put them all in one place.

The Example: A School Class Schedule

In this series we are using the example of an application that manages a school of a few dozen faculty and few hundred or perhaps a couple thousand students. Each year the school administration must make up the actual class assignments for each teacher, including what classroom and period each class will be taught in. Then students must be assigned into the classes.

There are five rules that must be followed:

  1. A teacher must be qualified in advance for each course they teach.
  2. No two classes can be given in the same classroom in the same period.
  3. No teacher can be in two places at once, a teacher can only teach one class in one period.
  4. No student can be in two places at once, so no student can be in more than one class in the same period.
  5. A student cannot take the same class again after passing the class once.

All five rules can be handled with some smart table design and we will not need any application code.

Next week we will look at rules 2 - 4, and the week after that we will tackle rule 5. This week we look at Rule 1.

Rule One and The Cross Reference Validation Pattern

When a user gives us their requirements, they will not express them as as computer programs or table designs. The user will express his needs in his own terms, expecting us to translate those terms into tables and programs. So our job is to translate Rule 1 into one or more tables with proper primary and foreign keys.

When we look at rule one we see that it splits into two requirements:

  • Subrule A: There must exist of a list of what course a teacher is qualified to teach.
  • Subrule B: All actual course assignments must match to the list of allowed (or qualified) classes.

When I see these two requirements together I automatically think "validate against a cross reference." I call this the "Cross Reference Validation" pattern. It is different from a simple foreign key validation. Let's go through it and see why.

We can start start at subrule B, because it is easy to recognize. Any time some entry in a table must match an entry in another table, that means a foreign key.

Now we want to work out what the parent table looks like. Is it a simple master table with a one-column character key? Is it a transaction table? We answer this buy drilling deeper into Subrule A:

  • Assumption 1: There is a list of teachers
  • Assumption 2: There is a list of courses
  • Assumption 3: There will be a list of teachers and courses together.

So the parent table must be a cross reference because each entry will list a teacher and a course. It should go without saying that we will have a table of tgeachers and another table of courses, so the table of qualifications must be a cross reference between these two.

Here is a picture of the Cross Reference Validation pattern:

  
CLASSROOM |  PERIOD    | COURSE   | TEACHER  | STUDENT  | ASSIGN_ID
----------+------------+----------+----------+----------+-----------   
  XXX     |   XXXX     |   XX     |  XXX     |   XXX    |   1
  XXX     |   XXXX     |   XX     |  XXX     |   XXX    |   2  
  XXX     |   XXXX     |   XX     |  XXX     |   XXX    |   3 
  XXX     |   XXXX     |   XX     |  XXX     |   XXX    |   4  
----------+------------+----------+----------+----------+-----------
                            |          |
                            |          |
       Use A foreign        |          |
       key to make sure     |          |
       teacher is qualified |          |
       for each course      |          |
                         COURSE     | TEACHER
                         -----------+--------
                           XXX      | XXXX
                           XXX      | XXXX

Here is the SQL to create these tables. I have left out anything not directly related to our Cross Reference Validation pattern:

-- Create the bottom table: allowed courses by teacher
CREATE TABLE courses_x_teachers (
   teacher char(10) 
  ,course  char(10)
  ,foreign key (teacher) references teachers(teacher)
  ,foreign key (courses) references courses(course)
  ,primary key (teacher,course)
);

-- Create the main table (assumes we have already created
-- teh TEACHERS table, the CLASSROOMS table and so forth
CREATE TABLE enrollment (
   classroom char(5)
  ,period    char(5)
  ,course    char(10)
  ,teacher   char(10)
  ,student   int 
  -- this is a trx table, so use an integer ID for pk
  ,assign_id int IDENTITY
  ,primary key (assign_id)
  -- The cross reference validation pattern needs a foreign key
  ,foreign key                    (teacher,course) 
   references courses_x_teachers  (teacher,course)

Another Comment On Integer Keys

In last week's essay we saw that the rule of thumb for cross-references is to use a multi-column primary key and not to use an integer key. It should be obvious now why we do that. If we used an integer key, then it could not help us!

The tables as described above completely satisfy the business requirement: no entry is allowed if the teacher is not approved for that course. But an integer key provides no such value, it is useless. If you use an integer key for the cross-reference table, then you are required to write extra program code to "chase" the key out to the cross-reference and make sure the values for teacher and course match what you have in the child table.

Foreign Keys and Performance

One time I was giving a presentation and somebody in the audience said they had been told not to use foreign keys because they reduced performance. To me this was like saying don't put gas in your car because it weighs down the car and reduces mileage. The question was such a shock and so unexpected that I had no decent answer.

Since that time I have discovered that this kind of nonsense is actually prevelant these days. Such advice may make sense in cases where the data structure is trivial (at most 7-10 tables) and the presentation is paramount, but in any serious application such advice is ridiculous. Here is the real scoop.

Somehow, some way, your code must ensure that every class assignment made in the above example is valid. This means that the check for teacher-course validity must be made. If you do not perform this check then your software is structurally unsound and you will have bad data, with no way to stop it and you will always be patching live systems in crisis mode.

So because you must make the checks, we can ask, which is better, to make the checks in program code or using the declarative constraints in the table definition? This is easy to answer. Without the foreign key, our program must make one round trip to the server to make the check, and then a second round trip to do the insert. But if you use foreign keys you only have to make one round trip. Since the overhead of making a trip to the server is often more than the time spent executing on the server, the foreign key solution will often perform twice as well as the application code solution.

In short, if you're worried about performance, use foreign keys, not application code.

Architecture Note: Server-Side Errors

Many programmers are taught not to use foreign keys and so they are not used to the idea that the database server will throw errors. Once you start using the database for all that it is worth, you will depend more and more on the errors it throws, so you want to make sure your framework can read them and report them to the user the same way it reports your application-generated errors.

Conclusion: Learn to Recognize Foreign Keys

User requirements will never be expressed as program code or table design, but we can recognize common patterns in them. One of those patterns is the Cross Reference Validation pattern, which we implement with a foreign key into a cross reference table. This and other patterns will stand out if we examine user requirements with an aim to identifying:

  • Lists of things you are keeping track of. These go into master tables, like courses and teachers.
  • Relationships between those master items, like a list of teacher-course qualifications.
  • Restrictions on how things can interact, so that a teacher must be qualified to teach courses means there will be a foreign key somehow into that teacher-course cross reference from some other table.

Next week we will examine Rules 2-4 and find out more about how unique constraints and their associated patterns can reduce the amount of code in our applications.

Next Essay: Limited Transaction Pattern

10 comments:

Anonymous said...

So what happens when a classroom is renamed or demolished? A period changes? A course is renamed? A teacher marries and changes surname?

A whole bunch of error prone manual update statements that could be avoided using integer primary keys?

KenDowns said...

Anonymous: I can tell you what happens if you use integer keys throughout: you take twice as long to write an app that performs half as well and is more error prone. All in the name of saving yourself from edge cases.

That being said, the edge cases must still be addressed:

1) A discontinued classroom (or terminated teacher, or discontinued course) is an example of a parent row you cannot use anymore. The kind of key you picked does not affect how you enforce that. There are different ways to answer this which will come up in a later entry.

2) Renaming courses or teachers does not require their codes to change, any more than it would require a meaningless and useless integer key to change.

3) I don't know what you mean by a changing period.

In short, these are edge cases that are distracting but not informative, and whose solutions are not affected by the kind of keys you choose. They should not play into the definitions of keys.

Fernando Ronci said...

Hello,

I'm following with great interest your weekly installments on DataBase design, but, as is always the case, every time I have to apply the concepts in real-world scenarios I'm clueless as to where to start. It seems I'll be an eternal beginner in the vast ocean of software development...but I digress. Case in point, could you please give me a hint as to how to tackle the following scenario?

A company manufactures 150 distinct products. Now, in the spirit of spelling out the data, I think this merits a table (e.g. Products) listing the 150 items, with Columns such as:
- ProductID (PK)
- Description
- Width
- Height
- Weight
- Amount of ingredient A
- Amount of ingredient B
- Number of molds

As you can infer, once defined, this table remains immutable.
Then, the objective is to keep daily records of the fabricated products, as well as sales. For this, I have thought of adding two more tables: One for keeping track of the manufactured products and the other one for keeping track of the sales. Keep in mind that not all 150 products are fabricated in every working day. As a reference, an average of 20 distinct products are fabricated in a single day, and it's in the varying nature of daily production where I'm stuck. I don't have the slightest idea on how to spell out these tables for keeping a journal of IN's (production) and OUT's (sales). For example, if I add one record per day in the "ManufacturedProducts" table, then I'll have something like 300 records in the first year (discounting Sundays and public holidays), 600 records in the second year...and so on. On the other hand, due to the granularity of sales, the "Sales" table (or whatever name it will ultimately be given) will have several records per day, each representing sort of a "log entry" that's added every time a product or group of products is removed from the inventory. Now, I don't think this scales well and I'm also not sure if it makes sense at all. For example, in the 5th year, everytime I ask the RDBMS for the # of units of any given product currently in the inventory, will it have to scan thousands of records in the tables and do the corresponding math to return the integer value I'm interested in? It seems pretty expensive to me. There must be a more efficient way of doing it. Obviously my reasoning must be (very) flawed.
I'm not implying that this case of simple inventory tracking is a difficult one, but it just happens that I don't (yet) have the necessary skills to do a proper modelling of the data.

In a few words, could you please tell me what the DB design pattern is for a simple case of inventory tracking like this one?
Thank you for reading and thank you for your help in advance.

Fernando

KenDowns said...

Fernando, that is a lot of questions, but I'll see what I can do, and hopefully some other readers will jump in as well.

Item 1: The "Amount of Ingredient A" and "Amount of Ingredient B" look like a violation of first normal form. The ingredients should be out in a separate table.

Item 2: The "Number of Molds" may also be a calculated value that does not belong in the table. Or at very least you may need a list of molds per item.

Item 3: "As you can infer, once defined, this table remains immutable." All tables are mutable forever. Never be afraid to add or remove columns as your understanding of the situation increases.

Item 4: "Then, the objective is to keep daily records of the fabricated products, as well as sales. For this, I have thought of adding two more tables" This is the right idea. You will do very well to put a row in a table every time somebody sneezes. Think of these highly detailed tables as the foundation upon which you can build anything. But if you leave out those details then when somebody asks for them you have say, "oh, we're not recording that, sorry." As for row counts, you will have no worries with the kind of row counts you are talking about.

Item 5: "everytime I ask the RDBMS for the # of units of any given product currently in the inventory, will it have to scan thousands of records...." This is where most people denormalize, including myself. You can have a column in the items table for "on hand inventory" which is the sum of all manufactures less all sales. Your program must increment and decrement the on-hand inventory whenever items go into the journal. The nice thing is that you can recover from bugs by recalculating from detail. I use database triggers myself to avoid mistakes in application code.

Fernando it sounds like you have the right instincts, don't be afraid to follow them. If you are afraid that 10,000 rows will take too long (and I can assure that is nothing), then write a program that inserts 10,000 rows into a table and then test the query time to get a SUM out of that table. You are on the right track, keep it up and report back!

Fernando Ronci said...

Thanks Ken,

It seems it's not as simple as I'd originally thought. But on the other hand it wouldn't be so difficult for someone moderately versed in DB design.
I'm going to re-read your whole series of weekly posts to refresh some concepts.
I also believe that (the inevitable) ongoing changes to the DB will break other layers and they will have to be regenerated.

Anyway, I'll give it a try.

Thanks again,
Fernando

Donald J Organ IV said...

Fernando,

I would suggest looking into using Andromeda
as what you are trying to do is something extremely similar to a basic form of the current project that I am working on.

Andromeda work off all of the information Ken shares in his blog, and he is the original creator of Andromeda. So I would really suggest taking a look at it.

Anonymous said...

I know this is a very old post but I felt I needed to comment on it.
You state that Foreign Keys should be used and they don't hurt performance.

This might be true if you exclude scalability from the "performance" factor.

What if instead of a database of students I have a database I don't know, with all the citizens in a country(Let's say, China) plus their 'tax-payment information'?

A single database wouldn't be enough in this case, the usual solution is to partition the database ... anything ringing yet?
Do you get why GoogleAppEngine's database doesn't allow relations in their database? Besides, CPU consumption when joining tables, goes really high and such...

KenDowns said...

@anonymous: you are dramatically mixing contexts. I spent a couple of years creating a marketing database system, we used no foreign keys. While it is not stated on every single post, it is often stated on the blog that it is crucial to know your context. This blog is almost entirely about business applications. Anybody with two brain cells firing uses foreign keys in biz apps, unless they began programming after about 2001 and confuse the web with programming.

Steffen said...

Hi Ken,

first of all:
- your blog is amazing - a fascinating source for inspiration. I just can't stop reading.
- I know, this article is an old one - however it is one that reflects your passion for character keys where it makes sense very much. So please forgive me, if I roll up old stuff.

Your arguments for having character keys rather than number keys for refence data I understood as following:
1. codes are understandable to an extends which prevents joins (increased performance)
2. higher storage requirements for characters can be seen as minor issue as storage is cheaper than in past

I wonder why the 'ressource is cheaper' argument does not play the INT key into hand as well. Regarding the mysqlperformanceblog you just have to plug in enough RAM into your system to allow >80% of your data being sql page cached there (I guess you know what I mean - if other readers do not: it is some sort of advanced caching). RAM is cheap and joins (especially with the right indeces in place) become extremly fast. I myself have implemented joins cross 10 and more tables (with at least one of them having more then a mio rows), that came up with results in a few milli seconds (the >80% in RAM applied there). I hear people screaming that there are occassions where you cannot have >80% in RAM. I know that - however these cases are very rare. In clouds you can plug 100 GB together with no worries and 100 GB allow some mio rows, which covers lots of use cases.

The other thing I try to get sorted directly relates to the The Cross Reference Validation Pattern:
I understood from your article that cross reference uniqueness check can be done best be achieved with character keys because INT keys are not self explaining. The implied risk is, that there are duplicated names in the referenced tables. I am not sure how to address this via character codes. If you have a plant table with a plant 'broad been' and another one called 'fava been' - your character code may be 'brobeen' and 'favbeen'. It both refers to the same thing, which is thick beens, your character code will not be able to solve the risk of duplication. However if you have a mechanism to clean duplicates from the referenced tables you can also do the cross reference validation based on INT keys. Or did I miss something?

There is also something related to this: While the time is running maturity of frameworks is increasing. Frameworks have to deal with the 'no duplicates in cross references'-on-int-key issue anyway, because they are not only dealing with reference data. But if the logic to do it is implemented in frameworks anyway and as frameworks become more popular what is the benefit of down cutting Cross Reference Validation to character keys?

Please do not get me wrong here. I am not an INT key fighter. I see some benefits in using character code key. I just like to understand how different (strong) concepts fit together.

Based on this would you now (5 years later) still implement character keys for reference data? Has the reasoning changed behind it?

Cheers,
Steffen

aparna john said...

Hi,Nevertheless web design, search engine optimization and copywriting is a really customized area. Design isn't really just what you see, it's also exactly what you think and feel as you browse a Web site Sure for Web Design Cochin, discovering HTML is an important part of creating an effective web site.Thanks...