Sunday, February 3, 2008

False Patterns Such as The Reverse Foreign Key

Welcome to the Database Programmer. This is a regular blog for anybody who wants practical information about how databases work and how your database and application code work together.

There is a new entry every Monday morning, and the complete table of contents is here. We are currently looking at Table Design Patterns and how they lead to tight and efficient code.

The Example: A School Class Schedule

In the past two weeks we have looked at the table of courses that teachers are teaching and students are enrolled in for a year. The five rules are:

  1. A teacher may not teach a class he/she is not qualified to teach.
  2. No two classes can be given in the same classroom in the same period.
  3. No teacher can be in two places at once, a teacher can only teach one class in one period.
  4. No student can be in two places at once, so no student can be in more than one class in the same period.
  5. A student cannot take the same class again after passing the class once.

Two weeks ago we looked at rule 1 and recognized a Cross Reference Validation Pattern, and last week we looked at rules 2 and 3 and saw a Limited Transactions pattern. This week we will look at rule 5 and see the Primary Key In Disguise, and next week we will wrap up this example by looking at how to do rule 4.

First Attempt: Restating Rule 5 In Terms of Tables

There are plenty of times when a customer will give you a requirement that sounds something like, "If X happens, Y may not happen." Rule 5 sounds like this kind of rule:

"If X Happens..."    | "...then Y may not happen"
If a Student passes  | the student may not
a course             | take the course again

The first thing we want to do is translate this into rules about entries in tables. Our first attempt might sound something like "An entry in the completed courses table prohibits an entry in this year's schedule." It might look like this:

HISTORY TABLE                                    THIS YEAR'S CLASSES

Student | Course                                 Student  |  Course        
--------+---------                              ----------+----------
Nirgal  | History       <-----------             Nirgal   |  Calculus
Jackie  | History     A "reverse" foreign        Nirgal   |  Geometry  
Nirgal  | Physics     key says entries in
Jackie  | Physics     history prevent entries
                      in current enrollment

The problem with this statement is that there is no way to enforce such an idea. It is not a foreign key, it is more like a reverse foreign key. A foreign key says a a parent row must exist but we are saying a parent row must not exist. If there were such a thing as a reverse foreign key we would be ready to move on to the next task. But since there is no such thing as a reverse foreign key, we must keep looking.

The Customer Does Not Design Tables

So far we have a formulation "An entry in the completed courses table prohibits an entry in this year's schedule." But here I have deliberately made a common mistake to demonstrate how programmers sometimes deal with the requirements given to us by end users. I have taken the rule and turned it into a statement about tables without first doing any real thinking. This is like translating a sentence word-for-word from one language to another. When you transliterate the user's requirements this way instead of translating, the result is exactly the same as for human languages: you get nonsense.

So we must now remember that the customer is not in the business of table design. The customer will explain their needs as best they can, but we should not expect the customers' statements to translate directly into table definitions.

So if we look at this rule again we realize that when one thing excludes another thing we may be looking at a primary key or unique constraint. If the student's completed courses and the student's current courses are stored in the same table then the problem is solved. We make a primary key or unique constraint on student + course and that is the end of that.

Relational Does Not Meet My Needs!

Nowadays (February 2008) it is pretty common to hear programmers who are relatively new to the database world say: "Relational just cannot do what I need, my customers have complex needs that don't fit into relational concepts." If that programmer is handling text, like books, or media files then they may be right. But if not, that programmer probably:

  • ...has been taking the customer's requests at face value instead of translating them into solid concepts.
  • not aware of dead-end patterns like the "Reverse Foreign Key" and how to recognize the valid patterns that are hiding behind them.

Now that we know we probably have a primary key, and not some kind of weird reverse foreign key, it is time to design the table.

A Primary Key In Disguise

Putting them into the same table is not that big a deal. It is nothing more than a list of courses the student has taken or is currently taking. The table would look something like this:

The students_x_courses Table

  Primary Key (or perhaps an int primary
       |       key and this is a 
       |       unique constraint)
  |         |
Student | Course   | Year  | Flag_hist |  Grade
 23     | HIST-302 | 2005  |  N        | 
 23     | PHYS-101 | 2003  |  Y        |  92     
 23     | CHEM-211 | 2004  |  Y        |  96       

Deletions Required

We should note that if a student fails a course, the approach we are taking requires that the row be deleted outright, otherwise they cannot take it again. All database programmers share a deep uneasiness about deleting data, it makes us nervous. There is good reason for this. As soon as you delete the failed classes somebody will ask for some statistics on how often students fail courses, which you cannot tell them because you deleted the data! Nobody wants to be having that conversation!

Therefore we should assume that we will have a separate table just to store a record of course failures. We will copy failure records to that table before we delete them.

Putting Current Data and Historical Data Together

There are plenty of strange ideas out there, and sometimes an idea from ages past will persist long after anybody remembers where the idea even came from. One such idea is that you should not mix together historical and current data.

This idea came from long ago when hard drives where extremely expensive, when many data files were stored on tapes or cartridges and never got near a hard drive. The idea was to separate your live and historical data because the bulk of your operations required one or the other but rarely both, and you could not afford to keep them both available at the same time. In really old fashioned batch operations something we take for granted like a simple query would be done by loading tape after tape onto a refrigerator-sized tape machine that executed some query program. If you kept history and live data together you might have to load 30 tapes, but if you kept them separate you would only have to load 3 tapes to pass the live data. That basic practice has stayed with us ever since in the form of prejudices and vague advice about "not mixing current and history."

But nowadays we do not need to worry too much about hard drives, there is no validity to most of the reasons for keeping this data separate.

Conclusion: Translation, Not Transliteration

This week we saw what happens when we take user requirements at face value. They lead us into dead-end design patterns, that is, patterns that are not valid and cannot be implemented. We saw how important it is to take the users' statements and seek out the "disguised" primary keys, foreign keys and so forth. This is the sad result of transliteration, the direct conversion of user statements into table designs.

The correct process is translation, looking at the user statements with a critical eye and seeking out the primary keys, foreign keys and table definitions that are lurking there. Those programmers who do not learn this skill will be led into the false belief that they are somehow dealing with problems that no human being has ever seen before, and that they need some post-relational or extra-relational or razmataz-relational system that will provide a unique, strange and clever solution to their troubles. The truth is usually much less interesting and not nearly so ego-inflating. The truth is that most data does fit into tables pretty neatly, and that most rules (but not all) can be expressed as unique constraints and foreign keys.

Next week we will wrap up this example by looking at rule 4, that no student can be in two places at once.

Next Week: The Framework and The Database


John "Z-Bo" Zabroski said...

Are you going to bring up "one fact, one place, one time"? This is the phrase that draws a hard line between the object-oriented model and the relational model.

You've mentioned hidden requirements, yet these hidden requirements can be discussed not just in terms of how obvious they are. Why not discuss hidden requirements by explaining "one fact, one place, one time"? You've alluded to how vulnerable an object model is to violating this constraint, but aren't addressing it directly.

John "Z-Bo" Zabroski said...

Also, it might be more mainstream to say "Table Design Anti-Patterns". At least, it would make sense to tag this post with that label. Right now, you haven't tagged this post with a "False Patterns" label, either.

Your remarks about programmers not understanding how to translate the user's requirements is interesting. However, so far it does not cover an argument I hear being made by proponents of XML: "Smashing the Enterprise Data Model". Many programmers seem to think the relational model requires a global enterprise data model in order to ensure integrity and consistency of the data set. As a result, many see the data set as difficult to put into a central respository like a RDBMS.

Tangentially, it seems many programmers do not know: how to ensure there is a single, authoritative schema; how to version databases; how to use source control to do development testing on a test database and manage updates to the schema; and how to ensure everyone knows what artifact represents the single, authoritative schema.

John "Z-Bo" Zabroski said...


I just noticed that you added support for Update Scripts to Andromeda when I went to pull the latest sources off SF.

KenDowns said...

John: the update scripts were conceived and put in within the space of a few days, definitely a record for Andromeda. They will prove very powerful for programmers after they have made a few iterations through their projects and released them into the wild.

Brian said...

Deleting failed courses and moving them to another table really doesn't sit well with me intuitively, but as I'm not a database master by any stretch I'm not sure what a better solution would be. Would it be possible to rig something up with a 'Passed' boolean column and a check constraint, so that we can keep failed courses in the same table while still enforcing this business rule?

Badger said...

A lot of good ideas in these posts.

I have a question. I would like to see how you would implement a more complex rule using the methods you have been describing, or if you would use code to accomplish it. Here is the new request that comes back from the customer:

"We have a lot of students repeating and failing courses, we would like that a student can only take a course 3 times regardless of whether they pass or fail on the third attempt"

How would you keep the data integrity here?

KenDowns said...

Badger, there are two ways to enforce a numerical limit.

If your server supports constraints on views, you can make a view that holds a COUNT on student+course and then constrain the value to be less than 4 ( or <= 3).

In my own framework we directly materialize these summaries in tables, and then we put the constraint on a table instead of a view.

Anonymous said...

This idea of deleting failed subjects sounds horrible to me :S

I hope no University is doing something like that.

I still don't get how you would link the enrollment table to the "failed subjects" so you can check whether the user is still allowed to enroll.

Andrey said...

Great article and beautiful blog

Again, deletion after completion is a little scary

Isn't it better to extend unique key to (student, course, year). Its just asks to be done :)

J-Man said...

I'm with everyone else that 'deleting' a record to a new table doesn't seem like the best solution. A query to find all the courses a student has taken would require checking two different tables (two queries). Also, if we're trying to use database-only methods, rather than application code, then the database would need to support something like triggers for that to work. Otherwise, the delete-and-move would have to be done manually or with application code (again, two steps: copy(insert) then delete).

It seems like a bad idea to use table location to give meaning to the data, rather than just storing the data; it reminds me of the magic values Ken talks about. Rather than just saying "Failed" or "Passed", we must interpret the data being in a particular table as failure. Also, simply reading the not-failed table doesn't tell us if a student has actually passed the course or is currently taking it (in other words, how do we represent a pending score?). And what if the school decides one day to change to a system based on actual scores vs a simple pass/fail? (Maybe a student with a C or a D in a class is given the option to retake, but an F is required to retake.) The two table system is set up to only handle the one situation; it's not very flexible. (I don't see why a real-world school would not want to store the actual scores anyway. Hidden requirement, anyone?)

BTW, Andrey, it isn't going to be as simple as a unique (student, course, year) key. That would keep a student from taking the same course more than once in a year, but doesn't look at all at the pass/fail status. In fact, if the student were to fail a course in the spring semester, this restriction would prevent them from retaking the course in the fall!

J-Man said...

Forgot to mention: thanks for the great series of articles, Ken! Great points and great discussion starters!

creatine said...

I could not find any other answer, because googling this is very hard: "reverse relationship" or "reverse foreign key" is all littered with unrelated django.core.urlresolvers.reverse references, and I have no idea how to name "*_set" thing to Google. A hint on how to google this one would be helpful as well.

Stew Ashton said...

The rule "a student cannot take a class he has already passed" is an example of what Tom Kyte calls "selective uniqueness". There is an Oracle-specific solution that uses a unique function-based index: the function returns student + class when the grade is passing or null, and null when the student failed. Since Oracle does not index null entries, the uniqueness is only enforced for classes the student did not fail.

Starting with Oracle 11g, one could use a virtual column and a unique constraint, which would be even clearer.