Sunday, April 19, 2009

The Relational Model

If you look at any system that was born on and for the internet, like Ruby on Rails, or the PHP language, you find an immense wealth of resources on the internet itself, in endless product web sites, blogs, and forums. But when you look for the same comprehensive information on products or ideas that matured before the web you find it is not there. Relational databases stand out as a product family that matured before the internet, and so their representation in cyberspace is very different from the newer stuff.

The Math Stuff

You may have heard relational theorists argue that the strength of relational databases comes from their solid mathematical foundations. Perhaps you have wondered, what does that mean? And why is that good?

To understand this, we have to begin with Edsger W. Dijkstra, a pioneer in the area of computer science with many accomplishments to his name. Dijkstra believed that the best way to develop a system or program was to begin with a mathematical description of the system, and then refine that system into a working program. When the program completely implemented the math, you were finished.

There is a really huge advantage to this approach. If you start out with a mathematical theory of some sort, which presumably has well known behaviors, then the working program will have all of those behaviors and, put simply, everybody will know what to expect of it.

This approach also reduces time wasted on creative efforts to work out how the program should behave. All those decisions collapse intot he simple drive to make the program mimic the math.

A Particular Bit of Math

It so happens that there is a particular body of math known as Relational Theory, which it seemd to E. F. Codd would be a very nice fit for storing business information. In his landmark 1970 paper A Relational Model of Data for Large Shared Data Banks (pdf) he sets out to show how these mathematical things called "relations" have behaviors that would be ideal for storing business models.

If we take the Dijkstra philosophy seriously, which is to build systems based on well-known mathematical theories, and we take Codd's claim that "Relations" match well to business record-keeping needs, the obvious conclusion is that we should build some kind of "Relational" datastore, and so we get the Relational Database systems of today.

So there in a nutshell is why relational theorists are so certain of the virtues of the relational model, it's behaviors are well-known, and if you can build something that matches them, you will have a very predictable system.

They are Still Talking About It

If you want to know more about the actual mathematics, check out the comp.databases.theory Usenet group, or check out Wikipedia's articles on Relational Algebra and Relational Calculus.

A Practical Downside

The downside to all of this comes whenever the mathematical model describes behaviors that are contrary to human goals or simply irrelevant to them. Examples are not hard to find.

When the web exploded in popularity, many programmers found that their greatest data storage needs centered on documents like web pages rather than collections of atomic values like a customer's discount code or credit terms. They found that relational databases were just not that good at storing documents, which only stands to reason because they were never intended to. In theory the model could be stretched, (if the programmer stretched as well), but the programmers could feel in their bones that the fit was not right, and they began searching for something new.

Another example is that of calculated values. If you have shopping cart, you probably have some field "TOTAL" somewhere that stores the final amount due for the customer. It so happens that such a thing violates relational theory, and there are some very bright theorists who will refuse all requests for assistance in getting that value to work, because you have violated their theory. This is probably the most shameful behavior that relational theorists exhibit - a complete refusal to consider extending the model to better reflect real world needs.

The Irony: There are No Relational Databases

The irony of it all is that when programmers set out to build relational systems, they ran into quite a few practical downsides and a sort of consensus was reached to break the model and create the SQL-based databases we have today. In a truly relational system a table would have quite a few more rules on it than we have in our SQL/TABLE based systems of today. But these rules must have seemed impractical or too difficult to implement, and they were scratched.

There is at least one product out there that claims to be truly relational, that is Dataphor.

The Weird Optional Part

Probably the grandest irony in the so-called relational database management systems is that any programmer can completely break the relational model by making bad table designs. If your tables are not normalized, you lose much of the benefits of the relational model, and you are completely free to make lots of non-normalized and de-normalized tables.

Conclusion

I have to admit I have always found the strength of relational databases to be their simplicy and power, and not so much their foundations (even if shaky) in mathematical theory. A modern database is very good at storing data in tabular form, and if you know how to design the tables, you've got a great foundation for a solid application. Going further, I've always found relational theorists to be unhelpful in the extreme in the edge cases where overall application needs are not fully met by the underlying mathematical model. The good news is that the products themselves have all of the power we need, so I left the relational theorists to their debates years ago.

10 comments:

Derek Neighbors said...

I am curious what you think of the new set of cloud databases.. Google's Big Table and Amazon's Cloud DB.

KenDowns said...

@Derek I'd be happy to write up something on them soon. Check back next week or week after.

Anonymous said...

"If you have shopping cart, you probably have some field "TOTAL" somewhere that stores the final amount due for the customer. It so happens that such a thing violates relational theory..."

How does it violate the Relational Model?

Andy said...

Having a "TOTAL" field in your "order" table *might* violate relational theory, but if you make it so that only a trigger can update it based on what's in your "order_item" table, then I think it's fine. You still get data integrity and that is what matters.

KenDowns said...

Andy: I use the trigger method myself. I normalize values that come from the outside, and then build the de-normalized derived values on top of them. That will pretty much take care of the problem.

But try telling a relational theorist that.

Boby Thomas said...

I have been following your serires for quite some time. You explain the fundamentatls very well. Thanks a lot and keep up the good work.

Boby Thomas
www.bobsindia.com

Anonymous said...

I still fail to see what you mean by the "calculated TOTALS field" (attribute, really) violates the Relational Model.

I'd like to see the schema you may be using as reference and an explanation of how you understand the failure.

I guess I'm pursuing this because lots of developers unfairly criticize "relational weenies" in error.

Also, lots of myths get perpetuated like "relational databases were just not that good at storing documents". Either your system needs specific info in a document, in which case it should be in the database "atomically", or you just want to store the whole doc, which you can as an attribute value with a document type.

I just want to make sure this isn't yet another case of it. It's a bit of the usual Internet case of:
http://xkcd.com/386/

(To add, "relational weenies" aren't the nicest of people at times, but that doesn't mean the other side gets to perpetuate the same mistakes! :) )

KenDowns said...

@anonymous: Having a TOTAL field is a violation of relational integrity because it is a non-key dependency. That's the official terminology.

The result of having the field (w/o the triggers Andy mentions) is what is called a DELETE ANOMALY. Picture an ORDERS table and a LINES table. Somebody has inserted 5 lines and the application sums them and writes to the TOTAL column. Now somebody deletes a line, invalidating the TOTAL. This is a DELETE ANOMALY because deleting this line ought to be a self-contained event, but in fact it causes other data elsewhere to now be incorrect (that is, the TOTAL value is now wrong).

This requirement that action in one table requires action in another is not supposed to happen if your tables are fully normalized.

Personally, I always put that TOTAL column in there, and as Andy says, I use triggers to update it so that the anomaly does not happen. This gives me the benefits of normalization and the benefits of automation in one stroke.

Jorge Monasterio said...

This was a great blog. Really good DB info. You should resume it.

lawtonterri said...

Nice post. Here’s a tutorial that shows how you can easily build an online database-driven web application with a parent-child table relationship, without codinghttp://blog.caspio.com/web-database/creating-one-to-many-relational-datapages/