Monday, December 31, 2007

Database Skills: Ringing in 2008

Happy New Year!

To celebrate the New Year, I am going to take a break from our normal schedule. I want to take advantage of this chance to review some smaller issues that otherwise might not come up for weeks or months.

A complete table of contents for this blog series is here. Next week we will resume with Third normal form and after that start talking about more complex table design issues.

What About ORM and ActiveRecord?

The term "ORM" means Object Relational Mapping and it basically means any system that tries to turn all data manipulation into object manipulation. One very popular ORM system is ActiveRecord, the ORM system in the popular Ruby on Rails framework.

All ORM efforts begin by recognizing that code and data are very different and operate on very different rules. This blog also begins by stating that code and data are very different and operate on very different principles. So far, so good!

The ORM crowd believes that having two different natures is an organic flaw. They borrow the term "impedance mismatch" from electronics to portray the situation as broken from the start. The solution, for the ORM crowd, is to find a way to treat data as objects in object-oriented code. So instead of issuing a query "Select .... from customers" you instantiate a customer object, set some properties, and then invoke object methods to retrieve data, change it, validate it, and send it back to the server.

On the other hand, the purpose of this blog is to explain how to use a database effectively, and that begins with understanding the database on its own terms. Because ORM tries to avoid the true nature of the database, we have to lay it aside for now and consider returning to it later, when we have mastered the database and may have some use for it.

Database Abstraction

Sometimes people use the term "ORM" to mean something much simpler, which is just database abstraction. For instance, in this blog you will often see my examples use a function "SQL_AllRows()", like here:

$rows = SQL_AllRows(
    "Select ssn,name_first,name_last from contacts"
);

This function does not exist in any library except my own (that I know of anyway, though it is part of the Andromeda Project.)

This function is a simple shorthand that issues several commands to the back end and gives me an array of associate arrays. The key idea is that with functions like this you can avoid using functions that are specific to only one server (MySQL or Postgres for instance), and can combine several annoying calls into one.

SQL Abstraction, like all of the simple things in life, is a Good Thing.

What is A Candidate Key?

So far in this series I have not mentioned Candidate Keys. To make matters worse, I have defined normalization in terms of Primary Keys, while really it is defined in terms of Candidate Keys. I did this for the simple reason that I find people can learn normalization better before learning about candidate keys, and that teaching candidate keys later tends to work better.

In short, a Candidate Key is any column or columns whose values must be unique. You can have more than one Candidate Key. For example, a table of teachers may be unique on Social Security Number, so that SSN is a candidate key, but it is also unique on First_name + Last_name, so the name makes another Candidate Key.

We will see a lot more about this when we get into the nitty gritty of choosing keys.

Should I Worry That SQL is not Relational?

Short answer: no. In fact, I am not even going to spend time explaining why SQL is not fully relational.

When a programmer first decides to master database ideas, he or she will ask a lot of questions, and get a lot of different answers. You will hear on newsgroups, email lists and blogs that the problems you are having are because SQL is not truly relational.

There are two important points here. The first point is that the world has somehow gotten along without this "vital" requirement for about 25 years now, which makes people like me wonder just how vital it is. The second point is that if you want to somehow make your database relational, you will have to brush up on your C or C++, read several volumes on database theory, and then start hacking the mySQL or Postgresql code to accomplish something that nobody else has found practical or necessary.

In short, we use databases that store data in tables. We use SQL to create those tables and manipulate that data. The design of these systems was originally inspired by the Relational Model as described by E.F. Codd, but when it came time to make real systems that worked, certain aspects of the relational got in the way, and they were dropped. Nobody (except a very few) have ever looked back.

So why are they called relational? Because once upon a time everybody decided that relational databases were the Next Big Thing. Then, as now, many IT directors would follow trends they did not fully understand. The salesmen and marketeers knew this, and so all table-based data handling systems were called Relational. This way the IT directors could tell their bosses they were moving systems onto relational databases, thereby protecting themselves by being part of the trend.

Should I Denormalize?

Short answer: yes.

Almost everybody does, and so do I. This blog will contain some advice about how to denormalize systematically and rationally. However, keep this in mind: you can only denormalize what has been normalized properly to begin with. That is why you must learn about normalization if you want to master table design.

Should I use Natural or Surrogate Keys?

If you want to see a good fight, walk into polite company and ask this question. In two weeks this blog will have a complete entry devoted solely to choosing and defining keys.

What about Views? Calculated Values?

Calculated values are a fact of life for any real-world programmer, but it turns out that relational theory does not allow them. This is of course a big problem, and so programmers have always happily stuffed calculated values into tables. There are many ways to deal with calculated values and I will spend at least two entries on it and possibly more.

Should I put Business Logic in The Server?

This is a very big question, and needs a lot of time to answer. I happen to believe the answer is yes, for reasons of both performance and integrity. However, most programmers do not believe this, and it is a very hard thing to do because coding stored procedures and triggers is very different from coding Java or PHP or Javascript and requires not just knowledge of the language but detailed knowledge of when to do things and why to do things one way and not another. Later in the series we will see triggers and stored procedures and talk about putting business logic into the server.

Conclusion: Happy New Year!

Databases are wonderful things, and they can do far more for you than most programmers know or even suspect. We have not even touched on topics like security and automation, but now it is time to take a little time off and enjoy the New Year's Holiday. Keep safe and see you next week!

Next Essay: Choosing Primary Keys

4 comments:

Jake said...

Really looking forward to the Natural or Surrogate Keys discussion, I can't remember the amount of time I've wasted at the start of every new project debating which I should use. Hopefully you'll shed some much needed light on the subject.

dportas said...

I'm not sure what you are referring to when you say that "certain aspects of the relational [model] got in the way, and they were dropped". I think the problems with SQL weren't widely understood until well after its adoption by vendors like Oracle. The informed and considered view of Codd and others seems to be that the designers of SQL failed to implement RM just because they didn't understand it properly, not for any sound reasons of design or because something "got in the way". Unfortunately the SQL ideas stuck. Practically every SQL developer has had cause to regret this at one time or other. For instance having to fix duplicate data in SQL or work around incomplete constraint support or write complex code to perform the equivalents of assignment or comparison operations.

I would also question the statement that "nobody else has found [the relational model] practical or necessary". Firstly, it has several practical implementations and is available in commercial products from companies including IBM and Dataphor for example. Secondly, software products have evolved which fill the gaps left by SQL's shortcomings (especially in the OLAP market for example). Since those products have earned multi-million dollar revenues it is at least arguable that DB customers might have been able to save those millions if they'd had access to relational products instead of SQL ones. If that's not a practical or necessary reason then I don't know what is :)

Should you worry? Yes. Database vendors have been selling you water bottled as wine. The best way to change that state of affairs is to educate people about what they are missing.

dportas[AT]acm.org

KenDowns said...

Dportas:

When I say, "certain aspects..." I mean first the "bags not sets" decision. All big systems out there implement tables as bags and not sets.

Thanks for the information on other products, I'm aware of Dataphor but I might wonder how many others are. You, me, and others who care to pursue such things, but not the typical programmer.

I stand by my statement that the practical programmer need not be worried. Pure RM has strengths but quasi-RM as implemented (read: SQL) is plenty Good Enough for most of us.

John "Z-Bo" Zabroski said...

Some of the products that fill in the gaps left by SQL come down to the interesting assumptions SQL makes. Many of the original designers of SQL originally came from the Bachman school of thought, and originally supported Navigational Style prior to becoming fascinated by the brevity of queries Codd would write.

The one big assumption SQL made is that it assumes you know EXACTLY what the TARGET of the query is. Obviously, this reflects the fact SQL's designers could not completely divorce themselves from the Navigational Style thought process of (roughly): "Hmm, let's see...I need to find a particular thing... so I need to a column name, and I already know what the type of the column I'll get back is going to be."

However, as Ken already alluded to, how many people who use MySQL actually need fuzzy querying capability instead of the crisp querying capability SQL provides?