The Database Programmer: June 2008

Monday, June 30, 2008

Database Performance: The Web Layer

A database application is a like a convoy of ships, it is only as fast as the the slowest ship. The three "ships" in a web-based database application are the database itself, the web layer, and the browser. Today we will continue our series on performance by examining how the web layer can efficiently retrieve data from the database.

Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications. There is a Complete Table Of Contents that is updated each week, and a Master list of table design patterns that is updated whenever a new design pattern is presented.

Cost 1: Round Trips

The first basic cost of retrieving data is the "round trip". Database programmers speak of a "round trip" as occurring whenever you send a request the server and retrieve some results. Each round trip to the server carries some overhead, as the server must do some basic work to allocate and release resources at the start and end of the request. This overhead is added to the base cost the server must pay to actually go out to disk to find and retrieve your data.

If your application makes more round trips than are necessary, then the program will be slower than it could be.

Cost 2: Retrieval Size

Every byte that the application retrieves from the server carries a cost at several points. The server must go to disk and read it, the wire must carry the load from the db server to the web server, and the web server must hold the result in memory. If your web code regularly retrieves more information than it needs, then the program will be slower than it could be.

This is why you will see advice that you should never use "SELECT *..." in direct queries, because it is near certain that you are retrieving data you will not use. The more desirable query names the exact columns you need so that you have maximum efficiency. This is especially important if your table contains text (aka clob) fields, if you use "SELECT *..." on one of those tables you risk pulling all kinds of data over the wire that is just going to be thrown away.

Example 1: One Round Trip By Using JOIN

Consider the basic case where you are retrieving and displaying the line items from an order (or shopping cart as people call it these days). Let us assume that you have an ORDER_LINES table that contains SKU, QTY, and PRICE among others. The item's description is in the ITEMS table. To display the lines, you must retrieve each line and also retrieve the item's description.

To do this most efficiently, we can make a single round trip to the server that retrieves all of the columns we need in one shot, then do a loop to render them like so (the example is in PHP):

# Assume some function that gives you the order number,
# sanitized for safe substition
$order = GetOrderNumber();

# Form the SQL
$sq="SELECT ol.sku,ol.price,ol.qty,ol.extended_price
           ,i.description
       FROM ORDER_LINES ol
       JOIN ITEMS       i   ON ol.sku = i.sku
      WHERE ol.oder = $order";

# Most frameworks should have some command to retrieve
# all rows for a query, something like this:
$lines = SQL_AllRows($sq);

# Finally, render the HTML
foreach($lines as $line) {
    #
    #  HTML rendering code here
    #
}

I should stress that this example carries a reasonable expectation that the order is small enough that you don't start hitting the inefficiencies of your particular language. Rendering large results sets in a Web Application is severely problematic compared to the old desktop systems, and doing so requires separate techniques that will have to wait for a future essay.

Example 2: Caching Small Tables

Sometimes you will need to generate a printed report that involves many tables, including several description lookups. For instance, I have a medical application that generates statements for all patients who have a balance. A typical run will produce 100 or 200 statements, and each statement requires information from no less than 8 tables.

In cases like this you can simplify your queries by retrieving the small lookup tables in their entirety before going to the main query and loop. For the example of the medical program there are two tables that qualify for this treatment. These are the tables of "ICD9" codes and "CPT" codes. Both of these usually have only about 100 rows, and there are only 2 relevant columns in one and 3 in the other. Therefore there is a big gain to be had by simply loading them into RAM ahead of time and simplifying the resulting code.

This bare-bones example shows simply that the tables are loaded first, and then main execution begins.

# The function SQL_Allrows() gives me the complete result from
# a query, the 2nd argument says to return an associative
# array with key values made out of the named column.
# NOTE: an "icd9" code is a medical diagnosis code
$icd9codes = SQL_AllRows(
    "Select icd9code,description from icd9codes"
    ,"icd9code"
);

# NOTE: a "CPT" code is a medical procedure code
$cptcodes = SQL_AllRows(
    "select cptcode,description from cptcodes"
    ,"cptcodes"
);

#  ...now continue by pre-fetching the list of patients
#     we will be dealing with, and then we can finally
#     go into the main loop and refer to the $icd9codes
#     and $cptcodes array as needed.
#
$patients = SQL_AllRows(
    "Select patient from patients where balance > 0
      order by last_name,first_name"
);
foreach($patients as $patient) {
    #
    # retrieve the statement information, use
    # arrays $cptcodes and $icd9codes to display
    # descriptions for those codes
    #
}

Knowing Your Context

There is one more piece of the puzzle that a programmer must have if he is to make wise decisions when trying to balance round trips and retrieval size. This is a thorough knowledge of your context. Knowing your context can dramatically help in making decisions.

Some examples of context are:

Huge social networking site or portal with hundreds of hits per second.
eCommerce site.
Line of business program used by the staff of a company to do their daily work.

My own context is the third item, line of business applications. In this context the following realities hold:

A huge user base might be a few hundred, with never more than five or six simultaneous transactions going on.
A much more common user base is 10-20 users (or even 3 or 4!), with one transaction every 5-20 seconds.
The public website accessed by customers is limited to a few thousand potential users, of which you rarely if ever have two or more users on at the same time.

In this context I have a wealth of server resources, because my customer can spend as little as $1500.00 and get a server with more RAM than 10-20 users will ever use at the same time. Therefore, my own coding habits often tend toward caching lookup tables and pulling 300 rows into memory at one shot so that I can get them to the screen (or PDF, or CSV...) as fast as possible. But these decisions are guided by the context of my applications, if your context is different, you may be led to different conclusions.

Conclusion

It is not difficult to create database applications that perform well. The basic rules of thumb are to make a minimum number of round trips to the server and to retrieve precisely the values that you need and no more. These ideas work well because they minimize your most expensive operation, which is disk access.

It is also perfectly acceptable to denormalize your tables (following Denormalization Patterns) which simplifies your queries and reduces JOIN operations. Finally, you must know your context well, so that you can evaluate techniques such as caching lookup tables.

These ideas form the cornerstone of most performance optimization and you will find that applying them over and over rigorously will give you most of what you need to keep performance strong in the web layer.

Next Post: Pay Me Now or Pay Me Later

Sunday, June 22, 2008

Database Performance 1: Huge Inserts

The modern database server provides a wealth of features that provide robust and reliable storage. Understanding how these features work is vital if you want fast performance for your databases. This week we begin a series on performance by looking at "ACID" compliance and how it affects our handling of large operations.

What is ACID Compliance

The modern database provides a set of features known as ACID compliance which make the database very robust. To paraphrase the Wikipedia article, ACID compliance means that:

Each transaction is Atomic. It is completed in its entirety or not at all.
The database is always Consistent. No user ever sees the intermediate and possibly invalid state of the database while your transaction is in progress.
Each transaction is isolated. Your changes do not get mixed up with other people's changes, even though they are executing at the same time (see more in the Wikipedia article on Serializability.)
The transaction is durable. Once the database says the job is complete without errors, you are assured the database has checked all constraints and keys and the transaction is completely valid. In most cases we also take this to mean the data is safely on disk and pulling the plug will not corrupt it.

Maintaining ACID compliance is expensive for the database server. It must in effect keep two versions of every row in play, and it must do so while multiple users have multiple transactions running at the same time, even while other users may be trying to read the rows that are being effected. This cost is considered more than acceptable when the reliability requirement is high. But there is one case where the inevitable consequence of ACID compliance is to destroy performance, and this is on large UPDATES and INSERTS. Today we are going to look particularly at large INSERT operations.

Consider the case where you are creating a new system that must load 1 million rows into a single table from an older system. You go to your server's manual and find the command to do so (For PostgreSQL it is COPY..., for SQL Server it is BULK INSERT...). You work out the painful details of the command with a test file of 10 rows and get the syntax down. Then you issue the command with the real file of 1 million rows. After a minute or two you realize, well, 1 million rows is a lot, time to get some coffee. Returning with the coffee you see that it is still not finished, so time to check some email. An hour later it is still not done, and when you leave it running overnight and come back in the morning your machine is frozen.

The problem here is simply one of numbers. One million rows of average length of 100 characters (in ASCII or UTF-8) will be about 100 megabytes. The server must do no less than maintain two completely separate states for database -- one state without your rows and one with your rows. The cost of this is several times the actual size of the input data, so the 1 million rows in this example will take several hundred megabytes of resources, at least! The server will be managing this process on both disk and in RAM. This will simply die on a laptop or development workstation, even one with a gig or two of RAM.

You can maybe go out and buy some RAM, but the purpose of this essay is to explain how to deal with those inevitable cases where the operation you are performing requires more resources than you have. This situation will always come up, so it is good to know how to deal with it.

Step 1: Drop Indexes and Keys

ACID compliance extends to indexes as well. When you INSERT many thousands or millions of rows to a single table in one shot, the server must maintain two separate versions of each index. This burden is laid on top of the burden of calculating the index keys for every single row one-by-one. If we began with a burden of several hundred megabytes of resources, just a few indexes on your table could end up more than doubling that.

This is why you will see advice on mailing lists and forums to drop indexes before doing large insert operations.

Your table will have one index automatically for the primary key, so you must drop the primary key. You will also want to drop foreign keys so that you do not waste time checking them (they also have indexes). Unique constraints also end up creating indexes automatically behind the scenes, so you must drop those. Other constraints must also be dropped to prevent having them checked for every single one of your million rows. Finally, you must drop all indexes you created yourself. After the load is complete, you must recreate these keys, indexes, and constraints.

In some cases, where your load is small enough, this may be enough to get predictable load times, and you can stop here. But the larger the operation, the more likely that this will not be enough. In those cases, there is another step to take.

Step 2: Chunks

The basic problem described above is that the database performance has gone non-linear. When you double the number of rows, it does not take twice as long, but four times (or 3 or 10 or whatever). When you multiply the rows by 10, it may not take 10 times as long, you might see it take 100 times as long, or more! (Or maybe you just killed it after you came back in the morning and your workstation was frozen).

We can restore linear performance if we break the input into chunks and load them one at a time in succession. You break the input into files that are small enough so that no individual file will send the server into non-linear hell. If we find that a chunk of 5000 rows loads in 4 seconds, and we have 2000 of these files, we now have a predictable load time. We have restored linear performance because we know that twice as many rows will take twice as long.

I am currently working on a system where I must occassionally load 3 tables of about 3 million rows each periodically from an old desktop Visual Foxpro system into a Postgres database. The chunking code on the output looks something like this:

    * This is FOXPRO code, which is vaguely like BASIC...
  mCount     = 0
  mIncrement = 5000                    * Hardcoded chunk size
  mRN        = Reccount(p_tableName)   * Fox's command to get row count

    * these three commands get the record pointer to the top
    * of the table and turn off all indexes 
 SELECT (p_tableName) 
 locate
 set order to 0
 for rnStart = 1 TO mRN step mIncrement
  mCount = mCount + 1
 
        * each loop outputs the next 5000 rows and leaves the
        * record pointer ready for the next loop. 
        * Foxpro uses a semi-colon to mean continue onto next line
  COPY column1,column2,column3 ;
        TO (m_dir+p_tableName+"_"+PADL(mCount,6,'0')+".asc") DELIMITED ;
   WHILE recno() < (rnStart + mIncrement)
        * This is foxpro's 'echo' command, a question mark
  ? p_tableName + "  copied "+str(_TALLY)+" records"
 endfor

Then on the receiving end I need a program that reads the chunks and loads them in. The relevant portion of the code is here (the example is in PHP and loads to PostgreSQL):

#  Assume variable $fcnt holds the number of chunks to load
#  and that variables like $tabname, $ddir etc hold file names,
#  directory locations, column lists and so forth.
for($c = 1; $c<= $fcnt; $c++) {
    $insert = "_".str_pad($c,6,'0',STR_PAD_LEFT);
    LogEntry("     loading file # "
        .str_pad($c,6,' ',STR_PAD_LEFT)
        .' of '.str_pad($fcnt,6,' ',STR_PAD_LEFT)
    );
    $cmd="COPY $tabname ($collist) "
        ." FROM '$ddir$afile$insert.asc' DELIMITERS ',' CSV QUOTE '\"'";
    # This is my frameworks super-simple direct SQL command
    SQL($cmd);
}

Conclusion: Chunks Restore Linear Performance

Database programmers depend heavily on "ACID" features to provide robust data storage. We depend upon these features so much that we will not consider using systems that cannot provide them (MySQL's MyISAM engine for instance). The cost of these features for performance is considered part of the bargain when robustness is required, but when you are doing a huge insert, the ACID features cause performance to go "non-linear", to become unpredictably long. As a first step you can drop indexes, keys, and constraints on a table to improve load times, but if that is not enough, you can restore linear performance by breaking the large operation into many "chunks", each of which is small enough to stay linear.

Next Essay: Performance in the Web Layer

Sunday, June 15, 2008

Why I Do Not Use ORM

An impedance mismatch occurs when two devices are connected so that neither is operating at peak efficiency. This lack of efficiency is not due to any intrinsic incompatibilities between the devices, it only exists once they are connected together the wrong way. Object Relational Mapping (ORM) does not cure a pre-existing impedance mismatch, it creates one, because it connects databases to applications in a way that hobbles both.

UPDATE: There is a newer Historical Review of ORM now available.

UPDATE: In response to comments below and on reddit.com, I have a new post that gives a detailed analysis of an algorithm implemented as a sproc, in app code with embedded SQL, and in ORM.

Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Good Marketing, Terrible Analogy

Normally I like to reserve this space for a positive presentation of things that I have found that work, I don't like to waste time ranting against things I don't like. However, ORM is so pervasive in some circles that it is important to establish why you will not see it used here, so as to avoid a lot of unnecessary chatter about its absence.

The use of the term "impedance mismatch" is great marketing, worthy of a certain software company in the Northwest of the US, but the analogy is utterly wrong. The analogy is used incorrectly to imply an intrinsic incompatibility, but the real irony is that there is no such incompability, and if we want to use the analogy we are forced to say that ORM is the impedance mismatch, because it creates the inefficient connection.

It always comes back to the fact that modern databases were designed to provide highly reliable permanent storage, and they possess a slew of features that promote that end. Programming languages on the other hand are meant to process data in a stepwise fashion. When the two of them meet it is very important to establish a strategy that uses the strengths of both, instead of trying to morph one into the other, which never yields efficient results.

The Very Basics

The language SQL is the most widely supported, implemented, and used way to connect to databases. But since most of us have long lists of complaints about the language, we end up writing abstraction layers that make it easier for us to avoid coding SQL directly. For many of us, the following diagram is a fair (if not simplified) representation of our systems:

This diagram is accurate for ORM systems, and also for non-ORM systems. What they all have in common is that they seek to avoid manually coding SQL in favor of generating SQL. All systems seek to give the programmer a set of classes or functions that makes it easy and natural to work with data without coding SQL manually.

This brings us to a very simple conclusion: the largest part of working out an efficient database strategy is working out a good SQL generation strategy. ORM is one such strategy, but it is far from the simplest or the best. To find the simplest and the best, we have to start looking at examples.

First Example: Single Row Operations

Consider the case of a generic CRUD interface to a database having a few dozen tables. These screens will support a few single-row operations, such as fetching a row, saving a new row, saving updates, or deleting a row.

We will assume a web form that has inputs with names that begin with "inp_", like "inp_name", "inp_add1", "inp_city" and so forth. The user has hit [NEW] on their AJAX form, filled in the values, and hit [SAVE]. What is the simplest possible way to handle this on the server? If we strip away all pre-conceived ideas about the "proper" thing to do, what is left? There are only these steps:

Assemble the relevant POST variables into some kind of data structure
Perform sanity checks and type-validations
Generate and execute an insert statement
Report success or failure to the user

The simplest possible code to do this looks something like this (the example is in PHP):

# This is a great routine to have.  If you don't have
# one that does this, write it today!  It should return
# an associative array equivalent to:
#    $row = array( 
#       'name'=>'....'
#      ,'add1'=>'....'
#    )
# This routine does NOT sanitize or escape special chars
$row = getPostStartingWith("inp_");

# get the table name.  
$table_id = myGetPostVarFunction('table_id');

# Call the insert generation program.  It should have
# a simple loop that sanitizes, does basic type-checking,
# and generates the INSERT.  After it executes the insert
# it must caches database errors for reporting to the user.
#
if (!SQLX_Insert($table_id,$row)) {
    myFrameworkErrorReporting();
}

Without all of my comments the code is 5 lines! The Insert generation program is trivial to write if you are Using a Data Dictionary, and it is even more trivial if you are using Server-side security and Triggers.

This is the simplest possible way to achieve the insert, and updates and deletes are just as easy. Given how simple this is (and how well it performs), any more complicated method must justify itself considerably in order to be considered.

ORM cannot be justified in this case because it is slower (objects are slower than procedural code), more complicated (anything more than 5 lines loses), and therefore more error-prone, and worst of all, it cannot accomplish any more for our efforts than we have already.

Objection! What About Business Logic?

The example above does not appear to allow for implementing business logic, but in fact it does. The SQLX_Insert() routine can call out to functions (fast) or objects (much slower) that massage data before and after the insert operation. I will be demonstrating some of these techniques in future essays, but of course the best permforming and safest method is to use triggers.

Example 2: Processes, Or, There Will Be SQL

Many programmers use the term "process" to describe a series of data operations that are performed together, usually on many rows in multiple tables. While processes are not common on a typical website, they are plenty common in line-of-business applications such as accounting, ERP, medical programs, and many many others.

Consider a time entry system, where the employees in a programming shop record their time, and once per week the bookkeeper generates invoices out of the time slips. When this is performed in SQL, we might first insert an entry into a table of BATCHES, obtain the batch number, and then enter a few SQL statements like this:

-- Step 1, mark the timeslips we will be working with
UPDATE timeslips SET batch = $batch
 WHERE batch IS NULL;
 
-- Step 2, generate invoices from unprocessed timeslips
INSERT INTO Invoices (customer,batch,billing,date)
SELECT CUSTOMER,$batch,SUM(billing) as billing,NOW()
  FROM timeslips
 WHERE batch = $batch
 GROUP BY customer;
 
-- Step 2, mark the timeslips with their invoices
UPDATE timeslips 
   SET invoice = invoices.invoice
  FROM invoices 
 WHERE timeslips.customer = invoices.customer
   AND timeslips.batch    = $batch;

While this example vastly simplifies the process, it ought to get across the basic idea of how to code things in SQL that end up being simple and straightforward.

Counter Example: The Disaster Scenario

The biggest enemy of any software project is success. Code that works wonderfully on the developer's laptop is suddenly thrown into a situation with datasets that are hundreds of times larger than the test data. That is when performance really matters. Processes that took 3 minutes on the laptop suddenly take 10 hours, and the customer is screaming. How do these things happen?

Mostly they happen because programmers ignore the realities of how databases work and try to deal with them in terms they understand, such as objects or even simple loops. Most often what happens is that the programmer writes code that ends up doing something like this:

foreach( $list_outer as $item_outer) {
    foreach( $list_inner as $item_inner) {
        ...some database operation
    }
}

The above example will perform terribly because it is executing round trips to the database server instead of working with sets. While nobody (hopefully) would knowingly write such code, ORM encourages you do to this all over the place, by hiding logic in objects that themselves are instantiating other objects. Any code that encourages you to go row-by-row, fetching each row as you need it, and saving them one-by-one, is going to perform terribly in a process. If the act of saving a row causes the object to load more objects to obtain subsidiary logic, the situation rapidly detiorates into exactly the code snippet above - or worse!

On a personal note, I have to confess that I am continually amazed and flabbergasted when I see blog posts or hear conversations in user groups about popular CMS systems and web frameworks that will make dozens of database calls to refresh a single page. A seasoned database programmer simply cannot write such a system, because they have habits and practices that instinctively guard against such disasters. The only possible explanation for these systems is the overall newnewss of the web and the extreme ignorance of database basics on the part of the CMS and framework authors. One can only hope the situation improves.

Sidebar: Object IDs are Still Good

There are some people who, like myself, examine how ORM systems work and say, "no way, not in my code." Sometimes they also go to the point of refusing to use a unique numeric key on a table, which is called by some people an "Object ID" or OID for short.

But these columns are very useful for single-row operations, which tend to dominate in CRUD screens (but not in processes). It is a bad idea to use them as primary keys (see A Sane Approach To Choosing Primary Keys), but they work wonderfully in any and all single-row operations. They make it much easier to code updates and deletes.

Conclusions

The recurring theme of these essays is that you can write clean and efficient code if you know how databases work on their own terms. Huge amounts of application code can be swept away when you understand primary keys and foreign keys and begin to normalize your tables. The next step from there is knowing how to code queries, but sooner or later you have to grapple with the overall architecture. (Well supposedly you would do that first, but many of us seem to learn about architectural concerns only after we have coded long enough to recognize them).

A thorough knowledge of database behavior tends to lead a person away from ORM. First off, the two basic premises of ORM are factually incorrect: One, that there is some native incompatibility between databases and code, and two, that all the world must be handled in objects. These two misconceptions themselves might be excusable if they turned out to be harmless, but they are far from harmless. They promote a willful ignorance of actual wise database use, and in so doing are bound to generate methods that are inefficient at best and horrible at worst.

Overall, there are always simpler and better performing ways to do anything that ORM sets out to achieve.

Next Essay: Performance on Huge Inserts

Addendum June 19, 2008

After reading the comments on the blog over the last few days I have decided to put in this addendum rather than attempt to answer each comment independently. I have attempted to answer the objections in descending order of relevance.

The Example is Trivial or "Cheats"

This is a very compelling challenge to the article offered by bewhite and Michael Schuerig and it deserves a meaningful response. What I want to do is flesh out my approach and why I find it better than using ORM. While I do not expect this to lead to agreement, I hope that it answers their challenges.

My sphere of activity is business applications, where two dozen tables is trivial and the norm is for dozens or hundreds of tables.
When table count beyond the trivial, many concerns come into play that do not appear at lower table counts.
I have found that a single unified description of the database works best for these situations, provided it can specify at very least schema, automations, constraints, and security. This is what I refer to as the data dictionary.
The first use of the data dictionary is to run a "builder" program that builds the database. This builder updates schemas, creates keys and indexes, and generates trigger code. The same builder is used for clean installs and upgrades.
The generated trigger code answers directly the challenges as to how non-trivial inserts are handled. Downstream effects are handled by the triggers, which were themselves generated out of the dictionary, and which implement security, automations, and constraints. No manual coding of SQL routines thank you very much.
All framework programs such as SQLX_Insert() read the dictionary and craft the trivial insert. The code does what you would expect, which is check for type validity, truncate overlong values (or throw errors). But it does need to know anything more than is required to generate an INSERT, all downstream activity occurs on the server.
The dictionary is further used to generate CRUD screens, using the definitions to do such things as gray out read-only fields, generate lookup widgets for foreign keys, and so forth. This generated code does not enforce these rules, the db server does that, it simply provides a convenient interface to the data.
A big consequence here is that there is no need for one-class-per-table, as most tables can be accessed by these 'free' CRUD screens.
That leaves special-purpose programs where 'free' CRUD screens don't reflect the work flow of the users. In a business app these usually come down to order entry, special inquiry screens and the lot. These can be programmed as purely UI elements that call the same simple SQLX_Insert() routines that the framework does, because the logic is handled on the server.
This approach is not so much about code reuse as code elimination. In particular, the philosophical goal is to put developer assets into data instead of code.
When this approach is taken to its full realization, you simply end up not needing ORM, it is an unnecessary layer of abstraction that contributes nothing to quality at any stage.

These ideas are implemented in my Andromeda framework. It is not the purpose of this blog to promote that framework, but it has been successfully used to produce the types of applications I describe on this blog. I make mention of it here for completeness.

So to conclude, both of these gentlemen are correct that the example says nothing about how the crucial SQLX_Insert() routine is coded, and I hope at least that this addendum fleshes this out and makes clear where it is different from ORM.

The Model Should Be Based On Classes

bewhite asks "Do you propose us to organize our applications in terms of tables and records instead of objects and classes?"

Yes. Well, maybe not you, but that's how I do it. I do not expect to reach agreement on this point, but here at least is why I do it this way:

My sphere of activity is business applications, things like accounting, ERP, medical management, job control, inventory, magazine distribution and so forth.
I have been doing business application programming for 15 years, but every program I have ever written (with a single recent exception) has replaced an existing application.
On every job I have been paid to migrate data, but the old program goes in the trash. Every program I have written will someday die, and every program written by every reader of this blog will someday die, but the data will be migrated again and again. (...and you may even be paid to re-deploy your own app on a new platform).
The data is so much more important than the code that it only makes sense to me to cast requirements in terms of data.
Once the data model is established, it is the job of the application and interface to give users convenient, accurate and safe access to their data.
While none of this precludes ORM per se, the dictionary-based approach described above allows me to write both procedural and OOP code and stay focused on what the customer is paying for: convenient, accurate and safe access.
The danger in casting needs in any other terms is that it places an architectural element above the highest customer need, which is suspect at best and just plain bad customer service at worst. We all love to write abstractions, but I much prefer the one that gets the job done correctly in the least time, rather than the one that, to me, appears to most in fashion.

Old Fashioned Technnologies

More than one comment said simply that triggers and other server-side technologies "went out". Since I was there and watched it happen I would contend that when the web exploded a new generation came along with different needs. In particular the need for content and document management caused people to question all of the conventional uses of the SQL databases, and like all programmers they are convinced their world is the only world and all of the world, ergo, triggers are history because I don't use them. Nevertheless, those of us who continue to write business applications continue to use the technologies that worked well then and only work better now.

Ken Does Not Like OOP

I love OOP, especially for user interfaces. I just don't think it should own the domain model, and I don't think that "trapping" business logic inside of classes gives nearly the same independence as a data dictionary does. I've tried it both ways and I'll stick with the dictionary.

Any Use of OO Code Implies ORM

A few comments said outright that if you are using OOP code then you are by definition mapping. Technically this is untrue if you understand the use of the term "map" as opposed to "interface". Mapping is the process of creating a one-to-one correspondence between items in one group (the code) to items in the other (the database). A non-ORM interface is one in which any code, procedural or OOP, passes SQL and handles data without requiring a one-to-one mapping of tables or rows to classes or functions. My apps are not ORM because I have no such requirement that there be a class for every table, and no such requirement that there be any specific code to handle a table.

Don't Talk about Procedural Being Faster

At least three comments blasted this contention. To put things in context, performance in a database application goes in two stages. First and absolutely most critical is to be extremely smart about reducing database reads, since they are 1000's of times slower than in-memory operations. However, once that is done, there is no reason to ignore speed improvements that can be gained by optimizing the code itself. The commenters are correct that this gain is of a very low order, but I would stand by the statement after making this contextual addendum.

Thank You All

This was the most widely read piece in this series, definitely the most controversial. There will not likely be any other articles this controversial, as the main purpose of this essay was to provide regular readers with some background as to why they will not see ORM-based examples in future essays. Thanks for the comments!

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other philosophy essays are:

Monday, June 9, 2008

Using a Data Dictionary

Database applications can be made much simpler if you maintain a body of data that describes your tables. Every single programming task in a database application needs to know something about the tables it is working with, so every program in a framework and many programs in an application can benefit from a central store of information about the database.

Introducing the Data Dictionary

The term "data dictionary" is used by many, including myself, to denote a separate set of tables that describes the application tables. The Data Dictionary contains such information as column names, types, and sizes, but also descriptive information such as titles, captions, primary keys, foreign keys, and hints to the user interface about how to display the field.

A super-simple beginning Data Dictionary might look like this:

TABLE    | COLUMN  | TYPE | PREC* | SCALE | PK | AUTO** |DESCRIPTION
---------+---------+------+-------+-------+----+--------+-------------------
students | student | int  |  -    |  -    | Y  | IDENT  | Student ID
students | firstnm | char | 20    |  -    |    |        | First Name
students | lastnm  | char | 20    |  -    |    |        | Last Name
students | city    | char | 20    |  -    |    |        | City
students | gpa     | num  |  2    |  1    |    |        | Grade Point Avg

* Precision: Needed for chars, varchars, and numerics
** automation: to be described more below

The First Question: Where to Put It

It might seem that the first question about a data dictionary would be "What do we put in it?" We will get to this question in a moment, but the most important question is usually, "Where do we put it?" By this I mean do you put into an XML file, or do you encode it into classes in program code? Or is it somehow directly entered into database tables? There are of course many opinions on this.

The first opinion is that you do not really need a data dictionary per se because you can get this information from the database server (if this is news to you, google "information_schema" and your favorite database). There are two major drawbacks when you depend on the server. First, you cannot build a database out of your data dictionary, because of course you don't have one until the database is built. The second drawback is much worse, which is that you cannot put any extended properties into the dictionary, and without those the dictionary is of little use.

Another method is to put the dictionary into classes, using the typical one-class-per-table approach and storing the data as properties of the class. There are multiple drawbacks to this approach:

Data that is "trapped" in code is very difficult to deal with efficiently, so many operations with the dictionary will be much harder to code and will be slower.
The dictionary is spread out into many files.
Ironically, a good data dictionary allows you to generate most CRUD forms, and so you don't need those one-class-per-table files filling up your directory. It seems silly to make a class that contains data that makes the class unnecessary.

The option I prefer is a plaintext file, which can be generated by a GUI or typed in by hand. The only requirement for the file is that it be easy to type and read, and easy to parse by a program. These requirements are well met by two formats: YAML and JSON (XML is a bletcherous horror to work with manually, so it is disqualified before the race starts). Both YAML and JSON enjoy parsers and writers in nearly all popular languages, so if you create your data dictionary in one of those you have a file that is human readable and writable, machine readable and writable, useful in nearly every language on every platform, easily placed in source control, and very flexible in what it can represent.

First Basic Use: Building and Upgrading Databases

A data dictionary is a wonderful way to build and upgrade databases. It is a real thrill to write your first SQL generator that scans a table of column definitions, writes out a CREATE TABLE statement, and executes it. The next step in the evolution of this process is to have the program query the INFORMATION_SCHEMA of the database, then work out which columns have been added to your data dictionary that are not in the table, and then upgrade the table with an ALTER TABLE statement.

This basic approach can easily be extended to include indexes and keys.

Sidebar: Data Dictionary Versus Upgrade Scripts

Many programmers use upgrade scripts to alter their schemas. The idea is that programmer Sax Russell adds a feature. Along with his code he writes a script that makes the necessary alterations to the database. Then comes Ann Clayborne who does the same thing, followed by Hiroko Ai, and then Sax again. When a customer upgrades their system, they run all four scripts in order. This can lead to horrible upgrade experiences in cases where multiple scripts are upgrading a large table several times. A data dictionary is far superior because it can simply examine the tables as they are, examine the data dictionary, work out a diff, and execute the smallest set of commands to bring the database current. This approach does require of course that the data dictionary be in source control like any other application file.

Second Basic Use: HTML Code Generation

A rich data dictionary can provide you with everything you need to provide "free" CRUD screens for most of your tables. Now to be sure, there are always those tasks that require special screens for the users, but there is just no reason to sit down and code up a screen to managae a table of customer types, zip codes, or even a general ledger chart of accounts. Here is an example of an expanded data dictionary in YAML format that contains enough information to generate screens with zero application code:

table customer_types:
    # Use this to generate menus!
    module: sales
    # Use this for the menu and the page title
    description: Customer Types
    
    column customer_type:
        type_id: char
        column_precision: 10
        # This is crucial for code generation
        description: Customer Type
        primary_key: "Y"
    column description:
        type_id: char
        column_precision: 40
        description: Description

The use of most of those properties should be pretty obvious, but I would like to point out one particular clever trick you can do. The "primary_key" flag can be used to enable a column during insert mode (if it is a user-entered key), and then to gray out the column in edit mode. When you consider this basic example it starts to become clear that nearly all of the code in most CRUD screens can be reduced to a few routines that read the data dictionary and generate HTML.

I would like to repeat that I do not mean to say that every single CRUD form in an application will work this way. In my experience 9 out of 10 tables can use "free" generated forms, but about 1 in 10 are used so often by users that you end up making special forms with shortcuts and wizards to speed up their work. The dictionary can help you there if you use it to generate the inputs, but it is no use trying to expand the dictionary to cover every conceivable case, there is always one more that the next customer needs that would just plain go faster if you coded it up by hand.

Third Basic Use: Trigger Generation

Generating triggers is a very powerful thing you can do with a data dictionary. We saw last week that the most complete encapsulation of code and data occurs when a trigger is placed on a table. Imagine you had a data dictionary that looked like this:

table orderlines:
    column extended_price:
        type_id: numeric
        column_precision: 10
        column_scale: 2
        calculate: @price * @qty

That magic little "calculate" value can be used to generate a trigger and put it onto the table. The trigger code might look something like this (The SQL version is PostgreSQL):

CREATE OR REPLACE FUNCTION example RETURNS TRIGGER AS
$$
BODY
    -- direct assignments are an error
    if new.extended_price is not null then
        raise error 'Direct assignment forbidden: extended_price';
    else
        new.extended_price = new.price * new.qty;
    end if;
END
$$
SECURITY DEFINER LANGUAGE PLPGSQL;

It is not my purpose here to explain how to generate that trigger, but simply to suggest that it is a very doable thing. Because I am not tracing out the steps, I do need to explain that I slipped those "@" signs into the "calculate" value so that when the trigger is built the builder program could detect column names and put the appropriate "old." and "new." prefixes in front of them.

Fourth Basic Use: SQL Access

A database application is full of code that reads and writes to tables. All frameworks and even very modest websites end up with a handful of routines that handle the mundane tasks of sending inserts and updates to the tables.

The Data Dictionary is the perfect resource for these routines. It can be used to properly format data, put quotes where they belong, clip overlong values (or throw an error), prevent the user from changing a primary key, and many other things.

Fifth Basic Use: Documentation

If you have a "rich" data dictionary, one that contains lots of extended properties that describe everything about columns and tables, and if you build your database and generate your forms out of that dictionary, then the next and final natural step is to generate technical documentation out of it as well.

In a first pass, technical documentation is limited to simply displaying the columns that go into a table, which admittedly does not mean much even to technical users. But once you get past that basic task you can begin to layer on lists of parent and child tables (as links of course), descriptions of formulas, and so on.

In a similar vein, Tooltip descriptions go well in a data dictionary.

Conclusion: True Synchronization

The holy grail of database programming is synchronization of code and data. Sychronization means a lot more than simply making sure you delivered the right code and upgraded the tables. Complete synchronization means that the framework is intrinsically incapable of mistakenly accessing missing columns or failing to consider important columns. At the framework level, if you make use of a data dictionary as described above then your framework will be organically synchronized, which is to say that the synchronization is built into the code itself.

Next Essay: Why I Do Not Use ORM

The Database Programmer