Sunday, June 15, 2008

Why I Do Not Use ORM

An impedance mismatch occurs when two devices are connected so that neither is operating at peak efficiency. This lack of efficiency is not due to any intrinsic incompatibilities between the devices, it only exists once they are connected together the wrong way. Object Relational Mapping (ORM) does not cure a pre-existing impedance mismatch, it creates one, because it connects databases to applications in a way that hobbles both.

UPDATE: There is a newer Historical Review of ORM now available.

UPDATE: In response to comments below and on reddit.com, I have a new post that gives a detailed analysis of an algorithm implemented as a sproc, in app code with embedded SQL, and in ORM.

Welcome to the Database Programmer blog. This blog is for anybody who wants to see practical examples of how databases work and how to create lean and efficient database applications.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Good Marketing, Terrible Analogy

Normally I like to reserve this space for a positive presentation of things that I have found that work, I don't like to waste time ranting against things I don't like. However, ORM is so pervasive in some circles that it is important to establish why you will not see it used here, so as to avoid a lot of unnecessary chatter about its absence.

The use of the term "impedance mismatch" is great marketing, worthy of a certain software company in the Northwest of the US, but the analogy is utterly wrong. The analogy is used incorrectly to imply an intrinsic incompatibility, but the real irony is that there is no such incompability, and if we want to use the analogy we are forced to say that ORM is the impedance mismatch, because it creates the inefficient connection.

It always comes back to the fact that modern databases were designed to provide highly reliable permanent storage, and they possess a slew of features that promote that end. Programming languages on the other hand are meant to process data in a stepwise fashion. When the two of them meet it is very important to establish a strategy that uses the strengths of both, instead of trying to morph one into the other, which never yields efficient results.

The Very Basics

The language SQL is the most widely supported, implemented, and used way to connect to databases. But since most of us have long lists of complaints about the language, we end up writing abstraction layers that make it easier for us to avoid coding SQL directly. For many of us, the following diagram is a fair (if not simplified) representation of our systems:

This diagram is accurate for ORM systems, and also for non-ORM systems. What they all have in common is that they seek to avoid manually coding SQL in favor of generating SQL. All systems seek to give the programmer a set of classes or functions that makes it easy and natural to work with data without coding SQL manually.

This brings us to a very simple conclusion: the largest part of working out an efficient database strategy is working out a good SQL generation strategy. ORM is one such strategy, but it is far from the simplest or the best. To find the simplest and the best, we have to start looking at examples.

First Example: Single Row Operations

Consider the case of a generic CRUD interface to a database having a few dozen tables. These screens will support a few single-row operations, such as fetching a row, saving a new row, saving updates, or deleting a row.

We will assume a web form that has inputs with names that begin with "inp_", like "inp_name", "inp_add1", "inp_city" and so forth. The user has hit [NEW] on their AJAX form, filled in the values, and hit [SAVE]. What is the simplest possible way to handle this on the server? If we strip away all pre-conceived ideas about the "proper" thing to do, what is left? There are only these steps:

  1. Assemble the relevant POST variables into some kind of data structure
  2. Perform sanity checks and type-validations
  3. Generate and execute an insert statement
  4. Report success or failure to the user

The simplest possible code to do this looks something like this (the example is in PHP):

# This is a great routine to have.  If you don't have
# one that does this, write it today!  It should return
# an associative array equivalent to:
#    $row = array( 
#       'name'=>'....'
#      ,'add1'=>'....'
#    )
# This routine does NOT sanitize or escape special chars
$row = getPostStartingWith("inp_");

# get the table name.  
$table_id = myGetPostVarFunction('table_id');

# Call the insert generation program.  It should have
# a simple loop that sanitizes, does basic type-checking,
# and generates the INSERT.  After it executes the insert
# it must caches database errors for reporting to the user.
#
if (!SQLX_Insert($table_id,$row)) {
    myFrameworkErrorReporting();
}

Without all of my comments the code is 5 lines! The Insert generation program is trivial to write if you are Using a Data Dictionary, and it is even more trivial if you are using Server-side security and Triggers.

This is the simplest possible way to achieve the insert, and updates and deletes are just as easy. Given how simple this is (and how well it performs), any more complicated method must justify itself considerably in order to be considered.

ORM cannot be justified in this case because it is slower (objects are slower than procedural code), more complicated (anything more than 5 lines loses), and therefore more error-prone, and worst of all, it cannot accomplish any more for our efforts than we have already.

Objection! What About Business Logic?

The example above does not appear to allow for implementing business logic, but in fact it does. The SQLX_Insert() routine can call out to functions (fast) or objects (much slower) that massage data before and after the insert operation. I will be demonstrating some of these techniques in future essays, but of course the best permforming and safest method is to use triggers.

Example 2: Processes, Or, There Will Be SQL

Many programmers use the term "process" to describe a series of data operations that are performed together, usually on many rows in multiple tables. While processes are not common on a typical website, they are plenty common in line-of-business applications such as accounting, ERP, medical programs, and many many others.

Consider a time entry system, where the employees in a programming shop record their time, and once per week the bookkeeper generates invoices out of the time slips. When this is performed in SQL, we might first insert an entry into a table of BATCHES, obtain the batch number, and then enter a few SQL statements like this:

-- Step 1, mark the timeslips we will be working with
UPDATE timeslips SET batch = $batch
 WHERE batch IS NULL;
 
-- Step 2, generate invoices from unprocessed timeslips
INSERT INTO Invoices (customer,batch,billing,date)
SELECT CUSTOMER,$batch,SUM(billing) as billing,NOW()
  FROM timeslips
 WHERE batch = $batch
 GROUP BY customer;
 
-- Step 2, mark the timeslips with their invoices
UPDATE timeslips 
   SET invoice = invoices.invoice
  FROM invoices 
 WHERE timeslips.customer = invoices.customer
   AND timeslips.batch    = $batch;

While this example vastly simplifies the process, it ought to get across the basic idea of how to code things in SQL that end up being simple and straightforward.

Counter Example: The Disaster Scenario

The biggest enemy of any software project is success. Code that works wonderfully on the developer's laptop is suddenly thrown into a situation with datasets that are hundreds of times larger than the test data. That is when performance really matters. Processes that took 3 minutes on the laptop suddenly take 10 hours, and the customer is screaming. How do these things happen?

Mostly they happen because programmers ignore the realities of how databases work and try to deal with them in terms they understand, such as objects or even simple loops. Most often what happens is that the programmer writes code that ends up doing something like this:

foreach( $list_outer as $item_outer) {
    foreach( $list_inner as $item_inner) {
        ...some database operation
    }
}

The above example will perform terribly because it is executing round trips to the database server instead of working with sets. While nobody (hopefully) would knowingly write such code, ORM encourages you do to this all over the place, by hiding logic in objects that themselves are instantiating other objects. Any code that encourages you to go row-by-row, fetching each row as you need it, and saving them one-by-one, is going to perform terribly in a process. If the act of saving a row causes the object to load more objects to obtain subsidiary logic, the situation rapidly detiorates into exactly the code snippet above - or worse!

On a personal note, I have to confess that I am continually amazed and flabbergasted when I see blog posts or hear conversations in user groups about popular CMS systems and web frameworks that will make dozens of database calls to refresh a single page. A seasoned database programmer simply cannot write such a system, because they have habits and practices that instinctively guard against such disasters. The only possible explanation for these systems is the overall newnewss of the web and the extreme ignorance of database basics on the part of the CMS and framework authors. One can only hope the situation improves.

Sidebar: Object IDs are Still Good

There are some people who, like myself, examine how ORM systems work and say, "no way, not in my code." Sometimes they also go to the point of refusing to use a unique numeric key on a table, which is called by some people an "Object ID" or OID for short.

But these columns are very useful for single-row operations, which tend to dominate in CRUD screens (but not in processes). It is a bad idea to use them as primary keys (see A Sane Approach To Choosing Primary Keys), but they work wonderfully in any and all single-row operations. They make it much easier to code updates and deletes.

Conclusions

The recurring theme of these essays is that you can write clean and efficient code if you know how databases work on their own terms. Huge amounts of application code can be swept away when you understand primary keys and foreign keys and begin to normalize your tables. The next step from there is knowing how to code queries, but sooner or later you have to grapple with the overall architecture. (Well supposedly you would do that first, but many of us seem to learn about architectural concerns only after we have coded long enough to recognize them).

A thorough knowledge of database behavior tends to lead a person away from ORM. First off, the two basic premises of ORM are factually incorrect: One, that there is some native incompatibility between databases and code, and two, that all the world must be handled in objects. These two misconceptions themselves might be excusable if they turned out to be harmless, but they are far from harmless. They promote a willful ignorance of actual wise database use, and in so doing are bound to generate methods that are inefficient at best and horrible at worst.

Overall, there are always simpler and better performing ways to do anything that ORM sets out to achieve.

Next Essay: Performance on Huge Inserts

Addendum June 19, 2008

After reading the comments on the blog over the last few days I have decided to put in this addendum rather than attempt to answer each comment independently. I have attempted to answer the objections in descending order of relevance.

The Example is Trivial or "Cheats"

This is a very compelling challenge to the article offered by bewhite and Michael Schuerig and it deserves a meaningful response. What I want to do is flesh out my approach and why I find it better than using ORM. While I do not expect this to lead to agreement, I hope that it answers their challenges.

  • My sphere of activity is business applications, where two dozen tables is trivial and the norm is for dozens or hundreds of tables.
  • When table count beyond the trivial, many concerns come into play that do not appear at lower table counts.
  • I have found that a single unified description of the database works best for these situations, provided it can specify at very least schema, automations, constraints, and security. This is what I refer to as the data dictionary.
  • The first use of the data dictionary is to run a "builder" program that builds the database. This builder updates schemas, creates keys and indexes, and generates trigger code. The same builder is used for clean installs and upgrades.
  • The generated trigger code answers directly the challenges as to how non-trivial inserts are handled. Downstream effects are handled by the triggers, which were themselves generated out of the dictionary, and which implement security, automations, and constraints. No manual coding of SQL routines thank you very much.
  • All framework programs such as SQLX_Insert() read the dictionary and craft the trivial insert. The code does what you would expect, which is check for type validity, truncate overlong values (or throw errors). But it does need to know anything more than is required to generate an INSERT, all downstream activity occurs on the server.
  • The dictionary is further used to generate CRUD screens, using the definitions to do such things as gray out read-only fields, generate lookup widgets for foreign keys, and so forth. This generated code does not enforce these rules, the db server does that, it simply provides a convenient interface to the data.
  • A big consequence here is that there is no need for one-class-per-table, as most tables can be accessed by these 'free' CRUD screens.
  • That leaves special-purpose programs where 'free' CRUD screens don't reflect the work flow of the users. In a business app these usually come down to order entry, special inquiry screens and the lot. These can be programmed as purely UI elements that call the same simple SQLX_Insert() routines that the framework does, because the logic is handled on the server.
  • This approach is not so much about code reuse as code elimination. In particular, the philosophical goal is to put developer assets into data instead of code.
  • When this approach is taken to its full realization, you simply end up not needing ORM, it is an unnecessary layer of abstraction that contributes nothing to quality at any stage.

These ideas are implemented in my Andromeda framework. It is not the purpose of this blog to promote that framework, but it has been successfully used to produce the types of applications I describe on this blog. I make mention of it here for completeness.

So to conclude, both of these gentlemen are correct that the example says nothing about how the crucial SQLX_Insert() routine is coded, and I hope at least that this addendum fleshes this out and makes clear where it is different from ORM.

The Model Should Be Based On Classes

bewhite asks "Do you propose us to organize our applications in terms of tables and records instead of objects and classes?"

Yes. Well, maybe not you, but that's how I do it. I do not expect to reach agreement on this point, but here at least is why I do it this way:

  • My sphere of activity is business applications, things like accounting, ERP, medical management, job control, inventory, magazine distribution and so forth.
  • I have been doing business application programming for 15 years, but every program I have ever written (with a single recent exception) has replaced an existing application.
  • On every job I have been paid to migrate data, but the old program goes in the trash. Every program I have written will someday die, and every program written by every reader of this blog will someday die, but the data will be migrated again and again. (...and you may even be paid to re-deploy your own app on a new platform).
  • The data is so much more important than the code that it only makes sense to me to cast requirements in terms of data.
  • Once the data model is established, it is the job of the application and interface to give users convenient, accurate and safe access to their data.
  • While none of this precludes ORM per se, the dictionary-based approach described above allows me to write both procedural and OOP code and stay focused on what the customer is paying for: convenient, accurate and safe access.
  • The danger in casting needs in any other terms is that it places an architectural element above the highest customer need, which is suspect at best and just plain bad customer service at worst. We all love to write abstractions, but I much prefer the one that gets the job done correctly in the least time, rather than the one that, to me, appears to most in fashion.

Old Fashioned Technnologies

More than one comment said simply that triggers and other server-side technologies "went out". Since I was there and watched it happen I would contend that when the web exploded a new generation came along with different needs. In particular the need for content and document management caused people to question all of the conventional uses of the SQL databases, and like all programmers they are convinced their world is the only world and all of the world, ergo, triggers are history because I don't use them. Nevertheless, those of us who continue to write business applications continue to use the technologies that worked well then and only work better now.

Ken Does Not Like OOP

I love OOP, especially for user interfaces. I just don't think it should own the domain model, and I don't think that "trapping" business logic inside of classes gives nearly the same independence as a data dictionary does. I've tried it both ways and I'll stick with the dictionary.

Any Use of OO Code Implies ORM

A few comments said outright that if you are using OOP code then you are by definition mapping. Technically this is untrue if you understand the use of the term "map" as opposed to "interface". Mapping is the process of creating a one-to-one correspondence between items in one group (the code) to items in the other (the database). A non-ORM interface is one in which any code, procedural or OOP, passes SQL and handles data without requiring a one-to-one mapping of tables or rows to classes or functions. My apps are not ORM because I have no such requirement that there be a class for every table, and no such requirement that there be any specific code to handle a table.

Don't Talk about Procedural Being Faster

At least three comments blasted this contention. To put things in context, performance in a database application goes in two stages. First and absolutely most critical is to be extremely smart about reducing database reads, since they are 1000's of times slower than in-memory operations. However, once that is done, there is no reason to ignore speed improvements that can be gained by optimizing the code itself. The commenters are correct that this gain is of a very low order, but I would stand by the statement after making this contextual addendum.

Thank You All

This was the most widely read piece in this series, definitely the most controversial. There will not likely be any other articles this controversial, as the main purpose of this essay was to provide regular readers with some background as to why they will not see ORM-based examples in future essays. Thanks for the comments!

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other philosophy essays are:

496 comments:

«Oldest   ‹Older   601 – 496 of 496
«Oldest ‹Older   601 – 496 of 496   Newer› Newest»