An impedance mismatch occurs when two devices are connected so that
neither is operating at peak efficiency. This lack of efficiency
is not due to any intrinsic incompatibilities between the devices,
it only exists once they are connected together the wrong way.
Object Relational Mapping (ORM) does not cure a pre-existing
impedance mismatch, it creates one, because it connects databases
to applications in a way that hobbles both.
UPDATE: There is a newer Historical Review of
ORM now available.
UPDATE: In response to comments below and on reddit.com,
I have a new post that gives a detailed
analysis of an algorithm implemented as a sproc, in app
code with embedded SQL, and in ORM.
Welcome to the Database Programmer blog. This blog is for anybody
who wants to see practical examples of how databases work and how
to create lean and efficient database applications.
There are links to other essays at the bottom of this post.
This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.
Good Marketing, Terrible Analogy
Normally I like to reserve this space for a positive presentation
of things that I have found that work, I don't like to waste time
ranting against things I don't like. However, ORM is so pervasive
in some circles that it is important to establish why you will
not see it used here, so as to avoid a lot of unnecessary chatter
about its absence.
The use of the term "impedance mismatch" is great marketing, worthy
of a certain software company in the Northwest of the US, but the analogy is
utterly wrong. The analogy is used incorrectly to imply an
intrinsic incompatibility, but the real irony is that
there is no such incompability, and if we want to use the analogy
we are forced to say that ORM is the impedance mismatch,
because it creates the inefficient connection.
It always comes back to the fact that
modern databases were designed to provide
highly reliable permanent storage, and they possess a slew of
features that promote that end. Programming languages on the
other hand are meant to process data in a stepwise fashion.
When the two of them meet it is very important to establish a
strategy that uses the strengths of both, instead of trying to
morph one into the other, which never yields efficient results.
The Very Basics
The language SQL is the most widely supported, implemented, and
used way to connect to databases. But since most of us have long
lists of complaints about the language, we end up writing abstraction
layers that make it easier for us to avoid coding SQL directly.
For many of us, the following diagram is a fair (if not simplified)
representation of our systems:
This diagram is accurate for ORM systems, and also for non-ORM
systems. What they all have in common is that they seek to
avoid manually coding SQL in favor of generating SQL. All
systems seek to give the programmer a set of classes or functions
that makes it easy and natural to work with data without coding
SQL manually.
This brings us to a very simple conclusion: the largest part of
working out an efficient database strategy is working out a
good SQL generation strategy. ORM is one such strategy, but
it is far from the simplest or the best. To find the simplest
and the best, we have to start looking at examples.
First Example: Single Row Operations
Consider the case of a generic CRUD interface to a database
having a few dozen tables. These screens will
support a few single-row operations, such as fetching a row,
saving a new row, saving updates, or deleting a row.
We will assume a web form that has inputs with names that
begin with "inp_", like "inp_name", "inp_add1", "inp_city"
and so forth. The user has hit [NEW] on their AJAX form,
filled in the values, and hit [SAVE]. What is the simplest
possible way to handle this on the server? If we strip away
all pre-conceived ideas about the "proper" thing to do,
what is left? There are only these steps:
- Assemble the relevant POST variables into some kind
of data structure
- Perform sanity checks and type-validations
- Generate and execute an insert statement
- Report success or failure to the user
The simplest possible code to do this looks something
like this (the example is in PHP):
# This is a great routine to have. If you don't have
# one that does this, write it today! It should return
# an associative array equivalent to:
# $row = array(
# 'name'=>'....'
# ,'add1'=>'....'
# )
# This routine does NOT sanitize or escape special chars
$row = getPostStartingWith("inp_");
# get the table name.
$table_id = myGetPostVarFunction('table_id');
# Call the insert generation program. It should have
# a simple loop that sanitizes, does basic type-checking,
# and generates the INSERT. After it executes the insert
# it must caches database errors for reporting to the user.
#
if (!SQLX_Insert($table_id,$row)) {
myFrameworkErrorReporting();
}
Without all of my comments the code is 5 lines! The
Insert generation program is trivial to write if you
are Using a Data Dictionary, and it is even more
trivial if you are using Server-side security and Triggers.
This is the simplest possible way to achieve the insert,
and updates and deletes are just as easy. Given how simple
this is (and how well it performs), any more complicated
method must justify itself considerably in order to be
considered.
ORM cannot be justified in this case because it is slower
(objects are slower than procedural code), more complicated
(anything more than 5 lines loses), and therefore more
error-prone, and worst of all, it cannot accomplish any
more for our efforts than we have already.
Objection! What About Business Logic?
The example above does not appear to allow for implementing
business logic, but in fact it does. The SQLX_Insert()
routine can call out to functions (fast) or objects
(much slower) that massage data before and after the
insert operation. I will be demonstrating some of these
techniques in future essays, but of course the best
permforming and safest method is to use triggers.
Example 2: Processes, Or, There Will Be SQL
Many programmers use the term "process" to describe a
series of data operations that are performed together,
usually on many rows in multiple tables.
While processes are not common on a typical website,
they are plenty common in line-of-business applications
such as accounting, ERP, medical programs, and many
many others.
Consider a time entry system, where the employees in a
programming shop record their time, and once per week
the bookkeeper generates invoices out of the time slips.
When this is performed in SQL, we might first insert
an entry into a table of BATCHES, obtain the batch
number, and then enter a few SQL statements like
this:
-- Step 1, mark the timeslips we will be working with
UPDATE timeslips SET batch = $batch
WHERE batch IS NULL;
-- Step 2, generate invoices from unprocessed timeslips
INSERT INTO Invoices (customer,batch,billing,date)
SELECT CUSTOMER,$batch,SUM(billing) as billing,NOW()
FROM timeslips
WHERE batch = $batch
GROUP BY customer;
-- Step 2, mark the timeslips with their invoices
UPDATE timeslips
SET invoice = invoices.invoice
FROM invoices
WHERE timeslips.customer = invoices.customer
AND timeslips.batch = $batch;
While this example vastly simplifies the process, it
ought to get across the basic idea of how to code things
in SQL that end up being simple and straightforward.
Counter Example: The Disaster Scenario
The biggest enemy of any software project is success.
Code that works wonderfully on the developer's laptop
is suddenly thrown into a situation with datasets that
are hundreds of times larger than the test data. That is
when performance really matters. Processes that took
3 minutes on the laptop suddenly take 10 hours, and the
customer is screaming. How do these things happen?
Mostly they happen because programmers ignore the
realities of how databases work and try to deal with them
in terms they understand, such as objects or even
simple loops. Most often what
happens is that the programmer writes code that ends up
doing something like this:
foreach( $list_outer as $item_outer) {
foreach( $list_inner as $item_inner) {
...some database operation
}
}
The above example will perform terribly because it is
executing round trips to the database server instead of
working with sets. While nobody (hopefully) would
knowingly write such code, ORM encourages you do to this
all over the place, by hiding logic in objects that themselves
are instantiating other objects. Any code that encourages
you to go row-by-row, fetching each row as you need it,
and saving them one-by-one, is going to perform terribly
in a process.
If the act of saving a row causes the object to load more
objects to obtain subsidiary logic, the situation rapidly
detiorates into exactly the code snippet above - or worse!
On a personal note, I have to confess that I am continually
amazed and flabbergasted when I see blog posts or hear
conversations in user groups about
popular CMS systems and web frameworks that will make
dozens of database calls to refresh a single page. A
seasoned database programmer simply cannot write such a
system, because they have habits and practices that
instinctively guard against such disasters.
The only possible explanation for these systems
is the overall newnewss of
the web and the extreme ignorance
of database basics on the part of the CMS and framework
authors. One can only hope the situation improves.
Sidebar: Object IDs are Still Good
There are some people who, like myself, examine how ORM
systems work and say, "no way, not in my code." Sometimes
they also go to the point of refusing to use a unique
numeric key on a table, which is called by some people
an "Object ID" or OID for short.
But these columns are very useful for single-row operations,
which tend to dominate in CRUD screens (but not in
processes). It is a bad idea to use them as primary keys
(see A Sane Approach To Choosing Primary Keys), but they work
wonderfully in any and all single-row operations. They make it
much easier to code updates and deletes.
Conclusions
The recurring theme of these essays is that you can write
clean and efficient code if you know how databases work on
their own terms. Huge amounts of application code can be swept
away when you understand primary keys and foreign keys and
begin to normalize your tables. The next step from there
is knowing how to code queries, but sooner or later you have to
grapple with the overall architecture. (Well supposedly you would
do that first, but many of us seem to learn about architectural
concerns only after we have coded long enough to recognize
them).
A thorough knowledge of database behavior tends to lead a person
away from ORM. First off, the two basic premises of ORM are
factually incorrect: One, that there is some native incompatibility
between databases and code, and two, that all the world must be handled
in objects. These two misconceptions themselves might be
excusable if they turned out to be harmless, but they are far from harmless.
They promote a willful ignorance of actual wise database use,
and in so doing are bound to generate methods that are
inefficient at best and horrible at worst.
Overall, there are always simpler and better performing ways
to do anything that ORM sets out to achieve.
Next Essay: Performance on Huge Inserts
Addendum June 19, 2008
After reading the comments on the blog over the last few days
I have decided to put in this addendum rather than attempt to
answer each comment independently. I have attempted to answer
the objections in descending order of relevance.
The Example is Trivial or "Cheats"
This is a very compelling challenge to the article offered
by bewhite and
Michael Schuerig
and it deserves a meaningful response. What I want to do is
flesh out my approach and why I find it better than using
ORM. While I do not expect this to lead to agreement, I hope
that it answers their challenges.
- My sphere of activity is business applications, where
two dozen tables is trivial and the norm is for dozens or
hundreds of tables.
- When table count beyond the trivial, many concerns come
into play that do not appear at lower table counts.
- I have found that a single unified description of the
database works best for these situations, provided it can
specify at very least schema, automations, constraints,
and security. This is what I refer to as the data
dictionary.
- The first use of the data dictionary is to run a "builder"
program that builds the database. This builder updates
schemas, creates keys and indexes, and generates trigger
code. The same builder is used for clean installs and
upgrades.
- The generated trigger code answers directly the challenges
as to how non-trivial inserts are handled. Downstream
effects are handled by the triggers, which were themselves
generated out of the dictionary, and which implement
security, automations, and constraints. No manual coding of
SQL routines thank you very much.
- All framework programs such as SQLX_Insert() read the
dictionary and craft the trivial insert. The code does
what you would expect, which is check for type validity,
truncate overlong values (or throw errors). But it does
need to know anything more than is required to generate
an INSERT, all downstream activity occurs on the server.
- The dictionary is further used to generate CRUD screens,
using the definitions to do such things as gray out
read-only fields, generate lookup widgets for foreign
keys, and so forth. This generated code does not
enforce these rules, the db server does that, it
simply provides a convenient interface to the data.
- A big consequence here is that there is no need for
one-class-per-table, as most tables can be accessed by
these 'free' CRUD screens.
- That leaves special-purpose programs where 'free' CRUD
screens don't reflect the work flow of the users. In a
business app these usually come down to order entry,
special inquiry screens and the lot. These can be
programmed as purely UI elements that call the same
simple SQLX_Insert() routines that the framework does,
because the logic is handled on the server.
- This approach is not so much about code reuse as code
elimination. In particular, the philosophical goal is to
put developer assets into data instead of code.
- When this approach is taken to its full realization, you
simply end up not needing ORM, it is an unnecessary layer
of abstraction that contributes nothing to quality at
any stage.
These ideas are implemented in my Andromeda framework. It is
not the purpose of this blog to promote that framework, but
it has been successfully used to produce the types of
applications I describe on this blog. I make mention of it
here for completeness.
So to conclude, both of these gentlemen are correct that the
example says nothing about how the crucial SQLX_Insert() routine
is coded, and I hope at least that this addendum fleshes this
out and makes clear where it is different from ORM.
The Model Should Be Based On Classes
bewhite asks "Do you propose us to organize our applications
in terms of tables and records instead of objects and classes?"
Yes. Well, maybe not you, but that's how I do it. I do not
expect to reach agreement on this point, but here at least is
why I do it this way:
- My sphere of activity is business applications,
things like accounting, ERP, medical management, job control,
inventory, magazine distribution and so forth.
- I have been doing business application programming for
15 years, but every program I have ever written (with a single
recent exception) has replaced an existing application.
- On every job I have been paid to migrate data, but the
old program goes in the trash. Every program I have written
will someday die, and every program written by every reader of
this blog will someday die, but the data will be migrated again
and again. (...and you may even be paid to re-deploy your own
app on a new platform).
- The data is so much more important than the code that it only
makes sense to me to cast requirements in terms of data.
- Once the data model is established, it is the job of the
application and interface to give users convenient, accurate
and safe access to their data.
- While none of this precludes ORM per se, the dictionary-based
approach described above allows me to write both procedural
and OOP code and stay focused on what the customer is paying
for: convenient, accurate and safe access.
- The danger in casting needs in any other terms is that it places
an architectural element above the highest customer need, which
is suspect at best and just plain bad customer service at
worst. We all love to write abstractions, but I much prefer
the one that gets the job done correctly in the least time,
rather than the one that, to me, appears to most in fashion.
Old Fashioned Technnologies
More than one comment said simply that triggers and other
server-side technologies "went out". Since I was there and
watched it happen I would contend that when the web exploded
a new generation came along with different needs. In particular
the need for content and document management caused people to
question all of the conventional uses of the SQL databases, and
like all programmers they are convinced their world is the only
world and all of the world, ergo, triggers are history because
I don't use them. Nevertheless, those of us who
continue to write business applications continue to use the
technologies that worked well then and only work better now.
Ken Does Not Like OOP
I love OOP, especially for user interfaces. I just don't think
it should own the domain model, and I don't think that
"trapping" business logic inside of classes gives nearly the
same independence as a data dictionary does. I've tried it
both ways and I'll stick with the dictionary.
Any Use of OO Code Implies ORM
A few comments said outright that if you are using OOP code then
you are by definition mapping. Technically this is untrue if
you understand the use of the term "map" as opposed to
"interface". Mapping is the process of creating a one-to-one
correspondence between items in one group (the code) to items
in the other (the database). A non-ORM interface is one
in which any code, procedural or OOP, passes SQL and
handles data without requiring a one-to-one mapping of tables
or rows to classes or functions. My apps are not ORM because
I have no such requirement that there be a class for every
table, and no such requirement that there be any specific code
to handle a table.
Don't Talk about Procedural Being Faster
At least three comments blasted this contention. To put things
in context, performance in a database application goes in
two stages. First and absolutely most critical is to be
extremely smart about reducing database reads, since they are
1000's of times slower than in-memory operations. However, once
that is done, there is no reason to ignore speed improvements
that can be gained by optimizing the code itself. The commenters
are correct that this gain is of a very low order, but I would
stand by the statement after making this contextual addendum.
Thank You All
This was the most widely read piece in this series, definitely
the most controversial. There will not likely be any other
articles this controversial, as the main purpose of this essay
was to provide regular readers with some background as to why
they will not see ORM-based examples in future essays. Thanks
for the comments!
Related Essays
This blog has two tables of contents, the
Topical Table of Contents and the list
of
Database Skills.
Other philosophy essays are: