Thursday, December 16, 2010

Critical Analysis of an Algorithm: Sproc, Embedded SQL, and ORM

This is a follow-up to yesterday's historical perspective on ORM. In this essay we examine a particular class of business logic and ask what happens if we go server-side, embedded SQL, or ORM.

This blog has two tables of contents, the Complete Table of Contents and the list of Database Skills.

Processes

We are going to look at a process. The term is not well defined, but my own working definition is any operation that has as many of the following properties as I seem to think are important at the time I make the decision:

  • Is categorically not CRUD: requires much more than displaying data to user or sending a single-row operation to the server
  • Involves reading or writing many rows
  • Involves reading or writing from multiple tables
  • Involves multiple passes of the same data
  • Involves no user interaction while executing
  • If not coded to be idempotent can cause huge headaches
  • Will likely take longer than a user is willing to wait (these days 3-10 seconds) and so runs in the background.
  • Depends upon rules tables that control its behavior

A Particular Process: Magazine Regulation

I have a more complete description of this problem here, so this is going to be very short. The system handles magazine deliveries to stores. The shop running the system has thousands of stores and thousands of magazines. Every store has a default quantity of the particular magazines they carry. For one particular magazine, NewsTime, there are 1000 stores that get an average default quantity of 50, requiring 50,000 magazines each weak.

Here is the twist. You never get exactly 50,000, no matter what you tell the distributor. Some weeks you get 45,000, others you get 55,000, with any variation in between. So the Regulation Process adjusts the defaults for each store until the delivery amounts equal the on-hand total that was delivered on the truck.

The Naive or Simple Algorithm

In the first pass, we are going to consider an unrealistically simple version of Magazine Regulation, where we have too many magazines and must up the quantities until we're giving out the entire amount on-hand.

Assume a table has already been populated that has the default quantities for each store, where the relevant columns for the moment would be these:

 StoreId   |  MagazineId   |  QTY_DEFAULT | QTY_REGULATED 
-----------+---------------+--------------+---------------
    1      |      50       |      75      |      0        
    2      |      50       |      23      |      0        
    4      |      50       |      48      |      0        
   10      |      50       |      19      |      0        
   17      |      50       |     110      |      0        
   21      |      50       |      82      |      0        

We are told only that the increases must be evenly distributed, we can't just ship 5000 extra magazines to a single store. That makes sense. A simple algorithm to do this would be:

  1. Give each store one additional magazine until you run out of magazines or run out of stores.
  2. Repeat step 1 until you run out of magazines.

The Pseudo-Code

Let's mock something up in pseudo-code that shows the structure of the solution:

magsToDeliver = get total magazines...
magsDefault   = get total of defaults...

-- Outer loop implements rule 2:
-- "repeat until you run out of magazines"
while magsToDeliver > magsDefault {

   -- Inner loop implements rule 1:
   -- "increase each store by 1 until you
   --  run out of stores or run out of magazines"
   for each store getting this magazine {
        if magsToDeliver <= magsDefault break
   
        -- If you want to go nuts, and even allow
        -- the accidental simultaneous execution
        -- of two routines doing the same thing, 
        -- put these lines here instead
        magsToDeliver = get total magazines...
        magsDefault   = get total of defaults...
   
        -- This is the actual job getting done
        qty_regulate  +=1
        magsToDeliver -=1
   }
}

The Three Methods

Let's characterize what happens with our three choices of stored procedure, embedded SQL, or ORM.

Stored Procedure. Likely the fastest solution, considering all of that row-by-row going on. If that were in app code (ORM or not) we would be making two round trips to the server per iteration.

The really crazy thing about the stored procedure solution is that it is utterly neutral to the ORM question. The entire ORM good-bad debate dissolves because there is no OOP code involved. So this could be the magic solution that ORM lovers and haters could both agree upon.

App Code With Embedded SQL (no ORM). Just about everybody hates this idea these days, but it should be here for completeness, and because there are some advantages. The top of the pseudo-code requires to aggregate pulls, and if you are not afraid of SQL you can pull down the result in one pass, instead of iterating on the client. Further, the innermost operation can be coded in SQL as a "UPDATE deliveries from (SELECT TOP 1 deliverid From deliveries...)" so that you get only one round trip per iteration, where ORM will cost two.

Any Kind of ORM. By this I mean the code will contain no SQL, and the innermost loop will likely instantiate some "delivery" objects, one after another, increment their qty_regulated property by one, and flush them out. This is twice as expensive as embedded SQL because you have to fetch the row from the database and then write it back, where the embedded SQL can issue a single command that locates and updates the row in a single statement.

Some may argue that I misunderstand ORM here, in that the library may be smart enough to allow the write without the read, and without forcing you to embed SQL. It would have to be something like A) instantiate empty object with key, B) assign value as an expression, like "+=1", C) save. I welcome any such examples and will update the post accordingly if any are forthcoming. I am assuming that no ORM tool I have seen can do this and would be happy to be corrected.

If the ORM forces us to calculate the initial sum of QTY_Default by fetching each row as an object and summing them in the app, we get an extra complete set of round trips. Bummer. But if we say, "Hey my ORM tool lets me embed SQL in *emergencies*" then perhaps we can embed a query with an aggregrate and skip that cost. But oops, we've got some embedded SQL. Leaky abstraction.

The Score So Far

So how does it look so far? All methods have the same number of reads and writes to disk, so we are scoring them on round trips. If "X" is the number of extra magazines to be distributed, and "Y" is the number of stores getting the magazine, we have for round trips:

  • Stored Procedure: 1
  • Embedded SQL: X + 1 (the first pull plus one trip per extra copy of the magazine)
  • ORM, Hypothetical:X + 1 (if the ORM tool can figure out how to do an update without first reading the row to the app)
  • ORM, Best Case: 2X + 1 (if the first pull can be an aggregrate without embedding SQL, and two round trips per iteration)
  • ORM, Worst Case:2X + Y (if the first pull must aggregate app-side and there are two round trips per iteration)

Update: if you want a laugh, check out the image on the Wikipedia page for "Business Logic", it depicts aggregation occuring on the client side.

This gives us the shape of the engineering decision. With all options reading and updating the same number of rows, it call comes down to round trips. As soon as you go client side your round trips go way up, and if your ORM tool does not support Update without Select, then it doubles from there.

Now multiply this across the entire application, every single action in code, every bit of "business logic" with any kind of loop that iterates over rows.

It Gets Worse/Better: Considering SQL Possibilities

If you happen to know much about modern SQL, you may be aware of the amazingly cool SQL RANK() function. If this function is used in the sproc or embedded SQL approaches, you can execute the algorithm with only one loop, in a maximum of N=CEILING((Delivered-Regulated)/Stores) iterations. This will go much faster than the row-by-row, and now those two options are pulling even further ahead of the naive row-by-row methods encouraged by an ORM tool.

This ability of SQL will become extremely important, as we are about to blow apart the simplicity of the algorithm.

We Now Return You To The Real World

I have never been paid good money to write an algorithm as simple as the one described above. This is because mature businesses have always refined these simple methods for years or decades, and so the real situation is always more complex.

In a real magazine regulation algorithm, the rules tend to be more like this:

  1. Apply all of these rules whether you are increasing or decreasing the amounts to deliver
  2. Stop applying the rules when delivery amounts have been balanced to what we have on hand, no matter where you are in the process
  3. Always increase/decrease any particular store by exactly one on each iteration, no matter which rule you are working on
  4. Never decrease any store below 2
  5. Decrease any store whose past 3 issues sold less than 60% by 1, unless this would project their sales of this issue above 60%, and prioritize by prior sales % ascending.
  6. If the previous step completes, and we are short of magazines decrease each store by 1 by order of previous sales percents ascending. Repeat until we are in balance.
  7. If all stores are reduced to 2 and we are still short, abort and write error to log.
  8. If after the decreases we have excess magazines, increase any store whose past 3 issues sold more than 70% by 1, unless this would reduce their projected sales of this issue below 70%, and prioritize by prior sales % descending (so the stores with the most sales are handled first in case we don't get to all of them)
  9. If the previous step completes, and we are still in excess, increase each store by 1 in order of previous sales percents descending. Repeat until we are in balance.

This can also get doubled again if we must implement one set of rules when we start out with a too few magazines, and another set of rules when we start out with too many.

Well, it's not that hard. It actually comes down to having four outer loops in succession. The percent-based reduction, then the by 1 reduction, then the percent-based increase, then the by 1 increase.

But technically the more important matters are these:

  • We now have to grab the sales % for every store for this magazine on their past 3 issues and keep it handy throughout the routine.
  • The rules stated above contain constants like 70%, 60%. These would properly be in some parameter table to allow the user to control them, so those have to be fetched.
  • The loop through the stores is now much different, as we are filtering on prior sales percent for the percent-based passes, and filtering and ordering on prior sales percent for the by 1 passes.

Revisiting the Three Approaches

Now let's see how our three approaches would change.

The Improved Stored Procedure. If we change the sproc to use RANK() and make batch updates, we would pull the prior sales percents into a temp table and apply a covering index to cut the reads from that table in half. Our main loop would then simply join to this temp table and use it for both filtering and ordering.

The Embedded SQL. If we also changed the embedded SQL so it was making batch updates with RANK(), we would also generate a temp table. This option remains the same as the sproc except for where we put the SQL. However, it now has far fewer round trips, and starts to look much more like the sproc in terms of performance.

The ORM Approach. The idea here would be to get those prior sales percents down into an ordered collection and then use them as the basis for the loops. The thorny part is that they must be aggregated and sorted. If we want to avoid all embedded SQL, then the aggregation can be done client-side if we don't mind pulling down 3 times as many rows as are required. The sorting we can pull off if we put the objects into a collection such as an associative array, where the key is the sales percent, then we can use [language of choice]'s built-in sorting (hopefully), and we have escaped the dread evil of embedded SQL.

So we end up where we were, only more so. The sproc remains the fastest, and if we know how to code set-oriented nifty stuff with RANK() then the embedded SQL will run in almost the exact same time. The ORM requires most likely even more round trips and expensive app-side operations that are performed much more efficiently in the db server, unless we are willing to break the abstraction and embed a bit of SQL.

But in the end, if all of that cost of the ORM kicks a 3 second routine to 7 seconds, still well below what any user would notice, and you avoid embedded SQL, and it lets you keep your paradigm, who am I to judge?

Conclusions

I offer none. There are so many conventions in play regarding where to put your code, what tool you are already using, and so forth, that it is really up to the reader to draw conclusions. I only hope there is enough information here to do so.

11 comments:

Anonymous said...

On the topic of ORMs, I'm not very familiar with LINQ, but as far as I know LINQ is moving towards a more set-oriented approach. It is sophisticated enough to translate the LINQ syntax into set-oriented SQL, at least in some cases.

Which is well and good, I suppose, except that you have to learn the LINQ syntax (which is just as complex and proprietary as any SQL dialect), and still you have to check that it generates the desired (set-based) SQL, and try to modify your LINQ query in various ways to get it to behave the way you want.

In other words, you have to know both LINQ and SQL; your LINQ statements become embedded into your code just like embedded SQL would, and you can never be quite certain what is happening under the covers.

That, plus LINQ (to SQL) only supports Microsoft SQL Server.

Adam said...

By way of addressing Anonymous's comments regarding LINQ, LINQ itself is not a database interface provider. LINQ is a combination of a set of programming interfaces (IQueryable, for example) and language sugar (from clauses in C# and VB.NET, for example) that provide a generic way of dealing with sets of objects within the language, whether those sets are in-memory collections (by way of LINQ-to-Objects) or database providers (by way of LINQ-to-SQL, LINQ-to-Entities, LINQ-to-NHibernate, etc.)

Your comments are relevant to LINQ-to-SQL (since it is, by default, only compatible with SQL Server), but L2S has basically reached its end of life; it was only there as an intermediate layer before the Entity Framework (which is entirely database agnostic) was ready, which, with the introduction of EF4 in .NET 4, it basically is.

You're right that you need to know SQL with LINQ, just as you should know SQL regardless of the mechanism you're using to query. If you don't know SQL, you shouldn't be interacting with an RDBMS, in my opinion. The advantage that LINQ gives is compile-time safety. Hard-code your SQL statements and all of your debugging is done at runtime. Use LINQ with an ORM (that is up-to-date with the schema), and you're going to get SQL that is syntactically--if not necessarily semantically--correct.

Anonymous said...

@Adam: "LINQ is a combination of a set of programming interfaces and language sugar that provide a generic way of dealing with sets of objects within the language."

Yes, and I can see some benefit in a common way of querying, say, objects, XML files and database tables using the same syntax/API. However, for the typical data-centric business application, querying a database is by far the most common operation, which kind of forces you (or should force you) to be proficient with SQL anyway.

"If you don't know SQL, you shouldn't be interacting with an RDBMS, in my opinion."

Amen, brother! :-)

I think the author of this blog said it well in this post: http://database-programmer.blogspot.com/2010/11/prepare-now-for-possible-future-head.html

"The ORM meme complex instructs its victims that it is ok to put structured atomic values into a Relational Database, but when it comes time to access and use that data we will pretend we did not put it into tables and we will pretend that the data is in objects. In this sense the term Object-Relational Mapping is a complete misnomer, because the point is not to map data to objects but to create the illusion that the tables do not even exist.

Perhaps ORM should stand for Obscuring Reality Machine."

MuJiang said...

Hi Ken,

Could you show us your amazingly cool RANK() detail SQL?
I may build my Rank() SQL later.

I presented ORM Performance Anti-Patterns to database developers many times, but your view point is really new and practical to me.
Wonderful post, more questions coming later.

Thanks,
Charlie

ice said...

Very informative article. I would like to believe that any developer worth his/her salt, knows that the sproc approach will more often than not (if not always) execute faster so at the end of the day, it's more about what approach allows a developer to deliver working (scalable, fast etc etc) as soon as is possible i.e. what are they the most comfortable with.

MichaƂ said...

Can you expand a bit on why you're assuming you need to flush the changes to the db on every iteration? I guess that is to cover running multiple such operations at the same time, but is this really a valid scenario?

And if it is in fact necessary and you can run all these operations in the same process (separate threads) then it should be possible to have ORM share the same objects (store info) across different threads (operations).

Add some thread synchronization and you should be able to write the loop with no round trips at all.

Jim LoVerde said...

There are some advances in ORMs that you might not be aware of in your analysis. For example, Hibernate can perform DML to achieve the same X+1 interactions as your embedded sql example (update ... where ...)

http://docs.jboss.org/hibernate/core/3.6/reference/en-US/html/batch.html#batch-direct

And I'm pretty sure you could use the same RANK approach in DML as well for your more complicated algorithm.

mcekovic said...

This post is vary biased for several reasons:
- Good ORM with good mapping will need just two round-trips to the above problem: one to fetch the data (using eager pre-fetching of collections) and one to update the data (using batch updates feature of the data access layer)
- The ORM may even not need to fetch the data as it is most probably already cached (in the shared cache)
- The example used in this article is chosen to show ORM is slow, but on the contrary, in average, ORM is faster then stored procedures, as complex business logic expressed in SQL and stored procedures requires much more SQL parsing (soft or hard which eats CPU as cakes, been this, seen this) and much more disk read access. This is because ORM can do better job in caching then database, as ORM is handling objects which are closer model to business reality and much more expressive then tables, so believe it or not ORM is closer to the data then Stored procedures and SQL, a couple of method calls far, or if you prefer, a couple of CPU instructions far!

mdanielov said...

great debate, but like many debates on ORM it is only sprinkled with performance concerns, whereas I think performance should be a big consideration. In my imagination a system should be judged by its ability to implement the requested business logic faithfully, and by its ability to do it in a timely manner. While most ORMs address the first concern, the latter is often left out in the cold.

Also, the performance issue can be often cast in a wrong light. I would like to emphasize that a developer needs to have a clear understanding of what the capabilities of components at his/her disposal are. Sometimes it makes sense to use SQL server horsepower, sometimes it makes sense to perform operations in the code and just to use DB structures for persistence. One good rule of thumb that I use is that any set based operations are best performed by SQL engine, and any procedural calculation are best delegated to code. but that too is not a definitive rule. Each case demands individual attention.

CONCORDIA Programming said...

Very informative debate, nice post. Thanks for sharing

Anonymous said...

There are also alternatives to ORM, LINQ or 100% SQL approaches, such as this CodeFluent Entities product http://www.softfluent.com/products/codefluent-entities which tries to build a gap between pure OO design and pure SQL design. It's model-first - of course, you always have to start somewhere :-)