Monday, November 29, 2010

Loops Without Cursors

Looping Without Cursors

Sometimes you need to process a table row-by-row, and the established approach is to use cursors, which are verbose, slow, and painful to code and use.

The Cursor Example

Here is the basic minimum syntax required to loop through a table and get something done. The SQL flavor is MS SQL Server, but its not much better in any other flavor.

-- I coded this off the top of my head, there
-- may be a minor syntax error or two

-- Most of this is pseudo-code, but take
-- note that it is ordered on column1
declare someCursorName cursor for
 select column1, column2, column3 
   from anyTable
  ORDER BY column1

-- Have to do this now
open someCursorName

-- Now you need to declare some variables
-- For the example I'm just making everything int
declare @column1 int
      , @column2 int
      , @column3 int

-- Gosh, we're actually about to start the loop!  Finally!
fetch next from someCursorName into @column1,@column2,@column3
while @@fetch_status = 0 begin

   --  If you still remember what you actually wanted
   --  to do inside the loop, code it here:

-- Repeat this line from the top here again:
fetch next from someCursorName into @column1,@column2,@column3
end

-- Not done yet, these two lines are crucial
close someCursorName
deallocate someCursorName

Call me petty, but what I hate about that code is that I have to refer to specific columns of interest 3 times (not counting the declarations). You refer to them in the cursor declaration and in the two FETCH commands. With a little clever coding, we can vastly simplify this and do it only once.

Using An Ordered Column

We can execute the same loop without the cursor if one of the columns is ordered and unique. Let us say that column1 is the primary key, and is an auto-incremented integer. So it is ordered and unique. The code now collapses down to:

-- I coded this off the top of my head, there
-- may be a minor syntax error or two

-- We can't get around declaring the vars, so do that
declare @column1 int
      , @column2 int
      , @column3 int

-- If you know a safe value for initialization, you
-- can use the code below.  If this is not 100% 
-- safe, you must query for the value or it must
-- be supplied from some other source
set @column1 = -1

-- BONUS POINTS: Can this become an infinite loop?
while 1 = 1 begin

-- Now we code the query and exit condition
 select TOP 1
        @column1 = column1
      , @column2 = column2
      , @column3 = column3 
   from anyTable
  WHERE column1 > @column1  -- this is what advances the loop
  ORDER BY column1

if @@rowcount = 0 begin
    break
end

        -- Put the actions here        

end

Final Notes

The only requirement for this approach is that you have a unique ordered column. This usually means a unique key or primary key. If "column1" is not unique, the loop will skip all but the first value in each group.

Also, it is very nice if you know a safe value to use as an initializer. Without that, you must query for the minimum value that matches the condition and then decrement it by one.

Finally, can this loop become infinite? No. Well, if, in the extremely unlikely situation that rows are being added to the base table faster than you are processing them, then yes, it could go on for a very long time. But if that were happening I'd say there was a separate problem to look at.

It should probably go without saying, but if the particular loop is going to happen very often, the table should be indexed on your unique ordered column. If it is a primary key or you already have a unique constraint it is not necessary to create an index explicitly because there will be one as part of the key or constraint.

10 comments:

Anonymous said...

Wow, a lot simpler with Oracle cursors:

for curs in
(select column1, column2, column3
from anyTable
ORDER BY column1
)
loop
-- code what you want to do here
-- using curs.column_1, curs.column_2 and curs.column_3
end loop;

-- job done, cursor closed.

Anonymous said...

Your approach is horrible and typical of programmers who care about the "write", which happens once, and not the "run", which happens over and over again.

Executing a SELECT requires a round trip to the database and some work on the DBs part. With a cursor, the DB does one execute, plus with a correct fetch size (like 100) you get rid of almost all the round trips.

Multiplying response time by 100 to save a few lines of code. Geez!

P.S. the comment above refers to implicit PL/SQL cursors, which silently use a fetch size of 100.

Anonymous said...

Some co-workers of mine were curious, so they did some comparisons using the code you wrote.

As long as the looping column is an indexed (PK) identity, performance was pretty much identical to that of a cursor. They tested using small data sets and huge data sets.

If the looping column is an identity, but not indexed, it takes several orders of magnitude longer.

So it seems that it's a matter of personal preference. There are very few times in our business where we must have a cursor, so it's not really a big problem.

I don't find cursors to be as verbose, slow, and painful as you describe, so I'd probably go with a cursor since it's "established" as you say -- i.e. programmers will know how to use it without explanation. Plus, you don't gain any speed with your approach.

KenDowns said...

Anonymous #2: I'm glad you did the follow-up tests so I did not have to :) When last I tested this the loop was considerably faster than the cursor, but that was *years* ago and who knows what has shifted since then. It should also go without saying that any access of many rows on an unindexed column is going to be slow no matter how you do it. If this is a regular event, the column should be indexed.

As for Anonymous #1, filtering out the bile, he seems to be saying it will be slower, which we know it is not. Not sure what he means by "round trips" since this is server-side execution in both cases. Probably means that the cursor will pre-fetch and there is no independent execution of the command on each loop, but as Anonymous #2 verified, that is not an issue.

cypressx said...

Hi,

I have used this approach and it had significant performance improvement. But, when I moved the Query code into a Stored Procedure or Table Valued Function the whole performance drops drastically and it works times slower than the solution with the cursor.
The query is not a complex one and is using few parameters.

Do you have any idea, why this is happening? I think, there are some side effects in this optimizations and therefore it should be avoided.

Thanks,
Yosif

P.S. A colleague of mine advised, that the problem may be coming from the Parameters of the Stored Procedure and the SQL Server couldn't correctly handle when to do INDEX Searches and does only TABLE Scans, but I've compared both Execution plans (of the Query and the Stored Procedure) and they are both the same.

nyetter said...

Are you sure there's no better syntax for this in Microsoft? My first reaction was the same as the first Anonymous poster: this is easy and concise in Oracle.

Also, you asserted without argument that cursors are slow, and then presented an alternative that intuitively will be much slower. I think you need to present your case.

Joe said...

"Call me petty, but what I hate about that code is that I have to refer to specific columns of interest 3 times (not counting the declarations)."

I'll call you petty because (a) it's MS SQL Server specific syntax but more importantly (b) "@column1" is NOT the same thing as "column1", i.e., "@column1" could be "@c1" or "@host_variable1" or anything else you'd fancy.

That said, the "fetch ... end" syntax requiring you to repeat the opening "fetch" is truly atrocious language design. It's hard to imagine a worse design: think about fetching tens of columns.

The PostgreSQL approach is much simpler, either Embedded, PL/pgSQL or interactive SQL. The second example here shows what to do in ECPG.

ml_black said...

Possibly a small typo that is causing some people to question that this is faster. In your SELECT TOP 1... statement, I don't believe you need the ...ORDER BY clause, since you are only selecting a single record. (You do, of course, need the sort in the SELECT that occurs prior to that i.e. the resultset that you will loop through.) In my tests, removing that sort led to *extremely* significant improvement in the timings of the stored proc I was working with, which originally had three cursor loops.

ml_black said...

P.S. Thanks!

Anonymous said...

Looks more like a Database hack than a concrete solution