Saturday, November 6, 2010

Recursive Queries with Common Table Expressions

This week The Database Programmer returns after almost 18 months with an entry on using Common Table Expressions (CTEs) to do recursive queries. Relational databases were plagued from their inception with a lack of meaningful treatment for recursive operations, and CTEs have finally plugged that hole.

Common Table Expressions appeared in SQL Server 2005, and in PostgreSQL 8.4, and are also available in Oracle. As for mySQL, since I don't use it, I did a quick Google search and looked at the Docs for 5.5, and couldn't really find anything. I generally tend to assume mySQL cannot do the tough stuff.

But First, A Rant

There have always been plenty of people who claimed SQL was a bloated and clumsy language to work with. Most of the time I tend to agree, but I find the advantages of relational/SQL system to be so large that I'm willing to pay that price.

But with Commom Table Expressions (CTEs) I just can't help drifting into conspiracy theories involving the enemies of SQL infiltrating the committees and deliberately suggesting the most twisted, bloated, complicated way they could think of to do what is really a very basic operation. In other words, I am profoundly unimpressed with the syntax of CTEs, but as long as they are here and they work, we'll go along.

The Basic Example

Your basic recursive table contains a foreign key to itself, so that some rows in the table are children of some other row in the table. This recursion can nest to any depth, and the chart below shows a very simple example:

Primary_key   |   Parent_Key  |  Notes  
--------------+---------------+---------------------------
     A        |     null      |   top level row, no parent
     B        |      A        |   first level child of A
     C        |      B        |   child of B, grandchild
              |               |   of A
     D        |      C        |   child of C, grandchild 
              |               |   of B, great-grandchild 
              |               |   of A
     X        |     null      |   top level row, no parent
     Y        |      X        |   child of X
     Z        |      Y        |   child of Y, grandchild 
              |               |   of X

What we want is a query that can return a given row and all of its children out to any level, including helpful information about the structure of the recursion, something like this:

Primary_key | Level | Top_Key | Immediate_Parent_key 
------------+-------+---------+-----------------------
     A      |   1   |  A      | null
     B      |   2   |  A      | A    
     C      |   3   |  A      | B   
     D      |   4   |  A      | C
     X      |   1   |  X      | null
     Y      |   2   |  X      | X   
     Z      |   3   |  X      | Y   

And Another Rant

At this point the mind boggles at how long this blog entry needs to be to explain this simple operation. But lets get going anyway.

The First Step and Last Step

A Common Table Expression begins with the "WITH" clause and ends with a standard SQL Select:

;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
  ....we'll get to this below....
)
select * from myCTEName

The basic idea is that we are going to define a CTE with a name and a list of columns, and then SELECT out of it. Without that final SELECT statement the CTE does not actually do anything. The SELECT can also be arbitrarily complex, doing aggregrations, WHERE clauses and anything else you might need.

The first thing to notice is the leading semi-colon. This is a trick adopted by MS SQL Server users. SQL Server does not require statements to be terminated with a semi-colon, but a SQL Server CTE requires the previous statement to have been terminated with a semi-colon (nice huh?). So SQL Server programmers adopted the strategy of starting the CTE with a semi-colon, which keeps the syntactical requirement with the CTE, where it belongs.

A given CTE sort of has a name. That is, you have to name it something, but think of it as a table alias in a SQL SELECT, such as "Select * from myTable a JOIN otherTable b...", it exists only during the execution of the statement.

The columns listed in the parantheses can have any names (at least in SQL Server). But these column names are what you will refer to in the final SQL SELECT statement.

Coding The Inside of the CTE, Step 1

Now we code the inside of the CTE in two steps. The first step is called the "anchor", and it is a straightforward query to find the top-level rows:

;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
    select primary_key   as primary_key
         , 1             as level
         , primary_key   as top_key
         , null          as immediate_parent_key
      from myRecursiveTable
     where Parent_key is null
)
select * from myCTEName

This should be self-explanatory, we are querying only for rows that have no parent (WHERE Parent_key is null) and we are hardcoding the "level" column to 1, and we are also hardcoding the "immediate_parent_key" column to null.

This query alone would return two of the rows from our desired output:

Primary_key | Level | Top_Key | Immediate_Parent_key 
------------+-------+---------+-----------------------
     A      |   1   |  A      | null
     X      |   1   |  X      | null

Coding The Inside of the CTE, Step 2

Now we are going to add the actual recursion. When I first learned CTEs this was the hardest part to figure out, because it turned out my hard-won set-oriented thinking was actually slowing me down, I had to think like a procedural programmer when defining the second half of the query.

;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
    select primary_key,1,primary_key,null
      from myRecursiveTable
     where Parent_key is null
    UNION ALL
    select chd.primary_key,par.level+1,par.top_key,chd.parent_key
      FROM myCTEName        par
      JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select * from myCTEName

Thinking step-wise, here is what is going on under the hood:

  1. The server executes the "anchor" query, generating a result set called "myCTEName" containing just the top level rows.
  2. The server then executes the second query. At this point the result set "myCTEName" exists and can be referenced, so that you can link children to their parents. (That's why you see the JOIN)
  3. Step 2 is repeated recursively, adding grand-children, great-grand-children, and so on, until no more rows are being added, at which point it stops, and...
  4. The final result set is passed to the trailing SELECT, which pulls results out of "myCTEName" as if it were a table or view.

So when we code the 2nd part of the inside of the CTE, the part after the UNION ALL, act as if the first query has already run and produced a table/view called "myCTEName" that can be referenced. Once you understand that, the query is pretty easy to understand:

  • The "From myCTEName par" clause tells us we are pulling from the previously generated set. I like to use the alias "par" for "parent" to remind myself that the prior result is the parent row.
  • We then join to the original source table and use the alias "chd" to remind ourselves we are pulling child rows from there. The "ON chd.parent_key = par.primary_key" defines how children are joined to parents.
  • Our first column, "chd.primary_key", is the unique key for the results.
  • Our second column, "par.level+1" gives us a nifty automatically incremented "level" column.
  • Our third column, "par.top_key" ensures that all rows contain a reference to their top-most parent.
  • Our final column, "chd.parent_key", makes sure each row contains a reference to its immediate parent.

Finding Various Statistics

Once you have the inside of the CTE coded, the fun part moves to the final SELECT, which is operating on the complete set of results. You do not necessarily have to pull the complete list. For instance, you may want to find out the maximum nesting level for each parent, or the count of children for each parent:

;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
    select primary_key,1,primary_key,null
      from myRecursiveTable
     where Parent_key is null
    UNION ALL
    select chd.primary_key,par.level+1,par.top_key,chd.parent_key
      FROM myCTEName        par
      JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select top_key
     , max(level) as maxNestingLevel
     , count(*)   as countRows
     , count(*)-1 as countChildren
  from myCTEName

Conclusion

Common Table Expressions give SQL-based databases the (very) long-needed ability to execute recursive queries, albeit with a rather complex syntax. Once you grasp the basics of how to code them, there are many possible uses that go far beyond the simple example shown here.

11 comments:

zippy1981 said...

Great to see you back from the dead. I've used CTEs in Microsoft SQL Server, but didn't know Oracle implemented them.

Jorge Monasterio said...

Are these CTE's efficient?

Anonymous said...

I have tried this on Firebird SQL, and it does not work, because firebird complains about cyclic dependencies (doh).

Seems like the approach is a good one, though not valid for all databases.

Anonymous said...

At last! I've been visiting for ages and waiting for a new article - thought you'd gone away for good (and after I'd recommended you to others). Welcome back.

KenDowns said...

Anonymous and zippy: thanks much for compliments, it feels good to be writing The Database Programmer again.

KenDowns said...

Jorge: I honestly don't know, I haven't run speed tests with them, and my current use is in ad-hoc analysis queries where I'm not terribly concerned with performance.

Perhaps you can run some tests and let us know :)

Joe said...

Belated welcome back!

I thought you would've tested with Postgres. With PostgreSQL 8.4.5, the query that works is as follows:

WITH recursive myCTEName (primary_key,level,top_key,immediate_parent_key) as

( select primary_key,1,primary_key,null::text from myRecursiveTable where Parent_key is null

UNION ALL

select chd.primary_key,par.level+1,par.top_key,chd.parent_key FROM myCTEName par JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key)

select * from myCTEName;

In other words, you have to use "WITH RECURSIVE" (otherwise you get: There is a WITH item named "myctename", but it cannot be referenced from this part of the query.) and you have to cast the null to something compatible with parent_key.

Joe said...

To make the last query work on PostgreSQL, you need to add "GROUP BY top_key" at the end.

With all due respect, I find your explanation somewhat misleading. Logically, the entire UNION SELECT is done first, but for the initial iteration the SELECT FROM myCTEname returns zero rows. The DBMS then recurses over the results, again and again, using the previous set of rows as the basis for the next iteration.

Lukas Eder said...

Hi Ken,

+1 for your ranting about syntax ;-) I'd like to hear your opinion about the marvelous syntax of the do-it-all MERGE statement, as supported by various RDBMS such as DB2, HSQLDB, Oracle, SQL Server, Sybase.

On the other hand, a comparison with Oracle's much simpler and more intuitive CONNECT BY syntax might be interesting as well. Have you done a comparison before? If the CONNECT BY syntax can be fully mapped onto CTE's, then I am going simulate the CONNECT BY clause for all RDBMS that support CTE's in http://www.jooq.org.

Cheers
Lukas

Anonymous said...

For FireBird 2.5+ just alter
'WITH myCTEName'
to
'WITH RECURSIVE myCTEName'

Anonymous said...

I can't thank you enough for this explanation. I knew I needed recursive queries to get the top X results for all subsets of results, but I didn't understand the example query well enough to translate it to the task I was performing.