Saturday, November 6, 2010

Recursive Queries with Common Table Expressions

This week The Database Programmer returns after almost 18 months with an entry on using Common Table Expressions (CTEs) to do recursive queries. Relational databases were plagued from their inception with a lack of meaningful treatment for recursive operations, and CTEs have finally plugged that hole.

Common Table Expressions appeared in SQL Server 2005, and in PostgreSQL 8.4, and are also available in Oracle. As for mySQL, since I don't use it, I did a quick Google search and looked at the Docs for 5.5, and couldn't really find anything. I generally tend to assume mySQL cannot do the tough stuff.

But First, A Rant

There have always been plenty of people who claimed SQL was a bloated and clumsy language to work with. Most of the time I tend to agree, but I find the advantages of relational/SQL system to be so large that I'm willing to pay that price.

But with Commom Table Expressions (CTEs) I just can't help drifting into conspiracy theories involving the enemies of SQL infiltrating the committees and deliberately suggesting the most twisted, bloated, complicated way they could think of to do what is really a very basic operation. In other words, I am profoundly unimpressed with the syntax of CTEs, but as long as they are here and they work, we'll go along.

The Basic Example

Your basic recursive table contains a foreign key to itself, so that some rows in the table are children of some other row in the table. This recursion can nest to any depth, and the chart below shows a very simple example:

Primary_key   |   Parent_Key  |  Notes  
--------------+---------------+---------------------------
     A        |     null      |   top level row, no parent
     B        |      A        |   first level child of A
     C        |      B        |   child of B, grandchild
              |               |   of A
     D        |      C        |   child of C, grandchild 
              |               |   of B, great-grandchild 
              |               |   of A
     X        |     null      |   top level row, no parent
     Y        |      X        |   child of X
     Z        |      Y        |   child of Y, grandchild 
              |               |   of X

What we want is a query that can return a given row and all of its children out to any level, including helpful information about the structure of the recursion, something like this:

Primary_key | Level | Top_Key | Immediate_Parent_key 
------------+-------+---------+-----------------------
     A      |   1   |  A      | null
     B      |   2   |  A      | A    
     C      |   3   |  A      | B   
     D      |   4   |  A      | C
     X      |   1   |  X      | null
     Y      |   2   |  X      | X   
     Z      |   3   |  X      | Y   

And Another Rant

At this point the mind boggles at how long this blog entry needs to be to explain this simple operation. But lets get going anyway.

The First Step and Last Step

A Common Table Expression begins with the "WITH" clause and ends with a standard SQL Select:

;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
  ....we'll get to this below....
)
select * from myCTEName

The basic idea is that we are going to define a CTE with a name and a list of columns, and then SELECT out of it. Without that final SELECT statement the CTE does not actually do anything. The SELECT can also be arbitrarily complex, doing aggregrations, WHERE clauses and anything else you might need.

The first thing to notice is the leading semi-colon. This is a trick adopted by MS SQL Server users. SQL Server does not require statements to be terminated with a semi-colon, but a SQL Server CTE requires the previous statement to have been terminated with a semi-colon (nice huh?). So SQL Server programmers adopted the strategy of starting the CTE with a semi-colon, which keeps the syntactical requirement with the CTE, where it belongs.

A given CTE sort of has a name. That is, you have to name it something, but think of it as a table alias in a SQL SELECT, such as "Select * from myTable a JOIN otherTable b...", it exists only during the execution of the statement.

The columns listed in the parantheses can have any names (at least in SQL Server). But these column names are what you will refer to in the final SQL SELECT statement.

Coding The Inside of the CTE, Step 1

Now we code the inside of the CTE in two steps. The first step is called the "anchor", and it is a straightforward query to find the top-level rows:

;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
    select primary_key   as primary_key
         , 1             as level
         , primary_key   as top_key
         , null          as immediate_parent_key
      from myRecursiveTable
     where Parent_key is null
)
select * from myCTEName

This should be self-explanatory, we are querying only for rows that have no parent (WHERE Parent_key is null) and we are hardcoding the "level" column to 1, and we are also hardcoding the "immediate_parent_key" column to null.

This query alone would return two of the rows from our desired output:

Primary_key | Level | Top_Key | Immediate_Parent_key 
------------+-------+---------+-----------------------
     A      |   1   |  A      | null
     X      |   1   |  X      | null

Coding The Inside of the CTE, Step 2

Now we are going to add the actual recursion. When I first learned CTEs this was the hardest part to figure out, because it turned out my hard-won set-oriented thinking was actually slowing me down, I had to think like a procedural programmer when defining the second half of the query.

;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
    select primary_key,1,primary_key,null
      from myRecursiveTable
     where Parent_key is null
    UNION ALL
    select chd.primary_key,par.level+1,par.top_key,chd.parent_key
      FROM myCTEName        par
      JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select * from myCTEName

Thinking step-wise, here is what is going on under the hood:

  1. The server executes the "anchor" query, generating a result set called "myCTEName" containing just the top level rows.
  2. The server then executes the second query. At this point the result set "myCTEName" exists and can be referenced, so that you can link children to their parents. (That's why you see the JOIN)
  3. Step 2 is repeated recursively, adding grand-children, great-grand-children, and so on, until no more rows are being added, at which point it stops, and...
  4. The final result set is passed to the trailing SELECT, which pulls results out of "myCTEName" as if it were a table or view.

So when we code the 2nd part of the inside of the CTE, the part after the UNION ALL, act as if the first query has already run and produced a table/view called "myCTEName" that can be referenced. Once you understand that, the query is pretty easy to understand:

  • The "From myCTEName par" clause tells us we are pulling from the previously generated set. I like to use the alias "par" for "parent" to remind myself that the prior result is the parent row.
  • We then join to the original source table and use the alias "chd" to remind ourselves we are pulling child rows from there. The "ON chd.parent_key = par.primary_key" defines how children are joined to parents.
  • Our first column, "chd.primary_key", is the unique key for the results.
  • Our second column, "par.level+1" gives us a nifty automatically incremented "level" column.
  • Our third column, "par.top_key" ensures that all rows contain a reference to their top-most parent.
  • Our final column, "chd.parent_key", makes sure each row contains a reference to its immediate parent.

Finding Various Statistics

Once you have the inside of the CTE coded, the fun part moves to the final SELECT, which is operating on the complete set of results. You do not necessarily have to pull the complete list. For instance, you may want to find out the maximum nesting level for each parent, or the count of children for each parent:

;WITH myCTEName (primary_key,level,top_key,immediate_parent_key)
as (
    select primary_key,1,primary_key,null
      from myRecursiveTable
     where Parent_key is null
    UNION ALL
    select chd.primary_key,par.level+1,par.top_key,chd.parent_key
      FROM myCTEName        par
      JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key
)
select top_key
     , max(level) as maxNestingLevel
     , count(*)   as countRows
     , count(*)-1 as countChildren
  from myCTEName

Conclusion

Common Table Expressions give SQL-based databases the (very) long-needed ability to execute recursive queries, albeit with a rather complex syntax. Once you grasp the basics of how to code them, there are many possible uses that go far beyond the simple example shown here.

23 comments:

zippy1981 said...

Great to see you back from the dead. I've used CTEs in Microsoft SQL Server, but didn't know Oracle implemented them.

Jorge Monasterio said...

Are these CTE's efficient?

Anonymous said...

I have tried this on Firebird SQL, and it does not work, because firebird complains about cyclic dependencies (doh).

Seems like the approach is a good one, though not valid for all databases.

Anonymous said...

At last! I've been visiting for ages and waiting for a new article - thought you'd gone away for good (and after I'd recommended you to others). Welcome back.

KenDowns said...

Anonymous and zippy: thanks much for compliments, it feels good to be writing The Database Programmer again.

KenDowns said...

Jorge: I honestly don't know, I haven't run speed tests with them, and my current use is in ad-hoc analysis queries where I'm not terribly concerned with performance.

Perhaps you can run some tests and let us know :)

Joe said...

Belated welcome back!

I thought you would've tested with Postgres. With PostgreSQL 8.4.5, the query that works is as follows:

WITH recursive myCTEName (primary_key,level,top_key,immediate_parent_key) as

( select primary_key,1,primary_key,null::text from myRecursiveTable where Parent_key is null

UNION ALL

select chd.primary_key,par.level+1,par.top_key,chd.parent_key FROM myCTEName par JOIN myRecursiveTable chd ON chd.parent_key = par.primary_key)

select * from myCTEName;

In other words, you have to use "WITH RECURSIVE" (otherwise you get: There is a WITH item named "myctename", but it cannot be referenced from this part of the query.) and you have to cast the null to something compatible with parent_key.

Joe said...

To make the last query work on PostgreSQL, you need to add "GROUP BY top_key" at the end.

With all due respect, I find your explanation somewhat misleading. Logically, the entire UNION SELECT is done first, but for the initial iteration the SELECT FROM myCTEname returns zero rows. The DBMS then recurses over the results, again and again, using the previous set of rows as the basis for the next iteration.

Lukas Eder said...

Hi Ken,

+1 for your ranting about syntax ;-) I'd like to hear your opinion about the marvelous syntax of the do-it-all MERGE statement, as supported by various RDBMS such as DB2, HSQLDB, Oracle, SQL Server, Sybase.

On the other hand, a comparison with Oracle's much simpler and more intuitive CONNECT BY syntax might be interesting as well. Have you done a comparison before? If the CONNECT BY syntax can be fully mapped onto CTE's, then I am going simulate the CONNECT BY clause for all RDBMS that support CTE's in http://www.jooq.org.

Cheers
Lukas

Anonymous said...

For FireBird 2.5+ just alter
'WITH myCTEName'
to
'WITH RECURSIVE myCTEName'

Anonymous said...

I can't thank you enough for this explanation. I knew I needed recursive queries to get the top X results for all subsets of results, but I didn't understand the example query well enough to translate it to the task I was performing.

JacobHarman said...

At the point when the organizations choose to enlist independent net designer, it implies that they have certainly figured out the limit that Microsoft's .Net innovation could ensure for the strong tech portrayal available. .Net is an open-source system for the product improvement that is used by net engineer independent experts as a controlled climate for introducing, executing, and supporting web applications and programming.

An independent asp net designer, an independent speck net engineer, and some other full-stack net designer favor this structure for moderate and simple to handle usefulness permits a rich scope of work area, versatile, and web applications working to upgrade business cutthroat chances according to the tech perspective>> .net developer freelance junior

Unknown said...

Does your business or enterprise urgently need a MATLAB embedded coder? Then I strongly recommend that you follow this link! To get information about hiring such a person, there are also many different services here that you can use too! It's very convenient and simple! Post your vacancy here, good luck!

Anonymous said...

Hello! More and more companies from all over the world are choosing to use the staffing model for their business. This service helps to quickly find and hire IT staff to fill the shortage of specialists for your project. If you are also interested in such a model, then I recommend that you go to our website and read more about our company that provides staff augmentation consulting services!

WaldoEffertz said...

Nowadays, if a startup plans to go to a high level, it needs an angular web developer, but if it is a startup then the financial resources are limited. There is a very profitable solution for hiring developers - outsourcing companies from Ukraine! You can hire developers in outsourcing companies, while the price for the work of this specialist will be several times lower.

Unknown said...

In our difficult times, recruitment services in the field of cloud integration automation service
are a great option for developing your business! Using these services, you will protect yourself from many pitfalls, and not only because there are many reasons that you can see right on this site! You should use these services, I recommend! Have a good day!

Unknown said...

The bot outsourcing model is used only by professionals who have extensive experience and knowledge, and that is why they decided to use this particular method in the development of the project. Our company can help you take your projects to the next level and significantly improve and optimize your services by hiring and providing the right staff.

DonLi said...

If you want to start using erp software development services but don't know where to start, then you need to ask for help from a team that specializes in this! They will help you quickly implement this service in your business projects, and you will immediately see and feel the difference and the result!

alexsaint said...

Scala is its focus on functional programming, which emphasizes immutability and declarative programming styles. This can lead to more concise and expressive code, as well as easier debugging and testing. I learned about this in an excellent article that clearly describes the work of specialists in this field and their benefits. hire scala engineer

Unknown said...

Perhaps you are currently doing business on the Internet and your budget does not allow to hire a freelance cyber security consultant who could greatly simplify your work, but outsourcing can quite easily accomplish this seemingly impossible task and help you get the specialist you need cheaper than if you hired it manually. You can read more about it on this site.

Unknown said...

Consider for the needs of your company - devops services and solutions provided by this advanced technology will definitely suit your taste, except then with successful cooperation with professionals at the following link, you can do it completely on a remote basis, which of course is a huge plus for you, I recommend it!

RemoteLabeler said...
This comment has been removed by the author.
data warehouse said...

Good information.