Sunday, March 30, 2008

How the SQL UNION Affects Table Design

Welcome to the Database Programmer!

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Introducing the UNION

Last week we saw a JOIN. A JOIN will combine values from two (or more) different tables and put them into the same row of the output.

In contrast to the JOIN, which puts values into the same row, the output of a UNION is to add together rows from more than one query to make a larger result. Here is an example of a UNION of a customers table and a vendors table:

The image shows that rows from two tables are combined one after the other in the results. The SQL is here:

SELECT customer as code,'CUST' as type,name,state
  FROM customers
UNION ALL
SELECT customer as code,'VEND' as type,name,state
  FROM vendors

You Usually Want UNION ALL

The example above uses the syntax "UNION ALL", with the keyword "ALL" being added. You will see this "ALL" keyword on every example on this blog and in many examples elsewhere.

If the "ALL" keyword is not included, most database servers will examine every row of the results and attempt to eliminate duplicates. They do this by examining every single column of every row. This means if you pull a 10-column query of 3000 rows out of one table and 6000 rows out of another, the database will attempt to de-duplicate 9000 rows based on all ten columns. This has two disadvantages, which are:

  1. It is slow, like a snail crawling through glue, and
  2. Most people don't actually expect or want the query to be de-duplicated.

Including the "ALL" keyword tells the server to return all rows and not bother to try to de-duplicate them.

Numbering Columns

When you do a GROUP BY (which we will see in later weeks) or an ORDER BY, most database servers require you to list the columns by number, not by name. This is because the values in the result may be coming from columns that had different names in the original base tables.

Object Oriented Influences

The example above shows only three columns each for the customers and vendors tables. We can assume of course that they have more columns, and we can also guess that many of those columns will be the same for both tables. In any case where you use a UNION, you may find yourself asking why you have two tables, and if you should have only one.

To make matters worse, if you learned your table design in the school of Object Orientation then you will have a very strong desire to make a base class called "trading partner" and make customers and vendors into subclasses of that base class. Then you want your tables to reflect your classes so there you are with one table.

Nevertheless, this is usually a mistake. It will make your code more complicated and error prone. To understand why, let's look at UNION again.

The UNION clause allows you to combine information from separate similar sources only when needed. In other words, the UNION clause lets you combine information upon demand, without requiring a permanent combination.

If you combine vendors and customers you have a problem that checks can be issued to customers, or orders can be entered against vendor accounts. To prevent this, you need complicated business logic in your application, and additional columns in the tables. Moreover, a customer will have columns a vendor does not and vice-versa, so you need more logic to ignore some columns or hardcode them or otherwise handle them based on what operation is begin performed. If they are separate, simple foreign keys do the trick and you don't need that extra code. Once you know about the UNION clause, there is little incentive to combine entities that should not be combined.

Conclusion: Combine at Output

If you go back and review the essays on table design patterns you will see that good table design is all about separating facts out into many different tables. The goal of the separation is to store each fact in exactly one place. Using primary keys and foreign keys on a well normalized database ensures that data is correct while it is on the way in.

However, on the way out, you need to recombine those facts. Two weeks ago we saw that the JOIN combines facts from different tables into a row. This week we saw that we can use the UNION to combine results vertically, that is, to add the results of one query to the results of another. Judicial use of UNION makes your application lean and efficient by letting you normalize data to ensure correctness on the way in, while still combining facts where necessary on the way out.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other essays relating to SQL SELECT are:

82 comments: