Monday, April 14, 2008

JOINS Part Two, The Many Forms of JOIN

Welcome to the Database Programmer, the blog for anybody who wants to learn the practical realities of working with databases.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

The Many Kinds of JOIN

When a programmer first decides to start learning SQL, the JOIN always appears simple at first. But it can produce some unexpected results. Sometimes a JOIN brings back more rows than the novice thinks it should, and sometimes less. This week we are going to concentrate on all of the various results that are produced by the JOIN.

A JOIN always operates on the same principle: it returns one row for each combination of rows from both sides of the match.

A Parent-Child JOIN Returns Number of Matching Children

Consider the following two simple tables:

CUSTOMER | ZIP               CUSTOMER | ORDER | DATE
---------+---------          ---------+-------+----------
 1234    | 11733              1234    |    57 | 3/1/08
 5283    | 77074              1234    |    78 | 3/15/08
                              1234    |    89 | 4/07/08
                              5283    |    23 | 2/13/08
                              5283    |    32 | 3/17/08

When you JOIN these two tables together, there are three combinations that match for customer 1234, and two combinations that match on 5283, so this query:

SELECT customers.zip
      ,orders.customer,orders.order,orders.date
  FROM customers
  JOIN orders    ON customers.customer = orders.customer

...will return 5 rows. Values from the child table appear once, and values from the parent tables repeat. The return count matches the number of rows in the child table:

 ZIP  | CUSTOMER | ORDER | DATE
------+----------+-------+----------
11733 |  1234    |    57 | 3/1/08
11733 |  1234    |    78 | 3/15/08
11733 |  1234    |    89 | 4/07/08
77074 |  5283    |    23 | 2/13/08
77074 |  5283    |    32 | 3/17/08

The Parent-Child And a Left JOIN

Now we consider the case when our Sales Manager comes down the hall and asks for a list of customers in zip code 11101, with their total sales. So you have this information in the database:

CUSTOMER | ZIP               CUSTOMER | ORDER | DATE    | AMOUNT
---------+---------          ---------+-------+---------+-------
 1234    | 11101              1234    |    57 | 3/1/08  |   25
 5283    | 11101              1234    |    78 | 3/15/08 |   35
 2938    | 11101              1234    |    89 | 4/07/08 |   45
                              5283    |    23 | 2/13/08 |   55
                              5283    |    32 | 3/17/08 |   65

...and you write this query:

SELECT c.customer,sum(o.amount) as amount
  FROM customers c
  JOIN orders    o ON c.customer = o.customer
 WHERE customer.zip = 11101

...and the result only has two rows:

CUSTOMER | AMOUNT   
---------+--------- 
 1234    |   105
 5283    |   120

The novice database programmer will look at this and say, "I don't get it, my filter says zip 11101, but I only got two rows back." This is because the JOIN clause acted as a filter, because the JOIN only returns where there is a match on both sides.

If you want to get all of the rows from the left side of the JOIN, even if they have no match on the right side, you use a LEFT JOIN:

SELECT c.customer,sum(o.amount) as amount
  FROM customers c
  LEFT JOIN orders o ON c.customer = o.customer
 WHERE customer.zip = 11101

...and now the answer is:

CUSTOMER | AMOUNT   
---------+--------- 
 1234    |   105
 5283    |   120
 2938    |  null

Of course we probably do not want that "null" because it messes up calculations, so you can use the COALESCE() function to replace nulls with zeroes. The COALESCE() function accepts a list of values and returns the first non-null:

SELECT c.customer,sum(COALESCE(o.amount,0)) as amount
  FROM customers c
  LEFT JOIN orders  o ON c.customer = o.customer
 WHERE customer.zip = 11101

...and this completes the picture:

CUSTOMER | AMOUNT   
---------+--------- 
 1234    |   105
 5283    |   120
 2938    |     0

You Can JOIN a Table to Itself

Recently a customer of mine dropped by with a problem he had in an insurance database. He had a table that listed various companies and their total premiums collected by year. He is required by various regulation to report the premiums as income at the rate of 28% in the year they are collected, with the remaining 72% being reported in the following year. It did not appear readily obvious how to do this.

This is a case of self-JOIN, where you JOIN a table to itself to match rows to each other. In this case we created the following query:

SELECT tab1.year
      ,tab1.premium * .28 as premium1
      ,tab2.premium * .72 as premium2
  FROM premiums tab1
  JOIN premiums tab2 ON tab1.company = tab2.company
 WHERE tab2.year = tab1.year + 1

Notice that the JOIN condition only matches company to company, and an additional WHERE clause was required to limit the matches. Technically we would prefer to see that second condition in the JOIN clause, but most database servers do not support that. It is a little more confusing to a newbie as written, but not terribly so.

Another SELF JOIN and a FULL OUTER JOIN

Another customer of mine called recently to complain that his legacy system is not doing something right. He said there was information missing.

In his business, which is magazine distribution to retailers, he has a table called "DEFAULTS" that lists the default number of each magazine given to each store. Sometimes he wants to see the defaults for two magazines side by side, for all of the stores they are delivered to. If his defaults table looked like this:

MAGAZINE | STORE | DEFAULT
---------+-------+--------
123      | A     |   5
123      | B     |  10
123      | C     |   7
456      | A     |   6
456      | D     |   3

...and he wanted to see magazines 123 and 456 side by side, he would expect output like this:

STORE | MAG 123  |  MAG 456 
------+----------+----------
A     |    5     |    6
B     |   10     |    0
C     |    7     |    0
D     |    0     |    3

Here is how you can recognize this request as needing a self-join. If it appears that you need two separate filters for the same table, and the two filters result in values that are meant to sit side-by-side, then you have a self-join. Our first stab at the query would be:

SELECT d1.store
      ,d1.default as mag123
      ,d2.default as def456
  FROM defaults d1
  JOIN defaults d2 on d1.store = d2.store
 WHERE d1.magazine = 123
   AND d2.magazine = 456

This is the query I found in his program, and as soon as I saw it his complaint made sense. He said basically, "There are only 5 rows for magazines x and y and there should be 500." The problem goes back to the fact that a JOIN limits the answer to those cases where there is a match on both sides. The query above would give only this:

STORE | MAG 123  |  MAG 456 
------+----------+----------
A     |    5     |    6

...because only store A has a row for both magazines. What we want is every entry for each magazine for either store even if there is no matching entry in the other store. The LEFT JOIN we saw above would help, but would only do half the job, it would give us:

STORE | MAG 123  |  MAG 456 
------+----------+----------
A     |    5     |    6
B     |   10     |    0
C     |    7     |    0

What we need is a FULL OUTER JOIN, which is basically a LEFT JOIN and a RIGHT JOIN. Its a LEFT JOIN that goes both ways. If we add in our COALESCE() functions the final query is:

SELECT d1.store
      ,COALESCE(d1.default,0) as mag123
      ,COALESCE(d2.default,0) as def456
  FROM defaults d1
  FULL OUTER JOIN defaults d2 on d1.store = d2.store
 WHERE d1.magazine = 123
   AND d2.magazine = 456

Conclusion: That your JOIN May Be Complete

The JOIN is a very powerful clause. The JOIN is the most-used method for combining information from multiple tables together. This week we saw that the JOIN has provisions for joining a table to itself, and for handling cases where the matches may not be all present.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other essays relating to SQL SELECT are:

4 comments:

Alex said...

Thanks again for a concise article! Maybe you can leave some remarks about the "traditional" (within the where clause) way of joins in another article?! I find it easier to understand for beginners ;-)

Best regards

Alex

Anonymous said...

Thanks..
A very nice article for the novice database programmer.

rudy said...

This --

sum(COALESCE(o.amount,0))

is unnecessary.

Aggregate functions such as SUM ignore NULLs.

SUM(o.amount)

;o)

Douglas said...

@rudy So does AVERAGE. I'm thinking he's covering his bases here so that any math works out as expect rather than as correct.