This is the Database Programmer blog, for anybody who wants practical advice on database use. There is a complete table of contents here.
Last week we started a new sub-series on queries. Today we are going to see one of the most important parts of any query, the JOIN clause. The JOIN is the fundamental mechanism for combining data from different tables.
The Very Basics of JOIN
Sometimes you need information that has been separated out into different tables. For instance, imagine you are a programmer on an in-house eCommerce site for a company that sells computer parts around the world. The Sales Manager walks down the hall one day and says she wants a detail listing of customer types and order dates. Our customer types are in the CUSTOMERS table, and the order dates are in the ORDERS table, so what she wants is this:
A SELECT statement to pull these columns would look like this:
SELECT customers.customer,customers.custtype,orders.date FROM customers JOIN orders ON customer.customer = order.customer
Refining With Aliases
Because the JOIN is so basic and common, it can get very cumbersome to constantly spell out long table names in front of every column. Therefore you can use an "alias" in the JOIN clause to give each table a nickname:
SELECT c.customer,c.custtype,o.date FROM customers c JOIN orders o ON c.customer = o.customer
More Than One JOIN
Sometimes the information you need is in two tables that cannot be JOINed because they are not "next" to each other. Put another way, neither table has a foreign key to the other. Put a third way, they do not have any columns in column. Let us pretend our troublesome Sales Manager comes down the hall again and this time she wants a listing of every item ordered by every customer type. This time she is looking for the following:
This time we need to start with the CUSTOMERS table and then "go through" the ORDERS table to get to the information we need in the LINES table. This means two JOINs:
SELECT c.custtype,l.sku FROM customers c JOIN orders o ON c.customer = o.customer JOIN lines l ON o.order = l.order
Most Requests Will Make Sense
Sometimes a user's request will appear to make no sense. The user asks for a combination of values that appear to have no connection, and we programmers object, "that makes no sense! Why would you want that?"
The most important idea to keep in mind here is not a technical idea at all, it is more a matter of how to keep people happy. In my experience it is extremely rare for a customer to ask for a query that well and truly makes no sense. Just because I do not understand it does not mean it makes no sense! In fact the Sales Manager likely knows her job very well and if she is asking for items by customer type she must have a reason.
Now, with that being said, the technical solution in these cases is to follow the foreign keys. If a database has been designed well, all tables will be connected to each other through foreign keys, and you can trace out a path that connects the various data points by following these foreign keys.
Denormalizing For Performance
You have probably heard people say that sometimes you need to "denormalize for performance." Now we will look at what that means.
Consider an assignment given to two people, one of them a veteran database programmer and the other a newbie. It is guaranteed that the veteran's database design will have a lot more tables in it than the newbie's database. This is because the veteran knows he will have far fewer errors getting data in if he keeps a separate table for each level of detail required by the program. By contrast, the newbie will be guided by a strange desire to save on tables as if there is some kind of world-wide shortage of tables.
But the veteran now has a problem. While normalization is great for ensuring correctness on the way in, it tends to require more JOINs on the way out, and it so happens that JOINs are rather expensive for a database to perform. In fact, they are one of the most expensive operations there is, and they only get worse as the number of tables being JOINed increases.
Therefore, the veteran will sometimes take a design to the fullest of normalization, and then deliberately denormalize it to reduce JOINs. A very simple example is adding the CUSTTYPE column to the ORDERS table and then copying the value of Customer Type onto each order. If the programmer is confident that the value will always be copied correctly, then any report on sales that involves customer types can avoid an expensive JOIN between ORDERS and CUSTOMERS. This is the essence of the "Denormalizing for Performance" approach, and we will see more essays specifically on that topic later in this series.
Denormalized is not the same as non-normalized. The newbie will have fewer tables, tables that are non-normalized because they have values bunched together that do not belong together. The newbie will spend a lot of time correcting data errors as a result of this. The veteran however will have lots of normalized tables and will look for (or write) a framework that assists in controlling where a user can write values and when the values are copied around to the their de-normalized spots.
Conclusion: Do Not Fear the JOIN
Just as the foreign key is the fundamental (and in fact the only) mechanism that relates data together, so the JOIN is the basic building block that ties that information back together.
2 comments:
Great article again!
Maybe one should stress, that a FK relation is not mandatory for a join, but in most cases very useful ;-)
Outer joins soon to come? ;-)
Alex: thanks! You are correct that the FK is not mandatory, so thanks for adding that. I almost put into the article a comment that in the last 4 years I have been able to design databases so that every single JOIN follows an FK, but I left it out because I thought it would be stretching a point in an introductory article. Nevertheless, I do feel it is possible with a good design for all or nearly all JOINs to follow fks.
Outer joins will certainly be treated at some point soon.
Post a Comment