Sunday, March 9, 2008

Introduction To Queries

This is the Database Programmer blog, for anybody who wants practical advice on database use.

There are links to other essays at the bottom of this post.

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Today we are going to look at performance, direct SQL coding, and then begin with the basics of the SQL SELECT command.

Disk Activity Determines Performance

Before we even look at the SQL SELECT command, we must know what motivates the experienced database programmer to pick certain kinds of queries and avoid others. The first motivator is of course performance. We all want our programs to go fast. If you want a fast database program, then you have to think about the hard disk.

The slowest device in a computer is the disk drive. The price of disk reads is so much higher than in-memory operations that there is nothing to gain by optimizing code unless disk reads are optimized first. I often tell my programmers to consider in-memory operations to be "free" when coding, and to concentrate all optimization efforts on reducing disk reads. This approach may gloss over some important truths, but it is an excellent starting point for database beginners.

Optimizing disk reads comes down to writing efficient queries on top of well-designed tables. With that being said, we have one more short note to cover before going into the syntax of the queries.

A Quick Note on Inline SQL

Many authors and framework programmers discourage the use of SQL SELECT statements directly in application code. This essay contains no opinions on that question.

However, it is important to know that the SELECT statement must be coded somehow and sent to the server, whether you code it manually or some framework tool generates it for you. This essay is all about how to code (or generate) that SELECT statement. Knowing how it all works is a requirement if you code your own framework or if you suspect your chosen framework may not be your friend in all cases.

Introducing SQL Select

The simplest query is four words (or symbols) long. If your database has a table of countries then here is a very simple query that will work:

SELECT * FROM COUNTRIES

This query will return all rows and columns from a table. Depending on what language you code in, these may come back in an associative array, an array of generic objects (or similar), or some other special-purpose object like a ResultSet.

You have probably been told not to use the "*" in a SELECT, for performance reasons. This is usually good advice. In the simplest case, you save bandwidth by only retrieving the columns that are of interest to you. If your table contains long varchar or text (aka clob) columns there are even more reasons to avoid "SELECT *". When you have long varchar and text columns, they may be stored outside of the main storage for the table, causing the server to look in two places to retrieve each row. Therefore, avoiding "SELECT *" and always specifying just the columns you need reduces disk reads on the server and reduces bandwidth delivering the results.

But as this is an introductory essay it is important to know how to retrieve a complete table, so I have used the "SELECT *" here.

Filtering Results with WHERE

The WHERE clause limits which rows from the base table will go into the query results. You specify a WHERE clause as one or more boolean conditional expressions. Multiple expressions can be separated by AND and OR, using parentheses to group expressions. You can review your product's documentation to see all of the comparisons and functions that are available. A moderate WHERE clause might look like this:

SELECT country,name
  FROM COUNTRIES
 WHERE country like 'A%'
   AND (   name like 'D%'
        OR name like 'E%'
       )
   AND continent = 'Africa'

Filtering and Performance

The primary purpose of a WHERE clause is to obtain the correct result. However, it is also a very important performance tool. Here is why.

If you are completely and totally new to database programming, you may get the idea that you will skip the WHERE clause and do your filtering in the application. This may seem like a good idea because you save the trouble of learning two languages. Instead of learning SQL plus your application language, you can concentrate on just your application language. And so you make a reasonable decision to use as little SQL as possible and just do everything in application code.

The drawback to this perfectly reasonable suggestion is that it violates our first performance concept, it creates a huge disk read burden. If you need five rows out of 50,000, then filtering in the application requires the database server to read all 50,000 rows off the disk. On top of that, these have to be delivered to your application for processing.

Making use of the WHERE clause means that only 5 rows are read off the disk (this assumes the presence of an index which will be explained in a later essay). In this particular example, using a WHERE clause will perform 1000 times faster than not using it. Of course this is only a single very vague example, but since a database application is composed largely of queries, it is definitely a good idea to have all of these queries start out on solid ground.

Foreign Keys and JOIN

Next week we are going to look at JOINs in much detail, but I want to mention them here briefly. The JOIN clause lets you return results from more than one table, and the JOIN determines how the rows from multiple tables will be matched to each other.

For this week I will say only that good queries will almost always use foreign keys as the basis of their JOINs. We have seen in these essays more than once that the foreign key is the fundamental and only way to connect information in separate tables. Naturally, therefore, the foreign key will loom large in our discussion of JOIN, since JOIN controls the combined retrieval of information from separate tables.

Sorting Query Results

You can sort query results by including an ORDER BY clause in the query. Simply name the columns:

SELECT customer,order,date 
  FROM ORDERS
 WHERE date >= '2008-03-01'
 ORDER BY date,customer

Some database servers let you put an ASC or DESC in front of individual columns, while other servers can only apply a DESC or ASC term to the entire sort operation.

Overall application performance comes into play with ORDER BY clauses. It is almost universally true that you can sort faster on the database server than you can in your code. You want to make sure that your manual queries contain ORDER BY clauses, and that your framework is generating them. You do not want to be sorting in application code in most cases.

Order of Terms Matters

You have to put the various clauses into the right order or they will not work. The order is:

SELECT ....
  FROM ....
  JOIN .... ON ....
 WHERE ....
 ORDER BY...

Conclusions

This week we began to examine queries, by looking at the very basics of the SQL SELECT query syntax. Not surprisingly, performance issues came up for every single part of the query, from the column list to the ORDER BY.

For performance, we looked at the basic idea that disk reads determine performance, which we will see more of in later weeks. I also mentioned that table design determines query efficiency, but we have not gotten very deep into that yet.

Related Essays

This blog has two tables of contents, the Topical Table of Contents and the list of Database Skills.

Other essays relating to SQL SELECT are:

96 comments: