Sunday, May 18, 2008

Database Triggers, Encapsulation and Composition

When Object Oriented Programming burst into the mainstream, encapsulation rapidly came to be accepted as a basic tenet of sound programming. Modern database servers provide a wonderful way to tightly couple code and data, a way that is secure, non-subvertible, flexible and powerful. This week we will see the first use of the database trigger, and examine its curious ability to accomplish the same goal as OOP encapsulation. Furthermore, while the database requires no analog to the idea of inheritance, triggers also serve as a very powerful way to build up business processes through composition.

This is the Database Programmer blog, which is for anybody who wants to learn how databases work on their own terms. Right now we are in the middle of a series of introductions. Last week we had an Introduction To Security, this week is triggers, and next week will be meta-data and data dictionaries. We need to see these introductions because the advanced table design patterns that are coming up are based upon these tools.

Triggers 101

A "trigger" is a small computer program stored on the database server that fires whenever a table is written to. The trigger can validate and if necessary reject the operation, and it can also modify the data being written. These two simple abilities can be combined to implement incredibly powerful operations and combinations of operations.

Before we get to the first example, we need to review the four basic options that are available when you define a trigger. These four options will be used in the examples below, and this quick review can serve as a reference when you review the examples.

  • INSERT UPDATE and DELETE Triggers can be defined for each of these operations, and most databases allow you to assign define a trigger to fire on combinations, like INSERT and UPDATE but not DELETE. Most servers also allow you to define multiple triggers to fire on each event.
  • BEFORE or AFTER. A trigger can be defined to fire before or after the operation. Triggers that fire before an operation can modify the data being written or reject the operation. Triggers that fire afterwards are good for writing to history tables.
  • STATEMENT or ROW. Some SQL statements such as DELETE and UPDATE affect multiple rows. All popular servers (that I know of) fully support having a trigger fire on each row of the operation. It is sometimes desirable for performance reasons to fire a trigger only once, either before or after the rows are affected. Support for this ability varies widely in the real world.
  • SECURITY. Security models vary widely in the wild. My favorite amongst the servers I have used is Postgres, which lets you grant all users complete freedom to fire any trigger. This is so useful because it lets you define security just once in terms of table access, and since triggers fire on table access, the effective trigger security shows up for free.

The First Example: Denormalization Patterns

Three weeks ago I reviewed the three basic Denormalization Patterns that are used to put calculated values into a database. These denormalization patterns fit hand-in-glove with triggers. The following code shows three simple calculations that can be performed in a trigger. I should note that the syntax is peculiar to Postgres, and you will need to review your own server's documentation to get the syntax right. If you have never written a trigger before, I should further warn that Postgres has a perverse two-step process, it makes you write a function and then attach that function to a trigger. MS SQL Server lets you define the trigger and code together in one step, which makes much more sense. But that being said, here it is:

CREATE FUNCTION order_lines_before_row_func RETURNS TRIGGER AS
$BODY$
   -- execute the FETCH of the price
   SELECT INTO new.price price FROM items 
    WHERE items.sku = new.sku ;
    
   -- EXTEND The price
   SET new.extended_price = new.price * new.qty;
   
   -- AGGREGATE the total to the sales order
   UPDATE ORDERS set lines_total = COALESCE(lines_total,0) + new.extended_price;
$BODY$
SECURITY DEFINER LANGUAGE PLPGSQL;

-- This second step is a peculiar requirement of postgres, 
-- by contrast, MS SQL Server would have let me define these options
-- above and would not require this very annoying second step.
CREATE TRIGGER order_lines_before_row_trigger 
BEFORE INSERT OR UPDATE
    ON ORDER_LINES
   FOR EACH ROW
EXECUTE PROCEDURE order_lines_before_row_func;

It is no understatment to say that you can build huge applications of hundreds of tables that use little more than the example above repeated over and over across many tables. There is of course more to the story, lots more, but the example above lays the pattern for most of what follows.

The Second Example: Playlist Ordering

Some months ago a member of the Long Island PHP User's Group asked about a method for ensuring that an ordered playlist would never have two songs in the same position. This question struck me because it is a great example of the general problem of maintaining any ordered list when you may be moving items around. It turns out an extremely simple trigger takes care of it neatly.

CREATE FUNCTION playlist_before_row_func RETURNS TRIGGER AS
$BODY$
   -- given a column "position", if an item is given a position,
   -- already occupied by another row, bump that other row forward
   UPDATE playlist SET position = position + 1
    WHERE position = new.position;
$BODY$
SECURITY DEFINER LANGUAGE PLPGSQL;

-- This second step is a peculiar requirement of postgres, 
-- by contrast, MS SQL Server would have let me define these options
-- above and would not require this very annoying second step.
CREATE TRIGGER playlist_before_row_trigger 
    ON playlist
BEFORE INSERT OR UPDATE 
   FOR EACH ROW
 EXECUTE FUNCTION playlist_before_row_func();

This trigger works because triggers will cascade, which is explained next.

Some Notes on Cascading Triggers

A trigger can contain code that executes an UPDATE statement, which may then fire other triggers, which themselves contain UPDATE statements, and so on. This is called cascading.

The second example above only works if a trigger can cascade. The chart below shows how it works. The first column shows the value of "position" for a few rows. Imagine the user inserts a new row at position 1, "stepping on" the value that is already there. The trigger shown in example two takes care of everything:



- POSITION -
 Old | New | Trigger action
-----+-----+------------------------------------------------------ 
     | 1   | Insert causes trigger to fire the UPDATE
  1  | 2   | UPDATE flips this from 1 to 2, causing trigger to fire again
  2  | 3   | 2nd UPDATE flips this from 2 to 3, causing the trigger to fire again
  3  | 4   | 3rd UPDATE flips this from 3 to 4, trigger fires a 4th time
  6  | 6   | Fourth UPDATE finds no target row, execution ends, no action on this row
  7  | 7   | no action

This patterns has some latent performance issues that are too complex to get into until I can do a complete essay on this table pattern. But the point here is that those performance problems are a part of the pattern itself, and using a trigger for the pattern is actually the fastest way to implement the pattern.

Encapsulation, Subversion and Security

The term encapsulation is used (as far as I know) exclusively by Object Oriented theorists. Nevertheless, it is striking how well the term describes the effect of triggers. When a trigger is put onto a table, you have the tightest possible integration of code and data, at least as far as database applications are concerned. This is true because the trigger allows for non-subvertibility and integrated security.

A system is said to be "subvertible" if a program or user can, by accident or malice, disobey a business rule or circumvent any of the protections set up in the system. Triggers represent a non-subvertible implementation of business rules, because no application code or user can bypass their actions. If that is not encapsulation I don't know what is.

There is a really nifty advantage to triggers that falls into this general discussion. When you implement calculations in triggers, you can freely allow access to the database from outside of your own application. This is a crucial feature when selling to companies with large IT departments that expect this ability like a birthright. Going further along these lines, you can implement browser and desktop versions of the application without recoding all of the calculations.

Last week we had an Introduction To Security in which we saw that database security is implemented as restrictions on who can INSERT, UPDATE, DELETE or SELECT from a table. This dovetails perfectly with the use of triggers because it allows you to think of security entirely in terms of table access, which is what it all boils down to in the end anyway. Using triggers plus table security gives the tightest possible integration of security and business logic, and puts your entire security framework on a simple and consistent basis.

Third Example: Composition

The Object Oriented Programmer is used to instantiating and using objects together in different patterns. This technique, known as "composition", is one of the basic ways for classes to interact with each other. I will not argue here that triggers are an exact analog to composition, because such a statement is not true and will invite meaningless debates over terms. What I will argue is that within a database the use of triggers to write to other tables accomplishes the same goal: the interaction of code from different but related contexts.

To illustrate this, our third example makes use of the cascading effect described above to link together the ORDER_LINES table, the ORDERS table, and the CUSTOMERS table. The idea is this. One of the rules on the ORDER_LINES table is that the extended_price must be calculated whenever price or qty change. But there is also a rule on the ORDERS table that it must always contain the total of the line items of the orders. Now let us imagine that the CUSTOMERS table must always contain the total of all open orders for the customer. A complete example for INSERT, UPDATE and DELETE would be far more than will fit here, but if we concentrate just on UPDATE operations we can see the general idea:

-- This is a modified form of example one above
CREATE FUNCTION order_lines_before_row_func RETURNS TRIGGER AS
$BODY$
   -- EXTEND The price
   SET new.extended_price = new.price * new.qty
   
   -- AGGREGATE the total to the sales order,
   -- notice that we are subtracting the old value and
   -- adding the new
   UPDATE ORDERS set lines_total 
         = COALESCE(lines_total,0) 
         + new.extended_price
         - old.extended_price;
$BODY$
SECURITY DEFINER LANGUAGE PLPGSQL;

-- The trigger above affected the ORDERS table, and now
-- we go the next step and affect the customers table
CREATE FUNCTION orders_before_row_func RETURNS TRIGGER AS
$BODY$
   -- AGGREGATE the total to the customer
   UPDATE CUSTOMERS set orders_total 
         = COALESCE(orders_total,0) 
         + new.lines_total
         - old.lines_total;
$BODY$
SECURITY DEFINER LANGUAGE PLPGSQL;

-- Postgres requires this 2nd step, but I'm not going into it here
CREATE TRIGGER order_lines_before_row_trigger .....
CREATE TRIGGER orders_before_row_trigger.....

I want to repeat here that I am not claiming this mechanism is an exact analog to composition, but I do want to stress that when you go with triggers a lot of code disappears out of your application, and you may find yourself wondering how all of this fits with the OOP mentality. All programming systems require the basic ability to both isolate code while also defining and controlling how these isolated fragments can interact. OOP makes use of inheritance and composition to relate code segments to each other, but triggers change the scene quite radically. The trigger encapsulates the behavior of the table, and because triggers can update other tables, the ability to have cascading or secondary actions is satisfied.

My final note on this idea is to repeat something I have stressed in these essays many times. The seasoned and experienced database programmer thinks of things in terms of his basic building block, the table. The trigger gives him the ability to attach code directly to this building block. The fact that triggers can update other tables gives him the necessary mechanism to have operations that span multiple tables.

Future Essays on Triggers

There is far more to cover on triggers, and I have glossed over many details in this first introduction. Future essays will demonstrate table design patterns that have their cleanest implementations with triggers, and we will also examine grave performance issues that must be considered on large complex systems. Before we get to that however, next week we will have an introduction to meta data and data dictionaries, which we have to cover first because they lead to more essays on triggers and code generation, and the data dictionary will play more and more into future essays on table design patterns.

Conclusion

Triggers are a very powerful mechanism for creating business logic that is guaranteed to execute, cannot be subverted, dovetails nicely with security. The trigger has the curious property of strongly encapsulating code and data, even though we do not usually use the term encapsulation to describe this. Triggers put great power in the hands of the programmer.

Sunday, May 11, 2008

Introducing Database Security

For a database application, all security in the end becomes a question of who can read and write to what tables. While this is obvious to veteran database programmers, it is not always so obvious to relative newcomers. A programmer who thinks primarily in terms of code will be used to conceptualizing actions, steps, procedures and so forth. This programmer will therefore think of security in terms of who is allowed to perform what actions. However, when you examine the actions performed by database application code, you find that, no surprise, it is a lot of database reads and writes.

This week in The Database Programmer we are going to see the basics of table-based security. This is a blog for anybody who wants to learn the practical realities of working with databases. A new essay is published every Monday morning, and the Complete Table of Contents is here.

Table Permissions 101

The most basic unit of storage for a database is the table, and so security begins with deciding who can do what to a table. The SQL language has one read command and three write commands (INSERT, UPDATE, and DELETE), so there are four possible permissions that can be granted to a user. Most databases out there provide syntax that is pretty close to this, which makes some direct permission assignments to a group named "ARSTAFF"

CREATE GROUP ARSTAFF
GRANT SELECT ON RECEIVABLES TO ARSTAFF
GRANT INSERT ON RECEIVABLES TO ARSTAFF
GRANT UPDATE ON RECEIVABLES TO ARSTAFF
GRANT DELETE ON RECEIVABLES TO ARSTAFF

These commands should be run at the same time that your tables are built, they are part of the basic database build.

A Basic Security Dictionary

A very basic security dictionary will list all tables and show who can do what to what. The dictionary begins with a single table, which lists tables and groups and permissions, something like this:

TABLE        | GROUP     | SELECT | INSERT | UPDATE | DELETE
-------------+-----------+--------+--------+--------+--------
AR_INVOICES  | AR_STAFF  | 1      | 1      | 1      | 0
AR_PAYMENTS  | AR_STAFF  | 1      | 1      | 1      | O
CUSTOMERS    | AR_STAFF  | 1      | 1      | 1      | 0
CUST_TYPES   | AR_STAFF  | 1      | 0      | 0      | 0
CUST_TYPES   | AR_ADMIN  | 1      | 1      | 1      | 1

Security permissions should always be defined in terms of groups, not individual users. This is important because it lets you make the group definitions part of the basic definition of the system, leaving the system administrator free to add users to groups at his own discretion.

Deny By Default And Determining a User's Permissions

Most security systems make use of the idea of deny by default. This means you do not have to specify who cannot see a table, the system starts out by assuming that nobody can see anything. Users can then only work with a table if they are in a group that can work with that table.

Another key idea is this: if a user is in even one group that allows a permission, the user gets the permission, even if his other groups are not allowed in.

With these two ideas, deny-by-default, and any-group-gets-you-in, then assuming you have a table that lists what groups each user is in, you can find a particular user's permissions on a table with this query:

SELECT MAX(select) as select
      ,MAX(insert) as insert
      ,MAX(update) as update
      ,MAX(delete) as delete
 FROM secure_dictionary
WHERE table = 'table name'
  AND group IN (list,of,groups,user,is,in)

If the query comes back empty, the user has no rights, otherwise you will get the user's permissions for that table.

Enhancing Our Security Dictionary

The security dictionary portrayed above can become tedious to work with because it often happens that you want to make similar assignments to groups of tables. For instance, in an accounting application you may an "APSTAFF" group that has basic permissions on Accounts Payable tables, and an "APADMIN" group that can do anything to any table in Accounts Payable. However, those groups have no permissions on the Accounts Receivable tables, so you have two new groups, "ARSTAFF" and "ARADMIN". Likewise for inventory, general ledger, and so forth.

This can all be made much easier by organizing your tables into modules and defining default permissions within a module to each group. The expanded dictionary would have these tables:

THE TABLE OF TABLES

TABLE        | MODULE
-------------+---------
AP_INVOICES  | AP
AP_CHECKS    | AP
AR_INVOICES  | AR
AR_PAYMENTS  | AR
GL_ACCOUNTS  | GL
GL_LEDGER    | GL

MODULE PERMISSION

MODULE  | GROUP     | SELECT | INSERT | UPDATE | DELETE
--------+-----------+--------+--------+--------+--------
AP      | AP_STAFF  | 1      | 0      | 0      | 0
AP      | AP_ADMIN  | 1      | 1      | 1      | 1
AR      | AR_STAFF  | 1      | 0      | 0      | 0
AR      | AR_ADMIN  | 1      | 1      | 1      | 1
GL      | GL_STAFF  | 1      | 0      | 0      | 0
GL      | GL_ADMIN  | 1      | 1      | 1      | 1

The permissions above basically let the regular in each group see anything, but write nothing. The admin users in each group can do anything. Specific table assignments to particular tables can then override these defaults to give the *STAFF members increased access on a table-by-table basis.

Our First Secure Table Pattern: The Read-Only Lookup Table

We are now going to see the first table design pattern that is based on security: the read-only lookup table. Consider a table of customer types and discounts. This table does not change that often, and regular staff are not permitted to make changes to the table, only managers can do that.

I call this pattern the "Read-Only Lookup Table." The table holds values that everybody needs to see, but only a few people can modify. The SQL to create a table might look like this:

CREATE GROUP AR_STAFF
CREATE GROUP AR_ADMIN
CREATE TABLE CUSTTYPES (
   custtype char(10) primary key
  ,description char(25)
  ,discount numeric(4,2) 
)
DENY ALL     ON CUSTTYPES TO AR_STAFF
GRANT SELECT ON CUSTTYPES TO AR_STAFF
GRANT ALL    ON CUSTTYPES TO AR_ADMIN

By the way, this is yet another argument against the deep tendency that code grinders have to combine tables that have similar structures, especially lookup tables. We know in theory that combining lookup tables is bad because like things should be stored together in a table and unlike thing should be stored in separate tables. Now we see the practical impact, which is that you cannot implement security on tables that have been combined (just like you can't implement constraints, foreign keys, calculations...) The crucial point here is that table design is not so much about identifying column names and widths, at its heart it is about organizing data. When you mistake a superficial resemblance (similar column structure) for an actual similarity, you confound your own ability to manage either entity correctly.

The Connection, Users, and Groups

You can certainly implement security in your framework by doing on-the-fly lookups to see who is allowed to do what. However, your run-time framework code will be much simpler if the security definitions are built into the server when the tables are built. The drawback is that it requires a dramatic change to how you connect to the database.

Many programmers use frameworks (or write their own) that make only one connection to the database. This connection has the maximum allowed permissions that any user of the system has, and the framework determines what particular users can actually do. This system is so common that many programmers probably assume it is the 11th commandment, and these same programmers are unaware that there is any other way to do it.

But in fact databases are able to create users and groups just the same way that operating systems are, and then to allow you to connect as these various users. Making all connections as a super-user and trusting your code to get it right puts a huge burden on the programmer, and we have a ready example of what can go wrong by looking at a very popular operating system produced in Redmond, Washington. By contrast, a simple table-based system reduces the task to careful review of the much simpler security dictionary.

The decision on which way to go is not necessarily cut and dried. If you have a site with a very simple database and very simple security, and thousands or millions of users, there is much to be gained by caching connections and arbitrating security in code. However, if you are producing line of business programs like accounting, ERP, medical billing or any other app where your users are limited, there is nothing to be gained by caching connections and everything to be gained by simplifying the system and making it easier to code and maintain.

Beyond Table Security

While table security can be vastly simpler than trying to secure programs one by one, it cannot in fact handle all of the cases that might come up. In later essays we will examine column-level security and row-level security. These are not well supported in the popular products available today, so you may have to work it into your framework (which is what I did for my Andromeda framework).

Also, now that we have seen an introduction to database security, we will be seeing more secure table design patterns in the future.

Next Week: Database Triggers, Encapsulation and Composition.

Sunday, May 4, 2008

Minimize Code, Maximize Data

Early in my career, I was fortunate to receive some programming lessons from one of the early pioneers in computer age. While I am sure he would not remember me, I certainly remembered him, and he said one thing that I remember very much: Minimize Code, Maximize Data.

This is The Database Programmer blog, a weekly series of essays for anybody who wants to find out about the practical realities of dealing with databases on their own terms. The Complete Table of Contents is Here.

The Best Kept Secret in Programming

This week we are going to examine what I was told so many years ago:

Minimize Code, Maximize Data

Since then, and it was nearly 15 years ago, I have never once heard another programmer (except myself) express this very basic and simple idea. It is the best kept secret in programming. I can think of two reasons why most programmers working today have never heard the simple idea, "minimize code, maximize data."

First possibility: the guy was totally wrong. I highly doubt this however as his company is approaching its 40th year of worldwide growth, and how many companies in this field last that long, never mind prosper and grow?

The second possibility seems much more likely to me: most programmers just don't think that way. Most programmers would never stumble upon this idea on their own and if they do hear it they forget it or reject it. We programmers love to code, and are reluctant at best to accept the idea that he who codes least codes best.

Now we will see how this simple idea impacts the entire software cycle, from designer to user.

The Example: Magazine Regulation

Our example is fairly easy. Consider a magazine distributor, a company that receives magazines in bulk and distributes them to stores. His system contains a table called DEFAULTS that lists the default quantity of each individual magazine that is distributed to each store (a big fat cross-reference table).

Now for the tricky part. Let's say we run a "SELECT SUM(qty)" from the DEFAULTS table for TV GUIDE and we get 2123. That means he needs 2123 copies of TV GUIDE to give every store their default delivery. But when the magazines arrived from the supplier, they only received 1900. What does he do? Well it so happens this happens every day, the distributor in fact never receives his exact default requirements, they are always over or under. So the distributor must run a process called regulation.

Regulation is an automatic computer process that runs through the defaults, compares them to what you actually have, and increases or decreases the delivery amounts to each store until everything lines up. It is by nature iterative, you run through the stores over and over increasing amounts by one or two until you balance out. Next we will look at how to write it.

Writing a Regulation Program

Before we get to the details of the regulation program, I have to point out one detail about magazine distribution. It turns out that 50-80% of magazines sent to retail racks go back to the distributors and are shredded. This means it is not uncommon to talk about sales percentages of 50% or 60%. If a store is receiving too many magazines, they might easily have a sales percent as low as 10%.

So how do we write the regulation program? I'm going to avoid any long build-up here and just go straight to the answer. You make a table of rules that determine how the regulation process works. The owner of the distribution company will say things like, "If the store sold less than 20% of the TV GUIDES for the past 4 weeks, drop their amount by 2, but do not go below 2." So you make a table with columns like "THRESHOLD_PERCENT", "DECREMENT" and "ABSOLUTE_MINIMUM". There will also be rules that operate when the company has too many magazines. The owner says, "If his sales percent is greater than 80%, give him two more." So your table now contains percentages and increase amounts to use when raising the distribution amounts.

From here your program is reduced to a fairly simple affair, it is little more than a double-nested loop. You fetch a list of the rules, in order, and you begin to iterate through the magazines, applying the current rule to the current magazine. You keep applying each rule over and over until there are no more cases where it applies, then you move on to the next rule. As soon as the total magazines matches the amount you have, the program terminates. It is the very soul of simplicity, and a perfect example of what database applications can do so well.

The Wrong Way To Do It

It so happens that I have a customer who is a magazine distributor, which is how I learned about this process. The system he is using now, which I maintain for him, was not written as I described it above. There is no table of rules. The programmer coded up each different rule separately, and the programmer is now dead (I feel like Dave Barry saying this, but I swear I am not making this up). The owner and I are both afraid to touch the code, and so he lives with it while we plan for something better.

While I do not wish to speak ill of the deceased, it is safe to say that the original programmer did not know he was supposed to minimize code and maximize data. Let's examine the consequences of that ignorance:

The customer is not in total control of his own business, because he
cannot reliably modify one of his most basic operations.

So we can now draw the conclusion that the "minimize code, maximize data" has very direct consequences for the user experience, especially those very important users who sign the checks.

The Impact Upon Your Code

A code grinder who is not trying to minimize code usually ends up with very complicated programs. This is true even of veteran and expert programmers who supposedly know better. They set out with an idea to simplify things and then end up writing complex heavily inter-dependent class libraries. I contend that this happens because they don't know that they are supposed to minimize code by maximizing data.

On the other hand, the database programmer who knows to maximize data ends up with dramatically simpler code patterns. This happens because a lot of the complex conditionals and branching logic that is required for code-centric solutions is reduced to row scanning operations that contain much simpler algorithms in the innermost loop.

We can see this in the example above for the regulation program. The data-centric solution is basically one loop nested inside of another with the core operation occurring on the inside. The code-centric solution, which I mentioned we are afraid to touch, is full of conditionals and branches that make it dangerous to mess with for fear of causing unintended side-effects. The original programmer could probably do anything with it, everybody else is reduced to doing nothing.

The problem becomes more exaggerated as time goes on. As programs mature, their original simplicity is enhanced to handle more sophisticated edge cases and exceptions. As this process continues, many of these sophisticated edge cases and exceptions will be mutually exclusive or will interact in subtle ways. When this happens the code-centric program becomes increasingly difficult to understand and keep correct, as the layering of conditionals and branches and interdependencies makes it harder and harder to eliminate unwanted side effects. The data-centric solution on the other hand, while still becoming more difficult, is reduced to simply making sure that the tables provide the correct options for the code, and the code remains a matter of scanning the rules, picking precedence, and executing them.

The Impact on Debugging

It is much easier to debug data than code, especially when a simple operation has been matured to handle a collection of subtle edge cases and exceptions that may interact with each other.

If we continue to the example of the regulation process, the basic question arises, how do you test this thing? The entire concept of reliable testing is far more than we can cover in a single essay, but I do want to introduce the idea of testing with -- you guessed it, more data.

The regulation program as I have described it loops through rules and magazine quantities and adjusts those quantities. It is a row-by-row process, with each pass through the inner loop executing a single change. Debugging this process can be vastly simplified if a table is made that records, in order, each update that was made and what rule was used to make the update. Obvious errors could be detected if a rule was being applied out of order, or if the rule made the wrong change. Generating the test cases brings in yet again more data, creating a body of magazines, defaults and customers that deliberately stresses the system with extreme cases.

This debug-the-data approach can have a huge impact on boosting the customer's confidence and control. If he can directly view this log he has a perfect black-and-white explanation of how the process ran. If he does not like it he can change the rules. If you let your program run in "planning only" mode, where it generates these logs without making the changes, then your customer can play what-if and you will find you have truly made a new friend!

Meta-Data and the Data Dictionary

This week's essay leads naturally to the matter of meta-data and data dictionaries and how they can dramatically reduce the amount of code needed for routine table maintenance tasks. However, that is a large topic and must be reserved for one or more future essays.

When we get to the topic of data dictionaries, we will be looking at how a data dictionary can give you true zero-code generated table maintenance forms, among other things.

Conclusion

The motto "Minimize Code, Maximize Data" is not well known in popular discussions today, which I contend is a natural consequence of our basic personalities as programmers. Coding is what we do and we tend to think of solving problems as an exercise in code and more code.

Nevertheless, we have seen this week in a specific example (which could be repeated many times over) that the code-centric versus data-centric decision impacts everybody. The rule "Minimize Code, Maximize Data" has positive impacts and the coding process, the debugging process, the maintenance process, and the user experience. Since that covers all parties concerned with software development, it is safe to conclude that this is a crucial design concept.

Next Essay: Introducing Database Security

Sunday, April 27, 2008

Denormalization Patterns

Welcome to the Database Programmer! Every Monday morning this blog contains a new essay. This blog is for people who want to learn the practical realities of databases. Topics range from simple to advanced.

The main theme of these essays is that your applications will be leaner, faster, and easier to write and maintain if you understand how databases work.

The complete table of contents is here.

The Non-normalized Database

A denormalized database is one that has been meticulously normalized to eliminate redundancies, only to have redundancies deliberately put back in to meet other needs. That is the kind of database we are going to talk about today.

By contrast, a database is "non-normalized" if nobody ever bothered to normalize it. This is what you get when the programmer says, "I'm not concerned with the table structure yet, I'm working on the code first." In my experience this is something like saying, "I'm not worried about the plan for this work bench I'm building, I'm just going to throw the wood on the table saw and start cutting." We are not interested in non-normalized databases.

The Fully Normalized Database

The fully normalized database will have few if any redundancies, each fact will be stored in exactly one place. This database has one big advantage, which is that write actions, INSERT, UPDATE, and DELETE, will be very easy to code up correctly. The application code for a fully normalized database will be smooth, simple, easy to write and easy to maintain.

But decades of experience has shown that fully normalized databases have a few drawbacks, which is why practical-minded programmers always end up denormalizing their designs. The problems in particular are:

  • No calculated values. If you have a shopping cart with the columns "price" and "qty", it may seem natural to put in a column "extended_price" that holds price * qty. This is actually forbidden by third normal form, though almost everybody puts it in anyway.
  • Non-reproducible Calculations. Continuing from the point above, most shopping carts or ordering systems have many calculations that are far more complicated than simply doing price * qty. Often a calculation may depend on a dozen or more preceding calculations. In complex situations like this, it is all too common that changes in the program mean an invoice printed after an upgrade produces different numbers than the same invoice printed before the upgrade. You don't want that phone call!
  • JOIN Jungles. A fully normalized database will have information scattered in many different tables. Though this makes it easy to get it right going in, it can make it terribly difficult to get seemingly simple combinations of data back out.

The Denormalized Database

As I said above, the denormalized database is one that was first normalized by having its redundancies removed, only to have some redundancies deliberately put back in. This can solve all three of the problems listed above. There is a cost of course. When we denormalize we have to keep two things in mind:

  1. We must dream up a way to keep the redundant values correct. If we can pull this off we get all of the advantages of denormalization with no drawbacks.
  2. Following up on the first point, we have a better chance of getting it right if we can identify a set of denormalization patterns. Once they are identified, we can code up something in the framework that supports them, and now we can see the gold at the end of the rainbow.

The Foreign Key Is Our Friend (Again)

I have said many times in these essays that the foreign key is the only way to establish relationships between facts stored in different tables. We will now see how this relates to denormalizing our database.

Denormalization means introducing redundancies. In other words, a fact that was stored in only one place is now stored in two or more places. If we are going to copy a value from one table to another table, it stands to reason that there must be some logical relationship between those two tables. Since the only kind of relationship we can have between two tables is a foreign key, our denormalization patterns must in some way work with foreign keys.

The First Pattern: FETCH

Consider a shopping cart that has a column "sku" and another column "price." Most programmers lay out these tables and write some code that copies the price from the ITEMS table to the ORDER_LINES table. I call this pattern the "FETCH" because the price is FETCHed from the ITEMS table and written into the ORDER_LINES table.

Most programmers code up FETCH operations all over the place and do not ever realize they are denormalizing. I think this pattern is just so natural that most of us never think about it. If you examine your database applications you will likely see that you are doing this all over the place.

In order to get this pattern to operate correctly, your framework must make sure at very least that the SKU is not null when an INSERT is made to ORDER_LINES, and that the price is copied during the INSERT. You can maintain correctness by not allowing users to change the SKU on this table, if they change their minds they must delete a line and enter a new one. Or, you can make your framework a little more flexible and execute the FETCH again if the SKU changes on an UPDATE.

Sidebar: Is FETCH Really Denormalizing?

Die-hard relational theorists will tell you not to copy price from the items table. You are supposed to leave it where it belongs and use JOINs to pick up the price when it is needed. There are three arguments against this sort of purity.

The first practical argument is that it is horribly difficult to deal with complex calculations this way. It is far easier to copy the price when the line goes in, so you never have to "go looking" for it again.

The second practical argument is that performance tanks if you follow the die-hard relational approach. If you have to look in 6 tables every time somebody refreshes their cart you will have a much slower program than one that only has to look in one table.

But the third argument is more theoretical, and it is this: the FETCH is not really denormalizing. The idea is that when the customer makes an order your store has entered in an agreement to sell something at a particular price. That price is stored on the order and is now a fact about that order. It is not redundant because you are not storing the SKU's generic price, you are only storing the price that that customer is going to pay on this order. If the price changes 5 minutes after the customer places the order, they will expect to get the price as it was when they put it in the cart, and so you are actually doing the right thing by writing it to the order.

The Second Pattern: Aggregations

The FETCH that was described above is all about copying a value from a parent table to a child table. The opposite pattern occurs when you roll up values from children to parents. These are usually done as totals (SUMS), counts, averages, minimums, and maximums.

Looking at the ORDER_LINES again, if a customer has 3 items in their cart, it is perfectly natural to most programmers to put a column "PRODUCT_TOTAL" onto their ORDERS table that holds the sum of all of the lines. This is called an aggregation, because the result in the parent table is always some operation performed on the aggregation of all of the child rows.

Aggregrations are always denormalizing because they are values that could be derived from other values. To be specific, an aggregration violates third normal form because it introduces a non-key dependency - a value that is dependent not on the key but on values from a completely different table!

In order to make sure this value is always correct, the framework must always update the total on the parent table when any line in the child table changes. If your framework can do that successfully, your aggregations will always be correct.

The Third Pattern: EXTEND

The first two patterns we saw dealt with foreign keys. The first pattern, the FETCH, involves values travelling "downward" on a foreign key from parent to child. The second pattern involves values travelling "upward" on a foreign key from child to parent. The third and final denormalizing pattern involves calculated values within a row.

The example at the beginning of this essay was the column EXTENDED_PRICE, which holds the value of PRICE * QTY. This is an EXTEND operation, because it extends a row by adding a new redundant value. This is denormalizing because it violates third normal form, it introduces a value that is not dependent on any candidate key.

If you want to makes sure your EXTENDs are always correct then you need a framework that will always update the calculation when either of its dependent values changes.

Dependency Tracking

In describing the three denormalizing patterns above, I have explained what you need to make sure each one is performed successfully. There is a final requirement to keeping all of this correct, which is that the operations must be performed in the proper order.

Considering the shopping cart again, in particular the ORDER_LINES table, these three operations must occur in this order:

  1. The PRICE is FETCHed
  2. The EXTENDED_PRICE is calculated as an EXTEND
  3. The ORDERS table's PRODUCT_TOTAL value is adjusted.

Your framework must have a reasonable way to make sure that the operations are performed in the correct order, or they will not give the correct result. As a rule of thumb, in most systems the FETCHes come first, followed by the EXTENDs, and then the aggregations.

Meta-data can be a big help here. When I first contemplated these patterns about four years ago, it occurred to me that they could all be stored as formulas in the basic description of the database, and that a code generator would sequence them for me and generate the code, so that the operations would always occur in the correct order. I wrote the basic system in the fall of 2004 and have found it to work extremely well ever since. In my personal opinion, this is the only way to reliably handle these patterns.

Conclusion: Denormalization Also Follows Patterns

A fully normalized database makes it easy to get data in correctly, but makes it difficult to get it out. Denormalizing is the process of taking a normalized database and deliberately introducing redundancies to improve query writing, performance and correctness. Not surprisingly, denormalization has its own patterns. Two of these follow the foreign key, and the third one works inside of a single row. If you follow these patterns and fashion your framework to keep them correct, you get all of the benefits of denormalization without the concern for bad data.

Next Week: Minimize Code, Maximize Data.

Sunday, April 20, 2008

Advanced Table Design: Resolutions

Welcome to the Database Programmer! Every Monday morning this blog contains a new essay. This blog is for people who want to learn the practical realities of databases. Topics range from simple to advanced.

The main theme of these essays is that your applications will be leaner, faster, and easier to write and maintain if you understand how databases work.

The complete table of contents is here.

The Resolution Pattern

A resolution is an interesting database pattern because it involves both table design and a very specific form of SELECT statement. Just getting the tables down is not enough, you have to know how to write the SELECT that will return the correct value.

Basic Description and Example

A resolution pattern occurs when you need a value and there is more than one place where it might be. As an example, consider the case of a computer services shop that provides complete IT services, including programming. In their billing system, they have a simple table that lists the rates for their various activities.

ACTIVITY  | RATE
----------+-------
ITGENERAL | 100
PROJMGT   | 200
SOFTWARE  | 150

This is simple enough, but now suppose that you have a particular employee that you bill out at $175.00/hour for software development. This makes the picture a little more complicated. But suppose that it gets more complicated, suppose that you enter into an arrangement with a particular customer to do volume software development for them for $135.00. And just to make it interesting, suppose you have a very specific arrangement with a particular customer to provide the services of a particular employee for $185.00 for project management.

With this many possible billing arrangements, your super-simple invoicing program is suddenly not so simple. On any particular invoice line, you must resolve the actual hourly billing rate out of several possibilities. Because you must resolve the value, this pattern is called a resolution.

Precise Description of the Resolution Pattern

A resolution pattern has these characteristics:

  • The goal of a resolution is to find a particular value. In our example this is a billing rate.
  • Resolutions examine multiple possible values and pick the first match according to precedence.
  • Precedence usually begins with the most specific and falls back to the most general. In our example the most specific possible rate is defined for a customer-activity-employee, while the most general is the default rate for an activity.

Resolutions are not always easy to recognize. Mostly this is because customers do not tell you "we have a resolution." Instead they tell you they have a billing rate. The explanation of the special overrides for employees comes in a different conversation, and perhaps to a different member of your team. Then later comes the explanation of the other overrides. The resolution only becomes apparent when the various requirements are all sorted out and put next to each other. Then somebody says, "Hey, there are four different formulas for the billing rate!" Then you know you have a resolution.

The Table Design

A resolution requires one table for each possible level of detail where a value might be supplied. In our example there will be a table for:

  • Rates by activity-customer-employee
  • Rates by activity-customer
  • Rates by activity-employee
  • Final default values by activity

These table only contain values when they are relevant. The table of activity-customer does not contain a row for every possible combination of activities and customers, it only contains a row when there has been some agreement to provide an activity to a specific customer for a special rate.

Here are the tables:

ACTIVITY | CUSTOMER | RATE
---------+----------+------
PROJMGT  | PRAXIS   |  225
SOFTWARE | PRAXIS   |  235


ACTIVITY | EMPLOYEE | RATE
---------+----------+------
PROJMGT  | SRUSSEL  |  225


ACTIVITY | EMPLOYEE | CUSTOMER | RATE
---------+----------+----------+------
PROJMGT  | HIROKO   | PRAXIS   |  250

Resolving In Client Code Will Kill Performance

Now consider that the there is a table somewhere that is used to drive billing. Maybe the employees themselves record their time in this table, or maybe some clerical staff member is entering them. Whoever puts them in, each record has an activity, an employee, and a customer (and of course hours). You need to write a program that finds the correct billing rate for each row.

A die-hard code grinder will do all of this in the client. He will write a query to pull all of the rows from the time entry table. Then he will loop through these rows. For each line he will query the server for a the most detailed value, activity-employee-customer. If it is not found he will do a second query for the next table in line, and so forth. This will be a performance disaster because his program will be making a huge number of round trips to the server. If he understood the LEFT JOIN he would need only one trip to the server.

First Stab with A LEFT JOIN

Here is a query that does most of what we need for the resolution:

SELECT ol.activity,ol.employee,ol.customer
      ,aec.rate as aec_rate
      ,ac.rate  as ac_rate
      ,ae.rate  as ae_rate
      ,a.rate
  FROM orderlines ol
  LEFT JOIN act_emp_cust_rates     aec 
    ON ol.activity = aec.activity
   AND ol.customer = aec.customer
   AND ol.employee = aec.employee
  LEFT JOIN act_cust_rates         ac
    ON ol.activity = ae.activity
   AND ol.customer = ae.customer
  LEFT JOIN act_emp_rates          ae 
    ON ol.activity = aec.activity
   AND ol.employee = aec.employee
  JOIN activities                  a 
    ON ol.activity = a.activity  
 WHERE (....relevant search conditions....)

The LEFT JOIN tells the server to return all matching rows from the orderlines table, even if there is no match in the various override tables. The above query will return something like this:

ACTIVITY | EMPLOYEE | CUSTOMER | AEC_RATE | AC_RATE | AE_RATE | RATE
---------+----------+----------+----------+---------+---------+------
PROJMGT  | HIROKO   | PRAXIS   |      250 |  null   |  null   |  200
PROJMGT  | NIRGAL   | PRAXIS   |     null |   225   |  null   |  200
SOFTWARE | SRUSSEL  | PRAXIS   |     null |   235   |  null   |  150
PROJMGT  | SRUSSEL  | GE       |     null |  null   |   225   |  200
PROJMGT  | SRUSSEL  | NASA     |     null |  null   |   225   |  200
SOFTWARE | HIROKO   | PRAXIS   |     null |   235   |  null   |  150
SOFTWARE | HIROKO   | GE       |     null |  null   |  null   |  150

The Final Form of the Query

The first form of the query returns all four possible rates, and the effect of a LEFT JOIN is to have a NULL value where there was no match on the right side.

We can do better than this and return the actual rate by using a COALESCE function. A COALESCE allows us to list two or more values, and the function returns the first one that is not null. This lets us return the actual resolved value from the server:

SELECT ol.activity,ol.employee,ol.customer
      ,COALESCE(aec.rate,ac.rate,ae.rate,a.rate) as rate
  FROM orderlines ol
  LEFT JOIN act_emp_cust_rates     aec 
    ON ol.activity = aec.activity
   AND ol.customer = aec.customer
   AND ol.employee = aec.employee
  LEFT JOIN act_cust_rates         ac
    ON ol.activity = ae.activity
   AND ol.customer = ae.customer
  LEFT JOIN act_emp_rates          ae 
    ON ol.activity = aec.activity
   AND ol.employee = aec.employee
  JOIN activities                  a 
    ON ol.activity = a.activity  
 WHERE (....relevant search conditions....)

...which gives us the complete answer:

ACTIVITY | EMPLOYEE | CUSTOMER | RATE 
---------+----------+----------+------
PROJMGT  | HIROKO   | PRAXIS   |  250 
PROJMGT  | NIRGAL   | PRAXIS   |  225 
SOFTWARE | SRUSSEL  | PRAXIS   |  235 
PROJMGT  | SRUSSEL  | GE       |  225 
PROJMGT  | SRUSSEL  | NASA     |  225 
SOFTWARE | HIROKO   | PRAXIS   |  235 
SOFTWARE | HIROKO   | GE       |  150 

Conclusion: Tables and Queries Go Together

We have seen this week our first table design pattern that requires a certain form of query. This shows us as well that queries themselves will fall into patterns, and we will definitely see more of these patterns in future essays.

Monday, April 14, 2008

JOINS Part Two, The Many Forms of JOIN

Welcome to the Database Programmer, the blog for anybody who wants to learn the practical realities of working with databases.

There is a new entry every Monday morning, and the complete table of contents is here. This week we are in a series on queries.

The Many Kinds of JOIN

When a programmer first decides to start learning SQL, the JOIN always appears simple at first. But it can produce some unexpected results. Sometimes a JOIN brings back more rows than the novice thinks it should, and sometimes less. This week we are going to concentrate on all of the various results that are produced by the JOIN.

A JOIN always operates on the same principle: it returns one row for each combination of rows from both sides of the match.

A Parent-Child JOIN Returns Number of Matching Children

Consider the following two simple tables:

CUSTOMER | ZIP               CUSTOMER | ORDER | DATE
---------+---------          ---------+-------+----------
 1234    | 11733              1234    |    57 | 3/1/08
 5283    | 77074              1234    |    78 | 3/15/08
                              1234    |    89 | 4/07/08
                              5283    |    23 | 2/13/08
                              5283    |    32 | 3/17/08

When you JOIN these two tables together, there are three combinations that match for customer 1234, and two combinations that match on 5283, so this query:

SELECT customers.zip
      ,orders.customer,orders.order,orders.date
  FROM customers
  JOIN orders    ON customers.customer = orders.customer

...will return 5 rows. Values from the child table appear once, and values from the parent tables repeat. The return count matches the number of rows in the child table:

 ZIP  | CUSTOMER | ORDER | DATE
------+----------+-------+----------
11733 |  1234    |    57 | 3/1/08
11733 |  1234    |    78 | 3/15/08
11733 |  1234    |    89 | 4/07/08
77074 |  5283    |    23 | 2/13/08
77074 |  5283    |    32 | 3/17/08

The Parent-Child And a Left JOIN

Now we consider the case when our Sales Manager comes down the hall and asks for a list of customers in zip code 11101, with their total sales. So you have this information in the database:

CUSTOMER | ZIP               CUSTOMER | ORDER | DATE    | AMOUNT
---------+---------          ---------+-------+---------+-------
 1234    | 11101              1234    |    57 | 3/1/08  |   25
 5283    | 11101              1234    |    78 | 3/15/08 |   35
 2938    | 11101              1234    |    89 | 4/07/08 |   45
                              5283    |    23 | 2/13/08 |   55
                              5283    |    32 | 3/17/08 |   65

...and you write this query:

SELECT c.customer,sum(o.amount) as amount
  FROM customers c
  JOIN orders    o ON c.customer = o.customer
 WHERE customer.zip = 11101

...and the result only has two rows:

CUSTOMER | AMOUNT   
---------+--------- 
 1234    |   105
 5283    |   120

The novice database programmer will look at this and say, "I don't get it, my filter says zip 11101, but I only got two rows back." This is because the JOIN clause acted as a filter, because the JOIN only returns where there is a match on both sides.

If you want to get all of the rows from the left side of the JOIN, even if they have no match on the right side, you use a LEFT JOIN:

SELECT c.customer,sum(o.amount) as amount
  FROM customers c
  LEFT JOIN orders o ON c.customer = o.customer
 WHERE customer.zip = 11101

...and now the answer is:

CUSTOMER | AMOUNT   
---------+--------- 
 1234    |   105
 5283    |   120
 2938    |  null

Of course we probably do not want that "null" because it messes up calculations, so you can use the COALESCE() function to replace nulls with zeroes. The COALESCE() function accepts a list of values and returns the first non-null:

SELECT c.customer,sum(COALESCE(o.amount,0)) as amount
  FROM customers c
  LEFT JOIN orders  o ON c.customer = o.customer
 WHERE customer.zip = 11101

...and this completes the picture:

CUSTOMER | AMOUNT   
---------+--------- 
 1234    |   105
 5283    |   120
 2938    |     0

You Can JOIN a Table to Itself

Recently a customer of mine dropped by with a problem he had in an insurance database. He had a table that listed various companies and their total premiums collected by year. He is required by various regulation to report the premiums as income at the rate of 28% in the year they are collected, with the remaining 72% being reported in the following year. It did not appear readily obvious how to do this.

This is a case of self-JOIN, where you JOIN a table to itself to match rows to each other. In this case we created the following query:

SELECT tab1.year
      ,tab1.premium * .28 as premium1
      ,tab2.premium * .72 as premium2
  FROM premiums tab1
  JOIN premiums tab2 ON tab1.company = tab2.company
 WHERE tab2.year = tab1.year + 1

Notice that the JOIN condition only matches company to company, and an additional WHERE clause was required to limit the matches. Technically we would prefer to see that second condition in the JOIN clause, but most database servers do not support that. It is a little more confusing to a newbie as written, but not terribly so.

Another SELF JOIN and a FULL OUTER JOIN

Another customer of mine called recently to complain that his legacy system is not doing something right. He said there was information missing.

In his business, which is magazine distribution to retailers, he has a table called "DEFAULTS" that lists the default number of each magazine given to each store. Sometimes he wants to see the defaults for two magazines side by side, for all of the stores they are delivered to. If his defaults table looked like this:

MAGAZINE | STORE | DEFAULT
---------+-------+--------
123      | A     |   5
123      | B     |  10
123      | C     |   7
456      | A     |   6
456      | D     |   3

...and he wanted to see magazines 123 and 456 side by side, he would expect output like this:

STORE | MAG 123  |  MAG 456 
------+----------+----------
A     |    5     |    6
B     |   10     |    0
C     |    7     |    0
D     |    0     |    3

Here is how you can recognize this request as needing a self-join. If it appears that you need two separate filters for the same table, and the two filters result in values that are meant to sit side-by-side, then you have a self-join. Our first stab at the query would be:

SELECT d1.store
      ,d1.default as mag123
      ,d2.default as def456
  FROM defaults d1
  JOIN defaults d2 on d1.store = d2.store
 WHERE d1.magazine = 123
   AND d2.magazine = 456

This is the query I found in his program, and as soon as I saw it his complaint made sense. He said basically, "There are only 5 rows for magazines x and y and there should be 500." The problem goes back to the fact that a JOIN limits the answer to those cases where there is a match on both sides. The query above would give only this:

STORE | MAG 123  |  MAG 456 
------+----------+----------
A     |    5     |    6

...because only store A has a row for both magazines. What we want is every entry for each magazine for either store even if there is no matching entry in the other store. The LEFT JOIN we saw above would help, but would only do half the job, it would give us:

STORE | MAG 123  |  MAG 456 
------+----------+----------
A     |    5     |    6
B     |   10     |    0
C     |    7     |    0

What we need is a FULL OUTER JOIN, which is basically a LEFT JOIN and a RIGHT JOIN. Its a LEFT JOIN that goes both ways. If we add in our COALESCE() functions the final query is:

SELECT d1.store
      ,COALESCE(d1.default,0) as mag123
      ,COALESCE(d2.default,0) as def456
  FROM defaults d1
  FULL OUTER JOIN defaults d2 on d1.store = d2.store
 WHERE d1.magazine = 123
   AND d2.magazine = 456

Conclusion: That your JOIN May Be Complete

The JOIN is a very powerful clause. The JOIN is the most-used method for combining information from multiple tables together. This week we saw that the JOIN has provisions for joining a table to itself, and for handling cases where the matches may not be all present.

Next week we will see an example of a table design pattern that includes a required query structure in its definition.

Sunday, April 6, 2008

GROUP BY, HAVING, SUM, AVG, and COUNT(*)

Welcome to the Database Programmer!

Good programming skills do not lead magically to good database skills. Masterful use of the database requires knowledge of the database in its own terms. Step 1 is knowing your table design patterns, and Step 2 is knowing how to fashion efficient queries. Learning how to code good queries can lead to faster performance and better application code.

There is a new entry in this series every Monday morning, and the Complete Table of Contents is here.

Aggregation

You can use a SQL SELECT to aggregate data. Aggregation combines rows together and performs some operation on their combined values. Very common aggregations are COUNT, SUM, and AVG.

The simplest use of aggregations is to examine an entire table and pull out only the aggregations, with no other columns specified. Consider this SQL:

SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,AVG(sale_amount) as avg
  FROM orders

If you have a very small sales order table, say about 7 rows, like this:

ORDER |  DATE      | STATE | SALE_AMOUNT
------+------------+-------+-------------
 1234 | 2007-11-01 | NY    |       10.00
 1235 | 2007-12-01 | TX    |       15.00
 1236 | 2008-01-01 | CA    |       20.00
 1237 | 2008-02-01 | TX    |       25.00
 1238 | 2008-03-01 | CA    |       30.00
 1237 | 2008-04-01 | NY    |       35.00
 1238 | 2008-05-01 | NY    |       40.00

Then the simple query above produces a one-row output:

CNT  | SUM  | AVG
-----+------+-----
  7  | 175  |  25

Some Notes on The Syntax

When we use COUNT(*) we always put the asterisk inside.

I have used the "as SUM" to specify a column name of the output. Without that I will get whatever the database server decides to call it, which will vary from platform to platform, so it is a good idea to learn to use the "AS" clause. Some folks would frown at using "SUM" as the name, since that is the name of the function and might be confusing, but I think we're all big kids and we can probably handle it.

The WHERE Clause Does What You Think

If you want to get just the sales from New York state, you can put a WHERE clause in:

SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,AVG(sale_amount) as avg
  FROM orders
 WHERE state = 'NY'

...and you will get only the results for NY:

CNT | SUM  | AVG
----+------+----------
  3 |  85  |  28.33333

Notice of course that the average has a repeating decimal. Most databases have a ROUND function of some sort, so I can correct that with:

SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount),0) as avg
  FROM orders
 WHERE state = 'NY'

The Fun Begins With GROUP BY

The query above is fine, but it would be very laborious if you had to issue the query (or write a program to do it) for every possible state. The answer is the GROUP BY clause. The GROUP BY clause says that the aggregations should be performed for the distinct values of a column or columns. It looks like this:

SELECT state,
      ,COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount),0) as avg
  FROM orders
 GROUP BY state

Which gives us this result:

STATE | CNT | SUM  | AVG
------+-----+------+----
NY    |  3  |  85  |  28
TX    |  2  |  40  |  20
CA    |  2  |  50  |  25  

Note that if you try to include a column that you are not grouping on, such as zip code, most database servers will reject the query because there may be different values of zip code for the same value of state, and they have no way to know which one to pick for a given value of state.

HAVING Clause is Like WHERE after GROUP BY

The HAVING clause lets us put a filter on the results after the aggregation has taken place. If your Sales Manager wants to know which states have an average sale amount of $25.00 or more. Now our query looks like this:

SELECT state,
      ,COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount),0) as avg
  FROM orders
 GROUP BY state
HAVING AVG(sale_amount) >= 25

Which gives us this result, notice that Texas is now missing, as they were just not selling big enough orders (sorry 'bout that Rhonda).

STATE | CNT | SUM  | AVG
------+-----+------+----
NY    |  3  |  85  |  28
CA    |  2  |  50  |  25  

The Hat Trick: All Three

You can pull some pretty nice results out of a database in a single query if you know how to combine the WHERE, GROUP BY, and HAVING. If you have ever worked with a Sales Manager, you know they constantly want to know strange numbers, so let's say our Sales Manager says, "Can you tell me the average order size by state for all orders greater than 20? And don't bother with any average less 30.00" We say, "Sure, don't walk away, I'll print it out right now."

SELECT state
      ,COUNT(*)
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount) as avg
  FROM orders
 WHERE sale_amount > 20
 GROUP BY state
HAVING avg(sale_amount) >= 30

How to Do a Weighted Average

Consider the case of a table that lists test, homework and quiz scores for the students in a certain course. Each particular score is worth a certain percentage of a student's grade, and the teacher wants the computer to calculate each student's file score. If the table looks like:

STUDENT     | WEIGHT | SCORE
------------+--------+-------
NIRGALAI    |     40 |    90
NIRGALAI    |     35 |    95
NIRGALAI    |     25 |    85
JBOONE      |     40 |    80
JBOONE      |     35 |    95
JBOONE      |     25 |    70
PCLAYBORNE  |     40 |    70
PCLAYBORNE  |     35 |    80
PCLAYBORNE  |     25 |    90

Then we can accomplish this in one pull like so:

SELECT student
      ,SUM(weight * score) / 100 as final
  FROM scores
 GROUP BY student

The nice thing about this query is that it works even if data is missing. If a student missed a test, they automatically get a zero averaged in.

Conclusion: Queries Are Where It's At

The only reason to put data into a database is to take it out again. The modern database has powerful strategies for ensuring the correctness of data going in (the primary key, foreign key and other constraints) and equally powerful tools for pulling the data back out.

Next Week: Joins Part Two, The Many Forms of JOIN