Sunday, April 6, 2008

GROUP BY, HAVING, SUM, AVG, and COUNT(*)

Welcome to the Database Programmer!

Good programming skills do not lead magically to good database skills. Masterful use of the database requires knowledge of the database in its own terms. Step 1 is knowing your table design patterns, and Step 2 is knowing how to fashion efficient queries. Learning how to code good queries can lead to faster performance and better application code.

There is a new entry in this series every Monday morning, and the Complete Table of Contents is here.

Aggregation

You can use a SQL SELECT to aggregate data. Aggregation combines rows together and performs some operation on their combined values. Very common aggregations are COUNT, SUM, and AVG.

The simplest use of aggregations is to examine an entire table and pull out only the aggregations, with no other columns specified. Consider this SQL:

SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,AVG(sale_amount) as avg
  FROM orders

If you have a very small sales order table, say about 7 rows, like this:

ORDER |  DATE      | STATE | SALE_AMOUNT
------+------------+-------+-------------
 1234 | 2007-11-01 | NY    |       10.00
 1235 | 2007-12-01 | TX    |       15.00
 1236 | 2008-01-01 | CA    |       20.00
 1237 | 2008-02-01 | TX    |       25.00
 1238 | 2008-03-01 | CA    |       30.00
 1237 | 2008-04-01 | NY    |       35.00
 1238 | 2008-05-01 | NY    |       40.00

Then the simple query above produces a one-row output:

CNT  | SUM  | AVG
-----+------+-----
  7  | 175  |  25

Some Notes on The Syntax

When we use COUNT(*) we always put the asterisk inside.

I have used the "as SUM" to specify a column name of the output. Without that I will get whatever the database server decides to call it, which will vary from platform to platform, so it is a good idea to learn to use the "AS" clause. Some folks would frown at using "SUM" as the name, since that is the name of the function and might be confusing, but I think we're all big kids and we can probably handle it.

The WHERE Clause Does What You Think

If you want to get just the sales from New York state, you can put a WHERE clause in:

SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,AVG(sale_amount) as avg
  FROM orders
 WHERE state = 'NY'

...and you will get only the results for NY:

CNT | SUM  | AVG
----+------+----------
  3 |  85  |  28.33333

Notice of course that the average has a repeating decimal. Most databases have a ROUND function of some sort, so I can correct that with:

SELECT COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount),0) as avg
  FROM orders
 WHERE state = 'NY'

The Fun Begins With GROUP BY

The query above is fine, but it would be very laborious if you had to issue the query (or write a program to do it) for every possible state. The answer is the GROUP BY clause. The GROUP BY clause says that the aggregations should be performed for the distinct values of a column or columns. It looks like this:

SELECT state,
      ,COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount),0) as avg
  FROM orders
 GROUP BY state

Which gives us this result:

STATE | CNT | SUM  | AVG
------+-----+------+----
NY    |  3  |  85  |  28
TX    |  2  |  40  |  20
CA    |  2  |  50  |  25  

Note that if you try to include a column that you are not grouping on, such as zip code, most database servers will reject the query because there may be different values of zip code for the same value of state, and they have no way to know which one to pick for a given value of state.

HAVING Clause is Like WHERE after GROUP BY

The HAVING clause lets us put a filter on the results after the aggregation has taken place. If your Sales Manager wants to know which states have an average sale amount of $25.00 or more. Now our query looks like this:

SELECT state,
      ,COUNT(*) as cnt
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount),0) as avg
  FROM orders
 GROUP BY state
HAVING AVG(sale_amount) >= 25

Which gives us this result, notice that Texas is now missing, as they were just not selling big enough orders (sorry 'bout that Rhonda).

STATE | CNT | SUM  | AVG
------+-----+------+----
NY    |  3  |  85  |  28
CA    |  2  |  50  |  25  

The Hat Trick: All Three

You can pull some pretty nice results out of a database in a single query if you know how to combine the WHERE, GROUP BY, and HAVING. If you have ever worked with a Sales Manager, you know they constantly want to know strange numbers, so let's say our Sales Manager says, "Can you tell me the average order size by state for all orders greater than 20? And don't bother with any average less 30.00" We say, "Sure, don't walk away, I'll print it out right now."

SELECT state
      ,COUNT(*)
      ,SUM(sale_amount) as sum
      ,ROUND(AVG(sale_amount) as avg
  FROM orders
 WHERE sale_amount > 20
 GROUP BY state
HAVING avg(sale_amount) >= 30

How to Do a Weighted Average

Consider the case of a table that lists test, homework and quiz scores for the students in a certain course. Each particular score is worth a certain percentage of a student's grade, and the teacher wants the computer to calculate each student's file score. If the table looks like:

STUDENT     | WEIGHT | SCORE
------------+--------+-------
NIRGALAI    |     40 |    90
NIRGALAI    |     35 |    95
NIRGALAI    |     25 |    85
JBOONE      |     40 |    80
JBOONE      |     35 |    95
JBOONE      |     25 |    70
PCLAYBORNE  |     40 |    70
PCLAYBORNE  |     35 |    80
PCLAYBORNE  |     25 |    90

Then we can accomplish this in one pull like so:

SELECT student
      ,SUM(weight * score) / 100 as final
  FROM scores
 GROUP BY student

The nice thing about this query is that it works even if data is missing. If a student missed a test, they automatically get a zero averaged in.

Conclusion: Queries Are Where It's At

The only reason to put data into a database is to take it out again. The modern database has powerful strategies for ensuring the correctness of data going in (the primary key, foreign key and other constraints) and equally powerful tools for pulling the data back out.

Next Week: Joins Part Two, The Many Forms of JOIN

9 comments:

Anonymous said...

This is so timely. You saved me countless hours of work.

Well I would probably have found the solution somewhere else, but your page was high up in the google result list ("group by" avg), and your explanation is easy to understand even for somebody whose DB experience consists of a few attempts to use MS Access.

Many thanks.

KenDowns said...

anonymous: glad to hear it!

rob hellest said...

Is it possible to add values returned from a count function? My count function works properly, and I need to add the values resulting from the count function.

Can the sum function be used on a count function? - example sum(count(*)) - not sure of the actual syntax. Thanks in advance!

See below for code sample -

select unique brtm.BROKER_TEAM_NAME, shhd.FILE_NO,
(select count(shhd.BROKER_TEAM) from tpsdba.shipment_header shhd
where brtm.BROKER_TEAM = shhd.BROKER_TEAM
and (shhd.DATE_ENTRY >= '20080401' and shhd.DATE_ENTRY <= '20080404') )
as NumEntries,
(select Count(*) from tpsdba.ci_lines cidt
where cidt.FILE_NO=shhd.FILE_NO) As CINum
from tpsdba.broker_team brtm
inner join
tpsdba.shipment_header shhd
on
brtm.BROKER_TEAM = shhd.BROKER_TEAM
where brtm.BROKER_TEAM = 'KXH'
and (shhd.DATE_ENTRY >= '20080401' and shhd.DATE_ENTRY <= '20080404')

KenDowns said...

Rob: In principle yes, you can sum counts. Your example shows a sub-query, which I have not written up on this series yet, but in short you can do a count in a subquery and then SUM the results in the outer query. Your example would be easier to work with if you stripped out everything except the count/sum columns, so off the top of my head you might have something like:


SELECT sum(thecount) as sum
from (select count(*)
as theCnt
from subtable
where...
) x

Yuri Sarlinski said...

Ken....u kick ass!

Anonymous said...

help please is this possible the scenario is this....

say i have a table named employee with the ff values.

employeesal employee employeegroup
100 a AA
200 b AA
300 c AB
400 d AB
500 e AA

i need to count employee by employeegroup and after that when i count them i need each row of count of employeegroup divided by the sum of employeesal.

this is my query...

select employeegroup, count(employeegroup), employeesal/select count(employeegroup) ... as total
.....


i cant seem to get it correctly the result of the toal is zero is there any work around query for this please email me thanks so much sherwin@pickle.ph

Anonymous said...

Your explanations and examples are clear and straightforward - hard to find sometimes.

I ran into a weirdness (or maybe not) recently. A co-worker used GROUP BY...HAVING but reversed the lines, as in

SELECT col1, count(1)
FROM tab1
HAVING count(1) > 1
GROUP BY col1

Strangely, it worked!

KenDowns said...

@anonymous: different servers enforce different rules for the ordering of the phrases. Back in foxpro you could mix them up any way you wanted.

Peggy said...

This is great information. Now what I don't understand is how to white the code/php that will display this information