DBMS, December 1997
DBMS Online: SQL for Smarties By Joe Celko

Thinking in SQL

NIST, DAMA, And SQL3...Oh My!


The National Committee for Information Technology Standards (NCITS; www.ncits.org) H2's secretary, Mike Gorman, has a way for the database user community to ask for things they want to have in SQL conformance requirements. Mike has a database application that you can grab off the Web and use to understand the relationships between SQL3's many features. Better yet, you can use this application to identify those SQL features that you absolutely, positively must have in the Standard for your applications. The goal is to distinguish those from other features that you could use but could also "program around" their absence, those features that you might use once in a blue moon, and those features that look like fun but you cannot imagine how you'd put them to practical use.

I urge you to go to Mike's company's (Whitemarsh Information Systems Corp.) Web site at www.wiscorp.com, find the application that lets you explore those feature relationships, and build up your feedback on which features you believe fit into which category. Your feedback will then be used as input to NCITS H2's discussions on SQL3's conformance requirements, whether leveling, packaging, or some other approach is used. Your confidentiality will be ensured as much as possible.

NIST and Validation Tests

While you are at Mike's Web site, you might also want to look at his adventures with the National Institute of Standards and Technology (NIST; www.nist.gov). He spoke to the Data Management Association (DAMA; www.dama.org) on September 12, 1997 about the cost and politics of NIST's decision to drop independent conformance testing of SQL products (FIPS 127-2, SQL and FIPS 128-2) back on July 1, 1997.

NIST formally protested Mike's appearance on the program to the DAMA board in Washington. DAMA reacted appropriately, and he spoke. The attendees were shocked and amazed. They had no idea and just could not understand why NIST would drop a program that cost only $600,000 and saved the Federal Government more than $30 million.

NIST has placed information about its SQL Project on the Internet at ftp://speckle.ncsl.nist.gov/sql-testing/contents_sql.htm. This site covers the following topics: description of SQL Standards, where to buy SQL Standards documents, FIPS PUB 127-2, how to download the SQL Test Suite, validated products list of tested SQL products, sample procurement language (solicitation wording), SQL testing laboratories, and a directory of SQL Testing Update newsletters since 1994.

Thinking in SQL

A method is what you do when you do not know what you should do next. When I teach a SQL class, one of the major problems is getting students to "unlearn" what they already know from procedural languages. One method I use is to stress thinking in terms of sets and not in terms of "one record at a time" processing.

For example, when you name a table that contains entities, use a collective or plural noun that implies an abstract set of a particular kind of entity. This means that calling a table Employee is not acceptable; calling it Employees is better, but the preferred name would be Personnel. "Employee" is singular; if you use this you have named the table after the name of a single record. And I hope that you have more than one employee working for your company. "Employees" conjures up a mental image of actual distinct people with arms, legs, names, and faces. This is useful, but it is not really abstract enough. "Personnel" is clearly an abstract class with no distinct people in its mental image. Just semantics? Sure, but it helps.

Likewise, I instruct people to ask set-oriented questions. For example, do not ask, "Who is John's boss?," but rather ask, "Who is in the set of John's superiors?" One method for building a query is to start with the SELECT clause and write down the columns or calculations desired in the result set rows. Then fill in the SELECT clause and just focus on the aggregates. All aggregates based on the same grouping, or subsets of the same grouping, can be retrieved simultaneously in one query. Usually the aggregates are explicitly stated and are easy to identify, but not always.

There is a surprising amount of misunderstanding as to how GROUP BY and HAVING clauses work. They are not like the old report writers that gave you aggregate results at the end of each group of records that you read out of a file.

A GROUP by clause first produces a partition of the table, which means it divides the original set into subsets such that the union of the subsets returns the original set and the intersection of the subsets is empty. These partitions are defined by having the same values in the columns listed in the GROUP BY clause. By convention in SQL, the NULLS are treated as equal to each other and therefore fall into one group.

Next, the original table is destroyed, demolished, disposed of, gone. Each group is reduced to a single row in the result table. This result table is called a grouped table, and all operations are now defined on its rows. The rows of the grouped table can only contain group attributes:

Some of his techniques are contrary to what most people do. In fact, some of his techniques are contrary to the way the language is taught in most classes and textbooks. I would like to share some of his ideas and present some supporting examples. He starts with three objectives for a solution:

The method is to ignore the select clause initially and follow these steps:

There is a strong temptation to write correlated subqueries to satisfy each part of a problem specification. This lets you read the specification and translate the English into SQL one sentence at a time. But correlated subqueries are expensive to execute and are often hard to read. A correlated subquery that has to be executed once for every row in the inner query is a real performance hog. A smart optimizer will attempt to "flatten" a query to eliminate the correlated subqueries, but this is not always possible. Let's take a look at some examples.

These examples will be based on a Personnel table, defined and populated as follows:

CREATE TABLE Personnel
(emp_nbr INTEGER NOT NULL PRIMARY KEY, 
name CHAR(12) NOT NULL, 
sex CHAR(1) NOT NULL, 
dept_nbr INTEGER NOT NULL, 
salary DECIMAL(12,2) NOT NULL);
Personnel
emp_nbr name sex dept_nbr salary 
==============================================
1001 Al M 501 30000.00
1002 Bob M 501 35000.00
1003 Charlie M 501 40000.00
1004 Anne F 501 30000.00
1005 Betty F 501 40000.00
1006 Cynthia F 501 50000.00
1007 Debbie F 501 60000.00
1008 Dave M 601 50000.00
1009 Ed M 601 60000.00
1010 Frank M 601 70000.00
1011 George M 601 80000.00
1012 Ellen F 601 50000.00
1013 Felicia F 601 60000.00
1014 Gail F 601 70000.00
1015 Howard M 701 65000.00
1016 Ira M 701 75000.00
1017 John M 701 85000.00
1018 Helen F 801 80000.00
1019 Jane F 801 90000.00
1020 Karen F 801 95000.00

Your problem is to find the departments that have more than two employees and an average salary of more than $61,000. Most SQL programmers have no problem generating a very straightforward solution:

SELECT dept_nbr
FROM Personnel
GROUP BY dept_nbr
HAVING COUNT(*) > 2
AND AVG(salary) > 61000

which produces:

Results
dept_nbr 
======== 
601 
701 
801 

But suppose we made a minor change to the problem: Find the departments that have more than two employees and an average salary for all male employees of more than $61,000. What we usually see is something like:

SELECT P1.dept_nbr
FROM Personnel AS P1
GROUP BY P1.dept_nbr
HAVING COUNT(*) > 2
AND (SELECT AVG(P2.salary)
FROM Personnel AS P2
WHERE P2.sex = 'M'
AND P2.dept_nbr = P1.dept_nbr) > 61000;

This is a perfectly valid solution and will produce the correct results:

Results
dept_nbr 
======= 
601 
701

But it is not a good solution. What are the aggregates in this problem, and what are they based on? The answer is that we want the COUNT of employees, by department and the average salary, by department for males. If we do a group by department for all employees, we can use a case statement to sift out only the male salaries.

SELECT dept_nbr
FROM Personnel
GROUP BY dept_nbr
HAVING COUNT(*) > 2
AND AVG(CASE WHEN sex = ıMı 
        THEN salary 
        ELSE NULL END) > 61000;

Remember that a null is dropped out of an aggregate function. We have eliminated the correlated subquery, reduced the code by half, and increased the performance. The CASE expression is going to be important in this technique. Now, let's make the problem more complicated. Find the departments that have more than two employees, an average salary for all male employees of more than $61,000, and a maximum female salary of more than $65,000. The typical solution will look something like:

SELECT P1.dept_nbr
  FROM Personnel AS P1
  GROUP BY P1.dept_nbr
HAVING COUNT(*) > 2
  AND (SELECT AVG(P2.salary)
    FROM Personnel AS P2
    WHERE P2.sex = 'M'
    AND P2.dept_nbr = P1.dept_nbr) > 61000
  AND (SELECT MAX(P3.salary)
    FROM Personnel AS P3
    WHERE P3.sex = 'F'
    AND P3.dept_nbr = P1.dept_nbr) > 65000;

Again, this will produce the correct results:

Results
dept_nbr     
========
601  

If we examine the aggregates we find we need the count of employees, by department and average salary by department for males and maximum salary by department for females. We can rewrite this query in the form:

SELECT dept_nbr
   FROM Personnel
  GROUP BY dept_nbr
HAVING COUNT(*) > 2
   AND AVG(CASE WHEN sex = 'M' 
        THEN salary 
        ELSE NULL END) > 61000
   AND MAX(CASE WHEN sex = 'F' 
        THEN salary 
        ELSE NULL END) > 65000;

Compare this to the subquery version. The real trick here is in using a CASE expression to build a subset from the rows in a common grouping. That lets us have aggregates for all employees, female employees, and male employees drawn from the same grouping and thereby eliminates the need for separate subqueries.

Try one more problem: Find the female employees who have a greater salary than the average salary of the males in their department. A common SQL-92 solution to this might make use of the scalar subquery expression and look like this:

SELECT P1.emp_nbr, P1.name, P1.salary, 
   (SELECT AVG(P2.salary)
   FROM Personnel AS P2
   WHERE P2.sex = 'M'
   AND P2.dept_nbr = P1.dept_nbr) AS male_avg_salary
 FROM Personnel AS P1
WHERE P1.sex = 'F'
 AND P1.salary > (SELECT AVG(P3.salary)
          FROM Personnel AS P3
          WHERE P3.dept_nbr = P1.dept_nbr
          AND P3.sex = 'M'));

Results
emp_nbr     name         salary    male_avg_salary
=================================================
1005        Betty        40000.00        35000.00                               
1006        Cynthia      50000.00        35000.00                               
1007        Debbie       60000.00        35000.00                               
1014        Gail         70000.00        65000.00       

This is a correct, but not very efficient, solution. Here we will eliminate the two correlated subqueries by doing a self join in the inner query. This query also uses only two copies of the Personnel table.

SELECT P1.emp_nbr, P1.name, P1.salary, 
   AVG(P2.salary) AS male_avg_salary
  FROM Personnel AS P1 
   INNER JOIN 
   Personnel AS P2
   ON P1.dept_nbr = P2.dept_nbr
  WHERE P1.sex = 'F'
   AND P2.sex = 'M'
  GROUP BY P1.emp_nbr, P1.name, P1.salary
HAVING P1.salary > AVG(P2.salary)

The key here that most people miss is in the GROUP BY and HAVING clauses. Once we GROUP BY emp_nbr, which is the PRIMARY KEY for the Personnel table, we can add as many columns as we want without affecting the results of the grouping. We can then reference the grouping column salary in the HAVING clause. You are not limited to just aggregate expressions in a HAVING clause.



Joe Celko is an Atlanta-based guru with Northern Lights Software Ltd. and a member of the ANSI X3H2 Database Standards Committee. He is also the author of two books on SQL: SQL For Smarties (Morgan-Kaufmann, 1995) and Instant SQL Programming (Wrox Press, 1995). You can contact Joe via email at 71062.1056@compuserve.com.
What did you think of this article? Send a letter to the editor.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
December 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated October 31, 1997