DBMS
DBMS, October 1996
SQL for Smarties By Joe Celko

Going To Extremes

How to use SQL to get the most or least of what you want.

It's time for another SQL lesson, class! It's a fact of life that you care about the extremes of a data item - can you really imagine a news report that begins with the 15th place winners of a sporting event? We want to know the largest sales amount, the highest score, the lowest temperature, and so forth. The set functions that return these highs and lows are called extrema functions.

Now you are thinking that I am going to do a whole column on the MIN( ) and MAX( ) functions in SQL. Nope. The following is all I have to say about them:

  1. People seem to get confused when they use MIN( ) and MAX( ) functions with temporal data types. Time has an ordering, but people seem to focus on the present as an origin, and it messes up their thinking.
  2. The forms MIN(distinct <expression>) and MAX(distinct <expression>) are allowed by the standards. Nobody writes code with them because they are totally redundant.
  3. Remember that NULLs are dropped from consideration and that an empty set returns an extrema result of NULL.
Now let's move on to the exotic members of this family of functions.

Greatest and Least

Oracle has two functions that are quite useful - I wish they were available in standard SQL. The problem is not that they take a list of values of the same data type as arguments and not a fixed number of parameters. The SQL Standards Committee did not mind defining the COALESCE(<comma separated list>) function in SQL-92, which scans the list from left to right and returns the first non-NULL value it finds. These functions are GREATEST(<comma separated list>) and LEAST(<comma separated list>).

The GREATEST(<comma separated list>) function returns the largest value in the list. You can think of it as a row-oriented version of MAX( ). The LEAST(<comma separated list>) function returns the smallest value in the list. You can think of it as a row-oriented version of MIN( ).

You can write a two-parameter version of GREATEST(x1, x2) in SQL-92 by using the case expression: CASE WHEN (x > y) then x ELSE y END. This case expression can then be extended to three parameters by nesting case expressions, as in:

CASE WHEN (CASE WHEN (x > y) THEN x ELSE y END > z)
THEN (CASE WHEN (x > y) THEN x ELSE y END) ELSE z END

or by flattening the case expression, as in:

CASE WHEN x >= y AND x >= z THEN x
WHEN y >= x AND y >= z THEN y
WHEN z >= x AND z >= y THEN z
END
The flattened version has an advantage because the case statement works from left to right and stops when it finds the first true case. The nested version would have to go through all of the levels of nesting to produce a result. As a matter of taste, I think the flattened version is easier to generate and read. But it would be nice to be able to use the <comma separated list> construct.

Well, you can try something else in SQL-92, but I'm not sure how many products actually implement it. I'm talking about the subquery expression:

(SELECT MAX(x)
FROM VALUES ((x1), (x2), ... (xn)) AS Foobar(x))

What happens in this monster? The VALUES( ) expression constructs a table, which we name Foobar in the as clause. Foobar has rows, shown by the comma-separated list of parenthesized of x's in this schematic. Using the expression VALUES(x1, x2, ... xn) will not work because this expression says to create a row of (n) values, and we want a column instead.

You should be familiar with the VALUES(<comma separated list>) because you have used it for years in the INSERT INTO statement. This feature was generalized and made more orthogonal in SQL-92, so that you could use it to create tables within queries or other SQL statements.

The single column in Foobar is named x in the AS clause. The renaming of columns in the AS clause is also not widely supported; most products that implement the AS clause use it only to rename tables. Finally, the subquery expression pulls out the MAX( ) value from Foobar.

This expression will probably be slow. The obvious execution plan would be to build Foobar as a table at runtime. Foobar will probably fit into main storage (cache memory) because its list is usually small compared to a "real" table. The query can use the usual mechanism for finding the MAX( ). This table also won't have an index, so the MAX( ) will be done with a table scan.

If anyone can find a better way, please let me know.

Making it to the Top

I like to hang around the MSACCESS forum on CompuServe because I hate the product so much. Several other consultants at different local user group meetings told me this is a bad attitude; as they explained, they love Access because it guarantees them employment. People use the product, get in trouble, and pay these consultants to fix the problems. My main reason for my dislike of Access is my enthusiasm (a.k.a. "religious mania") for standards in general and SQL in particular, so I consider any product so far away from any standard to be in league with the devil.

However, after considering the views of my fellow consultants, I now regard the MSACCESS forum as a source of SQL puzzles for this column. Where else can you find so many people having database problems? (That was a rhetorical question - do not send me the name of your company.)

The Top Salesperson Contest

One feature that Access has and SQL does not is the SELECT TOP (n) clause. This feature will sort the result table and return the first (n) rows. This is not a relational operation - tables are not supposed to be sorted. But I will concede that it is handy to be able to ask questions such as, "Who are the top three salespersons in the company?" in a single statement. The bad news is that this is not as easy as it sounds.
  1. It is a bad specification. How do you define "best" in terms of an ordering? Is it total sales? Or do you use one of those quota point systems that sales managers seem to love? For the rest of this column, assume that you are using a simple table with a column that has the total sales for each salesperson.
  2. What if you have three or fewer salespersons in the company? Do you report all the salespersons you do have? Or do you return a null, empty result set, or error message?
  3. How do you handle two salespersons who tied? Include them all and permit the result set to be bigger than three? Pick an arbitrary subset and exclude someone? Or do you return a null, empty result set, or error message?
The way that you do this in other SQL products is to produce a result with an ORDER BY clause, then read the first (n) rows from that cursor result. Remember, only cursors have the order by clause. Sybase has a "SET ROWCOUNT N" option that clips the result set at exactly (n) rows. Oracle and others have a row number that is, in effect, a sequential number in a system-generated column that you can use to clip out the first (n) rows. Both options suffer from the same problem: Your results depend completely on the order in which the rows were constructed - and that order can change from one execution of the query to the next.

I will argue that SQL is a set-oriented language, so you ought to use a set-oriented approach. Let's say you want to define a subset of sales values that have a count of three. This subset will have a greatest value in it, and that greatest value will also be in the original set. But which value is the upper boundary for the subset you're seeking? Take each sales total (the boundary of the subset) and build a group of other sales totals (the elements of the subset) that are less than or equal to it (enclosed by the boundary). You want to see the group(s) with three rows.

SELECT Elements.name, Elements.tot_sales
FROM SalesReport AS Elements, SalesReport AS Boundary
WHERE Elements.tot_sales <= Boundary.tot_sales
GROUP BY Elements.name, Elements.tot_sales
HAVING COUNT(Boundary.tot_sales) <= 3;

It should be obvious that 3 can be replaced by any cardinal number, but let's just leave it. A more subtle point is that the predicate "Elements.tot_sales <= Boundary.tot_sales" will include the boundary value and therefore implies that if you have only three salespersons, you still have a contest. If you have fewer than three, you will always get an empty result, in effect calling off the sales contest. If you want to call off the competition for lack of a quorum, change the predicate to "Elements.tot_sales < Boundary.tot_sales" instead.

As an aside, if you were awake during your college set theory course, you will remember that John von Neuman's definition of ordinal numbers is based on nested sets. You can get a lot of ideas for self-joins from set theory theorems. (John von Neuman was one of the greatest mathematicians of this century before he invented the modern stored program computer and Game Theory. Know your nerd heritage!)

Another way to express the query would be:

SELECT Elements.name, Elements.tot_sales
FROM SalesReport AS Elements
WHERE (SELECT COUNT(')
FROM SalesReport AS Boundary
WHERE Elements.tot_sales > Boundary.tot_sales)>= 3;
If you want to swap the subquery and the constant for readability, that action is legal in SQL-92 but not in SQL-89. What if you want to allow ties? Just change count( ) to a COUNT(DISTINCT) function with a HAVING clause, thus:

SELECT Elements.name, Elements.tot_sales
FROM SalesReport AS Elements, SalesReport AS Boundary
WHERE Elements.tot_sales <= Boundary.tot_sales
GROUP BY Elements.name, Elements.tot_sales
HAVING COUNT(DISTINCT Boundary.tot_sales) <= 3;

This says that I want to count the values of tot_sales, not the salespersons, so that if two or more of the crew hit the same total, I will include them in the report as tied for a particular position. This also means that the results can allow more than three rows because I can have ties. As you can see, it is easy to get a subtle change in the results with just a few simple changes to predicates.

A third version - which will run only in SQL-92 - uses scalar subqueries:

SELECT
(SELECT MAX(tot_sales)
FROM SalesReport) AS s1,
(SELECT MAX(tot_sales)
FROM SalesReport
WHERE tot_sales NOT IN (s1)) AS s2,
(SELECT MAX(tot_sales)
FROM SalesReport
WHERE tot_sales NOT IN (s1, s2)) AS s3,
. . .
(SELECT MAX(tot_sales)
FROM SalesReport
WHERE tot_sales NOT IN (s1, ..., s[n-1])) AS sn
FROM Dummy;
Dummy is any table, even an empty one, used to construct the single row with the results. Because all of the results appear in one row, they are accessible to each other.

You might want to test each version to see which one runs faster on your particular SQL product. Use SQL-92 scalar subqueries based on these self-joins to do this sort of query. The following view will show the ranking of the salespersons as a number (1 = first place, and so on) and their standing:

CREATE VIEW SalesContest (name, tot_sales, rank, standing)
AS SELECT S0.name, S0.tot_sales,
(SELECT COUNT(DISTINCT tot_sales)
FROM SalesReport AS S1
WHERE S1.tot_sales >= S0.tot_sales),
(SELECT COUNT(tot_sales)
FROM SalesReport AS S1
WHERE S1.tot_sales >= S0.tot_sales)
FROM SalesReport AS S0;
Then trim off the personnel who did not make the grade with the WHERE clause in a simple SELECT on the SalesContest VIEW. The difference between a standing and a ranking is that a standing can have gaps in the numbering. Standings are used in the British school system, among other places.

Assume that you have four salespersons and the SalesContest looks like this:

VIEW SalesContest
name tot_sales rank standing
"Able" $1000 1 1
"Baker" $900 2 3
"Brown" $900 2 3
"Carey" $700 3 4
"Delta" $600 4 5
"Eddy" $500 5 6

The contest winners are:

SELECT name, tot_sales, rank
FROM SalesContest
WHERE rank <= 3;

This gives you a result set with a tie for second place: ("Able," "Baker," "Brown," "Carey"). In one sense of the problem, however, the set of your top three salespersons should be ("Able," "Baker," "Brown"). Your first thought might be to use the same query, replacing "rank <= 3" with "standing <= 3" in the where clause.

That query will produce the right result for this data but is not valid in general. What if Carey and Delta work hard and bring their sales up to $900, so that four people are ranked in second place? Their standings would all become 5, and the query would return only Ms. Able, a result which has many incorrect implications.

One solution is to make sure that you have a top partition of exactly three rows, using this query:

SELECT name, tot_sales
FROM SalesContest
WHERE standing <= 3
AND EXISTS (SELECT *
FROM SalesContest
WHERE standing = 3);
This query will return an empty result set when there is not a clear top partition of the desired size.

The Regional Sales Contest

Another problem that comes up a lot on the CompuServe MSACCESS forum is using a select top with a group by; the problem is that you cannot do it. Assume you have another table with the sales personnel broken into regions and you want to find the top two salespersons in each region for local awards: CREATE TABLE SalesRegions (name CHAR(10) NOT NULL, tot_sales DECIMAL(8,2) NOT NULL, region INTEGER NOT NULL);

INSERT INTO SalesRegions VALUES("Able," 100.000, 1);
INSERT INTO SalesRegions VALUES("Baker," 900.00, 1);
INSERT INTO SalesRegions VALUES("Brown," 900.00, 1);
INSERT INTO SalesRegions VALUES("Carey," 700.00, 2);
INSERT INTO SalesRegions VALUES("Delta," 600.00, 2);
INSERT INTO SalesRegions VALUES("Eddy," 500.00, 2);
INSERT INTO SalesRegions VALUES("Fred," 900.00, 3);
INSERT INTO SalesRegions VALUES("George," 700.00, 3);
INSERT INTO SalesRegions VALUES("Herman," 600.00, 3);

This problem is pretty simple after you have done the same for the company-wide case. The equality test is replaced by an in( ) predicate that uses a version of the self-join you have been using:

SELECT R1.region, R1.name, R1.tot_sales
FROM SalesRegions AS R1
WHERE R1.name IN
(SELECT R3.name
FROM SalesRegions AS R2, SalesRegions AS R3
WHERE R2.tot_sales <= R3.tot_sales
AND R1.region = R2.region
AND R1.region = R3.region
GROUP BY R3.name, R3.tot_sales
HAVING COUNT(DISTINCT R2.tot_sales) <= 2)
ORDER BY region, tot_sales DESC, name;
As you add more grouping columns, you will need to add pairs of and-ed equality tests to ensure that the subquery and the containing query are working on the same dataset.

Puzzle

Brendan Campbell (74267.224@compuserve.com) posted an interesting problem on the Oracle User Group forum on CompuServe in May 1996. He gave me permission to use it and to publish his alternative PL/SQL solution as a bad example, thus submitting himself to public humiliation and disgrace for the good of science. Donating your body is easy - you're dead -but donating your dignity is hard.

Say you want a query to pass to a report program that shows the names of the instructors for each course and student. Now here's the catch: You physically have room for only two instructor names on the printout.

If there is only one instructor, display the instructor's name in the first instructor column and set the second column to blanks or null. If there are exactly two instructors, display both names in ascending order. If there are more than two instructors, the report will display the name of the first instructor in the first column and the string "--More--" in the second instructor column.

Assume the necessary data is in a table like this:

CREATE TABLE Register
(course INTEGER NOT NULL,
student CHAR(10) NOT NULL,
instructor CHAR(10) NOT NULL,

Brendan's original solution was 70 lines long, whereas the pure SQL answer is about 12 lines of code in a single statement. (See Puzzle Answer.)


Joe Celko is a member of the ANSI X3H2 Database Standards Committee, a widely published author, and a consultant with OSoft Development Corp. in Atlanta. Joe is also a frequent contributor to the DBMS Forum on CompuServe. You can email him at 71062.1056@compuserve.com.

Puzzle Answer

This column is about the extrema functions, so I hope you have been thinking about using them.
SELECT R1.course, R1.student,
MIN(R1.instructor), NULL
FROM Register AS R1
GROUP BY R1.course, R1.student
HAVING COUNT(*) = 1
UNION
SELECT R1.course, R1.student,
MIN(R1.instructor), MAX(R1.instructor)
FROM Register AS R1
GROUP BY R1.course, R1.student
HAVING COUNT(*) = 2
UNION
SELECT R1.course, R1.student,
MIN(R1.instructor), '--More--'
FROM Register AS R1
GROUP BY R1.course, R1.student
HAVING COUNT(*) > 2;
Now, to go into my usual painful details. The first SELECT statement picks the course/student combinations with one and only one instructor. See why the MIN( ) works? Without any competition, the only instructor will be the minimum by default. I like to use NULLs for missing values, but you could use a string constant instead.

The second SELECT statement picks the course/student combinations with two and only two instructors. The MIN( ) and MAX( ) functions work and order the names because there are only two instructors.

The third SELECT statement picks the course/student combinations with more than two instructors. I use the MIN( ) to get the first instructor, and then a constant of "--More--" (as per the report specification) in the second column.

- Joe Celko


Table of Contents - October 1996 | Home Page
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to mfrank@mfi.com
Updated Monday, September 23, 1996