
It's time for another SQL lesson, class! It's a fact of life that you care about the extremes of a data item - can you really imagine a news report that begins with the 15th place winners of a sporting event? We want to know the largest sales amount, the highest score, the lowest temperature, and so forth. The set functions that return these highs and lows are called extrema functions.
Now you are thinking that I am going to do a whole column on the MIN( ) and MAX( ) functions in SQL. Nope. The following is all I have to say about them:
The GREATEST(<comma separated list>) function returns the largest value in the list. You can think of it as a row-oriented version of MAX( ). The LEAST(<comma separated list>) function returns the smallest value in the list. You can think of it as a row-oriented version of MIN( ).
You can write a two-parameter version of GREATEST(x1, x2) in SQL-92 by using the case expression: CASE WHEN (x > y) then x ELSE y END. This case expression can then be extended to three parameters by nesting case expressions, as in:
CASE WHEN (CASE WHEN (x > y) THEN x ELSE y END > z)
THEN (CASE WHEN (x > y) THEN x ELSE y END) ELSE z END
or by flattening the case expression, as in:
Well, you can try something else in SQL-92, but I'm not sure how many products actually implement it. I'm talking about the subquery expression:
(SELECT MAX(x)
FROM VALUES ((x1), (x2), ... (xn)) AS Foobar(x))
What happens in this monster? The VALUES( ) expression constructs a table, which we name Foobar in the as clause. Foobar has rows, shown by the comma-separated list of parenthesized of x's in this schematic. Using the expression VALUES(x1, x2, ... xn) will not work because this expression says to create a row of (n) values, and we want a column instead.
You should be familiar with the VALUES(<comma separated list>) because you have used it for years in the INSERT INTO statement. This feature was generalized and made more orthogonal in SQL-92, so that you could use it to create tables within queries or other SQL statements.
The single column in Foobar is named x in the AS clause. The renaming of columns in the AS clause is also not widely supported; most products that implement the AS clause use it only to rename tables. Finally, the subquery expression pulls out the MAX( ) value from Foobar.
This expression will probably be slow. The obvious execution plan would be to build Foobar as a table at runtime. Foobar will probably fit into main storage (cache memory) because its list is usually small compared to a "real" table. The query can use the usual mechanism for finding the MAX( ). This table also won't have an index, so the MAX( ) will be done with a table scan.
If anyone can find a better way, please let me know.
However, after considering the views of my fellow consultants, I now regard the MSACCESS forum as a source of SQL puzzles for this column. Where else can you find so many people having database problems? (That was a rhetorical question - do not send me the name of your company.)
I will argue that SQL is a set-oriented language, so you ought to use a set-oriented approach. Let's say you want to define a subset of sales values that have a count of three. This subset will have a greatest value in it, and that greatest value will also be in the original set. But which value is the upper boundary for the subset you're seeking? Take each sales total (the boundary of the subset) and build a group of other sales totals (the elements of the subset) that are less than or equal to it (enclosed by the boundary). You want to see the group(s) with three rows.
SELECT Elements.name, Elements.tot_sales
FROM SalesReport AS Elements, SalesReport AS Boundary
WHERE Elements.tot_sales <= Boundary.tot_sales
GROUP BY Elements.name, Elements.tot_sales
HAVING COUNT(Boundary.tot_sales) <= 3;
It should be obvious that 3 can be replaced by any cardinal number, but let's just leave it. A more subtle point is that the predicate "Elements.tot_sales <= Boundary.tot_sales" will include the boundary value and therefore implies that if you have only three salespersons, you still have a contest. If you have fewer than three, you will always get an empty result, in effect calling off the sales contest. If you want to call off the competition for lack of a quorum, change the predicate to "Elements.tot_sales < Boundary.tot_sales" instead.
As an aside, if you were awake during your college set theory course, you will remember that John von Neuman's definition of ordinal numbers is based on nested sets. You can get a lot of ideas for self-joins from set theory theorems. (John von Neuman was one of the greatest mathematicians of this century before he invented the modern stored program computer and Game Theory. Know your nerd heritage!)
Another way to express the query would be:
SELECT Elements.name, Elements.tot_sales
FROM SalesReport AS Elements, SalesReport AS Boundary
WHERE Elements.tot_sales <= Boundary.tot_sales
GROUP BY Elements.name, Elements.tot_sales
HAVING COUNT(DISTINCT Boundary.tot_sales) <= 3;
This says that I want to count the values of tot_sales, not the salespersons, so that if two or more of the crew hit the same total, I will include them in the report as tied for a particular position. This also means that the results can allow more than three rows because I can have ties. As you can see, it is easy to get a subtle change in the results with just a few simple changes to predicates.
A third version - which will run only in SQL-92 - uses scalar subqueries:
You might want to test each version to see which one runs faster on your particular SQL product. Use SQL-92 scalar subqueries based on these self-joins to do this sort of query. The following view will show the ranking of the salespersons as a number (1 = first place, and so on) and their standing:
Assume that you have four salespersons and the SalesContest looks like this:
VIEW SalesContest
| name | tot_sales | rank | standing |
|---|---|---|---|
| "Able" | $1000 | 1 | 1 |
| "Baker" | $900 | 2 | 3 |
| "Brown" | $900 | 2 | 3 |
| "Carey" | $700 | 3 | 4 |
| "Delta" | $600 | 4 | 5 |
| "Eddy" | $500 | 5 | 6 |
The contest winners are:
SELECT name, tot_sales, rank
FROM SalesContest
WHERE rank <= 3;
This gives you a result set with a tie for second place: ("Able," "Baker," "Brown," "Carey"). In one sense of the problem, however, the set of your top three salespersons should be ("Able," "Baker," "Brown"). Your first thought might be to use the same query, replacing "rank <= 3" with "standing <= 3" in the where clause.
That query will produce the right result for this data but is not valid in general. What if Carey and Delta work hard and bring their sales up to $900, so that four people are ranked in second place? Their standings would all become 5, and the query would return only Ms. Able, a result which has many incorrect implications.
One solution is to make sure that you have a top partition of exactly three rows, using this query:
INSERT INTO SalesRegions VALUES("Able," 100.000, 1);
INSERT INTO SalesRegions VALUES("Baker," 900.00, 1);
INSERT INTO SalesRegions VALUES("Brown," 900.00, 1);
INSERT INTO SalesRegions VALUES("Carey," 700.00, 2);
INSERT INTO SalesRegions VALUES("Delta," 600.00, 2);
INSERT INTO SalesRegions VALUES("Eddy," 500.00, 2);
INSERT INTO SalesRegions VALUES("Fred," 900.00, 3);
INSERT INTO SalesRegions VALUES("George," 700.00, 3);
INSERT INTO SalesRegions VALUES("Herman," 600.00, 3);
This problem is pretty simple after you have done the same for the company-wide case. The equality test is replaced by an in( ) predicate that uses a version of the self-join you have been using:
Say you want a query to pass to a report program that shows the names of the instructors for each course and student. Now here's the catch: You physically have room for only two instructor names on the printout.
If there is only one instructor, display the instructor's name in the first instructor column and set the second column to blanks or null. If there are exactly two instructors, display both names in ascending order. If there are more than two instructors, the report will display the name of the first instructor in the first column and the string "--More--" in the second instructor column.
Assume the necessary data is in a table like this:
CREATE TABLE Register
(course INTEGER NOT NULL,
student CHAR(10) NOT NULL,
instructor CHAR(10) NOT NULL,
Brendan's original solution was 70 lines long, whereas the pure SQL answer is about 12 lines of code in a single statement. (See Puzzle Answer.)
The second SELECT statement picks the course/student combinations with two and only two instructors. The MIN( ) and MAX( ) functions work and order the names because there are only two instructors.
The third SELECT statement picks the course/student combinations with more than two instructors. I use the MIN( ) to get the first instructor, and then a constant of "--More--" (as per the report specification) in the second column.
- Joe Celko