DBMS, November 1996
SQL for Smarties By Joe Celko

Summing With SQL

Summation in a relational database model or in mathematics is tricky.

At the time I am writing this, the Olympics have just finished in Atlanta. Before the Olympics, I had always regarded living across from a MARTA (the Metropolitan Atlanta Rapid Transit Authority) station as an advantage. The Art Center station was a majo r nexus for Cobb County Transit buses coming into town, the volunteers at Georgia Tech, the midtown hotel district, the High Museum of Art, the Atlanta Symphony, and other Art activities. My dog thought it was great to have all those people out at 5:00 a .m. to play with.

From a database angle, the big news was the IBM scoring system screw-up, causing the Olympics to rely on hand-carried messages and the wire services to get the news out. They tried to get too fancy in too short a time for their own good.

A less-publicized computer problem was the trial run of the Visa cash card. The victim was supposed to buy a cash card and then use it by inserting it into a slot on a small terminal device connected to the cash register.

What Visa did not tell you, however, was that the terminal does not provide a password system, so losing your card is like losing your cash. Only a few of the chain-store merchants had the terminal hookups, and none of the small merchants selling Olympic paraphernalia did. Furthermore, if you had any credit left on your card when you went back home, you had to pay $3 to cash out the card. If you did not cash out your card, it would expire, and the surplus money would automatically go to a charity of the bank's choice.

Can anyone from Visa explain why people actually bought these things? A regular credit card would be more widely acceptable and would not expire; it would let me use my own money longer and pick my own charities. If I want to do electronic commerce, I ca n place orders on the Internet with more safety. There is still not one recorded case of a credit card number being stolen during a transaction on the Internet ("Electronic Commerce: Sign Here," The Economist, July 1996, page 54).

The Internet, Java, and SQL Again

According to an Associated Press report dated July 23, 1996, U.S. Justice Department officer Jamie Gorelick said that getting control over the Internet is as important to the U.S. Government as was the development of the first atomic bomb via the Manhatt an Project. This analogy is pretty strange, but I guess some people jump at any excuse to grab power. The atomic bomb resulted in "Balance of Terror" in the world - you could physically destroy any country that did not behave as you wished. I have not be en able to do that with my current Internet Access provider.

The trend toward making databases universally available on the Internet continues. Ensodex, which markets HotSockets for ODBC, has created a new product, HotSockets for JDBC. This product is an SDK and a JDBC server, for organizations wishing to connect their Internet/Intranet Web pages to their corporate data. This unique product enables corporate enterprise data to be connected via a Java applet. This capability means that all the client needs is a standard Web browser - and who doesn't have one of th ese? - to be Internet/Intranet-enabled for enterprise data.

HotSockets for JDBC works as a pass-through applet on any platform that supports Java-enabled Web browsers. That is, the applet created with HotSockets for JDBC is downloaded to the client when the Web page is loaded. It then connects back to the server from the applet, via HotSockets, to support database operations such as select, insert, update, delete, and scrolling displays. It also implements security, encryption, authentication, firewall services, and a packet-filtering API.

Summation versus Addition

SQL (and other database languages before SQL) has a function for performing a summation - in SQL, it is the sum( ) aggregate function. You would think that this function would be well understood. In June 1996, the DBMS Forum on CompuServe had a lively th read (also known as a "religious war") on this topic. It seems that summation is not actually all that easy in a relational database model or in mathematics.

Addition and summation are closely related but not quite the same thing. This slight difference is weird at first, but it is important. Addition is a binary operator, shown by a plus sign (+), that has some algebraic properties such as associativity and commutativity that we learned in high school algebra. You cannot add just one thing, and you cannot add an infinite set of things; the operation is simply not defined for either situation.

Summation is an aggregate or set-valued function that has an index set. The index set can be finite or infinite, but it cannot be empty. Each term in the summation has a one-to-one relationship with each element of the index set. If you are old enough yo u might remember the "big Sigma" () notation from college math. The index is the subscript variable, usually i, j, or k, which ranges from a starting value to an ending value. If you want the summation process to continue forever, yo u use the special symbol "infinity," shown as a figure eight on its side ().

The old figure eight "infinity" notation is procedural and not functional in nature. Think of it as hiding a small computer program that looks like this:

BEGIN
sum := 0;
WHILE i <= f
DO BEGIN
i := s
sum := sum + T(i);
i := i + 1;
END;
RETURN sum;
END;

When the <finish> value is infinity, think of the program as equivalent to:

BEGIN
sum := 0;
WHILE TRUE - the result of (<index> => INFINITY)
DO BEGIN
i := s
sum := sum + T(i);
i := i + 1;
END;
RETURN sum;
END;

Every programmer knows that this operation has some problems - we have all written an endless loop at some point in our careers. The answer in mathematics was to invent limits. The idea of a limit is that there is a value that the sum never exceeds, no m atter how many iterations of the loop are executed.

What happens if the index set is empty? Let's modify the original program again:

BEGIN
sum := 0;
WHILE FALSE - the result of having no <index> values
DO BEGIN
i := s
sum := sum + T(i);
i := i + 1;
END;
RETURN sum;
END;

Thus we might conclude that the summation of an empty set is zero, the identity element of addition. I will make several arguments that this is not the case.

First, the zero we are getting back from this program is something we created that was never in the original set of terms. The philosophical principle "ab nullo, ex nullo" ("from nothing, comes nothing") makes me uncomfortable about creating a result, ze ro, from an empty set.

Second, let's look at another way to write the procedural code:

BEGIN
sum := T(s);
DO BEGIN
i := i + 1;
sum := sum + T(i);
END;
UNTIL i < f;
RETURN sum;
END;

This code does not create any values that were not in the original summation. The empty set would result in a program failure on the first executable statement, because the initial term is not defined. In short, the procedural argument depends on your pr ocedure.

The third argument calls for a little background information about what has happened in mathematics since 1820. The current preferred notation is to show the index set as a set; you can see examples of this notation in Concrete Mathematics by Grah am, Knuth, and Patashnik (Addison-Wesley, 1994, ISBN 0-201-55802-5). There is no ordering, as there was in the old notation.

A set is a completed, bound thing that can be treated as a unit of operation in itself. You cannot sum an empty set with a process. Without going into the painful details, I'll say that mathematical proofs are completely different in those two cases. Bec ause the relational model and SQL are both set-oriented languages, I will argue that a set-oriented approach is better than a procedural one.

It is much easier to specify a complicated set using logical predicates than to write a complicated index expression inside a term. For example, I can define an index to be the set of all prime numbers, but nobody knows how to write an expression that wi ll generate the series of all prime numbers.

Let's create a simple Personnel table with just the employee's name, salary, and lottery winnings in each row. The lottery tickets are the employee retirement plan at this company. (We knew it would come to this someday.) We will model lottery tickets th at have not been checked against the winning numbers as a NULL. The value of the lottery winnings will probably resolve to zero, but you don't know that yet.

CREATE TABLE Personnel
(emp CHAR (5) NOT NULL PRIMARY KEY,
salary DECIMAL (8,2) NOT NULL,
lottery DECIMAL (8,2));

INSERT INTO Personnel VALUES ('Tom', 500.00, 200.00);
INSERT INTO Personnel VALUES ('Dick', 700.00, NULL);
INSERT INTO Personnel VALUES ('Harry', 800.00, NULL);

Now consider the straightforward statement to report the payroll:

SELECT emp, (salary + lottery) AS total_pay
FROM Personnel;

The result is:

emp total_pay
==================
'Tom', 700.00
'Dick' NULL
'Harry' NULL

The total_pay will be a NULL for all employees who have not scratched their tickets yet, because NULLs propagate in addition in SQL. This result is probably not what you wanted. Now look at the SUM( ) aggregate function in the statement:

SELECT emp, SUM(salary + lottery) AS total_pay
FROM Personnel
GROUP BY emp;

The result is:

emp total_pay
==================
'Tom', 700.00
'Dick' NULL
'Harry' NULL

Now, look at this query:

SELECT (SUM(salary) + SUM(lottery)) AS total_pay
FROM Personnel;

The result is:

total_pay
===========
2200.00

We have a different result than we would have gotten by summing the rows in the previous queries. This variation occurs because the index set for the SUM( ) aggregate function in SQL is all of the non-NULL values in the parameter expression.

What if you actually wanted SUM( ) to behave like "+" and propagate NULLs? Just use a fancy expression to see if any nulls exist in your expression, thus:


SELECT (SUM(salary + lottery) 
* CASE WHEN COUNT(*) = COUNT(salary + 
lottery) 
THEN 1 
ELSE NULL END) AS total_pay 
FROM Personnel;
What should the result of a SUM( ) function on an empty table be? Chris Date has advocated returning a zero, using the procedural code argument ("Empty Bags and Identity Crises," Database Programming & Design, April 1993, page 23). The zero result has also been a convention in mathematics since the days of the procedural notation. This convention made writing the indexing expressions easier in many cases because the zero would not matter in the final result, but it was known to be formally incorr ect (see exercise one of chapter two, page 62 of Concrete Mathematics for other situations in which the sequence is not well defined). Zero is called the identity element for addition - if you add zero to something, it does not change.

This convention is incorrect because nothing in the index is set to create a term in the summation. There is no one-to-one mapping from the index to the terms of the summation. Fourier did not have the idea of sets, empty or otherwise, let alone mappings , when he invented this notation in 1820. He was doing quite well to come up a procedural approach.

The argument for the NULL result was that there is a fundamental difference between an empty set and a non-empty set that totals to zero. Imagine that you are guarding a flock of sheep and I ask you how many sheep jumped over the fence into the other fie ld; you answer "zero." Now, if I ask you how many pink elephants jumped over the fence into the other field, would you answer "zero" or "I ain't got no elephants! I'm a shepherd, not a circus!" to that question? Hence, summation ought to return an error message, a NULL, or some indication that a set is missing.

Another property that is hard to understand at first is how GROUP BY works on an empty set. Consider this query:

1) SELECT SUM(x) FROM Empty;

which returns a result table with a single null. Now, consider:

2) SELECT SUM(x) FROM Empty GROUP BY x;

which returns an empty set result table (and probably a warning that no rows exist or satisfy the specified search criteria). The reason for the empty set result is that the GROUP BY partitions the table.

Time for a mathematical aside. A partitioning cuts the original set into a collection of subsets such that:

1) All of the subsets are non-empty.
2) The intersection of any distinct pair of the subsets is empty.
3) The union of all of the subsets together forms the original set.

You can get this definition from any book on elementary set theory (see Elementary Set Theory: Proof Techniques, Carol E. Gordon and Neil Hindman, Hafner Press, 1975, ISBN 0-02-845350-1). Because an empty set (table) has no partitions, the GROUP B Y cannot return a value. You could make an argument for an error message, but the convention in SQL is to return an empty result.

Another SQL convention is that if you get an empty result set in a subquery expression, it becomes a NULL. Because it is an expression, it must return something that can fit into a column in the result table. There is no value that would make sense, so w e use a NULL.

Partitions and summation act like parentheses and addition - both are ways of grouping operands that do not affect the operator. That is, just as ((a + b) + (c + d)) = ((a) + (b) + (c) + (d)), so does:

SELECT SUM(x)
FROM (SELECT SUM(x) FROM Pizza WHERE a = 'x1'
UNION
SELECT SUM(x) FROM Pizza WHERE a = 'x2'
...
UNION
SELECT SUM(x) FROM Pizza WHERE a = 'xN')
AS subtotal(x);

where ('x1', 'x2', ..., 'xN') are all the values in column x.

Another important property of summation is that if you partition a table on a key, each group has exactly one element. The result of the SUM( ) for each singleton group is the single element in it. This convention applies in the "big Sigma" notation when it has a single term; it's also another reason why addition, a binary operator, is not the same as summation, an n-ary operator (n > 0).

Set theory has the Axiom of Choice, which can be expessed in many different ways. Without being too formal about it, the Axiom of Choice says that if you have a partitioning of a set, even an infinite set, you can choose an element from each partition an d build a new set from it.

This axiom leads to a fundamental problem: If you believe that the summation over an empty set is zero, then what partition(s) of the original set does that sum map into? Oops - there are no partitions for the mapping. This violates the Axiom of Choice a nd destroys induction and bunch of other nice things in the system.

Puzzle

If the sum( ) aggregate function is related to the mathematician's "big Sigma" () notation, then we might be able to define a prod( ) aggregate function that is related to the "big Pi" () notation. The "big Pi" () or product notation is a way to represent repeated multiplication using an index set, which was introduced by Jacobi in 1829. In my August 1996 column ("Everything You Know is Wrong"), I showed a way to fake su ch an aggregate product function using a second table.

Frankly, it was ugly, but I did it that way because the SQL-92 standard doesn't have a good mathematics library. In real products, however, vendors have made most of the standard scalar arithmetic functions available.

Can you write a query to do a PROD( ) aggregate function?


Joe Celko is an Atlanta-based Guru with Northern Lights Software Ltd. and a member of the ANSI X3H2 Database Standards Committee. He is also the author of two books on SQL: SQL For Smarties (Morgan-Kaufmann) and Instant SQL (Wrox Press). You can contact Joe via CompuServe at 71062,1056 or via email at 71062.1056@compuserve.com.


* Ensodex Inc.; 1850 University Ave. W.; St. Paul, MN 55104; 612-641-8551 or fax 612-641-8566; http://www.ensodex.com; email info@ensodex.com.


Puzzle Answer

Anybody out there ever use a slide rule? It turns multiplication into addition by using logarithms. The expression for the product of a column from logarithm and exponential functions is:

SELECT ((EXP (SUM (LOG (CASE WHEN nbr = 0.00
THEN NULL 
ELSE ABS(nbr) END)))) 
* (CASE WHEN MIN (ABS (nbr)) = 0.00
THEN 0.00
ELSE 1.00 END)
* (CASE WHEN MOD (SUM (CASE WHEN SIGN(nbr) = -1 
THEN 1 
ELSE 0 END), 2) = 1
THEN -1.00
ELSE 1.00 END) AS Prod
FROM NumberTable;
The nice part of this is that you can also use the SUM (DISTINCT <expression>) option to get the equivalent of PROD (DISTINCT <expression>).

You should watch the data type of the column involved and use either integer 0 and 1 or decimal 0.00 and 1.00, as is appropriate in the CASE statements. It is worth studying the three CASE expressions that make up the terms of the PROD( ) calculation.

The first CASE expression ensures that all zeros and negative numbers are converted to a non-negative or NULL for the SUM( ) function, just in case your SQL raises an exception.

The second CASE expression returns zero as the answer if there was a zero in the nbr column of any selected row. The MIN (ABS (nbr)) is a handy trick for detecting the existence of a zero in a list of both positive and negative numbers with an aggregate function.

The third CASE expression returns minus one (-1) if there was an odd number of negative numbers in the nbr column. The innermost CASE expression uses a SIGN( ) function, which returns +1 for a positive number, -1 for a negative number, and 0 for a zero. The SUM( ) counts the -1 results, and then the MOD( ) functions determines if the count was odd or even.

As an aside, the book Bypasses: A Simple Approach To Complexity by Z. A. Melzak (Wiley-Interscience, 1983, ISBN 0-471-86854-X), is a short mathematical book on the general principle of conjugacy, the method of using a transform and its inverse to reduce the complexity of a calculation.

For bonus credits, can you give an argument as to why the PROD( ) of an empty set should be one, the identity element for multiplication, or NULL?

- Joe Celko


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
November 1996 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Wednesday, October 23, 1996.