DBMS, February 1997
DBMS Online: SQL for Smarties By Joe Celko

Web Searching and Conjugacy

Joe explains his mysterious SQL Puzzle answers.

I attended Software Development East '96 in October 1996 to give a talk on representing tree structures in SQL. Software Development is a good conference. I define a good conference as one where I get so many new ideas that I know it will take me months to follow up on them. Unfortunately, I could only come up for a day and half at the end of the show and missed most of the talks I would like to have heard.

Karl Seiler's "Advanced Database Search and Retrieval Technology" talk was not a plug for his company and product (Level Five Research and Quest respectively), but rather a good, balanced presentation on tools for doing Web searches.

The first problem, as anyone who has tried a search knows, is getting back either too much information when your search is too loose ("9,543,234 pages match your criteria.") or no information when you try to tighten it up. What you really want is a research assistant who would look up the most relevant choices for you and give you a short summary of each one before you accessed the Web sites you needed. We are getting there.

The second problem is that search engines are not always smart about what they find. For example, a search on "domino" returns references to a pizza company, musicians, a pet Dalmatian, a Novell network product, some Latin references, some personal Web sites, and game-related sites.

The third problem is that Web site providers try to trick search engines into giving their sites higher scores by using keywords over and over in their text. In a huge result set, you have to be near the top of the list to be read. A page where "domino" appears 100 times will probably get a higher score than a page where the word appears only once, so the Webmaster simply repeats the same words over and over in the comments on a page.

Level\5's Quest Server version 2.0 lets the user input a query with weights to the criteria via a point-and-click interface. This dialog is translated into a search on the Web, and Quest returns the best matches it can find and sorts the results in order of weight. It always returns something, and it is not so easily fooled by Web site provider tricks. If you would like to look at the product, you can download a demo copy from the company's Web site at www.L5R.com.

Karl also mentioned a research project in England that reads documents and reduces the contents to summaries whose length -- a paragraph, a sentence --is determined by the user. The shortest summary is two words, but you will encounter a lot of semantic loss. It takes a long time to analyze a document, and a human expert with voice input might still be faster for one document, but this project demonstrates that computerized summarization is possible.

Nobel Prizes and Decision Making

The 1996 Nobel Prize in Economics went to James Mirrles and William Vickerey for their work in decision theory. Mr. Mirrles worked with asymmetric information situations, which were originally proposed by Vickerey. In English, that means what you should do to get the other guy to tell the truth when he knows more than you do. For example, a lender does not really know how likely a borrower is to repay the loan, an auctioneer does not know what value a bidder really puts on the item, and so forth.

You might like to experiment with a "Vickerey auction" either in a computer simulation or with friends. This procedure takes sealed bids and sells the item to the highest bidder, as expected. The gimmick is that the winner pays the second highest price. This procedure, which lets the seller obtain as high a price as with any other auction, has been used to sell oil field rights and broadcast spectrum space.

Conjugacy and Old Puzzles

One complaint I get is that when I do a SQL puzzle, I give an answer but I don't explain anything about techniques. So this month I review two old puzzles and demonstrate how they can be solved with a common technique. This method of using a transform and its inverse to reduce the complexity of a calculation is called Conjugacy. It is a general technique that is not just limited to mathematical expression: The idea is that you want to perform operation S on a set of values, x, and you cannot do it directly, so instead you:
  1. Apply a transform T( ) to a set of values, x.
  2. Apply an operation, S', on the result of T(x). These results should be easier to work with than the original data; otherwise this whole procedure is a waste of time.
  3. Apply the transform T-1( ) to the set of values S'(T(x)). This will give you the final results you wanted, S(x).
This technique depends on T( ) having a true inverse or the data being limited to a range of values in which a true inverse exists. You might want to look over the functions that you have in your SQL implementation. The most common function conjugate pairs are:
  1. String concatenation and SUBSTRING(<string expression>).
  2. REVERSE(<string expression>) and REVERSE(<string expression>). This process flips the string around. Yes, some functions are their own inverses.
  3. The trig function pairs: TAN(<numeric expression>) and ARCTAN(<numeric expression>); SIN(<numeric expression>) and ARCSIN(<numeric expression>); and COS(<numeric expression>) and ARCCOS(<numeric expression>).
  4. Logarithmic function pairs: ln(<numeric expression>) and exp(<numeric expression>) and log10(<numeric expression>) and power(<numeric expression>,10). You might also see a base two logarithm function in some products.
  5. Casting strings to a non-string data type, and vice versa.

Logarithms and Exponential

In the puzzle section of my November 1996 column, I asked the reader to write a query that would yield the product of all of the non-NULL numbers in a column expression. In effect, add a new aggregate function to SQL, which we could call PROD( ). My answer used the natural logarithm and exponential functions to compute the product:
SELECT ((EXP (SUM (LN (CASE WHEN nbr = 0.00
THEN NULL
ELSE ABS(nbr) END))))
* (CASE WHEN MIN (ABS (nbr)) = 0.00
THEN 0.00
ELSE 1.00 END)
* (CASE WHEN MOD (SUM (CASE WHEN SIGN(nbr) = -1
THEN 1
ELSE 0 END), 2) = 1
THEN -1.00
ELSE 1.00 END) AS Prod
FROM NumberTable;
The nice part of this is that you can also use the sum(distinct <expression>)option to get the equivalent of prod(distinct <expression>).

The natural logarithm, LN(x), is not defined for all numbers, so we don't have a true inverse over the domain of the problem. As x approaches zero, LN(x) goes toward negative infinity and it is not defined for negative numbers.

For that reason, I added the three CASE expressions in the PROD( ) query. They are a bit tricky, so follow me closely: The first CASE expression is to insure that negative numbers are converted to their absolute values so the logarithm will be defined. Zeros are also converted to NULLs in this expression so that SQL will not raise an exception; several SQLs tested did not report an error. Instead, they returned a NULL or even a zero, both of which are wrong. The NULLs will be dropped out in the SUM( ), and we will handle multiplication by zero later.

The second CASE expression will return zero as the answer if there was a zero in the nbr column of any selected row. The MIN(ABS(nbr)) is a handy trick for detecting the existence of a zero in a list of both positive and negative numbers with an aggregate function. Don't forget to file that one away in your notebooks!

The third CASE expression will return ý1 (minus one) if there was an odd number of negative numbers in the nbr column. The innermost CASE expression uses a SIGN( ) function that returns +1 for a positive number, ý1 for a negative number, and 0 for a zero. The SUM( ) counts the ý1 results, and then the MOD( ) function determines if the count was odd or even. I am not sure if changing the answers clauses of the CASE expression from 1 and 0 to 1.0 and 0.0 -- which would make them into floating point numbers -- would improve performance or not.

You still need to test your SQL to see if SIGN( ) and ABS( ) return NULL for a NULL argument. I did not find any exceptions, but you never know.

Concatenation and Substring

Several readers sent in other solutions to Jack Wells' October puzzle: Find the last two salary changes for all of the employees from a table with the personnel history. Mike Conway came up with an answer in Oracle, which I translated into SQL-92 with mixed results. The problem with the translation was that the Oracle version of SQL does not support the SQL-92 Standard OUTERJOIN syntax, and you have to watch the order of execution to get the right results. Syed Kadir, an associate application engineer at Oracle, sent in an improvement on my answer using the view that was created in the first solution:

SELECT S1.emp, S1.sal_date, S1.sal_amt, S2.sal_date, S2.sal_amt
FROM Salaries1 AS S1, Salaries2 AS S2 -- use the view
WHERE S1.emp = S2.emp
   AND S1.sal_date > S2.sal_date
UNION ALL
SELECT emp, MAX(sal_date), MAX(sal_amt), NULL, NULL
   FROM Salaries1
  GROUP BY emp
HAVING COUNT(*) = 1;
You might have to replace the last two columns with the expressions CAST(NULL AS DATE) and CAST(NULL AS DECIMAL(8,2)) to assure that they are of the right data types for a UNION. However, the best answers once more came from Richard Romley of Smith-Barney. His first solution takes advantage of the subquery table expression to avoid VIEWSs:
SELECT B.emp, B.maxdate, Y.sal_amt, B.maxdate2, Z.sal_amt
FROM (SELECT A.emp, A.maxdate, maxdate2 = MAX(X.sal_date)
     FROM (SELECT W.emp, maxdate = MAX(W.sal_date)
         FROM Salaries AS W
         GROUP BY W.emp) AS A
     LEFT OUTER JOIN Salaries AS X
     ON A.emp = X.emp
      AND A.maxdate > X.sal_date
     GROUP BY A.emp, A.maxdate) AS B
    LEFT OUTER JOIN Salaries AS Y
    ON B.emp = Y.emp
      AND B.maxdate = Y.sal_date
    LEFT OUTER JOIN Salaries AS Z
    ON B.emp = Z.emp
      AND B.maxdate2 = Z.sal_date;
If your SQL product supports only a TABLE or VIEW name in their outer joins, then you can convert some of the subqueries into VIEWs for the table subqueries named A and B.

Richard and I went back and forth via email on other possible solutions until we both came up an unorthodox approach, which I wanted to reject on artistic grounds. In the end, though, performance is the ultimate judge.

The trick is Conjugacy and functions involved are concatenation and its inverse, SUBSTRING( ). The trick is to convert the dates and salaries into strings, concatenate them with the date string first, and then split the salary back out.

SELECT A.emp,
    MAX(A.sal_date) AS date1,
    SUBSTRING (MAX(CAST(A.sal_date AS CHAR(10))
      || CAST(A.sal_amt AS CHAR(8))), 11, 8) AS sal1,
    MAX(B.sal_date) AS date2,
    SUBSTRING (MAX(CAST(B.sal_date AS CHAR(10))
      || CAST(B.sal_amt AS CHAR(8))), 11, 8) AS sal2
FROM Salaries AS A
    LEFT OUTER JOIN
    Salaries AS B
    ON A.emp = B.emp
     AND A.sal_date > B.sal_date
GROUP BY A.emp;
If you wanted to, you could also CAST( ) the whole substring expression back to DECIMAL(8,2) so you would have the original data type. Another advantage is that you can extend the pattern of this query by adding more MAX( ) and SUBSTRING( ) pairs with matching LEFT OUTER JOINS and performance does not degrade significantly.

Look back at the list of conjugate function pairs at the start of this section; you will see that I have not used the trig functions or string reversal. Frankly, I cannot think of any way to use the trig functions, but I know that some reader has a great example of string reversal. When you send it in to me, you'll get your name in print and a new book as a prize.

PUZZLE:

A version of this problem got posted on the Oracle User forum on CompuServe in November 1996. The programmer wants to put a query in a cursor that will return the rows from Table 1 in an order determined by a row in Table 2. The information in Table 2 can call for any combination of three columns in Table1 as the sort order.

Table2
sortnbr sortseq
=============================
0 'T1.a, T1.b, T1.c;'
1'T1.a, T1.c, T1.b;'
2 'T1.b, T1.a, T1.c;'
3 'T1.b, T1.c, T1.a;'
4 'T1.c, T1.a, T1.b;'
5 'T1.c, T1.b, T1.a;'

His first approach was to use a dynamic cursor that was built with code something like the following pseudocode. In the actual program, the sort order code came from another procedure, not from a dialog with the user.

BEGIN
DECLARE query1 CHAR(100)
    DEFAULT('SELECT * FROM Table1 AS T1 ORDER BY ');
  mysortcode INTEGER NOT NULL;
  mysortseq CHAR(30) NOT NULL;
CONNECT TO MyDatabase;
/* get the id number for the sort */
WRITELN ('give me a sort order code: ');
READLN (mysortcode);
/* get the text of the Order By clause from Table2 */
SELECT sortseq INTO mysortseq
  FROM Table2
WHERE sortnbr = mysortcode;
/* concatenate the strings to build the query */
SET query1 = query1 || mysortseq;
/* prepare and execute the dynamic SQL string */
PREPARE query1;
EXECUTE query1;
/* use the results .. */
...
END;
Can you think of a way to do this sorting problem that would avoid using dynamic SQL? That is, can you write a single SELECT statement that could be put into a static SQL statement of the form "EXECUTE MyAnswer USING . . ." instead?



PUZZLE ANSWER

What we would like is to have a column on the end of Table 1 that is in the right order. Such a column can be built without using Table 2 at all. The trick is to use a host language flag (:sortflag) that tells you how to concatenate all of the sort keys into a single string that is in the desired order.
EXEC SQL SELECT Table1.*,
       (CASE WHEN :sortflag = 0
          THEN Table1.a || Table1.b || Table1.c
          WHEN :sortflag = 1
          THEN Table1.a || Table1.c || Table1.b
          ...
          WHEN :sortflag = 5
          THEN Table1.c || Table1.b || Table1.a
          ELSE ' ' END) AS sortfield
      FROM Table1
      WHERE ...
      ORDER BY sortfield;

This answer makes the query self-contained and removes a redundant table from the schema. Obviously, you will have to CAST( ) the three columns strings if they are not already CHAR( ) data types. I left that part out because it would make the code too hard to read. This statement is static SQL and will run faster than a dynamic statement that must be re-compiled every time.


Joe Celko is an Atlanta-based guru with Northern Lights Software Ltd. and a member of the ANSI X3H2 Database Standards Committee. He is also the author of two books on SQL: SQL For Smarties (Morgan-Kaufmann, 1995) and Instant SQL Programming (Wrox Press, 1995). You can contact Joe via CompuServe at 71062,1056 or via email at 71062.1056@compuserve.com.
Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
February 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Wednesday, January 22, 1997.