
My father used to tell me that I had to learn to spell because my pencil and my typewriter weren't going to learn for me. Well, not only can my "typewriter" spell for me, but it checks my grammar, looks for synonyms, maintains my files, and sends the mail out to my publisher, too.
My father also used to tell me to get an education because that was something that you could not lose or have taken from you (he grewup during the Great Depression, as you might have guessed). Well, he was wrong again. Nobody has to take your education away from you for you to lose it; technical knowledge has a half-life of its own.
This month, let's look at some myths of the trade that are not always true any more.
When I first learned about optimizing SQL, the general rule for an optimizer was to put the query first into a canonical form. In English, a canonical form means that you rewrite all expressions into a standard pattern in which the order of the operations is always the same and you have eliminated as many levels of nesting as possible. It makes the rest of the processing much easier. I mentioned a disjunctive canonical form for Boolean expressions in my July 1996 column.
The traditional pattern in SQL was to do all of the predicates that apply to a single table (projections) before doing the predicates involving two tables (joins). For example, given:
SELECT Table1.keycol, a, b FROM Table1, Table2 WHERE Table1.a = 42 AND Table2.b = 17 AND Table1.keycol = Table2.keycol;
The optimizer would first mark the rows in Table1 in which column a is equal to 42, then the rows in Table2 in which column b is equal to 17, and finally these rows would be paired based on their keycol values. The reason for this choice is obvious; if you first did the cross join (Cartesian product) of Table1 (say, 1000 rows) and Table2 (say, another 1000 rows), you would have a huge working table (one million rows) to prune down to the final result. However, you can expect that the number of rows in Table1 in which column a is equal to 42 will be fairly small (say, 100 rows), and likewise for Table2.b = 17 (say, 100 rows). This means that the working table will never be worse than 10,000 rows -- a table one hundredth the size of the other approach.
The folklore is that an expression of the form " I am not sure where this bit of folklore started, but it goes back at least as far as Michael Stonebraker and the original Ingres project group at the University of California, Berkeley. However, Stonebraker went on to do the Postgres project and had to unlearn this rule when he got to nontraditional data.
Imagine that you are working for a talent agency with a database that holds head shots of actors. You get a telephone call for a "Joe Celko look-alike" and the client has faxed over a copy of DBMS magazine with my column in it. The query might look like this:
You could go through all of the head shots with the expensive face-matching predicate's algorithm first. But here you are much better off performing the join between your client list and the SAG (Screen Actor Guild) table to find out if you have any dues-paid clients that fit the bill.
The details are in the paper "Predicate Migration: Optimizing Queries with Expensive Predicates" by Joseph M. Hellerstein and Michael Stonebraker. You can get a copy for $1.50 (U.S.) from the ACM. The reference is SIGMOD, 1993, ACM 0-89791-592-5/0005/0267.
At the CA-OpenIngres User Group meeting in Atlanta (May 7, 1996), I picked up a good trick that I had never thought of before. In the best of all SQL implementations, you would write all of your problems as single, Godzilla, moose-killer queries so that the perfect optimizer would have all the information it needs to calculate the perfect execution plan. In short, the world would look like my puzzle column answers.
However, you actually often create temporary tables, working tables, or views that will be materialized at runtime to improve performance in a chain of SQL statements. How these temporary tables are handled varies from product to product, but in most current products they do not appear in the schema information tables because they are created on the fly. The size and data distribution of such tables can vary with the data.
The new trick is to run statistics on the working or temporary table after creating it. This will give you the best possible execution plan for the chain of queries at the time the query is run. Before now, I had always thought of statistics as something to run on the weekend or after a large data load.
In SQL-92, you declare temporary tables as part of the database schema, so they are persistent but always start empty in a user session. This lets you declare referential and other constraints on them, which can be very helpful, but it is not widely implemented yet.
Obviously, the cost of running statistics on the tables involved has to be weighed against the improvement in performance. However, instead of the traditional single execution plan, I can imagine a SQL engine that looks only at the size of a working or temporary table, classifies it as small, medium, or large, and picks a plan dynamically, based on this single bit of critical information.
I had assumed that similar software projects would tend to have pretty much the same behavior regardless of the size of the company in which they are developed. That is, I assumed that the project itself - its resources, team size, budget, and so on - was more important than the project's environment for determining its outcome.
In the March 1995 issue of Open Computing, Adam McCauley cites a Standish Group report that gives some statistics about how software development is really done. Before completion, 31 percent of all software development projects will be canceled. Only an average of 16 percent of all software projects are completed on time and on budget in the trade as a whole. In larger companies, this drops to only nine percent of their software projects. The projects that are completed by the largest American organizations have only 42 percent of the originally proposed features and functions. But the report also states that for smaller companies, 78 percent of their software projects reach deployment with at least 74 percent of their original features and functions.
I do not have a good theory on this yet and I invite some reader feedback. There might be a Dilbert cartoon in this.
When the question of duplicates came up in SQL Committee meetings, we decided to leave it in the standard. The example we used internally, and which later appeared in the letter columns of Database Programming & Design and Datamation magazines in replies from other X3H2 Committee members, was a cash register receipt with multiple occurrences of cans of cat food on it. That is how this got to be the "cat food problem" in the literature.
The fundamental question is, what are you modeling in a table? Dr. Codd and Chris Date's position is that a table is a fact. Dr. Codd pointed out that once you state that a fact is true, it serves no useful purpose to state it again. The other position is that a table can represent an entity or a relationship among entities. With that approach, a duplicate row means more than one occurrence of an entity.
Dr. Codd added a "degree of duplication" (DoD) operator to his model in a note to the ANSI Committee as a way of handling duplicates when he realized that there is information in duplication that has to be handled. The DoD is not exactly a count(*) or a quantity column in the relation. It does not behave like a numeric column. For example, given two tables A and B:
When I union them, I eliminate duplicates and get a result of:
Assume that column y is replaced by the DoD and then do a union:
See the difference? It is an operator, not a value.
Having said all of this, I have been trying only to use duplicate rows for loading data into a SQL database from external or legacy data sources, such as the cash register tapes. I might also leave duplicates in subqueries because using a select distinct to remove them would cost too much sorting time or force an ordering in the working table of the subquery that results in bad performance. In short, I had been regarding duplicate rows like denormalization or dating women who ride motorcycles - you might do it, but you don't tell your mother.
It recently occurred to me that when you have a distributed database, the same logical table will be in several physical tables by definition. My design tools (both mental and software) are not geared for thinking like that, and I am going to need some new normal forms to handle these problems. I think that Dr. Codd and his DoD were ahead of me again.
Because the theme of this column is rethinking things we thought were wrong, let's try two problems that sound hard until you see the trick:
1. How do you ensure that a column will have a single alphabetic character string in it? In many languages, you can specify a column with a constraint such as "foobar char(14) not null is alpha,' or you can use Cobol-style picture specifications, a grep( ) style template, or a Fortran-style format statement.
SQL made a strong effort to separate the logical from the physical, so you don't get much help with specifying the physical layout of your data. When a programmer at our little shop came to me with this one, I came up with some really bad first tries using substrings and between predicates in a check( ) clause that was longer than the whole schema declaration.
2. In January 1996, Bob McGowan sent me a message on CompuServe asking for help with a problem. His client, a financial institution, tracks investment performance with a table constructed in this fashion (loosely translated):
How would you construct a query that would return one row for each portfolio's return over the date range? What he really wants is an aggregate function in the select clause to return a columnar product, such as the sum( ) that returns a columnar total. If you are a math major, you would write these functions as capital Sigma (ý) for summation and capital Pi (ý) for product. (See the Puzzle Answer below.)
1. I keep telling people to think in terms of whole sets and not in a "record at a time" mindset when they write SQL. The trick is to think in terms of whole strings and not in a "character at a time" mindset. Following is the answer:
First you prune off any blanks around the string - and leave any that are inside it. Next you concatenate the string 'AAAAA' to it; notice that this is five letters long (one less than the length of the column), so that if the original string was empty, this will fail the between predicate. If you want to allow a blank string, then make the string of 'A' characters the same length as the column. The third step is to uppercase the string and then see if the results are in the alphabetic range.
2. The nice part about SQL is that you can guess what the syntax would look like if something like this existed in the language:
PROD([DISTINCT] I am not sure that there is any use for the distinct option, but it should be there just the same. We actually have max(distinct x) and min(distinct x) defined in the standard that are totally redundant. This new aggregate function would let us write his problem simply as:
I have a trick that will do this operation in one query, but you need a second table that looks like the following for a period of five days:
Let's assume you want to look at January 6 to 10; you need to update the execute date column to that range, thus:
Now you perform the following query:
The answer is ('ABC', 1.76136576).
If anyone is missing a rate_of_return entry on a date in that range, their product will be zero. That might be fine, but if you need to get a null when you have missing data, then replace each sum( ) expression with:
The way this trick works is that, in effect, you have an identity matrix in a table. When you perform the cross join on the original table and the identity matrix table, only the matching execute_date value is multiplied by one in the sum( ); the other values are multiplied by zero and disappear. - Joe Celko
SELECT *
FROM TalentPool, SAG
WHERE TalentPool.member_no = SAG.member_no
AND head shot LOOKSLIKE { Joe Celko FAX object };
New Optimization Trick
Finishing Projects
Duplicate Rows
A B
x y x y
==== ====
a 4 a 7
b 3 b 3
A UNION B
x y
====
a 4
b 3
a 7
A UNION B
x DoD
======
a 11
b 6
Puzzle
CREATE TABLE Performance
(portfolio_id CHAR(5) NOT NULL,
execute_date DATE NOT NULL,
rate_of_return DECIMAL(5,2) NOT NULL,
PRIMARY KEY (portfolio_id, execute_date));
INSERT INTO Performance VALUES ('ABC', '1996-01-06',
0.15);
INSERT INTO Performance VALUES ('ABC', '1996-01-07',
0.10);
INSERT INTO Performance VALUES ('ABC', '1996-01-08',
0.12);
INSERT INTO Performance VALUES ('ABC', '1996-01-09',
0.11);
INSERT INTO Performance VALUES ('ABC', '1996-01-10',
0.12);
The formula to calculate a rate of return over a date range (1 to N) is:
(1.0 + (rate_of_return on day 1))
* (1.0 + (rate_of_return on day 2))
* ......
* (1.0 + (rate_of_return on day N))
Puzzle Answer
CREATE TABLE Foobar
(alpha_only VARCHAR(6)
CHECK ((UPPERCASE(TRIM(alpha_only)) || 'AAAAA')
BETWEEN 'AAAAAA' AND 'ZZZZZZ'),
);
SELECT portfolio_id, PROD(1.0 + rate_of_return)
FROM Performance
WHERE execute_date BETWEEN start_date AND end_date
GROUP BY portfolio_id;
CREATE TABLE BigPi
(execute_date DATE NOT NULL PRIMARY KEY,
day_1 INTEGER NOT NULL CHECK (day_1 IN (0, 1)),
day_2 INTEGER NOT NULL CHECK (day_2 IN (0, 1)),
day_3 INTEGER NOT NULL CHECK (day_3 IN (0, 1)),
day_4 INTEGER NOT NULL CHECK (day_4 IN (0, 1)),
day_5 INTEGER NOT NULL CHECK (day_5 IN (0, 1)));
INSERT INTO BigPi VALUES ('1996-01-06', 1, 0, 0, 0,
0);
INSERT INTO BigPi VALUES ('1996-01-07', 0, 1, 0, 0,
0);
INSERT INTO BigPi VALUES ('1996-01-08', 0, 0, 1, 0,
0);
INSERT INTO BigPi VALUES ('1996-01-09', 0, 0, 0, 1,
0);
INSERT INTO BigPi VALUES ('1996-01-10', 0, 0, 0, 0,
1);
SELECT portfolio_id,
(SUM((1.0 + P1.rate_of_return) * M1.day_1) *
SUM((1.0 + P1.rate_of_return) * M1.day_2) *
SUM((1.0 + P1.rate_of_return) * M1.day_3) *
SUM((1.0 + P1.rate_of_return) * M1.day_4) *
SUM((1.0 + P1.rate_of_return) * M1.day_5))
AS product
FROM Performance AS P1 CROSS JOIN BigPi AS M1
WHERE M1.execute_date = P1.execute_date
AND P1.execute_date
BETWEEN '1996-01-06' AND '1996-01-10'
GROUP BY portfolio_id;
IF SUM((1.0 + P1.rate_of_return) * M1.day_1) = 0
THEN NULL
ELSE SUM((1.0 + P1.rate_of_return) * M1.day_1)
ENDIF
Joe Celko is a member of the ANSI X3H2 Database Standards Committee, a widely published author, and a consultant with OSoft Development Corp. in Atlanta. Joe is also a frequent contributor to the DBMS Forum on CompuServe. You can email him at 71062.1056@compuserve.com.
Table of Contents - August 1996 | Home Page
Copyright © 1996 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to mfrank@mfi.com
Updated Monday, August 12, 1996