DBMS, October 1997
DBMS Online: SQL for Smarties By Joe Celko

Doing Bad Things Well -- Part 1

Working With Real Databases, Not the Textbook Examples

The sorry truth of life is that you will have to work with real databases that are not like those clean textbook examples. You will be asked to do strange things to make them spit forth the queries and reports that your users want. A lot of my consulting work consists of tuning SQL for performance. While you can get good results by doing system configuration and hardware tuning, the most payback for the least amount of hassle is in the SQL code. But half the time, the real problems are in the schema, and I am rarely allowed to correct the schema. It is not that my clients like bad performance; they just can't change the schema for physical or operational reasons.

Real databases are often improperly indexed, used to write reports, encoded in the most bizarre ways, denormalized (by accident and not intent), and have legacy operational requirements that make no sense in a relational world. I am going to take a look at these demons and give examples and solutions this month and next. Let's start with improper indexing.

Improper Indexing

At one client site, I found they had indexed a multimillion row table on a Microsoft SQL Server incorrectly by wasting a clustered index on the set of key columns for the table. (By clustered index, I mean that the table and the index are both kept in the same physical sorted order; Oracle uses the term clustered index differently.) This key was a combination of location and equipment identification numbers.

There are two obvious solutions to bad indexing. You can force the optimizer to use particular indexes in many products via explicit vendor extensions that pass hints to the optimizer. Or you can add another index and get your query to run faster.

The forced index trick works like putting a penny in a fuse box. It will work for a short time to get you around a problem, but when the power changes it can burn your house down. To understand what I mean, imagine a database with a column that has a skewed statistical distribution that fools the optimizer. The forced index trick can work wonders here. Over time, however, the distribution can change while the index hint never gets changed and continues to override the optimizer forever.

Adding an index will slow down every update, insert, and delete on the table. If your database uses a pessimistic locking scheme, it can also take also take time to build an index on a large table because it has to lock out all the users. This is a real consideration in an online system. Storage is also a potential problem.

But neither trick works for the client site I've been discussing. Obviously, a table can have only one clustered index and users needed the clustered index for the date stamp of each row because all their reports were based on the most recent week's transactions, not on the equipment itself. Changing the values in a key column is always expensive because of the indexing on it, but, in the case of SQL Server, you could have even more trouble. As of this writing, an update statement that changes a nonclustered key with a date in it using the same nonclustered index can fail to update what it should for some rows and update a few extra rows that it should not. Microsoft tech support has confirmed this to be a bug and it is taking care of it.

The DBA figured out that it would require more than 30 hours to copy the data to a tape, empty the tables, drop and recreate the indexes, sort the tape, and reload the table. Unfortunately, their system has to run online 2437; one hour offline in peak times can mean a loss of hundreds of thousands of dollars.

As an aside, my job was simply to make some critical queries run faster until they could find a solution on new hardware as a stop-gap. The DBA had a good solution to their problems, though. He decided to use a data warehouse approach for the changeover. The company would operate the new hardware in parallel with the old machine and move the database over a little at time while still running the old system in production. A set of scripts would begin running in the odd hours, ask for data, make any changes needed to columns, and store it in the new schema on the second machine. When the two databases held identical data, they could switch over. The only difference between this and a data warehouse extraction operation was that the data was going into a production database and not being summarized for a denormalized data warehouse schema.

Report Processing in SQL

SQL is not a report-writing language, but people keep using it that way and it almost always means getting denormalized result sets. Here is a version of a problem I got on CompuServe. They have a table of edited transactions, called clean transactions or ClnTrans, and a table called payments, and they both have the same structure:

CREATE TABLE ClnTrans 
(acct_id CHAR(5) NOT NULL PRIMARY KEY,
pmt1 DECIMAL(9,2) NOT NULL,
pmt2 DECIMAL(9,2) NOT NULL,
pmt3 DECIMAL(9,2) NOT NULL,
pmt4 DECIMAL(9,2) NOT NULL,
pmt5 DECIMAL(9,2) NOT NULL);

They want to insert the cleaned up transactions from ClnTrans into payments. Yes, I know that copying one table to another is a waste of time because you could use a view to hold the clean and dirty transactions in one table, but bear with me because this is not your ordinary insert. You have to insert the values in such a way that if you find missing payment gaps (shown as 0.00) between payments, you have to push the non-zero value after the gap forward to eliminate those gaps. The user is making us write a report in SQL for him. For example:

ClnTrans

acct_id pmt1 pmt2 pmt3 pmt4 pmt5 
====================================================
'Able' 10.50 15.00 0.00 0.00 30.50 
'Baker' 0.00 20.00 0.00 15.00 0.00 
'Chuck' 5.00 0.00 0.00 0.00 11.00

The ClnTrans table data would be inserted into the payments table this way: Payments

acct_id pmt1 pmt2 pmt3 pmt4 pmt5 
====================================================
'Able' 10.50 15.00 30.50 0.00 0.00 
'Baker' 20.00 15.00 0.00 0.00 0.00 
'Chuck' 5.00 11.00 0.00 0.00 0.00

The immediate approach is to insert the ClnTrans rows into payments, then run a string of updates against the result table with the general format:

UPDATE Payments
SET pmt1 = CASE WHEN pmt1 = 0.00
AND pmt2 > 0.00
THEN pmt2
ELSE 0.00,
pmt2 = CASE WHEN pmt1 = 0.00
AND pmt2 > 0.00
THEN 0.00
ELSE pmt2;

With a little experimentation, you will find some shortcuts, like putting payments 1 and 2 in the same update statement with payments 4 and 5 and what the minimum number of updates required to do the job will be.

But notice that you are violating my rule about thinking in sets and not in procedures. You are writing a step-by-step sorting procedure, using updates as the pairwise swap operators.

Let's try to write the solution as a single update statement that will run faster than anything else. The first step is to collect all the rules you know about the set of payments:

1. If all payments to the left of payment n are greater than 0.00, then payment n does not move.

2. If payment n is a zero, then it will be replaced by nearest nonzero value to its right when the moves are complete.

3. If there are k payments of zero dollars where k > 0, then payments 6 2 k through 5 are zero when the moves are complete.

4. If there are no payments of zero dollars, the moves are already complete. Now we start assembling the parts and pieces of an answer. Let's start with rule 4 in the list. We can outline the answer as an update statement of the form:

UPDATE Payments
SET pmt1 = ??,
pmt2 = ??,
pmt3 = ??,
pmt4 = ??,
pmt5 = ??
WHERE 0.00 IN (pmt1, pmt2, pmt3, pmt4, pmt5)

Rule 2 implies that we would like to use the coalesce() function which returns the first non-null value in a list. But that implies that we need to turn the zeros into nulls before we can use coalesce(). Write that bit of code for payment 1 and keep it aside.

PRE> COALESCE (NULLIF (pmt1, 0.00), NULLIF (pmt2, 0.00), NULLIF (pmt3, 0.00), NULLIF (pmt4, 0.00), NULLIF (pmt5, 0.00))

Rule 3 says that if any payment is a zero, then we know for sure that payment 5 will be a zero after the moves are completed. That can translate into a bit of code, too. We want a way to count the number of zeros in the payments. There is a sign() function that returns -1 for negative numbers, +1 for positive numbers and 0 for zeros. It will be handy for counting the non-zero and zero values on the left and right of a payment n:

pmtN = CASE WHEN (SIGN(pmt1) 
+ ... 
+ SIGN(pmt5)) < N
THEN 0.00
ELSE pmtN END;

An alternate version of the same expression would be to leave the value alone if everything to its left is nonzero:

pmtN = CASE WHEN (SIGN(pmt1) + ... + SIGN(pmtN)) = N THEN pmtN ELSE 0.00 END;

Putting this all into one monster query, we would get:

UPDATE Payments
SET pmt1 = CASE WHEN SIGN(pmt1) = 1 
THEN pmt1 
WHEN (SIGN(pmt1)
+ SIGN(pmt2) 
+ SIGN(pmt3) 
+ SIGN(pmt4) 
+ SIGN(pmt5)) = 0 
THEN 0.00 
WHEN pmt1 = 0.00
THEN COALESCE (NULLIF (pmt2, 0.00),
NULLIF (pmt3, 0.00), 
NULLIF (pmt4, 0.00), 
NULLIF (pmt5, 0.00)) 
END, 
pmt2 = CASE WHEN (SIGN(pmt1
+ SIGN(pmt2)) = 2 
THEN pmt2
WHEN (SIGN(pmt1)
+ SIGN(pmt2) 
+ SIGN(pmt3) 
+ SIGN(pmt4) 
+ SIGN(pmt5)) < 2 
THEN 0.00 
WHEN pmt2 = 0.00
THEN COALESCE (NULLIF (pmt3, 0.00), NULLIF (pmt4, 0.00), NULLIF (pmt5, 0.00)) END, pmt3 = CASE WHEN (SIGN(pmt1) + SIGN(pmt2) + SIGN(pmt3)) = 3 THEN pmt3 WHEN (SIGN(pmt1) + SIGN(pmt2) + SIGN(pmt3) + SIGN(pmt4) + SIGN(pmt5)) < 3 THEN 0.00 WHEN pmt3 = 0.00 THEN COALESCE (NULLIF (pmt4, 0.00), NULLIF (pmt5, 0.00)) END, pmt4 = CASE WHEN (SIGN(pmt1) + SIGN(pmt2) + SIGN(pmt3) + SIGN(pmt4)) = 4 THEN pmt4 WHEN (SIGN(pmt1) + SIGN(pmt2) + SIGN(pmt3) + SIGN(pmt4) + SIGN(pmt5)) < 4 THEN 0.00 WHEN pmt4 = 0.00 THEN pmt5 END, pmt5 = CASE WHEN (SIGN(pmt1) + SIGN(pmt2) + SIGN(pmt3) + SIGN(pmt4) + SIGN(pmt5)) = 5 THEN pmt5 ELSE 0.00 END;

This answer depends on the fact that a case expression executes in the order of the when clauses.

The first when clause says that if you have all nonzeros from the first payment through this payment, then leave this value alone. You are finished now.

The second when clause checks to see if you have enough zeros that this payment will be a zero after the shifting is done. You are finished now.

The third when clause says that if this payment is already a zero, then you need to set it to the first nonzero value to its right. Here is where we use the trick with the NULLIF() functions as parameters in the COALESCE() function to make it work the way we want it to. You are finished now.

There are some minor cleanups in this answer. For example, payment 5 will always stay the same or become a zero because there is no other value to its right to shift into it. Likewise, the coalesce() function parameter list does not need to include the current payment itself since we already know it is a zero.

Next month, I will deal with denormalizing data in the database.


Joe Celko is an Atlanta-based guru with Northern Lights Software Ltd. and a member of the ANSI X3H2 Database Standards Committee. He is also the author of two books on SQL: "SQL For Smarties" (Morgan-Kaufmann, 1995) and "Instant SQL Programming" (Wrox Press, 1995). You can contact Joe via 71062.1056@compuserve.com.

Puzzle Answer

Johannes Becher of CODATA in Germany came up with the following answer. It shows how big improvements can come from small changes.

SELECT DISTINCT C0.id
FROM Customers AS C0
WHERE EXISTS (SELECT C1.id
FROM Customers AS C1
WHERE C0.surname = C1.surname
AND C0.id <> C1.id
AND (CASE WHEN C0.fname = C1.fname 
THEN 1 ELSE 0 END)
+ (CASE WHEN C0.street = C1.street 
THEN 1 ELSE 0 END)
+ (CASE WHEN C0.city = C1.city 
THEN 1 ELSE 0 END)
+ (CASE WHEN C0.state = C1.state 
THEN 1 ELSE 0 END)
+ (CASE WHEN C0.phone = C1.phone 
THEN 1 ELSE 0 END) >= 2);

Note that he added the (C0.id <> D1.id) clause so that every row would not qualify as a duplicate of itself. Also, he moved the CASE expressions from the SELECT list, where it was a calculated column, to the WHERE clause. This made the addition an expression in a predicate where an optimizer could do shortcut evaluation. In a shortcut evaluation, as soon as you know the predicate is TRUE or FALSE, then you stop calculating and return an answer. As a calculated column, it had to be evaluated completely so it could be materialized in the VIEW. Going one step further, if you know the optimizer's order of evaluation (left to right or right to left) of the CASE expressions and the columns mostly to be matched, then you can arrange the expressions for even better performance. For example, city and state are probably not often misspelled and should come first.


What did you think of this article? Send a letter to the editor.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
October 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated September 17, 1997