DBMS, November 1997
DBMS Online: SQL for Smarties By Joe Celko

Doing Bad Things Well -- Part II

Denormalizing for Performance and Data Encoding Schemes.


Moving right along with the theme from last month's column, let's look at another common example of bad SQL. A denormalized table is usually the result of moving a flat file directly into a table on the incorrect assumption that records and fields in a file are like rows and columns in a table.

One of the most common problems you will find is a repeated group in the table that has two or more related columns whose order does not matter. I will use teams of bridge partners as my example, but it might be dependents in a personnel table or anything where the order of the members of the group can appear in any order. The best answer is to make the order matter.

CREATE TABLE Teams
(team_nbr INTEGER NOT NULL PRIMARY KEY,
player1 CHAR(15) NOT NULL,
player2 CHAR(15) NOT NULL,
CHECK (player1 < player2));

If someone is allowed to play by himself, then use the constraint check (player1 <= player2) instead. From this point on, the front end should enforce this ordering on new or updated data coming into the table.

This is a nice theory, but the front end is probably already in place and the user has been typing things without worries about alphabetizing the data for a long time. The real trick is to make sure that the value from updates, insertions, and deletions also stay ordered on the database server side, no matter how the user submits them. This can be done with some case expressions in single SQL statements, or hidden in stored procedures, such as this:

a) Deletion by team number is trivial:

DELETE FROM Teams
WHERE team_nbr = :drop_team;

b) Deletion of a team with a single player player1 is done with the statement:

DELETE FROM Teams
WHERE :drop_player1 IN (player1, player2);

c) Deletion of a team with an unordered pair of players (:drop_player1, :drop_player2) is a bit trickier:

DELETE FROM Teams
WHERE :drop_player1 = 
CASE WHEN :drop_player1 < :drop_player2
THEN player1 
ELSE player2 END
AND :drop_player2 = 
CASE WHEN :drop_player1 < :drop_player2
THEN player2 
ELSE player1 END;

d) Insert a new team

(:team_parm, :inst_player1, :inst_
player2):
INSERT INTO Teams (team_nbr, player1, player2) 
VALUES (:team_parm,
CASE WHEN :inst_player1 < :inst_player2 
THEN :inst_player1 
ELSE :inst_player2 END,
CASE WHEN :inst_player1 < :inst_player2 
THEN :inst_player2 
ELSE :inst_player1 END);

e) Updating is done with similar case statements written as needed for the particular situation. For example to change a partner on a team:

UPDATE Teams
SET player1 = CASE 
WHEN player1 = :drop_player 
THEN CASE 
WHEN :new_player <= player2 
THEN :new_player 
ELSE player2 END 
END, 
player2 = CASE 
WHEN player2 = :drop_player 
THEN CASE 
WHEN :new_player >= player1
THEN :new_player 
ELSE player1 END
END
WHERE :drop_player IN (player1, player2); 

If you need to see the data on the teams with either player listed first, then you can create a view like this:

CREATE VIEW FlipFlop (team_nbr, player1, player2) 
AS SELECT team_nbr, player1, player2 
FROM Teams
UNION
SELECT team_nbr, player2, player1
FROM Teams;

This view should not be used when you have to count players or teams because players will be duplicated, and it is not updatable because of the union.

Constraints and Denormalization

Everyone tells you that you need to denormalize for performance, but nobody ever told you how to do it. May I make a suggestion? When you denormalize a set of tables, you are usually destroying a functional dependency or two. If you have a good implementation of declarative referential integrity and check() constraints in your SQL, you can enforce the functional dependency in the denormalized table.

Let's take a classic bad example. We have a table that shows the students, their advisors, and their major department in one table.

CREATE TABLE StudentAdvisors
(teacher CHAR(15) NOT NULL,
student CHAR(15) NOT NULL,
dept CHAR(15) NOT NULL);

Let's further assume that we know that each department has one and only one advisor, and each student has one and only one major and must declare it. Now consider the following data:

StudentAdvisors
teacher student department 
==============================
Celko Wilson Comp Sci
Celko Higgins Comp Sci
Kimball Inmon Logistics

If Inmon drops out of Kimball's department, the deletion will destroy the fact that Kimball is with the Logistics department (deletion anomaly). If Wilson changes over to the Logistics department (update anomaly), then Celko becomes the advisor to two departments. If the English department gets any student, then we have to make up an advisor for them before we can let them into school (insertion anomaly). The normalized version of this example would be two tables that look like this:

CREATE TABLE Advisors
(teacher CHAR(15) NOT NULL,
dept CHAR(15) NOT NULL,
PRIMARY KEY (advisor, dept));
CREATE TABLE StudentMajors 
(student CHAR(15) NOT NULL,
dept CHAR(15) NOT NULL,
PRIMARY KEY (student, dept));

Now, imagine that you cannot get rid of the original table and replace it with these two tables. The first thought is to use views that have select distinct clauses that will give us the appearance of these normalized tables. These views will not be updatable because their rows will not uniquely reference single rows in the base table and they will not enforce the two rules we have on the data because they have no constraints to do so.

The better approach for this particular situation would be unique() constraints in the table declaration, which mirror the primary key declarations in the normalized tables. This still does not solve the problems caused by missing data, so we need to add some default values. This gives us a new table declaration:

CREATE TABLE StudentAdvisors 
(teacher CHAR(15) NOT NULL DEFAULT '{ To be Assigned }',
student CHAR(15) NOT NULL DEFAULT '{ No Students }',
dept CHAR(15) NOT NULL DEFAULT '{ Undeclared Major }',
UNIQUE (student),
UNIQUE (student, dept), 
UNIQUE (dept, teacher)); 

However, you will need to do something about the defaults, or you will be allowing only one student to have an undeclared major, and so forth. You do this in the front-end code by adding a few characters to the end of each default value you insert. Then when the data is displayed, you only show the few characters of the string to hide their uniqueness. Before you ask, the unique () constraint allows only one null, so dropping the not null constraint will not help.

Is this an awkward kludge? Hard to program? Yes! This is just part of the penalty you have to pay for not normalizing your data.

TPC-D

The Transaction Processing Performance Council (www.tpc.org) is a consortium of computer companies that publishes benchmarks for OLTP systems. The A, B, C, and D benchmarks each examine a different aspect of an OLTP system. The rule of thumb is that the higher the letter on the end of the name, the more complex the data involved. Everyone has a different opinion about how useful these benchmarks are, but everyone agrees that they are the only game in town.

The TPC-A benchmark did little more than measure the access and transmission rates of the hardware involved. The TPC-B and TPC-C benchmarks were more realistic measurements of transaction processing systems. But the TPC-D benchmark breaks the pattern and examines system performance in a decision-support workload, which is where I do my work. It differs from the previous OLTP benchmarks in testing system performance in workloads with few transactions involving a large amount of data and more complex SQL statements.

Three of the five primary metrics are: Query Processing Performance, which measures raw query execution when all available resources are concentrated on a single query; Query Throughput, meant to show a system's ability to process concurrent queries; and Queries Per Hour, a measure of the price/performance ratio calculated by dividing the five-year cost of ownership by the composite of the two performance metrics.

It is definitely worth asking for the TPC-D results from your vendor before you commit to a data warehouse platform. "The sweet spot of the NT market today is between 10GB and 30GB of data, with the larger NT data marts and warehouses incorporating 100GB and beyond," stated Robert Craig, director of the Data Warehousing and Business Intelligence Service, Hurwitz Group, Inc. I would guess that the sweet spot of the Unix database market runs to bigger sizes, and that mainframes would be larger still.

Data Encoding Schemes

One of my big worries about databases is how the data is encoded, so I have an interest in standards for such things.

A good general source of information is at www.fedstats.gov, which has links and search utilities to find statistical information generated by any federal statistics agency without having to know in advance what agency produces or publishes the data you want. It presents you with an index that lists 275 predefined categories of statistical information. A user can limit a keyword search to specific statistics agencies or cover all federal statistics. The "Fast Facts" section contains often-requested graphics and tables similar to those on the Internet. The entire Statistical Abstract of the United States is also on the FedStats site.

Point your Web browser to link-usa.com/zipcode where you will find an offer for a database of latitude and longitude, and also sample programs in VB and C++ that show how to calculate the distance between two zip codes. Their price is between $50 and $100, depending on what other information you need. If you just need raw census data go to tiger.census.gov or you can also get a compressed version of the files at ftp://ftp.census.gov/pub/tiger/tms/gazetteer.

Most Americans know about ZIP codes but do not know about the international standards for two and three character codes for language names and country names. These standards are maintained by ISO TC37.

2-character language names -- ISO 639:1988 3-character language names -- ISO 639-2 (year unknown) 2-character country names -- ISO 3166 3-character country names -- ISO 3166

Unofficial extracts in English from the ISO 639:1998 standards are available as follows at www.indigo.ie/egt/standards/iso639/iso639-1-en.html. You can find copies of ISO 639 at www.free.net/Docs/standards/ISO/639.txt.

Banking on 2000

The August 18, 1997 issue of the Washington Perspective, a newsletter published by America's Community Bankers, reports: "Sen. Robert Bennett (R-Utah), chairman of the Senate Banking Committee's Financial Services and Technology Subcommittee, has asked the federal banking regulators to help him develop legislation to shield banks from Year 2000 liability."

The article goes on to say that while Bennett felt regulators should do more to increase awareness of the problem and encourage efforts to fix it, he also thought that once financial institutions have been certified as Year 2000-compliant, they should be afforded some protection from "liability due to failures in other systems over which they have no control."

The community banks are being surveyed about their plans and progress toward solving their millennium conversion problems. The Office of Thrift Supervision has appointed a high-level examiner, Dorothy van Cleve of the Midwest Region, as its Year 2000 Coordinator and put up a Y2K Web page (www.access.gpo.gov/ots/y2k.html) with some interesting links.


Puzzle

Clive M. Walden posted a problem on the CompuServe case forum that used a denormalized database, something like what we have discussed in this column. The table records which one of three possible supervisors have authorized a piece of work in a workflow model. The table looks like this:

CREATE TABLE WorkFlow 
(job CHAR(10) NOT NULL, 
auth1 CHAR(1), 
auth2 CHAR(1), 
auth3 CHAR(1));
INSERT INTO WorkFlow VALUES ('A', 'a', 'b', 'c'); 
INSERT INTO WorkFlow VALUES ('B', 'a', 'b', 'c'); 
INSERT INTO WorkFlow VALUES ('C', 'b', NULL, 'd'); 
INSERT INTO WorkFlow VALUES ('D', NULL, NULL, NULL);
CREATE TABLE Personnel 
(emp_id CHAR(1), 
fname CHAR(10));
INSERT INTO Personnel VALUES ('a', 'Joe'); 
INSERT INTO Personnel VALUES ('b', 'Mary'); 
INSERT INTO Personnel VALUES ('c', 'Ann'); 
INSERT INTO Personnel VALUES ('d', 'Bill');
The goal is to find all the ways to produce a result with the job and the first name of the person who authorized it.
Results
job authorized_by
=====================
A Joe
A Mary
A Ann
B Joe
B Mary
B Ann
C Mary
C Bill
D NULL 

Puzzle Answer

Anders Altberg came up with the most solutions on CompuServe. Here is his list:

SELECT W1.job, P1.fname AS authorized_by
FROM WorkFlow AS W1 
LEFT JOIN 
Personnel AS P1
ON POSITION (P1.id IN W1.auth1||W1.auth2|
|W1.auth3) > 0;
or like this:
SELECT W1.job, P1.fname AS authorized_by
FROM WorkFlow AS W1 
FULL OUTER JOIN 
Personnel AS P1
ON W1.auth1||W1.auth2||W1.auth3 LIKE 
'%'||P1.emp_id||'%'
or
SELECT W1.job, P1.fname AS authorized_by;
FROM WorkFlow AS W1 
LEFT OUTER JOIN 
Personnel AS P1
ON P1.emp_id IN (W1.auth1, 
W1.auth2, W1.auth3);
You can also display them across the page in their own columns.
SELECT W1.job, 
P3.fname AS name1, 
P2.fname AS name2, 
P1.fname AS name3
FROM Personnel AS P3
RIGHT OUTER JOIN Personnel AS P2
RIGHT OUTER JOIN Personnel AS P1
RIGHT OUTER JOIN WorkFlow AS W1
ON P1.emp_id = W1.auth1
ON P2.emp_id = W1.auth2
ON P3.emp_id = W1.auth3;
Results
job name1 name2 name3
==================================
A Ann Mary Joe
B Ann Mary Joe
C Bill NULL Mary
D NULL NULL NULL


Joe Celko is an Atlanta-based guru with Northern Lights Software Ltd. and a member of the ANSI X3H2 Database Standards Committee. He is also the author of two books on SQL: SQL For Smarties (Morgan-Kaufmann, 1995) and Instant SQL Programming (Wrox Press, 1995). You can contact Joe via email at 71062.1056@compuserve.com.
What did you think of this article? Send a letter to the editor.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
November 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated October 1, 1997