Moving right along with the theme from last month's column, let's look at another common example of bad SQL. A denormalized table is usually the result of moving a flat file directly into a table on the incorrect assumption that records and fields in a file are like rows and columns in a table.
One of the most common problems you will find is a repeated group in the table that has two or more related columns whose order does not matter. I will use teams of bridge partners as my example, but it might be dependents in a personnel table or anything where the order of the members of the group can appear in any order. The best answer is to make the order matter.
CREATE TABLE Teams (team_nbr INTEGER NOT NULL PRIMARY KEY, player1 CHAR(15) NOT NULL, player2 CHAR(15) NOT NULL, CHECK (player1 < player2));
If someone is allowed to play by himself, then use the constraint check (player1 <= player2) instead. From this point on, the front end should enforce this ordering on new or updated data coming into the table.
This is a nice theory, but the front end is probably already in place and the user has been typing things without worries about alphabetizing the data for a long time. The real trick is to make sure that the value from updates, insertions, and deletions also stay ordered on the database server side, no matter how the user submits them. This can be done with some case expressions in single SQL statements, or hidden in stored procedures, such as this:
a) Deletion by team number is trivial:
DELETE FROM Teams WHERE team_nbr = :drop_team;
b) Deletion of a team with a single player player1 is done with the statement:
DELETE FROM Teams WHERE :drop_player1 IN (player1, player2);
c) Deletion of a team with an unordered pair of players (:drop_player1, :drop_player2) is a bit trickier:
DELETE FROM Teams WHERE :drop_player1 = CASE WHEN :drop_player1 < :drop_player2 THEN player1 ELSE player2 END AND :drop_player2 = CASE WHEN :drop_player1 < :drop_player2 THEN player2 ELSE player1 END;
d) Insert a new team
(:team_parm, :inst_player1, :inst_ player2): INSERT INTO Teams (team_nbr, player1, player2) VALUES (:team_parm, CASE WHEN :inst_player1 < :inst_player2 THEN :inst_player1 ELSE :inst_player2 END, CASE WHEN :inst_player1 < :inst_player2 THEN :inst_player2 ELSE :inst_player1 END);
e) Updating is done with similar case statements written as needed for the particular situation. For example to change a partner on a team:
UPDATE Teams SET player1 = CASE WHEN player1 = :drop_player THEN CASE WHEN :new_player <= player2 THEN :new_player ELSE player2 END END, player2 = CASE WHEN player2 = :drop_player THEN CASE WHEN :new_player >= player1 THEN :new_player ELSE player1 END END WHERE :drop_player IN (player1, player2);
If you need to see the data on the teams with either player listed first, then you can create a view like this:
CREATE VIEW FlipFlop (team_nbr, player1, player2) AS SELECT team_nbr, player1, player2 FROM Teams UNION SELECT team_nbr, player2, player1 FROM Teams;
This view should not be used when you have to count players or teams because players will be duplicated, and it is not updatable because of the union.
Everyone tells you that you need to denormalize for performance, but nobody ever told you how to do it. May I make a suggestion? When you denormalize a set of tables, you are usually destroying a functional dependency or two. If you have a good implementation of declarative referential integrity and check() constraints in your SQL, you can enforce the functional dependency in the denormalized table.
Let's take a classic bad example. We have a table that shows the students, their advisors, and their major department in one table.
CREATE TABLE StudentAdvisors (teacher CHAR(15) NOT NULL, student CHAR(15) NOT NULL, dept CHAR(15) NOT NULL);
Let's further assume that we know that each department has one and only one advisor, and each student has one and only one major and must declare it. Now consider the following data:
StudentAdvisors teacher student department ============================== Celko Wilson Comp Sci Celko Higgins Comp Sci Kimball Inmon Logistics
If Inmon drops out of Kimball's department, the deletion will destroy the fact that Kimball is with the Logistics department (deletion anomaly). If Wilson changes over to the Logistics department (update anomaly), then Celko becomes the advisor to two departments. If the English department gets any student, then we have to make up an advisor for them before we can let them into school (insertion anomaly). The normalized version of this example would be two tables that look like this:
CREATE TABLE Advisors (teacher CHAR(15) NOT NULL, dept CHAR(15) NOT NULL, PRIMARY KEY (advisor, dept)); CREATE TABLE StudentMajors (student CHAR(15) NOT NULL, dept CHAR(15) NOT NULL, PRIMARY KEY (student, dept));
Now, imagine that you cannot get rid of the original table and replace it with these two tables. The first thought is to use views that have select distinct clauses that will give us the appearance of these normalized tables. These views will not be updatable because their rows will not uniquely reference single rows in the base table and they will not enforce the two rules we have on the data because they have no constraints to do so.
The better approach for this particular situation would be unique() constraints in the table declaration, which mirror the primary key declarations in the normalized tables. This still does not solve the problems caused by missing data, so we need to add some default values. This gives us a new table declaration:
CREATE TABLE StudentAdvisors
(teacher CHAR(15) NOT NULL DEFAULT '{ To be Assigned }',
student CHAR(15) NOT NULL DEFAULT '{ No Students }',
dept CHAR(15) NOT NULL DEFAULT '{ Undeclared Major }',
UNIQUE (student),
UNIQUE (student, dept),
UNIQUE (dept, teacher));
However, you will need to do something about the defaults, or you will be allowing only one student to have an undeclared major, and so forth. You do this in the front-end code by adding a few characters to the end of each default value you insert. Then when the data is displayed, you only show the few characters of the string to hide their uniqueness. Before you ask, the unique () constraint allows only one null, so dropping the not null constraint will not help.
Is this an awkward kludge? Hard to program? Yes! This is just part of the penalty you have to pay for not normalizing your data.
The Transaction Processing Performance Council (www.tpc.org) is a consortium of computer companies that publishes benchmarks for OLTP systems. The A, B, C, and D benchmarks each examine a different aspect of an OLTP system. The rule of thumb is that the higher the letter on the end of the name, the more complex the data involved. Everyone has a different opinion about how useful these benchmarks are, but everyone agrees that they are the only game in town.
The TPC-A benchmark did little more than measure the access and transmission rates of the hardware involved. The TPC-B and TPC-C benchmarks were more realistic measurements of transaction processing systems. But the TPC-D benchmark breaks the pattern and examines system performance in a decision-support workload, which is where I do my work. It differs from the previous OLTP benchmarks in testing system performance in workloads with few transactions involving a large amount of data and more complex SQL statements.
Three of the five primary metrics are: Query Processing Performance, which measures raw query execution when all available resources are concentrated on a single query; Query Throughput, meant to show a system's ability to process concurrent queries; and Queries Per Hour, a measure of the price/performance ratio calculated by dividing the five-year cost of ownership by the composite of the two performance metrics.
It is definitely worth asking for the TPC-D results from your vendor before you commit to a data warehouse platform. "The sweet spot of the NT market today is between 10GB and 30GB of data, with the larger NT data marts and warehouses incorporating 100GB and beyond," stated Robert Craig, director of the Data Warehousing and Business Intelligence Service, Hurwitz Group, Inc. I would guess that the sweet spot of the Unix database market runs to bigger sizes, and that mainframes would be larger still.
One of my big worries about databases is how the data is encoded, so I have an interest in standards for such things.
A good general source of information is at www.fedstats.gov, which has links and search utilities to find statistical information generated by any federal statistics agency without having to know in advance what agency produces or publishes the data you want. It presents you with an index that lists 275 predefined categories of statistical information. A user can limit a keyword search to specific statistics agencies or cover all federal statistics. The "Fast Facts" section contains often-requested graphics and tables similar to those on the Internet. The entire Statistical Abstract of the United States is also on the FedStats site.
Point your Web browser to link-usa.com/zipcode where you will find an offer for a database of latitude and longitude, and also sample programs in VB and C++ that show how to calculate the distance between two zip codes. Their price is between $50 and $100, depending on what other information you need. If you just need raw census data go to tiger.census.gov or you can also get a compressed version of the files at ftp://ftp.census.gov/pub/tiger/tms/gazetteer.
Most Americans know about ZIP codes but do not know about the international standards for two and three character codes for language names and country names. These standards are maintained by ISO TC37.
2-character language names -- ISO 639:1988 3-character language names -- ISO 639-2 (year unknown) 2-character country names -- ISO 3166 3-character country names -- ISO 3166
Unofficial extracts in English from the ISO 639:1998 standards are available as follows at www.indigo.ie/egt/standards/iso639/iso639-1-en.html. You can find copies of ISO 639 at www.free.net/Docs/standards/ISO/639.txt.
The August 18, 1997 issue of the Washington Perspective, a newsletter published by America's Community Bankers, reports: "Sen. Robert Bennett (R-Utah), chairman of the Senate Banking Committee's Financial Services and Technology Subcommittee, has asked the federal banking regulators to help him develop legislation to shield banks from Year 2000 liability."
The article goes on to say that while Bennett felt regulators should do more to increase awareness of the problem and encourage efforts to fix it, he also thought that once financial institutions have been certified as Year 2000-compliant, they should be afforded some protection from "liability due to failures in other systems over which they have no control."
The community banks are being surveyed about their plans and progress toward solving their millennium conversion problems. The Office of Thrift Supervision has appointed a high-level examiner, Dorothy van Cleve of the Midwest Region, as its Year 2000 Coordinator and put up a Y2K Web page (www.access.gpo.gov/ots/y2k.html) with some interesting links.
Clive M. Walden posted a problem on the CompuServe case forum that used a denormalized database, something like what we have discussed in this column. The table records which one of three possible supervisors have authorized a piece of work in a workflow model. The table looks like this:
CREATE TABLE WorkFlow
(job CHAR(10) NOT NULL,
auth1 CHAR(1),
auth2 CHAR(1),
auth3 CHAR(1));
INSERT INTO WorkFlow VALUES ('A', 'a', 'b', 'c');
INSERT INTO WorkFlow VALUES ('B', 'a', 'b', 'c');
INSERT INTO WorkFlow VALUES ('C', 'b', NULL, 'd');
INSERT INTO WorkFlow VALUES ('D', NULL, NULL, NULL);
CREATE TABLE Personnel
(emp_id CHAR(1),
fname CHAR(10));
INSERT INTO Personnel VALUES ('a', 'Joe');
INSERT INTO Personnel VALUES ('b', 'Mary');
INSERT INTO Personnel VALUES ('c', 'Ann');
INSERT INTO Personnel VALUES ('d', 'Bill');
The goal is to find all the ways to produce a result with the job and the first name of the person who authorized it.
Results job authorized_by ===================== A Joe A Mary A Ann B Joe B Mary B Ann C Mary C Bill D NULL
Anders Altberg came up with the most solutions on CompuServe. Here is his list:
SELECT W1.job, P1.fname AS authorized_by FROM WorkFlow AS W1 LEFT JOIN Personnel AS P1 ON POSITION (P1.id IN W1.auth1||W1.auth2| |W1.auth3) > 0;or like this:
SELECT W1.job, P1.fname AS authorized_by FROM WorkFlow AS W1 FULL OUTER JOIN Personnel AS P1 ON W1.auth1||W1.auth2||W1.auth3 LIKE '%'||P1.emp_id||'%'or
SELECT W1.job, P1.fname AS authorized_by; FROM WorkFlow AS W1 LEFT OUTER JOIN Personnel AS P1 ON P1.emp_id IN (W1.auth1, W1.auth2, W1.auth3);You can also display them across the page in their own columns.
SELECT W1.job, P3.fname AS name1, P2.fname AS name2, P1.fname AS name3 FROM Personnel AS P3 RIGHT OUTER JOIN Personnel AS P2 RIGHT OUTER JOIN Personnel AS P1 RIGHT OUTER JOIN WorkFlow AS W1 ON P1.emp_id = W1.auth1 ON P2.emp_id = W1.auth2 ON P3.emp_id = W1.auth3; Results job name1 name2 name3 ================================== A Ann Mary Joe B Ann Mary Joe C Bill NULL Mary D NULL NULL NULL