Every February, I like to try to sneak a column with more of a mathematical, statistical, or theoretical content past my editor. I feel that a programmer without some mathematical, statistical, or theoretical content in his or her head is not going to be able to solve problems. What I usually do is start with a strong database tie-in, then move on to more abstract topics. So here is this yearıs attempt. [Editorıs note: We always ask Joe to write columns like this, and he only relents about once a year, I swear!]
A translation or lookup table is a table for converting encodings into something a human being or a query can use. (I prefer the term "encoding" to just "code" because the second term has so many meanings in this trade.) The usual schema for these tables is:
CREATE TABLE Translations (encode_1 <datatype> NOT NULL, encode_2 <datatype> NOT NULL, ... encode_n <datatype> NOT NULL, meaning <datatype> NOT NULL, PRIMARY KEY (encode_1, encode_2, ... encode_n));
They are tables in which one or more encodings lead to a unique meaning. You will find them all over the place in a relational database, and they can be used almost anywhere. An example of such a schema when n=1 would be a look up table for ZIP codes and the corresponding city and state (ZIP code, city-state). For n=2, we could use (longitude, latitude, city-state).
We had a thread on a CompuServe forum with a guy who had what he considered to be a novel approach to handling translation tables. Instead of keeping just the encodings and the meaning in the table, he added a third column that identified the code. This approach allowed him to put all the encodings in his system in one table, instead of 20 or 30 small tables, each holding one encoding.
CREATE TABLE Translations (encode_type CHAR(n) NOT NULL, encoding VARCHAR(n) NOT NULL, meaning VARCHAR(m) NOT NULL, PRIMARY KEY (encode_type, encoding));
Having them all in one table let him write a single front-end program to maintain all the encodings. I do not write front-end programs myself, but I was told that this is a common trick among front-end programmers.
I hope not. This method really stinks, and I want to spend some time to discourage it. Without looking at the following paragraphs, sit down and make a list of all the disadvantages, and see if you found anything I missed.
1. Storage size. The total storage required for the "monster table approach" is greater than the storage required for the "one encoding, one table" approach because of the redundant encoding type column.
2. Datatypes. All encodings are forced into one datatype that has to be a string of the largest length that any encoding uses in the system. But varchar(n) is not always the best way to represent data. char(n) data often has advantages for access and storage in many SQL products. Numeric encodings can take advantage of arithmetic operators for ranges, check digits, and so forth with check clauses. Dates can be used as codes that are translated into holidays and other events. Datatypes are not a "one size fits all" affair.
3. Validation. The only way to write a check() clause on the monster table is with a huge CASE expression of the form:
CREATE TABLE Translations (encode_type CHAR(n) NOT NULL CHECK (encode_type IN (<type 1>, . . . , <type n>)), encoding VARCHAR(n) NOT NULL CHECK (CASE WHEN encode_type = <type 1> AND <validation 1> THEN 1 = 1 . . . WHEN encode_type = <type n> AND <validation n> THEN 1 = 1 ELSE 1 = 2 END), meaning VARCHAR(m) NOT NULL, PRIMARY KEY (encode_type, encoding));
This means that validation is going to take a long time, because every change will have to be considered by all the when clauses in this oversized CASE expression until the SQL engine finds one that tests true. You also need to add a check() clause to the encode_type column to be sure that the user does not create an invalid encoding name.
4. Flexibility. The monster table is created with one column for the encoding, so it cannot be used for n-valued encodings where n > 1.
5. Security. To avoid exposing rows in one encoding scheme to unauthorized users, the monster table has to have views defined on it that restrict a user to the encode_types that he or she is allowed to update. At this point, some of the rationale for the single table is gone because the front end must now handle views in almost the same way that it would handle multiple tables. These views also must have the with check option clause so the user does not make a valid change outside the scope of his or her permissions.
6. Normalization. The real reason this approach does not work is that it is an attempt to violate first normal form (1NF). Yes, I can see that these tables have a primary key and that all the columns in a SQL database have to be scalar and of one datatype.
But I will still argue that it is not a 1NF table. I realize that people use the term "datatype" to mean different things, so let me clarify my terms. I mean those primitive, built-in things that come with the computer. I then use datatypes to build domains and use domains to represent the values of attributes.
For example, I might decide that I need to have a column for Dewey decimal codes in a table that represents library books. I can then pick decimal(6,3), numeric(6,3), real, float, or a pair of integers as the datatype to hold the "three digits, decimal point, three digits" format of the Dewey decimal code domain.
The fact that two domains use the same datatype does not make them the same attribute. The extra encode_type column changes the domain of the other columns and thus violates first normal form.
Before the more experienced SQL programmer asks, the indicator that is passed with a SQL variable to a host program in embedded SQL to indicate the presence of a null is not like this. It is not changing the domain of the SQL variable.
The Online Database Derby is a programming contest for object databases. The 1997 contest was organized by Doug Barry of Barry & Associates with sponsorship from IDX Systems Corp. (which provided the medical records database that was used for the programming problems), Distributed Object Computing Magazine, and our sister magazine Database Programming & Design.
The results were posted on November 12, 1997 at www.odbmsfacts.com/derby/. The contest problem was to develop and enhance an assigned medical records application and database.
Each contestant met all requirements of the Derby, including completion of all development; completion of an availability test and an enhancement test; providing all application, data definition, and database build source code; and making the Internet application available on the Web for 21 days of the availability test. The medical records database used had more than 8 million records and was about 1GB in size.
Two companies completed all the requirements of the Derby and received a Winnersı Circle Award. They were InterSystems Corp. (www.intersys.com) on a Digital Alpha Station 500 MHz and POET Software (www.poet.com) running on a Pentium 166 MHz.
IBEX Object Systems, O2 Technology, and Versant Object Technology also participated in the Derby. These companies did not reach the winnersı circle mainly because their applications were not running at the very beginning of the availability test period.
However, no participating database managers failed throughout the 21-day availability test, although there were a series of problems for some of the participants, including problems with machines, network failure, and changes made by service providers. According to Barry, "The sturdiness of the database managers indicates the level of development that all of these products have reached."
Visitors to the Derby site have an opportunity to get a great deal of information about real-world object databases from the Web site, including participant performance graphs for a 21-day period and a look at source code. Over the course of the event, more than 25,000 sessions were registered at the Derby Web site.
Since I have a client who is in this business, I have been looking at "online auctions" on the Internet. There seem to be two flavors of auctions right now, but I expect others to evolve. One is the request for quotations (RFQ) type and the other is the more traditional auction.
An RFQ site allows a purchaser, usually a company, to post a notice on a Web site that it is looking for a particular item or service and then to receive bids for it. In the old days, this used to be done by placing an announcement in The Federal Register if the requester was a federal government agency. Private companies tended to have a list of suppliers they would consider and the RFQs were mailed or faxed to that selected bidder list.
Some prequalification of the bidders makes sense; Joeıs Garage has no business bidding on a fighter plane. This prequalification can be done by the bidders themselves, by the buyers, or by the auctioneer.
The traditional auction actually comes in several flavors. The usual form in the United States is to offer an item for sale and start with an initial asking price. Then bidders can make larger offers during a certain time period, and when bidding is closed, the bidder with the highest offer gets the item.
This is fairly easy to set up on the Internet, as long as you can find people who are willing to cross over time zones to get a shot at something upon which they wish to bid. You can post the bids and their UTC time stamps in real time online.
The real problem is telling everyone what they are really going to have to pay.
Sometimes the sale is just forbidden. In Singapore, you canıt sell chewing gum because they want to keep their country spotless and they will throw you in jail for possession. In Georgia, you cannot sell vintage wines by mail supposedly because the state legislature wants to prevent teenagers with credit cards from getting drunk on a $100 bottle of California Merlot. I am sure that this is a serious problem in some areas.
If the bidders are using different currencies, then you need to provide currency rate conversions for the other bidders and the auctioneer. If you have friends in Europe, ask them about their Euro conversion problems.
If the bidders are in different locations, then you need to provide the delivery costs and time. Delivery costs are not affected by the bids, but rather by the nature of the goods, the origin, and the destination.
Right now, I am looking at a catalog of Japanese antiques in London that posts its prices in U.S. dollars as well as pounds sterling. But the dual prices are not automatically updated, and shipping has to be negotiated by email or by telephone discussion.
I think that this simple approach defeats the goal of using the Internet and a database to create automated auction commerce. To make things simple, assume that today I have $1,000, I want to buy a Tansu, and that the U.S. dollar is worth two pounds. I place my bid and I have that Tansu I wanted for $1,000, but the U.S. dollar went up to 2.20 pounds while I was transferring funds. Do I get a refund? Well, that depends on when the bidding stopped ı on receipt of payment or the drop of the gavel.
Shipping is not easy to compute. Packing antiques and getting them from place to place is not like mailing a letter. Very often, the bidder has a strong time preference so that the same item delivered late is not of the same value. Birthday and Christmas presents are the most common examples in everyday life.
The Internet also allows for testing novel forms of auctions. Here are a few variations.
In parts of the United States, merchants used to conduct "Dutch auctions" (also known as a reverse auction) when they had something they wanted to sell. This consists of putting the merchandise in the shop window with a price on it, and then reducing the price every day until the item was sold. The strategy for the auctioneer (merchant) was to hope that the item would move at a price at or above the lowest price he was willing to accept. The strategy for the bidders was to hold off buying until the price reached or dropped below the highest price they were willing to pay. The gimmick is that if one buyer has a higher maximum price than you, then she might buy the item before you do, and you have nothing. The merchant can profit by irrational behavior on the part of bidders who are trying to beat out other bidders and will pay a higher price for the item.
Another form of auction was invented in 1961 by economist William Vickrey and attempts to keep everyone happy. The process consists of taking sealed bids on an item. Then only the highest two bids are retained. The highest bidder gets the item, but he pays the second-highest price for it.
This system leaves the highest bidder happy because he got the item for less than he was willing to pay for it. The second through nth highest bidders canıt complain since they would have lost anyway and nobody got the item for less than they were willing to pay for it. Conspiracy to rig the top two bids doesnıt work because you and a co-conspirator can only make the second-highest bid higher and not lower than the unrigged bids.
This is a variation on a problem that was posted on the CompuServe ACCESS forum by Marc Jellinek. You have a log from an auction house that looks like this:
CREATE TABLE Bids
(bidder CHAR(10) NOT NULL,
auct_date DATE NOT NULL,
made_bid CHAR(1) NOT NULL CHECK (made_bid IN ('Y', 'N')));
It gives the name of the bidder, the consecutive dates of a series of auctions, and a note if a bid was made. Here is some sample data:
INSERT INTO Bids VALUES ('Curly', '1998-04-02', 'N');
INSERT INTO Bids VALUES ('Curly', '1998-04-03', 'Y');
INSERT INTO Bids VALUES ('Curly', '1998-04-04', 'N');
INSERT INTO Bids VALUES ('Curly', '1998-04-05', 'Y');
INSERT INTO Bids VALUES ('Curly', '1998-04-06', 'N');
INSERT INTO Bids VALUES ('Curly', '1998-04-07', 'Y');
INSERT INTO Bids VALUES ('Curly', '1998-04-08', 'N');
INSERT INTO Bids VALUES ('Dick', '1998-04-01', 'Y');
INSERT INTO Bids VALUES ('Dick', '1998-04-02', 'Y');
INSERT INTO Bids VALUES ('Dick', '1998-04-03', 'Y');
INSERT INTO Bids VALUES ('Dick', '1998-04-04', 'Y');
INSERT INTO Bids VALUES ('Dick', '1998-04-05', 'N');
INSERT INTO Bids VALUES ('Larry', '1998-04-04', 'N');
INSERT INTO Bids VALUES ('Larry', '1998-04-05', 'N');
INSERT INTO Bids VALUES ('Larry', '1998-04-06', 'N');
INSERT INTO Bids VALUES ('Larry', '1998-04-07', 'Y');
INSERT INTO Bids VALUES ('Moe', '1998-04-04', 'N');
INSERT INTO Bids VALUES ('Moe', '1998-04-05', 'N');
INSERT INTO Bids VALUES ('Moe', '1998-04-06', 'N');
INSERT INTO Bids VALUES ('Moe', '1998-04-07', 'N');
INSERT INTO Bids VALUES ('Tom', '1998-04-01', 'N');
INSERT INTO Bids VALUES ('Tom', '1998-04-02', 'N');
INSERT INTO Bids VALUES ('Tom', '1998-04-03', 'Y');
INSERT INTO Bids VALUES ('Tom', '1998-04-04', 'N');
We want to find out two things. First, which bidders have declined to bid three times or more during this period? Secondly, which bidders have declined to bid three times or more in a row during this period?
Answer:
The first problem asks for information about the group of bids made by each bidder, so we can use a GROUP BY clause to get the grouping, then use a HAVING clause to reduce the answer:
SELECT bidder FROM Bids GROUP BY bidder HAVING SUM(CASE WHEN made_bid = 'N' THEN 1 ELSE 0 END) > = 3;
The second query involves orderings, so it will be a little messier. We should try to write this query so that 3 DAYS is a parameter that can be changed. Here is one answer:
SELECT B1.bidder, B1.auct_date, B2.auct_date
FROM Bids AS B1, Bids AS B2
WHERE B1.bidder = B2.bidder
AND B1.auct_date < = B2.auct_date
AND (B2.auct_date - B1.auct_date + INTERVAL 1 DAY) = 3 DAYS
AND 'N' = ALL (SELECT made_bid
FROM Bids AS B3
WHERE B3.bidder = B1.bidder
AND B3.auct_date
BETWEEN B1.auct_date AND B2.auct_date);
B1 is the copy of the Bids table that serves as the starting point for the query. We then join it to copy B2 to set a second (future) end point in a time line for each bidder. Those endpoints have to enclose 3 days, and all the made_bid flags within those days have to be set to 'N'.
For extra credit, how would you do this query if the days when the bidder was absent were missing and only the days he attended were recorded as 'Y' and 'N' in the Bids table? To make that easier to understand, imagine a new bidder with this data:
INSERT INTO Bids VALUES ('Marc', '1998-04-02', 'N');
INSERT INTO Bids VALUES ('Marc', '1998-04-04', 'N');
Marc skipped an auction on 1998-04-03, but all his bids for the three-day period 1998-04-02 to 1998-04-04 were 'N'; so is he a cheapskate or not? That is a business decision, but one way to look at only the actual days in attendance is:
SELECT DISTINCT B1.bidder
FROM Bids AS B1, Bids AS B2
WHERE B1.biddate < = B2.biddate
AND B1.bidder = B2.bidder
AND B2.biddate - B1.biddate + 1 = 3
AND 3 = (SELECT COUNT(made_bid)
FROM Bids AS B3
WHERE B3.bidder = B1.bidder
AND B3.made_bid = 'N'
AND B3.biddate BETWEEN B1.biddate AND B2.biddate);
You still need a three-day duration, but you also have to have three made_bid flags that are set to 'N' within those days. You can also combine the last two predicates to find bidders who never make an offer.