DBMS, March 1997
DBMS Online: SQL for Smarties By Joe Celko

Random Data and Observations

Randomizing data is much like shuffling a deck of cards.

I recently bought an old house that needs some work (read: money pit), and so most of my life and worldly processions are still in cardboard boxes as I write this month's column. One friend who "helped" us move labeled all of her boxes "Living Room" because that is where she packed them. Although I might be short a pile of office supplies, half of my wedding photographs, and all of my tax records, at least she gave me a topic for the first section of this month's column: random functions and physical storage.

Shuffling Cards

Mark Betz (76605.2346@compuserve.com) wrote a computer backgammon game, MVP Backgammon, that has been out for about a year. Since the first beta two years ago, he has been dealing with complaints that the dice are biased in favor of the computer. He has tested his random number generator and posted his results, but people still do not believe him. One person wrote that the computer often got more of the "good" rolls during the game and therefore the dice could not be random. Betz had to point out that the computer was probably a better player and was setting itself up so that more of the rolls were "good" for the machine. The complainant never wrote back.

Sometimes you want the data in your database spread randomly over physical storage because you will be searching it randomly. One player (database user) will be favored over another simply by luck, and any estimate of performance must be worded with a margin of error. If all of the items being sought are on the same page of physical disk storage, the database will lock that page and make everyone wait his or her turn, and performance will suffer. This usually occurs in a situation where the queries fetch a single row. But it also happens when you load a sorted sequential file into a database.

Other times you want the data in your database clustered in some sorted order in physical storage because you will be searching it in that order. In this case, if all of the related items being sought are on the same page of storage, the database will lock only that page and overall performance will improve. This usually occurs in a situation where the queries aggregate multiple rows or are looking for a range or something similar.

What is a good way to randomize the physical table storage if we are loading data from a sorted sequential file? If I gave you a standard deck of 52 playing cards and the same question, you would shuffle them. A "perfect shuffle" is a magician's term meaning that you cut the deck into two equal packs and then interlace a card from one pack with one from the other. If you start with the right-hand pack, it is called an "inshuffle"; if you start with the left-hand pack, it is called an "outshuffle."

The general formula for computing (n) the number of outshuffles required to restore a deck of (2*d) cards is:

(2^n) mod (2*d-1) = 1

It takes only eight outshuffles, but 52 inshuffles, to restore a standard 52-card deck to its starting position. Also, if d is not a multiple of two, then the formula fails and the only known procedure is to work out each case.

Persi Diaconis at Harvard University demonstrated that the usual riffle shuffle -- split the deck into two packs and shuffle, but without perfect interlacing -- is very good at producing randomness in the deck. It takes seven riffle shuffles to randomize a deck of playing cards. Diaconis and other researchers also found out that certain-size decks would have sudden changes in their randomness; after six riffle shuffles, a deck is still visibly ordered, but this order disappears with the seventh shuffle.

Implications for computer people: Parallel computers whose processors perform the same task simultaneously and then mix their outputs are an example of what are called "shuffle networks." Algorithms that distribute data over storage pages are in effect shuffling the rows of a table and can have similar "catastrophe points"; this accounts for sudden changes in performance in multiuser systems.

Maintaining VIEWs

Pavel Benes (benes@lanprojekt.cz), a senior programmer with LAN-Projekt Plzen in the Czech Republic, had to create and modify rather complicated views based on scalar subqueries in WATCOM SQL (now Sybase SQL Anywhere). The tools in Sybase SQL Central or in PowerBuilder's view painter are not appropriate for views with many computed columns. These tools can only modify the CREATE VIEW statements with a text editor. Benes' tool analyzes the text of the CREATE VIEW statement from the system tables and displays the result in a data window for modifications. He placed this software and a brief description on the Web at www.lanprojekt.cz/viewpainter. This version runs on Windows 95 but can be ported to other platforms.

Computer Talks Under Torture

Science Magazine (November 1, 1996, page 716; www.sciencemag.org) reported a new strategy for breaking into virtually any public-key encryption system. Scientists at Princeton University and Bellcore have been working together to discover how they can force a computer or encoding chip to err in its calculations while encrypting a message and, at the same time, to leak information about the message being encrypted. The technique is called "differential fault analysis"; one method of implementing it is to irradiate the chip and then compare a number of error-ridden encryptions with a single flawless one. This technique was carried a step further by researchers at the Weizmann Institute in Israel, allowing them to decipher the secret key from a 56-bit Data Encryption Standard algorithm with little trouble.

The SQL/CLI

On October 28, 1996, ANSI (American National Standards Institute) approved ANSI/ISO/IEC 9075-3:1995 Information Technology - Database Languages - SQL - Part 3: Call-Level Interface (SQL/CLI) as a new standard.

This standard looks a lot like ODBC, but it is more generic. It is supposed to be the basis for version 3.0 of ODBC from Microsoft. This subject is worth an article -- or a book -- in itself, but for now, the SQL/CLI will add SQL-92 support and depend more on handles as parameters to the library calls. (See "Understanding ODBC 3.0 Standards and OLE DB" by Ken North, DBMS, April 1996, page S15.)

The End of Public Domain Data?

On November 10, 1996, Info-Policy-Notes (an email newsletter available from listproc@tap.org, a service provided by The Taxpayer Assets Project [TAP], which was founded by Ralph Nader in 1988 to monitor the management and sale of government property) reported on a proposal being pushed by Bruce Lehman, the head of the Patent and Trademark Office (PTO) under the Clinton administration to regulate factual databases. U.S. Government officials are pressing for the adoption of an international treaty that will, if enacted, significantly change the ways sports statistics are controlled and disseminated. The treaty isn't specifically directed at sports statistics -- it is a much broader attempt to create a new property right in facts and other data now in the public domain -- but it will have an enormous impact on the legal rights exercised by the National Football League (NFL), Major League Baseball (MLB), the National Basketball Association (NBA), the National Hockey League (NHL), and virtually all other professional or amateur athletic leagues.

The same treaty will also radically affect the way that stock prices, weather data, train schedules, and other facts are controlled. It just happens to occur at the same time that the NBA and other sports franchises are stepping up their efforts to control the realtime dissemination of sports statistics through the Internet or with wireless paging devices. (See, for example, "NBA Case No Slam Dunk," by Janet Kornblum, September 7, 1996, at www.news.com/News/Item/0,4,3208,00.html.)

If the treaty is approved and implemented, sports leagues will have far broader powers to dictate the terms and conditions under which sport statistics are reported and disseminated. According to the proposed treaty and legislation introduced in the 104th Congress to implement the treaty (HR 3531, a version of the database protection proposal), the NFL, NBA, NHL, and MLB will have the right to prevent anyone from publishing these and other statistics without express permission from the sports league. This will include the right to control access to the historical archives of sports statistics -- even the right to dictate who can publish, say, the box scores from a game.

The proposals for a new legal environment for publishing facts are outlined in a draft treaty on "databases" that was considered at a December 1996 meeting of the World Intellectual Property Organization (WIPO), in Geneva, Switzerland. The proposal would require the United States and other countries to create a new property right for public domain materials. "Texts, sounds, images, numbers, facts, or data representing any other matter or substance" will be protected.

This treaty represents "the end of the public domain," according to American University Law Professor Peter Jaszi. Copies of the proposed treaty, a federal register notice asking for public comment, and independent commentary can be found at the Union for the Public Domain Web site at www.public-domain.org/database/database.html.

Regular readers of this column will remember that in 1991, the U.S. Supreme Court ruled in what is now called "the Feist decision" that the facts from a telephone directory of names, addresses, and phone numbers were not protected under the copyright laws, and that in general "facts" could not be copyrighted by anyone. The Feist decision alarmed several large database vendors, who crafted this new sui generis property right that would protect facts (and just about everything else).

The database vendors have already succeeded in obtaining a directive on a database proposal from the European Union, although no European country has yet passed legislation to implement the treaty. The most active supporter of this new property right is West Publishing Co., the Canadian legal publisher. A West Publishing employee chairs a key American Bar Association subcommittee that wrote a favorable report on the treaty. A number of very large British and Dutch database vendors are also lobbying hard for the treaty.

West wants the new property right to protect the "page numbers" and "corrections" it adds to the judicial opinions it publishes in paper-bound books. Telephone companies want to protect the names, addresses, and telephone numbers they publish, and other database vendors want to protect scientific data or other non-copyrighted government information they publish. In seeking to protect these items, the treaty was written to stamp "owned by" labels on a vast sea of information now in the public domain. Copyright experts J.H. Reichman and Pamela Samuelson say the treaty advances the "least balanced and most potentially anti-competitive intellectual property rights ever created." (See the full document on the Web at ksgwww.harvard.edu/iip/reisamda.html.)

Televisions are Munitions

According to the New York Times ("U.S. Classifies Web TV Technology as a Weapon" by John Markoff, November 8, 1996; www.nytimes.com), Web TV, a $300 television-set-top device sold by Sony and Philips Electronics for browsing the World Wide Web, uses a 128-bit security that makes it a weapon requiring a special export license before it can be sold overseas. Few industry experts expect such a license to be granted, so the companies are unlikely to begin selling current versions of the American-made devices next year in Europe and Japan, as they had planned. And so we lose a little more in our balance of payments in spite of superior technology.

The administration has proposed a deal in which companies would be freed from export restrictions if they agreed to build such a "key recovery" mechanism into their products. If you were a foreign citizen, would you like the U.S. Government reading your email? Yet at the same time that television sets are getting hit under the munitions act, Tandem Computers Inc. (www.tandem.com) announced its first Internet Transaction Processing (ITP) products to create a secure, reliable environment for commerce on public networks such as the Internet.

Tandem's parallel hardware systems have already given the company a good presence in the OLTP (online transaction processing) market. Tandem hardware runs about 80 percent of the world's automatic teller machines, 66 percent of credit card transactions, and 90 percent of securities transactions. This recent move endorses Internet commerce.

Your Government at Work?

On another front, NIST has been closing down its FIPS conformance-testing lab. Unfortunately, there is supporting law that requires NIST to represent and perform data management standards activities. The four Public Laws that apply are OMB Circular 119 (now codified), Paperwork Reduction Act of 1995, PL 104-106, and PL 104-113.
  1. OMB Circular A-119 (October 20, 1993) states that:
      Standards are to be used as basis of procurement where ever possible.
    1. Participation in domestic and international standards organizations is encouraged.
    2. Agencies can provide financial support, secretarial functions, and cooperative conformance testing.
    3. Department of Commerce is to establish an interagency consultative mechanism for policy implementation.
    4. Heads of agencies must designate staff, establish procedures, and coordinate standards participation.
  2. Paperwork Reduction Act of 1996 (Dec 1, 1996), 104th Congress, 1st Session, rewrite of Chapter 35 of Title 44 the United States Code states that:
    1. Information technology is employed to reduce the paperwork burden on the public.
    2. An OMB director is established to develop, coordinate, and oversee implementation of Federal Information Resource Management policies, procedures, standards, and guidelines.
    3. Federal agencies shall work with NIST in the development of the IRM policies, procedures, standards, and guidelines for IT functions and activities of the Federal Government.
    4. Federal agencies shall work with GSA, NIST, and others in the development of a Government-wide strategic plan for IRM that includes objectives, plans, public access enhancement, strategies, and ongoing progress assessments.
    5. Each agency must carry out IRM to improve productivity, efficiency, and effectiveness; reduce burden on the public; and improve integrity, quality, and utility of information.
    6. NIST and others must review agency IRM activities to ascertain efficiency and effectiveness.
  3. Extracts from Report 104-390, 104th Congress, 1st Session, National Technology Transfer and Advancement Act of 1995 (PL 104-113, HR 2196, S1164), House of Representatives. Section 12, Standards Conformity, subsection 13 specifically requires NIST to coordinate Federal, State, local, and private-sector standards activities; it additionally states NIST's statutory mandate to act as the lead agency to ensure Federal use of standards developed by private consensus standards organizations. Finally, the law codifies OMB Circular 119.
  4. Extracts from Report 104-450, 104th Congress, 2nd Session, National Defense Authorization Act for Fiscal Year 1996 (PL 104-106, S1124), Subsection D, Section 5131 authorizes NIST through the Department of Commerce to promulgate standards and guidelines pertaining to Federal computer systems. Section 5132 states the Sense of the Congress requirement for all agencies to achieve at least a five percent decrease in IT cost and a five percent increase in efficiency by reason of improvements in IRM by the agency.

You might want to email your representatives and senators to see what they are doing about NIST operating in contempt of Congress and Public Law.

Puzzle

Magnus Bergh (70630.335@compuserve.com) posted the following scheduling problem on the MSACCESS forum on CompuServe in 1996: Mr. A wishes to arrange meetings with several other persons. Time is allocated in 16 intervals of 30 minutes during the meeting day, and Mr. A has a table of everyone's busy and free time slots. Assume Mr. A's calendar is completely open.

There are two problems:

  1. Find a common time slot that everyone has open for a group meeting.
  2. Find a time slot that each person has open so you can schedule one-on-one meetings in the same day.
Assume that your tables look like this:

CREATE TABLE Appointments
(person CHAR(10) NOT NULL,
period INTEGER NOT NULL CHECK(period BETWEEN 1 AND 16),
status CHAR(4) NOT NULL CHECK(status IN ('Open','Busy')),
PRIMARY KEY (person, period));

INSERT INTO Appointments VALUES ('Mr. B', 8, 'Busy');
INSERT INTO Appointments VALUES ('Mr. B', 9, 'Open');
INSERT INTO Appointments VALUES (ýMr. B', 10, 'Open');
INSERT INTO Appointments VALUES (ýMr. B', 11, 'Busy');
INSERT INTO Appointments VALUES ('Mr. B', 12, 'Open');
INSERT INTO Appointments VALUES ('Mr. B', 13, 'Busy');
INSERT INTO Appointments VALUES ('Mr. B', 14, 'Busy');
INSERT INTO Appointments VALUES ('Mr. B', 15, 'Busy');
INSERT INTO Appointments VALUES ('Mr. B', 16, 'Busy');
INSERT INTO Appointments VALUES ('Mr. C', 8, 'Open');
INSERT INTO Appointments VALUES ('Mr. C', 9, 'Open');
INSERT INTO Appointments VALUES ('Mr. C', 10, 'Open');
INSERT INTO Appointments VALUES ('Mr. C', 11, 'Busy');
INSERT INTO Appointments VALUES ('Mr. C', 12, 'Busy');
INSERT INTO Appointments VALUES ('Mr. C', 13, 'Busy');
INSERT INTO Appointments VALUES ('Mr. C', 14, 'Busy');
INSERT INTO Appointments VALUES ('Mr. C', 15, 'Busy');
INSERT INTO Appointments VALUES ('Mr. C', 16, 'Busy');
INSERT INTO Appointments VALUES ('Mr. D', 8, 'Open');
INSERT INTO Appointments VALUES ('Mr. D', 9, 'Busy');
INSERT INTO Appointments VALUES ('Mr. D', 10, 'Open');
INSERT INTO Appointments VALUES ('Mr. D', 11, 'Busy');
INSERT INTO Appointments VALUES ('Mr. D', 12, 'Busy');
INSERT INTO Appointments VALUES ('Mr. D', 13, 'Busy');
INSERT INTO Appointments VALUES ('Mr. D', 14, 'Busy');
INSERT INTO Appointments VALUES ('Mr. D', 15, 'Busy');
INSERT INTO Appointments VALUES ('Mr. D', 16, 'Busy');

and you need to build a table of your people that you want to meet with:

CREATE TABLE MyPeople
(person CHAR(10) NOT NULL PRIMARY KEY);

INSERT INTO MyPeople VALUES ('Mr. B');
INSERT INTO MyPeople VALUES ('Mr. C');
INSERT INTO MyPeople VALUES ('Mr. D');

Can you do these two queries?

CREATE VIEW Available (person, period, status)
AS SELECT person, period, status 
FROM Appointments 
WHERE status = 'Open';
Hint: A view of open time slots might help, but it is not needed. You can find this month's puzzle answer in the online version of this column on the DBMS Web site at www. dbmsmag.com.


Joe Celko is an Atlanta-based guru with Northern Lights Software Ltd. and a member of the ANSI X3H2 Database Standards Committee. He is also the author of two books on SQL: SQL For Smarties (Morgan-Kaufmann, 1995) and Instant SQL Programming (Wrox Press, 1995). You can contact Joe via CompuServe at 71062,1056 or via email at 71062.1056@compuserve.com.


PUZZLE ANSWER

To get everyone in one time slot, find which period(s) everyone is open:
SELECT period
FROM Appointments AS S1
WHERE status = 'Open'
AND person IN (SELECT person FROM MyPeople)
GROUP BY period 
HAVING COUNT(*) = (SELECT COUNT(*) FROM MyPeople);
Results

period
======
10

If this does not work, then you might try again for the next best meeting by looking for (n-1) persons with a common open slot:

SELECT period
FROM Appointments AS S1
WHERE status = 'Open'
AND person IN (SELECT person FROM MyPeople)
GROUP BY period 
HAVING COUNT(*) = (SELECT COUNT(*) FROM MyPeople) - 1;
Results

period
======
8
9

and so forth until you get a hit.

Obviously, the equality comparison can be replaced with ">=" to see if you can get a quorum for a meeting, instead of exactly (n-1) people. Notice that you are not told which (n-1) people can meet, only the period.

One way of doing the query for the second problem is a self-join, which has one copy of the view for each person involved.

SELECT A1.period AS B, A2.period AS C, A3.period AS D 
FROM Available AS A1,
Available AS A2,
Available AS A3 
WHERE (A1.person = 'Mr. B' 
AND A1.period NOT IN (A2.period, A3.period))
AND (A2.person = 'Mr. C' 
AND A2.period NOT IN (A1.period, A3.period))
AND (A3.person = 'Mr. D' 
AND A3.period NOT IN (A1.period, A2.period));
Results

B CD
===========
9 810
9 10 8
10 9 8
12 8 10
12 9 8
12 9 10
12 10 8

This pattern can easily be extended to any number of people. This strategy will return all possible combinations of valid meetings, if any exist. If your people are not very busy, then the number of possible combinations explodes. Using this sample data, compute the number of possible meetings if everyone is open.

This combinatory explosion might be a good thing, or it might not. I will offer a book as a prize for either a better complete solution or a SQL query that returns one and only one possible meeting schedule. No procedural code, or keep it to a minimum, please.

-- Joe Celko


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
March 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Tuesday, February 11, 1997.