DBMS, September 1998
DBMS Online: SQL for Smarties By Joe Celko

Yet Another SQL Sequel

What the future holds for Celko, Plato, and Aristotle.


I have been writing columns for the computer trade press for more than a decade, but people remember me best for a column I began doing in Database Programming & Design (Miller Freeman) in January 1990 under the title "Celko on SQL," which ran until June 1992. DBMS Magazine (M&T Publishing) hired me as a full-time employee, so I stopped freelancing for Database Programming & Design. I began my column in DBMS in May 1992 under the title "SQL Explorer," and it was renamed "SQL for Smarties" in April 1996.

While I was working at DBMS, the Berlin Wall had fallen, Germany had reunified, and this all lead to the Deutsche Bank calling in a note they had against M&T's parent company in Germany. So, at the end of this CASCADE ON UPDATE operation, the American operation was sold to Miller Freeman. My old editors and friends took over and immediately fired me, of course, because this was the period of downsizing or upsizing or rightsizing, after all. But I kept my column on a freelance basis and that is what matters.

Well, I am moving it again. By now everyone knows that DBMS and Database Programming & Design will be combined into a new magazine called Intelligent Enterprise. This is the last issue of DBMS, and the last issue of Database Programming & Design will have an October cover date. David Stodder, who used to edit me in Database Programming & Design, will be the editor-in-chief of Intelligent Enterprise.

I will be doing a bimonthly column in the new magazine; I have no idea who will rotate with me. However, Miller Freeman plans to expand the Web site and retain the archives in DBMS and Database Programming & Design. They want a stronger Web presence, and I will also be writing for the Web site; one proposed project is a SQL tutorial in monthly installments.

Databases and Property Rights

Perhaps writing more on the Internet will be good. I won't have to deal with the three-month lead time required for hardcopy media. For example, in May 1998, I needed to pass along a message from the Intellectual Property committee of the Institute of Electrical and Electronics Engineers asking you to contact your U.S. Representative in Congress over a really bad "Database Protection" bill they had before them. By the time you read my column, this could be history and not something you need to react to immediately. Likewise, I cannot easily post readers' solutions to my SQL puzzles in a print magazine.

The name of the controversial bill is the "Collections of Information Antipiracy Act" (H.R. 2652). It was put on suspension, which makes it impossible for members of Congress to discuss, debate, or amend from the House floor. The suspension rule is usually used when issues are noncontroversial - or when one political faction is trying to ram something through Congress.

H.R. 2652 was introduced as a new form of protection for databases not covered under copyright law. Under U.S. copyright law, a simple compilation of facts is not protected and never has been. Engineers and scientists who can now freely use data generated or compiled by other people will be violating freshly created proprietary rights on those compilations of technological and scientific information presently in the public domain.

The bill provides a 15-year term of protection, but once the database is enhanced or changed in any way, the protection would be perpetually renewed. Basically, you can kiss public domain access goodbye.

You can call the main Capitol switchboard at 202-225-3121 or go to www.house.gov/writerep/ to contact your representative.

Microsoft and Innovation

The story line of the propaganda out of Redmond lately has been that the evil Justice Department is preventing Microsoft from innovation. As much as I hate Janet Reno and the Clinton administration in general, I cannot call Microsoft an innovative company.

Can you name a single category of product it created? DOS was originally QDOS, Xerox invented the idea of Windows, which Apple perfected, and you can go right on down the line. Microsoft buys or copies (poorly) technology from smaller companies. The smaller company is always killed after the deal. Microsoft's only innovation is in contracts that lock people into its products or let Microsoft use its technology. Bill Gates is from a family of high-powered corporate lawyers, not engineers.

In May 1998, Cognos acquired the marketing and distribution rights for Aristotle from Panorama Software. Panorama Software Systems is the largest developer and marketer of business intelligence products and services in the Israeli market, but it is also the firm that sold OLAP technology to Microsoft.

Cognos's Aristotle is a front end to Microsoft's OLAP Plato Server. Cognos has a good set of tools - PowerPlay, Impromptu, Scenario, and 4Thought - and will be focusing on the emerging market for mainstream business intelligence tools. Cognos is an active member of Microsoft's Data Warehousing alliance, and Cognos and Microsoft are presenting DataSmart Symposiums worldwide.

Aristotle is a Plato-specific client. If Plato starts driving a mass market for data marts, and therefore OLAP tools, how long will it be before Microsoft eats Cognos or tries to write its own tools and deal them out? Has Cognos locked down the Plato technology well enough to survive?

Aristotle will be available for download from the Cognos Web site (www.cognos.com) during Plato beta 3. General availability will be timed to coincide with Microsoft's Plato Server. Pricing will be announced at the time of shipment.

Data Mining and Linearity

While we are on the subject of decision support, it is worth passing along the key points in a short article on data mining by Dr. Ben A. Hitt of the Gordian Institute. Market analysts wanted to have as much data about their prospective customers as possible. The idea was that you would cook it with statistical tools, and the correct marketing approach would fall out of the data and into our laps.

Well, we now have the massive pile of data, but that mass is overwhelming. The problem is that the statistical tools present you with both meaningful and meaningless facts in piles as massive as the original data. Did you know that bald men do not buy as much shampoo as their hairy friends? Sure you did, but now you can spend a lot of computer time and money proving it to three decimal places.

In short, brute force is not going to work; you are looking for nuggets of information, not extra cards for a game of Trivial Pursuit.

One basic problem is that human behavior is nonlinear by nature, so linear techniques will miss information contained in the database. For example, very different lifestyles can have the same external behaviors. A double income family without children may eat at a pizza parlor as frequently as a single mother with several children.

There are statistical methods that account for and even take advantage of nonlinearity, but they are highly sophisticated and work well, at least at this juncture, only in the hands of professional statisticians. A successful approach to modeling nonlinear relationships has been neural networks, which attempt to mimic learning, memory, and pattern recognition in the human brain. One particular neural network type, the back-propagation algorithm, has performed very well in this regard, and it is now accepted as a reliable method for data mining.

However, it has its shortcomings. The relationships between specific variables and the neural network results are difficult, at best, to explain. It would be nice if there was some cause-and-effect logic with the results.

How can you tell if a database is worth mining in the first place? You want to find previously undisclosed relationships, often referred to as knowledge discovery, or a better term, data prospecting. The unsupervised algorithms process the data and classify records in the database according to similarities in the patterns - a form of cluster analysis - but the newer techniques are also adaptive or vigilant because they can recognize novel patterns as they appear in the data.

A major advantage to all of these methods is that the weights are easily translated into real-world values. If relationships exist between variables, they are easily expressed in understandable terms. The algorithms include Fuzzy Adaptive Resonance Theory, Lead Clustering or Feature Mapping, and American Heuristic's Adaptive Fuzzy Feature Map. No, I am not going to give you a mathematical description of them, but you can use that list of names to start doing your own research.

JSQL

Caribou Lake Software (www.CaribouLake.com) announced new additions to its JSQL family, namely, Jpass and Jtunnel versions 3.0, and the new Jsockets product.

Jpass and Jtunnel can each access multiple combinations of JDBC data sources. These products let companies extend JDBC applications to a multitier architecture. Jpass makes pass-through JDBC calls from client applications to any database server via a Web server, so you are not stuck with only the database server from which you got the Java applet. Jtunnel can actually deliver JDBC through DNS-disabling firewalls and reverse proxies, allowing application deployment to the Internet.

Jsockets delivers Java applets and business objects from the application server to the Internet. Companies can deploy Java components to the client from server-side Java applications. Unlike other HTTP-tunneling socket solutions, Jsockets supports full-duplex communication, even letting a client listen for connections from other Jsockets clients. This lets Java developers ignore problems with firewalls, proxies, and reverse proxies without sacrificing the flexibility of socket-style communications. Jsockets can create a virtual private network between any nodes on the Internet, regardless of configuration.

Searching

Hylas International (www.hylas.com) has established a specialized site and search engine devoted to digital imaging. Hylas International previously created the Photo-on-PC (www.photo-on-pc.com) Web site for digital imaging.

This is a good idea and it needs to be generalized. The Dewey Decimal Catalog system attempted to catalog all of humanity's knowledge under one scheme for library work. But if you look at any specialized field of knowledge, it has its own terminology and taxonomy. Textbases had recognized this problem and tried to develop semantic models with only limited success.

One of the most common searches done in the real world is the "I don't know what you call it, but I know what it looks like!" search. Ask anyone in retail sales. What we really need is a search engine that would catalog pictures on Web sites and allow true image searches. This is not cheap or easy, but for certain sites with a high graphic content, it could be worth the effort.

Puzzle

You have a table that contains rows that represent before and after pairs of some data object defined by a three-part compound key. The first two are city and state, the third column is the timestamp for the change, and the fourth key column is an authorization code, which is either an 'A' or a 'B':

CREATE TABLE ChangeLog
(state INTEGER NOT NULL,
city INTEGER NOT NULL,
change_time TIMESTAMP NOT NULL,
authorize CHAR (1) NOT NULL CHECK (authorize IN 
('A','B')), change_date TIMESTAMP NOT NULL,
stuff INTEGER NOT NULL,
...,
PRIMARY KEY (state, city, change_time, authorize));
I want to find the rows within a certain date range that had both authorizations.

Answer

This view will give you all the rows that have both an 'A' and a 'B' image:

CREATE VIEW ChangePairs (state, city, start_date, 
end_date) 
SELECT state, city, MIN(change_time), MAX(change_time) 
FROM ChangeLog
GROUP BY state, city
HAVING COUNT(*) = 2 - assumes exactly one A and 
one B row 
AND MIN(change_time) <= MAX(change_time);
Then use it in a join query:
SELECT L1.*
FROM ChangeLog AS L1, ChangePairs AS P1
WHERE L1.state = P1.state
AND L1.city = P1.city
AND change_time BETWEEN :my_range_start AND 
:my_range_end;
It is possible to write this in one monster query, but that would be a real mess to read. You might also want to put an index on (state, city, change_time) to make this join go faster.


Joe Celko is an Atlanta-based independent consultant and a member of the ANSI X3H2 Database Standards Committee. He is also the author of two books on SQL: SQL For Smarties (Morgan-Kaufmann, 1995) and Instant SQL Programming (Wrox Press, 1995). You can contact Joe via email at 71062.1056@compuserve.com.
What did you think of this article? Send a letter to the editor.


Subscribe to DBMS -- It's free for qualified readers in the United States
September 1998 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS (http://www.dbmsmag.com)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated August 7, 1998