I was on the road for the first three weeks of May this year, speaking at trade shows. Just before I left on my "world tour," my basement flooded and the floors collapsed, my wife's Pomeranian was attacked by our German Shepherd, and we moved another step forward in the process of adopting a kid. Anders Altberg described my life at that point by saying "That's one unknown, one delete, one update, and one insert in physical order."
I left to attend DB Forum in Sao Paulo, Brazil on May 4. The flight was on Varig Airlines, which still actually serves good food! This is a marketing concept that American flights could use to great advantage. The reason for the trip was that MantelMedia, the company presenting the DB Forum, now has the translation rights to DBMS Magazine in Brazil and wanted some of the columnists from their new acquisition to speak. (To see the Brazilian version of DBMS, go to www.mantel.com.br/revistas.)
I do not speak Portuguese -- in fact, after more than 10 years in Los Angeles, I never even picked up Spanish. MantelMedia had a driver for me at the airport, complete with a cardboard sign that had my name on it -- "MISTR SELK." Unfortunately, I had my Portuguese phrase book with me. I promptly walked to the gentleman and said, "Preciso duma lampada," which means "I wish to buy a lightbulb", and pointed at his sign. This immediately identified me as a tourist and got me to the hotel.
Sao Paulo itself was a crazy quilt of small, narrow unplanned lots with modern poured concrete buildings next to old red clay tile-roofed Spanish-style apartment units. The sidewalk was lower than the doorways to help keep water from the streets out. Each section of sidewalk was also different and matched the building it fronted -- maybe the city does not maintain, or even own, the sidewalks.
I did my scheduled sessions on May 6. But another speaker did not show up, so I did my "Year 2000" speech in his place. I had to marvel at the skill of the translation staff -- technical stuff is bad enough, but to do it without any preparation in realtime is very arduous. Simultaneous translation via radio headsets is interesting. It looked like a class where everyone had a Sony Walkman on, and when you told a joke, the laugh would come a few seconds later. I had never thought about it, but the Romance languages require about a third more time because of the case endings.
I also got to meet Rick van der Lans, a Dutch computer geek whose books and writings I have admired for years. He was an enjoyable human being and also the only person whose Portuguese was worse than mine -- Portuguese with a Dutch accent sounds really strange. The poor man could not even order a lightbulb! How did he ever get a taxi?
Rick is actually the guy who invented the nested sets model for representing trees in SQL that I discussed in my March through June 1996 DBMS columns. He beat me to publication by several months.
Because I stayed to do the extra session, I took a slightly later flight to Dallas-Fort Worth. The connecting flight required that we get off the plane, go through immigration, then go through customs, where we were asked questions like "Do you have a parrot in your suitcase?" or "Are you carrying drugs?" (I wonder how many smugglers they catch that way?)
Then we went over to another gate, got back on the same model plane, and took the same seats. Then the plane had to be pulled for maintenance. We had to deplane and walk to another gate that was just far enough away so you could not take the airport shuttle train and had to go on foot. The moving sidewalks were also broken. The second plane was a smaller model, so we had to stand in line and get new seats. My new seat had a rude, older Frenchman in it who wanted an aisle seat because he had an aisle seat on the other plane. I let it go, because I did not have a French phrase book with the appropriate language for the occasion -- at this point I was starting to miss my sleep and had no desire to order a lightbulb.
Fifteen hours later (on May 8), I got to the San Francisco PowerBuilder User Group -- just in time to change out of my dirty clothes, shower, and get to my presentation site 15 minutes before I was to speak. I am good at this! I didn't miss a beat! I made them get me a dark beer and hot rolls so I could eat something while I did my first speech.
The presentations went very well. Americans were asking questions and getting into the programming techniques that made up my lectures. The Brazilians I had encountered were very polite and I enjoyed them, but they did not ask questions or have a dialog with me as the Americans did. This might be a cultural difference about asking questions in public.
I spent the weekend in San Francisco because I was speaking at DB/Expo on May 14, and I like the show. It is where I could see a difference between the United States and the rest of the world.
Everyone agrees that the enterprise database will be hooked into the Internet. Although we in the U.S. are obsessed with Java as the new silver bullet, Europe and South America are not. Oh, they think Java will be important, but they want to see development tools that produce Java rather than learn to write native Java code.
I then went south to Los Angeles for the Computer Law Institute on May 15 and 16. Some interesting things here dealt with the position of databases as intellectual property. Before this time, the case of Feist vs. Rural Telephone Directory Inc. (which I have covered in prior columns) held that a mere compilation of fact without a creative touch cannot obtain copyright protection, and therefore a telephone directory is up for grabs.
But now, trade secret protection seems to be the way to protect a compilation. Some recent decisions are based on people who bought a database on CD and proceeded to resell the data to their customers via a Web site. The rulings state that the material had conditions of sale that restrict to whom, how, and how much of the data can be distributed to other people. Certain classes of "click to agree" contracts are also gaining more strength.
I then got back home in Atlanta on May 19 at 1:30 p.m., in time to wash up and rush to an awards presentation at Chamblee High, where Amanda, our potential adoptee, was receiving a pin for a 3.5+ grade point average. Our basement was still a wreck, but the Pomeranian is healing nicely. And of course there was email from my DBMS editor, Betsy, asking me where my August column was.
Although this statement is true, SQL is pretty weak on the real statistical work you do in a data warehouse, because it is limited to simple descriptive summary functions used with the GROUP BY clause.
The important part of having a data warehouse is to find correlations among the data elements, so that you can make predictions from the data. You really want to do cluster analysis, find correlations, and look for associations -- in short, get some prediction information out of the data.
Correlations are probably the "easiest ones of the hard stuff" to understand, so of course they are the most often misrepresented and misunderstood. First of all, correlation is not cause and effect. A necessary cause is one that must be present for an effect to happen -- a car must have gas to run. A sufficient cause will bring about the effect by itself -- dropping a hammer on your foot will make you scream in pain, but so will having your hard drive crash. A contributory cause is one that helps the effect along but would not be necessary or sufficient by itself to create the effect. Loading Windows 95 on your machine will not cause a crash per se, but it will contribute to it.
There are also coincidences, where one thing happens at the same time as another, but without a causal relationship at all.
A correlation between two measurements, say X and Y, is basically a formula that lets you predict one measurement given the other, plus or minus some error range. For example, if I shot a cannon locked at a certain angle and with a certain amount of gunpowder, I could expect to place the cannonball within a relatively small radius of the target most of the time. Once in a while, the cannonball will be dead on target, and other times it could land several meters away, thanks to variation in the shape and weight of the cannonball, the quality of the powder, the wind, and a million other unmeasurable things. But my formula is good enough to wage a war.
The strength of the prediction is called the coefficient of correlation and is denoted by the variable r where (-1 <= r <= 1) in statistics. A coefficient of correlation of negative one is absolute negative correlation -- when X happens, then Y never happens. Positive one is absolute negative correlation -- when X happens, then Y always happens. A zero coefficient of correlation means that X and Y happen independently of each other. A correlation of 0.7 or better is gratifying in the real world.
The confidence level is related to the coefficient of correlation, but it is expressed as a percentage. It says that x percent of the time, the relationship you have would not happen by chance.
The March 1994 issue of Discover magazine included commentary columns entitled "Counting on Dyscalculia" by John Allen Paulos, the man who coined the word "innumeracy" in his 1990 bestseller of the same title. His particular topic was health statistics, because those create a lot of "pop dread" when they get played in the media.
One of his examples in the article was a widely covered lawsuit by a man who alleged a causal connection between his wife's frequent use of a cellular phone and her subsequent brain cancer. But brain cancer is a rare disease that strikes approximately seven out of 100,000 people per year. Given the large population of the United States, this is still about 17,500 new cases per year, and that number has held pretty steady for years.
There are an estimated 10 million cellular phone users in the United States. If there were a causal relationship, then there would be an increase in brain cancer cases as cellular phone usage increased. In fact, if we found that there were fewer than 7,000 cases among cellular phone users, we could use the same "argument" to "prove" that cellular phones prevent brain cancer.
There are five ways two variables can be related to each other. The truth could be that X causes Y. You can estimate the chirp rate of a cricket from the temperature in Fahrenheit using the formula (chirps = 3.777 × degrees - 137.22), with r = 0.9919 accuracy. However, nobody believes that crickets cause temperature changes.
The second case is that Y causes X. Remember that a correlation does not point in one direction only. You can do a little algebra and write the same formula in terms of X or Y.
The third case is that X and Y interact with each other. Supply and demand curves in classic economics are an example where as one goes up, the other goes down (negative feedback in computer terms). A more horrible example is drug addiction, where the user requires larger and larger doses to get the desired effect (positive feedback in computer terms), as opposed to a habituation, where the usage hits an upper level and stays there.
The fourth case is that any relationship is pure chance. Any two trends in the same direction will have some correlation, so it should not surprise you that once in a while, two will match very closely.
The final case is where the two variables are effects of other variable(s), which are outside the study. The most common unseen variables are changes in a common environment. For example, severe hay fever attacks go up when corn prices go down. They share a common element -- good weather. Good weather means a bigger corn crop and hence lower prices, but it also means more ragweed and pollen and hence more hay fever attacks.
CREATE TABLE MyTable (class CHAR(1) NOT NULL CHECK(class BETWEEN 'A' AND 'Z'), state_list VARCHAR(50) NOT NULL);The state-list is a comma-separated string of two-letter state codes and data that looks like this:
| MyTable | |
| class | state_list |
| ==================== | |
| 'A' | 'NY,NJ,WA' |
| 'A' | 'OK,GA' |
| 'B' | 'AL' |
| 'C' | 'NY' |
| 'D' | MO,AZ' |
What we want to get out is a new table that is more relational. The state_list column violates the rule that a column is atomic. The new table will have a new identifier column and have only one state per row.
CREATE TABLE MyTable2 (id INTEGER NOT NULL, class CHAR(1) NOT NULL CHECK(class BETWEEN 'A' AND 'Z'), state CHAR(2) NOT NULL);Here is one way to do this:
INSERT INTO MyTable2
SELECT 1, class,
CASE WHEN POSITION(',' IN state_list) = 0
THEN state
ELSE SUBSTRING(state_list FROM 1 FOR 2)
AS state
UNION
SELECT 1, class,
CASE WHEN POSITION(',' IN state_list) = 0
THEN state
ELSE SUBSTRING(state_list FROM 4 FOR 2)
AS state
UNION
SELECT 1, class,
CASE WHEN POSITION(',' IN state_list) = 0
THEN state
ELSE SUBSTRING(state_list FROM 7 FOR 2)
AS state;
Now update the id column to make each id unique:
UPDATE MyTable2 SET id = (SELECT SUM(T1.id) FROM MyTable AS T1 WHERE (T1.class || T1.state) <= (MyTable2.class || MyTable2.state));Finally, give the table a primary key:
ALTER TABLE MyTable2 ADD CONSTRAINT PRIMARY KEY (id);
As you can see, this is pretty elaborate. Can you do better? (See Answer.)
Anders Altberg came up with the idea of creating a table of the two-letter state codes and using it in this query:
POSITION (S1.state IN T1.state_list) > 0
and it will probably run faster. The unique identifier could then be constructed and the primary key constraint added in the same manner as before.
Another advantage, or perhaps disadvantage, of this
approach is that invalid state codes are not represented in MyTable2. I will offer a copy of my new book, SQL Puzzles & Answers, as a prize for the best solution for finding all the rows in MyTable with invalid state codes in them.
ANSWER
Not only is that suggested answer not very pretty, it will fail if there are more than three state codes in the state_list in the string. Yes, you could look at the maximum length of the state_list string and extend the solution, but it is still a mess.
SELECT 1, T1.class, S1.state
FROM MyTable AS T1, StateCodes AS S1
WHERE T1.state_list LIKE '%' || S1.state || '%';
If your SQL has a POSITION( ) function, then you can
replace the like predicate with
What did you think of this article? Send a letter to the editor.