Surveying the current stock of picks and shovels for effective data warehousing.
As terabyte-size databases become prevalent and data warehouses commonplace, it is becoming more difficult to create value from the data -- there's just too much of it. The ability to create value from data is constrained by, among other things, the skill and experience of the analyst to make and verify hypotheses by sifting through large amounts of data. Data mining products attempt to overcome this limitation by using the computer, rather than humans, to discover data relationships that can be used to make predictions.
Although both data mining and decision-support applications can lead to the discovery of new data relationships, data mining tools often find unsuspected relationships in data that traditional techniques will overlook. This is because data mining applications attempt to make correlations and find relationships in data using what is called a "machine-based discovery" process. This process does not require people to create and test hypotheses about data relationships and correlations. For example, data mining can identify micro-market segments by analyzing hundreds of attributes in customer sales records rather than having a marketer rely on traditional age and income ranges. Users can ask data mining tools to identify the variables that affect sales of a specific product. Rather than having to guess that either cents-off coupons or weather may be the most significant indicator, say, it may really turn out that an unsuspected product placement of complementary items is most important.
Although many tools specialize in only one technique, a growing number of tools use a combination of technologies, matching the technique to the problem at hand. Many algorithms such as CART, CHAID, and C4.5 are non-proprietary; product differentiation is in the areas of ease of setup, performance (particularly against large databases), presentation, and visualization of results.
Several types of business problems are appropriate for data mining analysis. A typical association problem is market basket analysis. Clustering problems such as grouping of customers into market segments are often used to define market segments. A common classification problem is identification of fraud. A detailed explanation of several of these problems was described in Bruce Moxon's "Defining Data Mining." (See DBMS Data Warehousing Supplement, August 1996, page S11.)
Although most OLAP and statistics vendors do not support the definition of data mining used in this article -- machine-generated rather than human-generated hypotheses -- several are adding data mining technologies into their tools. Pilot Software Inc.'s Discovery Server uses decision-tree-based technology to perform customer profiling. Once customer segments are defined, the segment definitions can be fed into the hierarchy metadata definitions in Pilot's multidimensional database for use by the OLAP Analysis Server. SPSS, the statistical package, has neural network and CHAID data mining components.
DataMind Corp., which developed what it calls "agent network technology" data mining, has a relationship with Arbor Software Corp., developer of the Essbase OLAP product. As each record is read from a database, DataMind's product maps every unique input criteria to one or more output criteria. A discovery model is created that contain agents (which represent input criteria and output conditions), connections between agents, and connection weights. Essbase and DataMind both use Microsoft's Excel as a common user interface. (See DBMS, October 1996, page 36, for a review of DataMind.)
Most data mining tools now use Windows clients and access large relational databases using ODBC. Some data mining tools such as Thinking Machines Corp.'s Darwin have specialized architectures to exploit high-performance, parallel-processing hardware and databases. Rather than simply creating statistical printouts of the results, sophisticated online visualization for end users is now available. Features can include the ability to display n-dimensional graphics, dynamically rotate graphs so they can be viewed from many different perspectives, and show related 3-D graphs on screens. Visible Decisions Inc.'s Discovery tool can use VRML 1.0 to interact with information over the Internet.
Information Discovery Inc.'s IDIS: The Information Discovery Engine is a client/server tool with Windows or Netscape clients that uses proprietary pattern-discovery algorithms to discover rules to classify data. It accesses large relational databases. The Mark Twain Edition component provides English text explanation of the rules it found. The IDIS:Predictive Modeler uses pattern matching to make predictions and forecasts based on previously generated IDIS rules. Information Discovery has retail and financial service industry applications available.
XpertRule Profiler by Attar Software Ltd. uses a rule induction process to create a decision tree that identifies which factors affect the desired outcome. An easy-to-understand Decision Tree View shows the number of database records and frequency of the desired outcome in each decision tree node. WizRule from WizSoft Inc. uses a proprietary mathematical algorithm to discover every rule under investigation in a relatively short time. Angoss Software International Ltd.'s KnowledgeSeeker specializes in market segmentation and target marketing.
The key benefit of rules-based data mining approaches is that they are relatively easy to understand. However, these tools may not produce the best results on data that does not contain strong patterns or database fields that are missing values.
NeuralWare's NeuralWorks Predict Professional uses feed-forward neural networks to create predictive models as well as fuzzy logic, statistics, and genetic algorithms. A Build Wizard walks users through the development process. Models are created in Microsoft Excel. Runtime portions of the model can be incorporated in C or C++ applications. The Explain component of NeuralWorks Predict attempts to explain the neural net decisions made in creating the predictive model. HNC Software Inc.'s DataBase Mining Workstation and DataBase Mining Marksman specialize in direct marketing data mining, modeling, and profiling of customer and prospect data.
Neural networks are best used to model non-linear data (data that can not be expressed as a linear mathematical equation), noisy data, or data that is missing some values. One common complaint with neural network tools is that they do not explain the rationale for the clusters that are created. Another lesser-known restriction is that neural networks only work on numeric data. A state field, for example, must often be translated into 50 fields, each with a value of either one or zero.
Thinking Machines Corp.'s Darwin product uses CART, neural network, genetic algorithms, and k-nearest-neighbor techniques to perform classification and prediction. Unlike many other data mining tools, Darwin has a parallel software architecture that lets it specifically process large databases on high-performance, parallel-processing platforms such as DEC's AlphaServer, IBM Corp.'s SP2, and Sun Microsystems Inc.'s Ultra Enterprise servers.
A new entry in the data mining tools market is IBM's Intelligent Data Miner. Data Miner, in RS/6000 and SP2 versions, provides rules induction, neural network, and statistical analysis approaches to association discovery, sequential pattern discovery, and prediction. It runs against large DB2 databases and extracts from Oracle Corp.'s and Sybase Inc.'s databases.
NeoVista Solution Inc.'s new Decision Series consists of sophisticated data mining algorithms designed to be highly scalable. The products in the suite have the ability to transfer results and discovery factors into relational tables that can then by viewed by common query, reporting, and visualization tools.
Data mining products have different scalability requirements than OLAP or SQL query-based tools in accessing a data warehouse. Data mining applications often need to access all columns in all rows and to perform elaborate non-SQL algorithms upon the retrieved data. To develop the optimal data mining model, you must run the tools several times against the data, changing the model parameters each time. Thus data mining is compute-intensive as well as I/O-intensive.
Sampling -- use of a statistical subset of the entire database that has similar characteristics to the entire database -- is the primary technique used to overcome the lengthy response times that occur when building a model. A number of samples are often needed. One sample is used to "train" the application -- in other words, determine the parameters of the data mining model. Several validation trials against different database samples are often required to validate the model successfully. Samples are most often created either by choosing all data items in every nth record of the database or by choosing only a small number of items in every record. However, sampling intervals must sometimes be chosen with more care; analyzing a whole year's worth of retail sales records could easily mask the results of any short-term promotions. In such cases, only one or two weeks of records should be sampled, perhaps using only a subset of stores.
The first step is to realize that many data mining solutions can be used to perform categorization, association, and prediction. But there are significant differences in the results depending on the algorithms used, specific data values, and visualization techniques.
When choosing which tools are appropriate to solve a problem, use several criteria to narrow down the list of products to be investigated. First, consider which problem is to be addressed and which results to obtain. Second, select tools that can process the amount of the data that must be analyzed and from which predictions are needed. Most tools work from ASCII or spreadsheet files, and many products can use ODBC or other technology to access relational databases. Few tools really take advantage of parallel processing technology to access terabytes of data.
Different tools require varying levels of analyst resources, skills, and time to implement. Product prices range from hundreds of dollars to hundreds of thousands of dollars. Generally, the less expensive tools are more narrowly focused, and the more expensive tools use a variety of sophisticated techniques. Some vendors have vertical industry applications. Intrepid Systems (which acquired retail decision-support specialist Kelly Information Systems in March, 1996) links operational retail information with its data mining Market Basket Workbench.
Once you have chosen several potential tools, evaluate each one against real data. There is no substitute for this step because of the significant and unpredictable differences among different products and technologies. Most companies I talked to continually test more than one technology. Typically, statistical, OLAP, rule induction, and neural network technology are all researched. Neural networks are sometimes the most effective technology, but at other times statistical algorithms outperform the more advanced technology.
Please refer to the accompanying product chart for more information on the companies mentioned in this article.
Also see the sidebar "Lessons from the Trenches: Knowledge, Discovery, and Data Mining" by Herb Edelstein and Janet Millenson.
TABLE 1. Comparing the Leading Data Mining Technologies. | ||
| Technology | Advantages | Limitations |
|---|---|---|
| Rule-based analysis | Good for data that is "complete" with data relationships that can be modeled in via if . . . then rules or decision trees. Rules are readable. | Large number of rules are difficult to understand. Data may not have strong rules-based relationships. |
| Neural networks | Good for data with non-linear relationships. Can work well if data is missing some values. | Inability to explain the found relationships, although some leading-edge tools are attempting to create explanations of the decisions. Requires non-numeric data to be converted to numeric data values. |
| Fuzzy logic | Can rank results based on closeness to the desired result. | Small number of applications and vendors. |
| K-nearest-neighbor | Good for discovering clusters; can utilize an entire data source rather than require sampling for training. | Requires a large amount of memory (this technology is also called memory-based reasoning). May be overly sensitive to closely matching records. |
| Genetic algorithms | Good for forecasting problems involving data with non-linear relationships. Can work well if data is missing some values. | Inability to explain the found relationships, although some leading-edge tools are attempting to create explanations of the decisions. Requires non-numeric data to be converted to numeric data values. |
| Advanced Visualization | Users control the discovery of relationships via highly sophisticated 3-D visualization. | Requires more user intervention than other methods. |
| Combination | Will choose the best technology for the problem or can compare the results of the different technologies. Users not required to learn several tools. | Can be complicated to use the tool because of its complexity. May not provide best-in-class techniques for each technology. |