DBMS
 

 

Data Mining Today

By Peter Brooks
DBMS, February 1997

Surveying the current stock of picks and shovels for effective data warehousing.


As terabyte-size databases become prevalent and data warehouses commonplace, it is becoming more difficult to create value from the data -- there's just too much of it. The ability to create value from data is constrained by, among other things, the skill and experience of the analyst to make and verify hypotheses by sifting through large amounts of data. Data mining products attempt to overcome this limitation by using the computer, rather than humans, to discover data relationships that can be used to make predictions.

Although both data mining and decision-support applications can lead to the discovery of new data relationships, data mining tools often find unsuspected relationships in data that traditional techniques will overlook. This is because data mining applications attempt to make correlations and find relationships in data using what is called a "machine-based discovery" process. This process does not require people to create and test hypotheses about data relationships and correlations. For example, data mining can identify micro-market segments by analyzing hundreds of attributes in customer sales records rather than having a marketer rely on traditional age and income ranges. Users can ask data mining tools to identify the variables that affect sales of a specific product. Rather than having to guess that either cents-off coupons or weather may be the most significant indicator, say, it may really turn out that an unsuspected product placement of complementary items is most important.

Data Mining Technology

In this article I look at tools that do not require human initiative to predict data relationships to use the product. Thus I focus on tools that use rules-based analysis such as Classification and Regression Tree (CART) and Chi Square Automatic Interactive Detection (CHAID) method decision trees, neural networks, fuzzy logic, k-nearest-neighbor, genetic algorithms, and/or advanced visualization. Table 1 compares and contrasts these technologies.

Although many tools specialize in only one technique, a growing number of tools use a combination of technologies, matching the technique to the problem at hand. Many algorithms such as CART, CHAID, and C4.5 are non-proprietary; product differentiation is in the areas of ease of setup, performance (particularly against large databases), presentation, and visualization of results.

Several types of business problems are appropriate for data mining analysis. A typical association problem is market basket analysis. Clustering problems such as grouping of customers into market segments are often used to define market segments. A common classification problem is identification of fraud. A detailed explanation of several of these problems was described in Bruce Moxon's "Defining Data Mining." (See DBMS Data Warehousing Supplement, August 1996, page S11.)

Although most OLAP and statistics vendors do not support the definition of data mining used in this article -- machine-generated rather than human-generated hypotheses -- several are adding data mining technologies into their tools. Pilot Software Inc.'s Discovery Server uses decision-tree-based technology to perform customer profiling. Once customer segments are defined, the segment definitions can be fed into the hierarchy metadata definitions in Pilot's multidimensional database for use by the OLAP Analysis Server. SPSS, the statistical package, has neural network and CHAID data mining components.

DataMind Corp., which developed what it calls "agent network technology" data mining, has a relationship with Arbor Software Corp., developer of the Essbase OLAP product. As each record is read from a database, DataMind's product maps every unique input criteria to one or more output criteria. A discovery model is created that contain agents (which represent input criteria and output conditions), connections between agents, and connection weights. Essbase and DataMind both use Microsoft's Excel as a common user interface. (See DBMS, October 1996, page 36, for a review of DataMind.)

Most data mining tools now use Windows clients and access large relational databases using ODBC. Some data mining tools such as Thinking Machines Corp.'s Darwin have specialized architectures to exploit high-performance, parallel-processing hardware and databases. Rather than simply creating statistical printouts of the results, sophisticated online visualization for end users is now available. Features can include the ability to display n-dimensional graphics, dynamically rotate graphs so they can be viewed from many different perspectives, and show related 3-D graphs on screens. Visible Decisions Inc.'s Discovery tool can use VRML 1.0 to interact with information over the Internet.

Rules-Based Tools

Rules-based tools use decision-tree, expert system, knowledge-based system, or similar algorithms to discover patterns in historical data. Specific rules are extracted from the overall set of discovered rules to make predictions. Minimizing credit card defaults by analyzing application information is an example of a problem addressed by rules-based tools.

Information Discovery Inc.'s IDIS: The Information Discovery Engine is a client/server tool with Windows or Netscape clients that uses proprietary pattern-discovery algorithms to discover rules to classify data. It accesses large relational databases. The Mark Twain Edition component provides English text explanation of the rules it found. The IDIS:Predictive Modeler uses pattern matching to make predictions and forecasts based on previously generated IDIS rules. Information Discovery has retail and financial service industry applications available.

XpertRule Profiler by Attar Software Ltd. uses a rule induction process to create a decision tree that identifies which factors affect the desired outcome. An easy-to-understand Decision Tree View shows the number of database records and frequency of the desired outcome in each decision tree node. WizRule from WizSoft Inc. uses a proprietary mathematical algorithm to discover every rule under investigation in a relatively short time. Angoss Software International Ltd.'s KnowledgeSeeker specializes in market segmentation and target marketing.

The key benefit of rules-based data mining approaches is that they are relatively easy to understand. However, these tools may not produce the best results on data that does not contain strong patterns or database fields that are missing values.

Neural Network Tools

Neural networks are a series of software synapses used to create a prediction model by clustering information into natural groups and then predicting into which groups new records will fall. The first phase in the neural network process is a training phase against a subset of the data where the neural network "learns." Once trained, the model is validated against other subsets of the databases and its predictive accuracy is determined. This combination of training and validation is performed many times until the predictive accuracy does not improve with additional training.

NeuralWare's NeuralWorks Predict Professional uses feed-forward neural networks to create predictive models as well as fuzzy logic, statistics, and genetic algorithms. A Build Wizard walks users through the development process. Models are created in Microsoft Excel. Runtime portions of the model can be incorporated in C or C++ applications. The Explain component of NeuralWorks Predict attempts to explain the neural net decisions made in creating the predictive model. HNC Software Inc.'s DataBase Mining Workstation and DataBase Mining Marksman specialize in direct marketing data mining, modeling, and profiling of customer and prospect data.

Neural networks are best used to model non-linear data (data that can not be expressed as a linear mathematical equation), noisy data, or data that is missing some values. One common complaint with neural network tools is that they do not explain the rationale for the clusters that are created. Another lesser-known restriction is that neural networks only work on numeric data. A state field, for example, must often be translated into 50 fields, each with a value of either one or zero.

Tools That Use a Combination of Techniques

Some tools determine whether to use rules-based or neural network techniques, depending on the type of problem to be analyzed.

Thinking Machines Corp.'s Darwin product uses CART, neural network, genetic algorithms, and k-nearest-neighbor techniques to perform classification and prediction. Unlike many other data mining tools, Darwin has a parallel software architecture that lets it specifically process large databases on high-performance, parallel-processing platforms such as DEC's AlphaServer, IBM Corp.'s SP2, and Sun Microsystems Inc.'s Ultra Enterprise servers.

A new entry in the data mining tools market is IBM's Intelligent Data Miner. Data Miner, in RS/6000 and SP2 versions, provides rules induction, neural network, and statistical analysis approaches to association discovery, sequential pattern discovery, and prediction. It runs against large DB2 databases and extracts from Oracle Corp.'s and Sybase Inc.'s databases.

NeoVista Solution Inc.'s new Decision Series consists of sophisticated data mining algorithms designed to be highly scalable. The products in the suite have the ability to transfer results and discovery factors into relational tables that can then by viewed by common query, reporting, and visualization tools.

Scalability

Only in the last five years or so have data mining tools begun to overcome their traditional inability to access large amounts of data stored in relational databases. Previously, data mining tools would only access text or proprietary file formats. Relational data would have to be extracted into these files to be processed. Although many products simply send SQL queries to the database and let the database handle scalability issues, some products, such as Thinking Machines' Darwin, are specifically designed to support parallel processing.

Data mining products have different scalability requirements than OLAP or SQL query-based tools in accessing a data warehouse. Data mining applications often need to access all columns in all rows and to perform elaborate non-SQL algorithms upon the retrieved data. To develop the optimal data mining model, you must run the tools several times against the data, changing the model parameters each time. Thus data mining is compute-intensive as well as I/O-intensive.

Sampling -- use of a statistical subset of the entire database that has similar characteristics to the entire database -- is the primary technique used to overcome the lengthy response times that occur when building a model. A number of samples are often needed. One sample is used to "train" the application -- in other words, determine the parameters of the data mining model. Several validation trials against different database samples are often required to validate the model successfully. Samples are most often created either by choosing all data items in every nth record of the database or by choosing only a small number of items in every record. However, sampling intervals must sometimes be chosen with more care; analyzing a whole year's worth of retail sales records could easily mask the results of any short-term promotions. In such cases, only one or two weeks of records should be sampled, perhaps using only a subset of stores.

Recommendation

The ability to extract information from data has significantly improved in the last five years. However, there is so much technology and so many choices! Can data mining techniques identify more interesting or higher-value relationships than decision-support techniques? Which is the best data mining technology to use?

The first step is to realize that many data mining solutions can be used to perform categorization, association, and prediction. But there are significant differences in the results depending on the algorithms used, specific data values, and visualization techniques.

When choosing which tools are appropriate to solve a problem, use several criteria to narrow down the list of products to be investigated. First, consider which problem is to be addressed and which results to obtain. Second, select tools that can process the amount of the data that must be analyzed and from which predictions are needed. Most tools work from ASCII or spreadsheet files, and many products can use ODBC or other technology to access relational databases. Few tools really take advantage of parallel processing technology to access terabytes of data.

Different tools require varying levels of analyst resources, skills, and time to implement. Product prices range from hundreds of dollars to hundreds of thousands of dollars. Generally, the less expensive tools are more narrowly focused, and the more expensive tools use a variety of sophisticated techniques. Some vendors have vertical industry applications. Intrepid Systems (which acquired retail decision-support specialist Kelly Information Systems in March, 1996) links operational retail information with its data mining Market Basket Workbench.

Once you have chosen several potential tools, evaluate each one against real data. There is no substitute for this step because of the significant and unpredictable differences among different products and technologies. Most companies I talked to continually test more than one technology. Typically, statistical, OLAP, rule induction, and neural network technology are all researched. Neural networks are sometimes the most effective technology, but at other times statistical algorithms outperform the more advanced technology.

Out of the Frying Pan. . .

Data mining products are coming out of the lab and into the data center. As databases grow in size and complexity, organizations are looking for data mining products to provide insight into data relationships that are difficult for humans to find. Although not appropriate for all types of problems, data mining products are already producing powerful results for several classes of problems such as customer segmentation, fraud detection, and market basket analysis -- often better results than those from statistical or OLAP techniques. Data mining is the next wave of decision-support technology, and many companies have already used data mining technology successfully. Although not every company needs to begin data mining today, the longer they wait, the more perilous their avoidance will become.

Please refer to the accompanying product chart for more information on the companies mentioned in this article.

Also see the sidebar "Lessons from the Trenches: Knowledge, Discovery, and Data Mining" by Herb Edelstein and Janet Millenson.


Peter Brooks is a management consultant with the Advanced Technology Group of Coopers & Lybrand Consulting, based in Boston. You can email Peter at 74477.3043@compuserve.com.

TABLE 1. Comparing the Leading Data Mining Technologies.

TechnologyAdvantagesLimitations
Rule-based analysisGood for data that is "complete" with data relationships that can be modeled in via if . . . then rules or decision trees. Rules are readable.Large number of rules are difficult to understand. Data may not have strong rules-based relationships.
Neural networksGood for data with non-linear relationships. Can work well if data is missing some values. Inability to explain the found relationships, although some leading-edge tools are attempting to create explanations of the decisions. Requires non-numeric data to be converted to numeric data values.
Fuzzy logicCan rank results based on closeness to the desired result.Small number of applications and vendors.
K-nearest-neighborGood for discovering clusters; can utilize an entire data source rather than require sampling for training. Requires a large amount of memory (this technology is also called memory-based reasoning). May be overly sensitive to closely matching records.
Genetic algorithmsGood for forecasting problems involving data with non-linear relationships. Can work well if data is missing some values.Inability to explain the found relationships, although some leading-edge tools are attempting to create explanations of the decisions. Requires non-numeric data to be converted to numeric data values.
Advanced VisualizationUsers control the discovery of relationships via highly sophisticated 3-D visualization. Requires more user intervention than other methods.
CombinationWill choose the best technology for the problem or can compare the results of the different technologies. Users not required to learn several tools.Can be complicated to use the tool because of its complexity. May not provide best-in-class techniques for each technology.


Subscribe to DBMS and Internet Systems -- It's free for qualified readers in the United States
February 1997 Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS and Internet Systems (http://www.dbmsmag.com)
Copyright © 1997 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated Wednesday, January 22, 1997.