
As recently as two years ago, data mining was a new concept for many people. Data mining products were new and marred by unpolished interfaces. Only the most innovative or daring early adopters were trying to apply these emerging tools. Today's products have matured, and data mining is accessible to a much wider audience. We are even seeing the emergence of specialized vertical market data mining products.
But what kinds of business problems can data mining technology solve, and what must users understand to apply these tools effectively? These articles will dispel some of the mystery by explaining popular data mining techniques and how they can be applied to real-world problems. They will also discuss what you can realistically expect from modern data mining products.
Data mining extracts new information from data. Data mining tools do more than query and analysis tools, OLAP tools, or statistical techniques like an analysis of variance to name just a few examples. Understanding the kinds of questions data mining tools can answer is the best way to appreciate how they differ from other approaches.
Other query and analysis tools can respond to questions such as, "Do sales of Product X increase in November," or "Do sales of Product X decrease when there is a promotion on Product Y?" In contrast, you can use a data mining tool to ask, "What are the factors that determine sales of Product X?"
The traditional approach is much more painstaking for the analyst. A critical distinction has to do with who drives the bus. With traditional tools, the analyst starts with a question, or an assumption, or perhaps just a hunch and explores the data and builds a model, step by step, working to prove or disprove a theory. It is the analyst's responsibility to propose each hypothesis, test it, propose an additional or substitute hypothesis, test it, and so on, and in this iterative way, build a model. While this responsibility does not disappear entirely with data mining, data mining shifts much of the work of finding an appropriate model from the analyst to the computer. This has the following potential benefits:
Data mining tools create analytical models that are predictive, descriptive or both.
A predictive model can answer questions such as "Is this transaction fraudulent," "How much profit will this customer generate," or "Where in the body is the most likely location of this patientýs primary tumor?"
Descriptive models provide information about the relationships in the underlying data, generating information of the form "A customer who purchases diapers is 3 times more likely to also purchase beer," or "Weight and age, together, are the most important factors for predicting the presence of disease x," or "Households with incomes between $60,000 and $80,000 and two or more cars are much more similar to each other than households with no children and incomes between $40,000 and $60,000."
Most but not all predictive models are also descriptive. An example is a decision tree (See Decision Trees). Some descriptive models cannot be used for prediction. An example is a model formed using a technique known as association (See Association and Sequencing).
Data mining is part of a larger iterative process called knowledge discovery. The steps in the knowledge discovery process are:
Defining the problem. Identify the goals of the knowledge discovery project. Verify that the goals are actionable. For example, if the goals are met, the business can put newly discovered knowledge to use. You must also identify the data to be used.
Collecting, cleaning, and preparing data. Obtain necessary data from various internal and external sources. Resolve representation and encoding differences. Join data from various tables to create a homogeneous source. Check and resolve data conflicts, outliers (unusual or exception values), missing data, and ambiguity. Use conversions and combinations to generate new data fields such as ratios or rolled-up summaries. These steps require considerable effort, often as much as much as 70 percent or more of the total data mining effort. If you already have a data warehouse (especially one that contains detailed transaction data), you probably have a head start on the data collection, integration, and cleaning that is needed.
Data mining. The model-building step involves selecting data mining tools, transforming data if the tool requires it, generating samples (as necessary) for training, testing and validating the model and, finally, using the tools to build, test and select models.
Validating the models. Test the model for accuracy on an independent dataset, one that has not been used to create the model. Assess the sensitivity of a model. Pilot test the model for usability. For example, if you are using a model to predict customer response, make a prediction and do a test mailing to a subset. See how closely the responses match your predictions.
Deploying the model. For a predictive model, use the model to predict results for new cases, then use the prediction to alter organizational behavior. Deployment may require building computerized systems that capture the appropriate data and generate a prediction in real time so that a decision maker can apply the prediction. For example, a model can determine if a credit card transaction is likely to be fraudulent.
Monitoring. Whatever you are modeling, it is likely to change over time. The economy changes, competitors introduce new products, or the news media finds a new hot topic. Any of these forces will alter customer behavior. So the model that was correct yesterday may no longer be very good tomorrow. Monitoring models requires constant revalidation of the model on new data to assess if the model is still appropriate.
The knowledge discovery process is iterative. For example, while cleaning and preparing data you might discover that data from a certain source is unusable, or that you require data from a previously unidentified source to be merged with the other data. Often the first time through the data mining step will reveal that additional data cleaning is required.
Data mining tools can model a number of different problems. The most common of these are:
Classification and regression represent the largest body of problems to which data mining is applied today, creating models to predict class membership (classification) or a value (regression). Three examples of these types of problems are predicting whether or not a loan applicant is a good credit risk, predicting the lifetime profitability of a customer, and predicting the probability that a patient has a certain disease. There are several classification and regression techniques including decision trees, neural networks, Naýve-Bayes and nearest neighbor. (For an overview, see Classification and Regression.)
Association and sequencing. Often called market basket analysis, these techniques generate descriptive models that discover rules such as "Customers who purchase pasta are three times more likely to purchase cheese than customers who donýt buy pasta." (See Association and Sequencing).
Clustering is a descriptive technique that groups similar entities together and puts dissimilar entities in different groups. It can be used in marketing for finding customer affinity groups, and in health care to find patients with similar profiles. Clustering techniques include a special type of neural net called a Kohonen net, as well as k-means and demographic algorithms. These techniques are not discussed in any of the articles in this issue.
Clustering is very subjective. Because you must employ a distance measure, like the nearest neighbor technique, the clusters are completely dependent on the distance measure that you used. As a result, two different data miners, working with the same data, could find two completely different ways of clustering that data. And ten data miners could find ten different clusters. That begs the question, "Who is right?" Or would the right one have been found if only an eleventh data miner came into the picture? Therefore, clustering always requires significant involvement from a business or domain expert who needs to both propose an appropriate distance measure to judge whether the clusters are useful.
Because a data mining or knowledge discovery exercise frequently needs both a classification and a clustering algorithm, say, or regression and association, many data mining products include a suite of tools. Even when tackling only one problem, it might be advantageous to have multiple algorithms at hand. That allows you to compare a decision tree model to a neural net model, increasing our confidence in the results when predictions from the two models are identical, and appropriately raising a flag when the two models disagree. Understanding how each data mining technique works can help you match the appropriate technique to your business problem.
Analysts who use data mining tools must appreciate the critical distinction between causality and correlation. It is easy to confuse the two. A rule such as "Customers who purchase pasta are three times more likely to purchase cheese than customers who donýt buy pasta" sounds like buying pasta causes people to buy cheese. While that might indeed be the case, it could or also be that buying cheese causes people to buy pasta. Or maybe neither of these is the case, and there is something else, like the sudden popularity of a book called You Can Lose Five Pounds a Week Eating Pasta With Cheese! that is causing them to be bought together. Data mining will never tell you the cause. You might know the cause, but that knowledge comes from somewhere else, not from the data mining model.
The problems with causality also apply when you use a data mining model to make predictions. If a model predicts that a certain customer is likely to respond to a promotion because he owns two or more cars, has an income in excess of $70,000, and has owned his home for more than five years, the prediction is based on the observation that people with these characteristics have been more likely to respond to similar offers in the past. These factors may be causal, but there is no way to know from the model which are.
Data mining tools find correlations, not causes, and the rules and predictions that come out of data mining tools are based on correlation only.
What are some of the trends in data mining? Expect to see data mining products evolve into tools that support more than just the data mining step in knowledge discovery and that help encourage a better overall methodology. Data mining tools operate on data, so we can expect to see the algorithms move closer to the data, perhaps into the DBMS itself.
The major advantage that data mining tools have over the traditional analysis tools is that they use computer cycles to replace human cycles. The market will continue to build on that advantage with products that search larger and larger spaces to find the best model. This will occur in products that incorporate different modeling techniques in the search as well as ways of automatically creating new variables such as ratios or rollups. A new type of decision tree, called an oblique tree, will soon be available that generates splits based on compound relationships between independent variables, rather than the one-variable-at-a-time approach used today.
Many data mining tools still require a significant level of expertise from users. Tool vendors must design better user interfaces if they hope to gain wider acceptance of their products, particularly in midsize and smaller companies. Easier interfaces will allow end user analysts with limited technical skills to achieve good results, yet let experts tweak models in any number of ways, and rush users at any level of expertise quickly through their learning curves.