DBMS
Classification and Regression. Predicting outcomes is the most popular application of data mining.
DBMS, Data Mining Solutions Supplement

Classification and regression are the most common types of problems to which data mining is applied today. Data miners use classification and regression to predict customer behavior, to signal potentially fraudulent transactions, to predict store profitability, and to identify candidates for medical procedures, to name just a few of many applications. A common thread in these applications is that they have a very high payoff. Data mining can increase revenue, prevent theft, save lives, and help make better decisions.

Numerous applications of classification and regression can be found in marketing and customer relations alone. Typical examples ı which span many industries ı include predicting direct mail response and identifying cross-selling opportunities. More industry-specific applications can be found in telecommunications, where data mining is used to control churn (customer attrition), and in banking or insurance where, it is used for credit-risk management and to detect fraudulent transactions or claims.

Various data mining techniques are available for classification and regression problems, and some techniques have several algorithms. (See the Techniques section below and subsequent articles for more details.) While these techniques will produce very different models, each technique generates a predictive model based on historical data. The model then predicts outcomes of new cases.

What distinguishes classification from regression is the type of output that is predicted. Classification, as the name implies, predicts class membership. For example, a model predicts that Jane Doe, a potential customer, will respond to an offer. With classification the predicted output (the class) is categorical. A categorical variable has only a few possible values, such as "Yes" or "No," or "Low," "Middle," or "High."

Regression predicts a specific value. For example, a model predicts that John Doeıs customer profitability will be $854. Regression is used in cases where the predicted output can take on an unlimited (or at least many) possible values. An example could be predicting the profitability of a new customer. Such output variables are usually called continuous, a term that is used even for variables, such as a personıs age, that are not continuous in the traditional mathematical sense. The term regression, as used here, is closely related to, but should not be confused with, linear regression which, because it has been so broadly used, is often referred to without the adjective "linear." Linear regression is a mathematical technique that can be used to fit variables to a line (or, when dealing with more than two dimensions, a hyperplane) and to predict values.

Many data mining authors and product vendors draw a strong distinction between classification and regression. They may have inherited this distinction from the statistics community, which until recently only concerned itself with continuous variables and regression. Statistical techniques for categorical variables are relatively recent developments. However, we have come to view classification and regression as being very closely related. Not only are many of the same tools used for both classification and regression, but it only takes a few minor data conversions to transform a classification problem to a regression problem, and vice versa.

For example, when trying to predict whether or not a person is likely to respond to a direct mail promotion, we could generate a score that ranges from 0 to 1. The score is interpreted so that values close to 0 mean very unlikely to respond and values close to 1 mean very likely to respond. People with scores greater than .5 (or some other cutoff; see the discussion on lift below for how to pick the cutoff) are classified as responders. Now we have turned the classification problem into a regression problem. Or, by viewing customer profitability as falling into four categories: unprofitable (profit less than $0), low (profit between $0 and $1,000), medium (profit between $1,000 and $5,000) and high (profit greater than $5,000), we can turn a regression problem (predict profitability amount) into a classification problem (predict profitability class).

In general, a regression problem is turned into a classification problem by bracketing the predicted continuous variable into discrete categories, and a classification problem can be turned into a regression problem by predicting a score or probability for each category and assigning a range of scores to each category. Bracketing values (also called binning) also occurs when a regression model is built using a decision tree. (See "Decision Trees".) Classification with neural nets commonly predicts a score or probability.

Despite the fact that regression problems can be converted into classification problems, it is likely that a tool oriented to predicting values will be easier to use and will produce better results on a regression problem. A prediction tool will be easier to use because necessary conversions from continuous to categorical or categorical to continuous will more than likely be done for you. And it will produce better results because artifacts, such as binning, have been eliminated. As in so many other applications, it is important to match the tool to the task.

Predictive Modeling Techniques

There are four techniques that dominate the commercially available classification and regression tools today. Briefly, they are:

All these techniques can generate predictive models (though some products that use them do not). Some of these techniques also provide descriptive models, information that provides insight or further understanding of relationships in the data, independent of the predictive nature of the model. This information might be of the form "income is the most important factor in determining whether or not someone is a good credit risk." Such descriptive information may be presented in text form, as shown here, or through a visualization tool such as a decision tree graphic or sensitivity analysis.

Organizing the Data

In order to do predictions, either regressions or classifications, you need data to build a predictive model. In data mining we call this activity training or learning. The latter term comes from the field of machine learning, a subfield within artificial intelligence (AI), but is used in data mining even when the technique being used does not have its roots in AI.

The data necessary to build a predictive model is composed of cases where the outcome is known and included. The field or variable that contains the outcome such as good credit risk is called the dependent or target variable. All of the other fields such as income or marital status that are used to build the predictive model are called the independent variables. Note that there may be some fields present in the data, like customer ID or social security number, that are not used in the model and are therefore neither dependent nor independent variables. These fields may be used later when the result set is joined to another database to retrieve contact information for marketing purposes. It is also quite common for data mining to uncover errors in the data. Keeping identifiers with the data, even though they aren't used by the data mining itself, might prove useful when trying to track errors back into the source database.

For a classification model, each case (record) in the training data has been preclassified in the dependent variable. For a regression model, the value of the dependent variable is known for each case. Typically, training data is historical, but can comprise a set of examples provided or created by an expert if historical data is unavailable. For example, to build a model to predict the credit risk of loan applicants, the training data needs to have a column (the dependent variable) that indicates whether each person is a good risk or a bad risk. Historical data could be based on people who were previously lent money, with good or poor risk determined by whether or not they defaulted. If historical data is not available, experienced loan officers could generate a set of training data by scoring a number of loan applicants as good or poor risks.

For classification models, it is important that data be available for all possible outcomes so the model can learn about all cases. For example, in the credit risk example, you must include cases for both good and poor risk. In marketing, you cannot produce a model that predicts whether consumers will or will not respond to a future promotion if the only historical data available is for those who responded to past promotions.

 Predicted 
Actual High Middle LowTotals
High 401 20 5 Total = 426
Middle 38 806 93 Total = 937
Low 11 115 630 Total = 756

Table 1. Confusion matrix comparing predicted blood pressure class to actual. In this example, the biggest discrepancy is the 115 cases of Low group that were predicted to be in the Middle group.

Classification and regression learning is also called supervised learning. In supervised learning, training occurs with a dataset that includes known outcomes (known classes for classification and known values for regression). This distinguishes it from the type of learning used in clustering algorithms that employ unsupervised learning. In clustering, the target clusters are not already known. It is up to the clustering algorithm to find new and original ways to cluster (or group) instances together. While they seem similar (both do some kind of grouping), clustering and classification are actually very different. In fact, sometimes you might use a clustering algorithm and a classification algorithm together. First, you would use the clustering algorithm to find a good way to group similar instances together. Then you'd designate each cluster as a class and assign each instance to the class that corresponds to its cluster. Finally, you would use a classification algorithm to find the rules for assigning a new instance to a class.

The dataset that is used to build the model is called the training dataset. Usually the training dataset is a sample extracted from a larger dataset, possibly a data warehouse or data mart. Using a sample instead of the complete database allows for some data to be withheld from the model-building step so that it can be used to test the model after it has been built. This second dataset is commonly called the test dataset.

While there will always be at least two datasets -- training and testing -- it is possible that there may be as many as four different datasets used to generate and validate a predictive model. These datasets, which we will here call the training, control, test and validation datasets, are used in the following ways:

Why so many datasets? The reason is rooted in the fact that data mining is an iterative process, with several levels of nested loops. At each nested level a new independent dataset is needed to properly test or validate the model produced by the preceding level or loop. The data mining tool itself is in the innermost loop. It uses the training data to build a model. It may consider a large number of models, using the control data to select the best one. Once you select a model it must be tested. The model builder cannot use either the training or control data to test the model produced by the tool because the model has already been influenced by these cases. That's why an independent test set is required. If the test results are not satisfactory, then the model builder will likely change some learning parameters and build another model.

We have seen the terminology for these datasets vary dramatically between products. Sometimes we have even seen overlap in their usage where a test dataset doubles as a control dataset. We recommend that all these datasets be distinct, but certainly -- in all cases ı there must be a dataset not used during model generation which serves as an independent measure of accuracy.

There is another way of computing the accuracy of a model, called n-fold validation, which eliminates the requirement to withhold some of the data for testing, and that allows all of the data to be used to build the model. Using n-fold validation is expensive because it requires generating many more models, and should therefore only be used when there is so little data that giving up a portion of it for testing risks losing information needed to build a good model.

In the same way that the model-building algorithm might evaluate a number of models, so too will the person building a model consider several models. The model builder chops the collection of all possible models into large chunks, leaving it to the model-building algorithm to search through each chunk. The algorithm uses the control set to pick the best model in a chunk and the model builder uses the test set to pick the best model overall.

Because the control and test samples were used to select the best combination of parameters, mappings, and so forth, the resulting model has now been influenced by the control and test samples. Therefore, another dataset -- the validation set -- is needed to validate the final model.

Sampling Techniques

Each dataset is a sample obtained from the original historical data. These samples should be distinct from each other; that is, instances used in one sample should not appear in any of the other samples. You must also use proper sampling techniques to obtain the datasets to avoid bias in the samples. Random sampling is the most important characteristic of proper sampling. In particular, a sample should not be taken from the beginning or the end of the larger dataset (for example, the first or last 10,000 records), nor should the sample be selected as every nth (that is, every other or every third, etc.) record. An easy way to select a true random sample is to use a random number generator to generate a number between 0 and 1 for each record in the dataset. To select a sample that makes up 15 percent (for example) of the larger file, select those records for which the random number is less than or equal to 0.15.

Another reason for using a sample is that processing fewer records reduces the amount of time that is required for training. Depending on the type of data mining technique that is being used, training can involve one pass through the data or up to thousands of passes. The Naıve-Bayes technique requires only one pass. Decision trees require several passes, and neural networks require hundreds to thousands of passes through the data. When you consider that the model-building process requires the construction of a large number of models, with each model requiring at least one pass and possibly thousands of passes through the training data, you can see that the time required is closely related to the size of the training dataset.

Dataset size has already had a significant impact on development of data mining products. Some organizations that may have a high potential benefit from data mining have very large customer bases and correspondingly large data volumes. As a result, some data mining vendors have focused on building algorithms that can take advantage of parallel processor architectures. The idea is that a computer with five parallel processors, for example, can speed through the training data in as little as one fifth of the time of a single processor. However, because there is so much data, a sample limited to a small percentage of the whole dataset will still capture enough variability to be able to generate a good model. Recall that national elections, involving millions of voters, are usually predicted with samples of around 1,500 voters!

If you have a very large dataset that you want to use for data mining, the trade-off is the cost and administrative complexity of a system with any number of parallel processors as opposed to relying on a well-constructed sample.

What is the impact of sampling on small effects? Many have argued that sampling risks overlooking important but very infrequent effects such as the extremely profitable customer who is literally "one in a million." Sampling will reduce the likelihood that such unusual effects are modeled because they may not be in the sample. But we think that it is unlikely that a data mining tool could find such a small effect because it would be indistinguishable from noise. Noise results from errors that creep into the data in various ways and is present in almost all data. If noise has affected a predictive model then the model will make incorrect predictions to the degree that it has been affected by errors in the training data. Such a model is said to be overtrained.

Infrequent effects cannot be detected if their frequency is too close to the frequency of noise. When this happens the model will either learn neither or both. In the former case, predictions will not be based on the small effect. In the latter case, which is one of overtraining, predictions will be based on noise.

The ability of a learning algorithm to recognize an effect (a signal) and ignore noise is based on the signal to noise (S/N) ratio. Some algorithms allow you to increase the S/N ratio by artificially increasing the frequency of certain cases in the training sample. Of course, this will only help if the model builder is already aware of a particular low-frequency effect.

While you can use sampling to solve many of the problems posed by extremely large datasets, very small datasets pose problems as well. Sometimes, instead of having large volumes of data, it may be that there is very little data available, with the number of instances in the hundreds, or even less. In this case, holding out part of the data for validation may reduce the training dataset to a size that is too small to build a reliable model. In this case, you may want to use n-fold cross-validation. As previously noted, n-fold cross-validation is an iterative technique that permits all the data to be used to build the model while still remaining statistically valid.

Model Measures: Accuracy and Impact

Testing and validation are essential components of the model-building process. Whether it is done automatically by the technique or manually by the model builder, testing selects the best model from a set of possible models. Before any model can be used as a predictor its general applicability needs to be measured by testing its accuracy on a separate validation dataset that does not share any records with the training or control datasets. For both model selection and validation, the model is run against the test or validation data and a computed measure gauges how accurately the model predicted the results.

For classification, this measure of accuracy includes, at a minimum, a percentage that measures the number of cases that were correctly predicted (an accuracy measure) or incorrectly predicted (an error measure). More useful is a confusion matrix that summarizes a comparison of actual and predicted outcomes. Table 1 shows a confusion matrix that summarizes the predictions from a blood pressure application. From this matrix it is apparent that the largest error is 115 cases of Low blood pressure predicted as Middle. Sometimes confusion matrices present percentages instead of counts or both.

Error or accuracy of a regression model can be measured in a number of ways. The most common accuracy measure is R-squared. Many tools will also show accuracy through a scatter plot that plots predicted values against actual values. The scatter plot from a perfect model would form a straight line. Figure 1 shows a scatter plot generated by Clementine from Integral Solutions Ltd. MEDV is the actual median value of homes and $N-MEDV is the predicted value using a neural net. This particular scatter plot shows that there is some kind of problem with maximum actual values (MEDV around 50). CHAS = 1, shown in red, are for properties near the Charles River.

Accuracy, by itself, is not a sufficient measure of a modelıs usefulness. Data mining models are generally built to aid decision making or analysis or to improve some aspect of business performance. It would serve little purpose to implement a model that is not likely to have the desired impact on the business. As a result, many classification and regression products include outputs that help evaluate what the impact of the model would be.

When using data mining to predict whether someone is likely to respond to a direct mail offer, for example, it can be more useful to rank the prospects according to a predicted response probability, rather than just classifying each candidate as responding or not responding. The probabilities can then be used to compute that the highest ranked decile (for example, the top 10 percent) will produce 40 percent of the total responses. A lift chart shows this graphically. From the lift chart in Figure 2, an analyst can determine the appropriate cut-off for any desired response rate. For example, a 90 percent response can be obtained by sending the offer to only the top 55 percent. Though not shown here, lift information can be combined with cost and revenue information to compute the return on investment (ROI) for any cutoff point. This permits campaigns to be managed by using a target ROI to select the cutoff point in the rank ordered target file.

Predicting New Cases

When a classification or regression tool is used in predictive mode, the most important outputs are the predicted classes or values. The typical method for prediction from a data file is generation of an output file that copies the input records and appends an additional column containing the predicted value or class for each case. Some products import data into and export data out of various DBMSs.

In addition to predicting from a data file, some tools also support prediction for single cases. For example, Figure 3 shows the Case Prediction Tool available in Data Mine Builder from Red Brick Systems (also available as DataMind DataCruncher from DataMind Corp.). Through a pull-down menu, values can be set for each of the independent variables, and the predicted class is immediately shown. Probably more important than the ability to predict a single case, this kind of tool aids in understanding the model as well. It can help an analyst probe the model to find, for example, the point in a range of values for a single variable that is associated with a change in prediction, when all other variables are held constant.

Prediction Is Popular

Classification and regression can help organizations improve their operations and strategic planning. As a result, virtually every data mining product on the market today supports the ability to do classification and regression. A variety of techniques, including decision trees, neural networks, Naıve-Bayes and k-nearest neighbor are used to implement classification and regression.

Because classification and regression are the bread and butter of data mining, we expect to see a rapid evolution in these products. Products must improve the ability of the model builder to explore and understand the model. Another need is a tighter coupling between data mining tools and the source databases, allowing for data mining directly against the data mart or data warehouse and eliminating the need for repeated extraction of volumes of data. Finally, the algorithm implementations will improve. Expect to see increased use of automatic model search, better model controls and improved user interfaces that use visual programming.


Figure 1. This scatter plot produced by Clementine from Integral Solutions Ltd. shows how close predicted values are to actual values. A diagonal line with no dispersion would represent a perfect prediction.


Figure 2. Darwin from Thinking Machines Corp. can display a lift chart.


Figure 3. Data Mine Builderıs Case Prediction tool lets the analyst set values for each of the independent variables from a pull-down and immediately displays the predicted outcome, here Good risk based on Low debt, High income and marital status Yes. Pressing the Why button shows the Prediction Influence, in this case that the most important applicable rule for Good was Income High & Married Yes.


Estelle Brand (estelle@xore.com) and Rob Gerritsen (rob@xore.com) are founders of Exclusive Ore Inc., based in Blue Bell, Pennsylvania, which is a consulting and training company specializing in data mining. During the last two years they have used more than a dozen data mining products. Their database management systems experience dates back to the dark ages. For more information about Exclusive Ore and data mining, see www.xore.com.
What did you think of this article? Send a letter to the editor.


Subscribe to DBMS -- It's free for qualified readers in the United States
Data Mining Solutions Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS (http://www.dbmsmag.com)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated February 26, 1998