DBMS
Naıve-Bayes and Nearest Neighbors. These techniques quickly build classification models for prediction and description of data.
DBMS, Data Mining Solutions Supplement

Naıve-Bayes and Nearest Neighbor

Naıve-Bayes and nearest neighbor (more precisely, k-nearest neighbor where k represents a number of similar cases) are two less frequently implemented but useful classification techniques. Both algorithms are fast and easy to use. Their speed makes them ideal candidates for quickly building and testing classification models. In addition, the descriptive nature of a Naıve-Bayes model makes it an excellent exploratory tool. However, as we will discuss later in this article, these algorithms have well-known flaws that suggest they should be used with caution or augmented with other techniques.

Naıve-Bayes

Naıve-Bayes is named after Thomas Bayes (1702-1761), a British minister whose theory of probability was first published posthumously in 1764. Bayesıs Theorem is used in the Naıve-Bayes technique to compute the probabilities that are used to make predictions. The reason for the word "Naıve" in the name will be discussed later.

When a new case is analyzed, a prediction is made by combining the effects of the independent variables on the dependent variable (the outcome that is predicted).

Naıve-Bayes is a classification technique that is both predictive and descriptive. It analyzes the relationship between each independent variable and the dependent variable to derive a conditional probability for each relationship. When a new case is analyzed, a prediction is made by combining the effects of the independent variables on the dependent variable (the outcome that is predicted). In theory, a Native-Bayes prediction will only be correct if all the independent variables are statistically independent of each other, which is frequently not true. For example, data about people usually will contain multiple attributes (such as weight, education, income, and so forth) that are all correlated with age. In such a case, using Naıve-Bayes would be expected to overemphasize the effect of age. Notwithstanding these limitations, practice has shown that Naıve-Bayes produces good results, and its simplicity and speed make it an ideal tool for modeling and investigating simple relationships.

Naıve-Bayes requires only one pass through the training set to generate a classification model. This makes it the most efficient data mining technique. However, Naıve-Bayes does not handle continuous data, so any independent or dependent variables that contain continuous values must be binned or bracketed. For instance, if one of the independent variables is age, the values must be transformed from the specific value into ranges such as "less than 20 years," "21 to 30 years," "31 to 40 years" and so on. Binning is technically simple, and most algorithms automate it, but the selection of the ranges can have a dramatic impact on the quality of the model produced.

Using Naıve-Bayes for classification is a fairly simple process. During training, the probability of each outcome (dependent variable value) is computed by counting how many times it occurs in the training dataset. This is called the prior probability. For example, if the Good Risk outcome occurs 2 times in a total of 5 cases, then the prior probability for Good Risk is 0.4. You can think of the prior probability in the following way: "If I know nothing else about a loan applicant, there is a 0.4 probability that the applicant is a Good Risk." In addition to the prior probabilities, Naıve-Bayes also computes how frequently each independent variable value occurs in combination with each dependent (output) variable value. These frequencies are then used to compute conditional probabilities that are combined with the prior probability to make the predictions. In essence, Naıve-Bayes uses the conditional probabilities to modify the prior probabilities.

Let's explore how Naıve-Bayes can be applied to the credit risk classification problem (see the sidebar Predicting Credit Risk). The goal is to be able to predict whether an applicant will be a Good or a Poor Credit Risk. In our example, all the columns ı both independent and dependent -- are categorical, so we do not have to convert any values to categorical variables by binning (grouping) values. The first five columns of Table 2 contain the sample data.

From the sample data, we cross-tabulate counts of each Risk outcome (Good or Poor) and each value in the independent variable columns (see Table 1). For example, row 3 reports two cases where Income is High and Risk is Good and one case where Income is High and Risk is Poor.

CountsCountsProbabilities Probabilities
Independent
Variable
Value Good Risk Poor Risk Good Risk Poor Risk
Debt High 1 1 0.50 0.33
Debt Low 1 2 0.50 0.67
Income High 2 1 1.00 0.33
Income Low 0 2 0 0.67
Married Yes 2 2 1.00 0.67
Married No 0 1 0 0.33
Total by Risk 2 3

Table 1. Cross tabulated counts and computed probabilities for each combination of an independent variable and the dependent variable (Good or Poor Risk).

We can easily compute conditional probabilities using these counts, dividing by the "Total by Risk" numbers (on the bottom row) in each case. For example, the conditional probability that a Good Risk has High Income is 1.00 (see row 3). Thatıs because both instances of Good Risk have High Income. In the same way, the conditional probability that a Poor Risk has a Low Income is 0.67, because two out of three Poor risks have Low Income (see row 4).

The bottom row of Table 1 is also used to compute the prior probabilities for Good and Poor Risk. In our data the prior probability for Good is 0.40 (two of five cases) and the prior probability for Poor is 0.60 (three of five cases).

The probabilities in Table 1 might seem the reverse of what we want because they are all probabilities for the independent variables. You might be wondering why we computed these because what we are really trying to do is predict the dependent variable so we can determine whether or not someone is a Good or Poor risk. Thatıs where the beauty of Bayesı Theorem comes in, as it lets us compute the probability of the dependent variable based on the conditional probabilities of the independent variables! In fact, given a particular case, we can compute a score for each possible value of the dependent variable simply by multiplying the prior probability for the dependent variable times all the conditional probabilities from Table 1 for the independent variables. The highest scoring value becomes the predicted value.

For example, in Table 2 the first row for Joe in the training set has High Debt, High Income and Married is Yes. In Table 1 the probabilities associated with these values and Good Risk are 0.5, 1.0 and 1.0 respectively (see rows 1, 3 and 5). The product of these three numbers and the prior probability for Good (0.40) is 0.20 (0.50 x 1 x 1 x 0.40). For Poor Risk the probabilities (also from rows 1, 3 and 5) are 0.33, 0.33, and 0.67. With a prior probability for Poor Risk of 0.60, the score for Poor Risk is 0.044 (0.33 x 0.33 x 0.67 x 0.60). Because the score for Good is higher, we predict that Joe will be a Good risk.

Name Debt Income Married? Risk
Actual
Good Risk
Score
Poor Risk
Score
Risk Predicted
Joe High High Yes Good 0.200 0.044 Good
Sue Low High Yes Good 0.077 0.034 Good
John Low High No Poor 0 0.086 Poor
Mary High LowYes Poor 0 0.096 Poor
Fred Low Low Yes Poor 0 0.137 Poor

Table 2. The actual risk, the scores for Good risk and Poor risk, and the predicted risk for all cases in the sample data.

Table 2 presents the actual risk, the scores for Good risk and Poor risk, and the predicted risk for all cases in the sample data. With all outcomes predicted correctly, the algorithm has 100 percent accuracy on the training set. Although this is a good sign, we must also validate the model by using it to predict the outcomes for separate test data, but we will skip that step.

The scores that we computed can also be easily converted to probabilities, by dividing each score by the sum of all scores. For example, the probability that Joe is a Good Risk is approximately 82 percent (0.2 divided by 0.244, which is the sum of the Good Risk value of 0.2 and the Poor Risk value of 0.044.). The probability that Joe is a Poor Risk is 18 percent.

Note that we donıt need to know values for all the independent variables to make a prediction. In fact, if we know none of them, we can still predict using just the prior probability. If we know only the value for Income, we can use the conditional probabilities associated with Income to modify the prior probabilities, and make a prediction. This ability to make predictions from partial information is a significant advantage of Naıve-Bayes.

Bayesı Theorem permits us to compute the scores and probabilities by multiplying the conditional probabilities from Table 1. This is valid only if we assume a statistical independence between the various independent variables such as debt and income. This assumption is the origin of the adjective "naıve" in the name of the algorithm, presumably because this assumption is usually not correct. Nevertheless, the algorithm appears to have good results in practice, even when the independence assumption is false.

Extending the Bayesian Technique

It is possible to extend the Naıve-Bayes algorithm, making it less naıve, by capturing interactions between pairs of nonindependent columns, or even more elaborate combinations of columns, such as triples.

Considering a combination of columns can be costly because it brings the combinatorial explosion into play. For example, if there are 200 independent columns in your dataset, and each column typically has 5 values, then there will be 497,500 pairs to consider! This is because there are 200*199/2 (or 19,900) pairs of columns (computed from the standard formula for m things taken n at a time). For each pair of columns there are 52 (25) combinations of values. The product of these two numbers (19,900 x 25) is 497,500. Considering triples instead of pairs leads to more than 164 million combinations.

A counting algorithm would potentially need one counter for each possible combination (half a million for pairs, 164 million for triples), but the counting could still be done in one pass. Clearly not all combinations will occur, so the actual storage space needed for counters could be significantly less. Nevertheless, an algorithm that considers combinations will need to manage much more storage than a simple Bayesian algorithm. Implementers of these nonnaıve (or at least less naıve) Bayesian algorithms that consider combinations of columns might also want to use human interaction to specify which combinations of columns to consider, greatly reducing the overall number of pairs that need to be counted.

It is easy to see why Naıve-Bayes, in its purest form, makes an excellent exploratory tool. Itıs fast, it can make predictions from partial information, and the conditional probabilities provide useful descriptive feedback. Some data mining implementations of Naıve-Bayes, perhaps more properly called non-Naıve-Bayes, have extensions that enable them to analyze and identify more complicated relationships. Once trained, a Naıve-Bayes model must be tested with an independent dataset to insure that the relationships found are generally applicable before using the model for prediction. And as with all classification and regression models, it must be monitored over time to verify its continued relevance.

Nearest Neighbor

Nearest Neighbor (more precisely k-nearest neighbor, also k-NN) is a predictive technique suitable for classification models.

Unlike other predictive algorithms, the training data is not scanned or processed to create the model. Instead, the training data is the model. When a new case or instance is presented to the model, the algorithm looks at all the data to find a subset of cases that are most similar to it and uses them to predict the outcome.

There are two principal drivers in the k-NN algorithm: the number of nearest cases to be used (k) and a metric to measure what is meant by nearest.

Each use of the k-NN algorithm requires that we specify a positive integer value for k. This determines how many existing cases are looked at when predicting a new case. k-NN refers to a family of algorithms that we could denote as 1-NN, 2-NN, 3-NN, and so forth. For example, 4-NN indicates that the algorithm will use the four nearest cases to predict the outcome of a new case. Figure 1 shows how this value is set in Unica Technologyıs Pattern Recognition Workbench (PRW).

As the term nearest implies, k-NN is based on a concept of distance, and this requires a metric to determine distances. All metrics must result in a specific number for comparison purposes. Whatever metric is used is both arbitrary and extremely important. It is arbitrary because there is no preset definition of what constitutes a "good" metric. It is important because the choice of a metric greatly affects the predictions. Different metrics, used on the same training data, can result in completely different predictions. This means that a business expert is needed to help determine a good metric.

To classify a new case, the algorithm computes the distance from the new case to each case (row) in the training data. The new case is predicted to have the same outcome as the predominant outcome in the k closest cases in the training data.

Letıs see how this applies to the data in our credit risk example. For this discussion we will use k=3, or 3-NN. For our distance measure we will use a simple metric that is computed by summing scores for each of the three independent columns, where the score for a column is 0 if the values in the two instances are the same, and 1 if they differ.

We can now compute the distance between Joe and Sue. The three column scores for Joe and Sue are 1, 0, and 0 because they have different values only in the Debt column. The distance between Joe and Sue ı the sum of these scores ı is equal to 1. If two cases have all values in common, their distance is zero.

Table 3 summarizes all the distances between all the records in our training dataset. Note that this matrix is symmetrical; only the numbers below (or above) the diagonal of zeroes are needed. However, presenting it this way makes it easy to see all of the distances for any one case (row) at a glance.

Joe Sue John Mary Fred
Joe 0 1 2 1 2
Sue 1 0 1 2 1
John 2 1 0 3 2
Mary 1 2 3 0 1
Fred 2 1 2 1 0

Table 3. The computed distance measures for each combination of cases.

Now we are ready to apply the k-NN technique to see how it classifies our training data. Remember that we chose k=3, so we are interested in the three neighbors nearest to Joe. The distances in Table 4, column 1 show that Joe, Sue, and Mary are Joeıs three nearest neighbors because they have the lowest distance scores.

Whoa! How can Joe be his own neighbor? Isnıt this somehow invalid, to use the information we have about Joe to predict an outcome for Joe? Well, in one sense it is, but this is absolutely no different than for any other model when it is tested with the training data. This is why we always need to use a separate test dataset to validate a model.

The Risk values for Joeıs three nearest neighbors (Joe himself, Sue, and Mary) are Good, Good, and Poor, respectively. The predicted Risk for Joe is the value that is most frequent among the k neighbors, or Good, in this case.

Who are Sueıs nearest 3 neighbors? Clearly Sue herself, but what about Joe, John and Fred, who are all the same distance (1) from Sue? We could include all three, exclude all three, or include all three with a proportionate vote (2/3 each in this case). The decision is entirely up to the implementers of the algorithm. For our example, weıll use 2/3 vote each, resulting in votes of Good (Sue herself), 2/3 Good, 2/3 Poor, and 2/3 Poor, for a consensus of Good.

Table 4 enumerates the predictions from the 3-NN algorithm. The accuracy for 3-NN on the training set is 80 percent. Results from using 2-NN in this case would be exactly the same, which the reader can verify.

Name Debt Income Married? Risk 3-NN Prediction
Joe High High Yes Good Good
Sue Low High Yes Good Good
John Low High No Poor -
Mary High Low Yes Poor Poor
Fred Low Low Yes Poor Poor

Table 4. Predicted risks based on 3-NN or the three nearest neighbors for each case.

Since both the value of k and the distance metric have tremendous impact on the quality of the model it is important to choose them carefully.

What is a good value for k? The 1-NN algorithm tries to find one case in the training set that most closely matches the case we are trying to predict. When it is found, the predicted value is assumed to be the outcome of the matching case. For 1-NN accuracy on the training set will be 100 percent by definition. However, 1-NN, because it looks for only the one best match, is extremely susceptible to noise and will never reflect any kind of pattern in the data. Many commercial algorithms start with a default k value of 10. The best value for k will require some experimentation with the data at hand, comparing results for different k values against a test dataset.

In the beginning of this section, we pointed out that k-NN does not make a learning pass through the data. Unfortunately, this does not make k-NN an especially efficient algorithm, particularly with large training datasets. Thatıs because all the processing is postponed until predictions are being made. Then each new case being predicted requires a complete pass through the training data to compute the distance between the target instance and each instance in the training set. The algorithm keeps track of the k nearest instances as it proceeds through the training set, predicting an outcome when the pass is complete.

There are a number of issues associated with determining the metric used to compute the distance between instances. For some categorical data, such as the two-valued variables in our example, the very simple 0-1 metric (0 for same value, 1 for different value) that we used might suffice. For categorical data that is ordered (age groups, for example) a distance metric should recognize that the difference between the 21 to 30 age group and the 51 to 60 age group is larger than the difference between the 21 to 30 age group and the 41 to 50 age group. But is the distance between the 21 to 30 age group and the 41 to 50 age group the same as the distance between the 41 to 50 age group and the 61 to 70 age group? Numerically, the "difference" is the same, but if we are interested in modeling consumer behavior, for example, then the distances might be different, and that should be taken into account.

What about distance metrics for continuous data? In our example, all data was categorical., but this is coincidental. The k-NN algorithm works equally well with continuous data. A simple distance metric would use the arithmetic difference between values, but this can lead to problems, as we shall see.

Computing the distances between values in different columns is even more complicated than resolving the distance within a column. You are forced to mix apples and oranges. Consider, for example, a dataset with a categorical debt column (High and Low, as before), a categorical age column (11 to 20 coded as 1, 21 to 30 coded as 2, 31 to 40 coded as 3, and so on) and a continuous income column. If a 0-1 distance metric is used for the debt column, and value differences are used as the distance metrics in the other two columns, then differences in the income column, which will typically be in the 10,000 to 40,000 range will completely overpower differences in the other two columns. Scaling all differences to a 0-1 range will fix the obvious part of the problem. However, no matter what distance metric is used, it will require resolving assertions such as: "Two people with a $12,000 difference in income are just as similar as two people whose ages are in adjacent age groups."

While the nearest neighbor technique is simple in concept, the selection of k and the choice of distance metrics pose definite challenges. The absence of any real training process distinguishes the nearest neighbor technique from the other predictive methods described in this publication. The bottom line is that both k and the distance metrics need to be set, and both will require some experimentation. The model builder will need to build multiple models, validating each with independent test sets to determine which values are appropriate.

Understanding the Output

Naıve-Bayes and nearest neighbor both produce predictive models. As a predictor, Naıve-Bayes uses value frequencies to compute conditional probabilities used to compute predictions for new cases. Nearest neighbor uses the data itself to identify similar cases that are used to make predictions.

When you use Naıve-Bayes in predictive mode, the typical approach is to analyze new cases from a data file and generate an output file that copies the input records and appends a column containing the predicted class for each case. While you can use Naıve-Bayes effectively to predict large numbers of new cases, the multipass processing requirements of k-NN make it more likely to be used on a small number of instances.

Some implementations of Naıve-Bayes, such as Red Brick Systems' Data Mine Builder (which is based on DataCruncher from DataMind Corp.) also include the scores (called Level) for each possible class in the output file. An example of this is shown in Figure 2. This kind of output can be used to evaluate what might be called the "certainty" of a prediction (called Differentiation in Figure 2).

Naıve-Bayes has a descriptive aspect that the k-nearest neighbor model lacks. It is found in the cross-tabulations (counts) and the conditional probabilities derived from them. Most implementations also compute frequencies, and these can be used to review which relationships are weakly or strongly supported. For example, a high conditional probability can have a low frequency, calling into question its reliability because the relationship occurred very infrequently.

Figure 3, which shows the conditional probabilities computed by the Naıve-Bayes algorithm in MineSet from Silicon Graphics, is an example of how the computed probabilities can be used in a descriptive way. From the charts, a user can easily observe that Poor risk is predominant for Low income. The High Income pie (second pie on the Income row) is one-quarter yellow. This is the same information (a three to one ratio in favor of "Good"), as the relative percentages (100 percent and 33 percent) on line 3 of Table 1.

A user can interact with the MineSet chart to explore combined probabilities. Each pie on the left corresponds to a possible attribute value. If multiple pies on the left are selected (by clicking on them) the pie on the right is changed to reflect the relative probability of Poor and Good Risk based on the combination of selected attribute values.

Proceed with Caution

The nearest neighbor technique has a large number of drawbacks when compared to the other methods examined in this publication. These drawbacks include a lack of descriptive output, reliance on an arbitrary distance measure, and the performance implications of having to scan the entire dataset to predict each new case. In particular, reliance on an arbitrary distance measure makes this a subjective technique, and for that reason we would restrict its use to applications that have natural distance measures. Alternatively, it could be used to compare predictions to those that are being generated from a model based on a different technique such as Naıve-Bayes or decision tree.

The Naıve-Bayes method, on the other hand, has a number of qualities to recommend it. It is the most efficient technique with respect to training, and it can make predictions from partial data. Additionally, the descriptive information it provides can be a very important aid to understanding. In fact, researchers have found that some users like the descriptive information provided by Naıve-Bayes better than that provided by decision trees or rule systems. While the biggest theoretical drawback to Naıve-Bayes is the need for independence between the columns, this is usually not a problem in practice.


Figure 1. Unica Technology's Pattern Recognition Workbench (PRW) lets users set a parameter for the number of closest matching cases that will be used to predict the outcome of a new case when using the Nearest Neighbor algorithm.


Figure 2. Red Brickıs DataMine Builder (which is based on DataCruncher from DataMind Corp.) produces a score (called Level) for each possible outcome. This Differentiation column represents the certainty of each prediction.


Figure 3. Silicon Graphicsı MineSet product includes an Evidence Visualizer that graphically displays conditional probabilities from a Naıve-Bayes analysis.



Estelle Brand (estelle@xore.com) and Rob Gerritsen (rob@xore.com) are founders of Exclusive Ore Inc., based in Blue Bell, Pennsylvania, which is a consulting and training company specializing in data mining. During the last two years they have used more than a dozen data mining products. Their database management systems experience dates back to the dark ages. For more information about Exclusive Ore and data mining, see www.xore.com.
What did you think of this article? Send a letter to the editor.


Subscribe to DBMS -- It's free for qualified readers in the United States
Data Mining Solutions Table of Contents | Other Contents | Article Index | Search | Site Index | Home

DBMS (http://www.dbmsmag.com)
Copyright © 1998 Miller Freeman, Inc. ALL RIGHTS RESERVED
Redistribution without permission is prohibited.
Please send questions or comments to dbms@mfi.com
Updated February 27, 1998