
Have you ever made an extraordinary purchase on one of your credit cards and been somewhat embarrassed when the charge wasnıt authorized or surprised when a credit card representative asked to speak to you? Somehow your transaction was flagged as possibly being fraudulent. Well, it wasnıt the person you spoke to who picked your transaction out of the millions per hour that are being processed. It was, more than likely, a neural net.
How did the neural net recognize that your transaction was unusual? By having previously looked at the transactions of millions of other people, including transactions that turned out to be fraudulent, the neural net formed a model that allows it to separate good transactions from bad. Of course, the neural net can only pick transactions that are likely to be fraudulent. Thatıs why a human must get involved to make the final determination. Luckily if you remembered your motherıs maiden name, the transaction was approved and you went home with your purchase.
Neural networks are among the most complicated of the classification and regression algorithms. Although training a neural network can be time consuming, a trained neural network can speedily make predictions for new cases. For example, a trained neural network can detect fraudulent transactions in real time. They can also be used for other data mining applications such as clustering. Neural nets are used in other applications as well, such as handwriting recognition or robot control.
Despite their broad application, we will restrict our discussion here to neural nets used for classification and regression. The output from a neural network is purely predictive. Because there is no descriptive component to a neural network model, a neural net's choices are hard to understand, and this often discourages its use. In fact, this technique is often referred to as a "black box" technology.
There are many different types of neural networks. In this article we will consider only the most common type used for classification and regression, the so-called "multilayer feed-forward backprop neural network." Donıt worry too much about the name. Weıll explain the meaning of these terms later on. For now, just recognize that there are many other types, too numerous to mention.
A key difference between neural networks and other techniques that we have examined is that neural nets only operate directly on numbers. As a result, any nonnumeric data in either the independent or dependent (output) columns must be converted to numbers before we can use the data with a neural net.
Neural networks are based on an early model of human brain function. Although it is described as a "network," a neural net is nothing more than a mathematical function that computes an output based on a set of input values. The network paradigm makes it easy to decompose the larger function to a set of related subfunctions, and it enables a variety of learning algorithms that can estimate the parameters of the subfunctions.
Table 1 shows our sample credit risk data (see sidebar Predicting Credit Risk) with all two-valued categorical variables converted into values of either 0 or 1. High Debt and Income, Married=Yes, and Good Risk were all replaced by the value 1. Low Debt and Income, Married=No, and Poor Risk were replaced by 0. No conversion is necessary for the Name column because it is not used as an independent or dependent column. The numeric assignments used for conversions are completely arbitrary, and need only be consistent within a column. For example, the substitution for High in the Debt column does not need to be the same as the substitution for High in the Income column. In fact, they could be reversed in the Income column. Nor are the substitutions limited to 0s and 1s. Indeed, we could have used, say, 110.5 for High and 1392 for Low, for example. In general, if the nonnumeric values are ordered, it is probably a good idea (but, interestingly, not a requirement) to preserve that ordering in the numeric encoding.
| Name | Debt | Income | Married? | Risk |
|---|---|---|---|---|
| Joe | 1 | 1 | 1 | 1 |
| Sue | 0 | 1 | 1 | 1 |
| John | 0 | 1 | 0 | 0 |
| Mary | 1 | 0 | 1 | 0 |
| Fred | 0 | 0 | 1 | 1 |
Table 1. Credit risk data with column values converted to numeric values.
Before we look at how the neural network training process works, letıs look at a trained neural net that can be used to predict Good and Poor risks for our credit risk classification problem.
The neural net that we are going to use for this problem is shown in Figure 1. This network contains six nodes, which we have marked A through F. The yellow nodes (A, B and C) are input nodes and constitute the input layer. The input nodes correspond to the independent variable columns in the credit risk problem (Debt, Income, and Married).
The red node (F) is the output node and makes up the output layer. In this case there is only one output node, corresponding to Risk, the dependent column. But neural network techniques in general do not restrict the number of output columns. That there can be multiple outputs representing multiple simultaneous predictions is one way that neural nets differ from most other predictive techniques.
The two blue nodes (D and E) are the hidden nodes and constitute a single hidden layer. The number of hidden nodes and, for that matter, the number of hidden layers, are set at the userıs discretion. The number of hidden nodes often increases with the number of inputs and the complexity of the problem. Too many hidden nodes can lead to overfitting, and too few hidden nodes can result in models with poor accuracy. Finding an appropriate number of hidden nodes is an important part of any data mining effort with neural nets. Various products offer numerous guidelines, but there are no hard and fast rules to apply. Several neural net products include search algorithms that evaluate nets with different numbers of hidden nodes to help find an appropriate number to use. But most products expect the model builder to build several models with different parameters and measure their accuracy on the test dataset.
The number of input, hidden, and output nodes is sometimes referred to as the neural net topology. Others refer to this as the network architecture. The arrangement of nodes in layers, as we have done here, is common but not essential. Other, less common arrangements are pyramid and recursive schemes. The particular net in Figure 1 is also called fully connected because each node in one layer has connections to every node in the next layer. Again, this is typical, but not required.
Figure 1 also shows weights on the arrows between the nodes. Typically, there are no weights on the arrows coming into the input layer or coming out of the output layer. The values of the other weights are determined during the neural net training or learning process.
Note that weights can be both positive and negative. For reasons we won't go into here, neural net algorithms usually restrict weights to a narrow range such as between plus and minus 1 or between plus and minus 10. Weights are typically real numbers with decimals; we have used integers to simplify our calculations.
The heart of the neural net algorithm involves a series of mathematical operations that use the weights to compute a weighted sum of the inputs at each node. In addition, each node also has a squashing function that converts the weighted sum of the inputs to an output value. For our neural net we will use a very simple squashing function: if the weighted sum of the inputs is greater than zero, the output is 1, otherwise the output is 0.
Equations for the output values at nodes D, E and F can now be written as follows:
D = If (A + 2B ı C) > 0 Then 1 Else 0 E = If (-2A + 2B ı 5C) > 0 Then 1 Else 0 F = If (D- 2E) > 0 Then 1 Else 0
Table 2 shows the sample data, with the three independent variables (Debt, Income, and Married) converted to numbers, the actual risk, and the computed values for nodes D, E, and F. The output value for F (1 on the first row) is the predicted value of Risk for Joe. It equals the actual value, so the neural net made a correct prediction in this case. In fact, the net in Figure 1 makes correct predictions for all five rows in the training set, as shown in Table 2.
| Node: | A | B | C | D | E | F | |
|---|---|---|---|---|---|---|---|
| Name | Debt | Income | Married | Risk | |||
| Joe | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| Sue | 0 | 1 | 1 | 1 | 1 | 0 | 0 |
| John | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| Mary | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| Fred | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Table 2. Column F shows the computed or predicted risk values for each person. In this neural network, all the predictions (column F) match the actual risk.
In most neural nets, the goal of training is to find a set of weights that result in the best possible fit on the training data, without overfitting. Because weights are continuous numeric values, there are literally an infinite variety of combinations of weight values. Even if we artificially restrict weights to certain values within specific ranges, the number of combinations is staggering. Consider our simple problem. If we restrict each weight to the 21 integer values between plus and minus 10 (including 0), with 8 weights there will be more than 30 billion combinations. Actually, we have simplified our discussion so far by eliminating a weight called the bias that is added to each hidden and output nodeıs input value. Three bias weights plus 8 other weights brings the total number of combinations to about 350 trillion!
Finding the best combination of weights is a significant searching problem that probably cannot be solved by enumeration. A number of search techniques, including genetic algorithms, have been used to search for the best combination of weights. The most common is a class of algorithms called gradient descent. A gradient descent algorithm starts with a solution, usually a set of weights that have been randomly generated. Then a case from the learning set is presented to the net. The net (initially with random weights) is used to compute an output, the output is compared to the desired result, and the difference, called the error, is computed. The weights are then altered slightly so that if the same case were presented again, the error would be less. This gradual reduction in error is the descent.
The most common gradient descent algorithm is called backpropagation (or sometimes just backprop). It uses a complicated mathematical procedure to work backward from the output node, computing in turn each prior node's contribution to the error. From this it is possible to compute not only each nodeıs but also each weight's share of the error. In this way, the error is propagated backward through the entire network, resulting in adjustments to all weights that contributed to the error.
This cycle is repeated for each case in the training set, with small adjustments being made in the weights after each case. When the entire training set has been processed, it is processed again. Each run through the entire training set is called an epoch. It is quite possible that training the net will require several hundred or even several thousand epochs. In this way, even though each case results in only small adjustments to the weights in each epoch, each case is seen in each of several hundred (or several thousand) epochs, resulting in a much larger cumulative effect.
Making small adjustments for each case, but visiting each case many times, is somewhat analogous to a potter sitting at a wheel. The rapidly turning wheel makes it possible to shape the clay by applying only a slight amount of pressure during each revolution.
In a neural net, this pressure is called the learning rate. A typical learning rate value might be 5 percent. This means that after computing the total weight adjustment needed to get a perfect prediction for a case, the actual adjustment made in the next round will not be the total change, but only 5 percent of that amount.
Why not make the full adjustment by setting the learning rate to 100 percent? This would result in a net trained only for the last case that had been presented to it. By making small adjustments, the net retains (in the weights) information from preceding cases as well.
Intricacies of neural nets require technical detail beyond the scope of this article. But, if youıre interested, see the sidebar on "Neural Net Esoterics" for additional discussion of gradient descent, squashing functions, and other topics.
We have mentioned some of the parameters that a user may need to alter during the course of using a neural net. While there will be some differences from product to product, and while some products do a fairly decent job of setting defaults, a user will need to learn a lot about how they work in order to get the most out of a neural net algorithm. The list below summarizes these parameters as well as some others that we have not yet touched on.
During neural net training, the algorithm passes through the data numerous times. How does the training algorithm decide how many passes to make? Neural net algorithms use a number of different stopping rules to control when training ends. In fact, most products let users select from among several rules and let users set many parameters.
The four most common stopping rules are:
In addition to stopping rules, most products provide visual feedback during training so that an observer can decide to intervene and manually stop the training. Figure 2 provides an example from Unica Technologyıs Pattern Recognition Workbench (PRW). PRWıs chart shows how the overall error and the test error are falling as training progresses. While the test error is fluctuating more rapidly than the average error, in general the spread between the red and blue lines is not increasing, suggesting that overfitting has not yet occurred. (Actually, there was a significant deviation earlier in training, but the error rates have converged since then.)
Because training can take a very long time (several days is not unheard of on very large datasets), it is important that results from training are usable independent of how training was stopped, including termination from a system failure. In addition, it should be possible to resume training from a previously trained state. This allows the data miner to stop training, evaluate a model, and then resume training to see if further improvement is possible.
In a neural net, the presence of so many controls such as topology, learning rate, and so forth makes training very difficult. Despite the generally good default settings provided with neural net implementations, users are often overwhelmed by the many parameters that must be set and adjusted when training a neural net. To facilitate this, some implementations such as Unica Technologyıs PRW, HNCıs DataBase Mining Marksman, and Integrated Solutionsı Clementine, among others, incorporate a built-in search technique that evaluates a number of different neural network topologies. This feature greatly reduces the effort required of the user to find a good model.
Regardless of how your best neural network model is selected ı whether by the model builder or a search algorithm ı the model needs to be tested before it is used to make predictions. Testing the model with an independent dataset measures its general applicability. Once testing is complete and you are satisfied with the results, you can use the model as a predictor. But donıt forget to monitor the model over time to ensure that todayıs model meets tomorrowıs needs.
The output from neural networks varies greatly. We saw earlier that some products provide feedback during training. Other common outputs are accuracy measures (confusion matrix, R-squared, and so forth) for validating the model. Unfortunately, none of these aids the user in understanding the model or the underlying data relationships. To help users better understand the model, some neural net products do a sensitivity analysis, often with an interactive component.
For example, SPSS's Neural Connection has a "What If" feature that displays a Sensitivity chart as shown in Figure 3. By moving the scrollbars along the graphics, the user can explore the impact on output from changes in values of two independent variables.
Some products offer special features with unique forms of output. For example, Figure 4 shows the output from Unica Technologyıs PRW that is generated during a search of network topologies. The left half of the screen contains error and accuracy measures. The right hand shows the parameters of the search, in this case the number of epochs for training, the number of hidden nodes in the hidden layer, and the learning rate. By automating the search, PRW makes it much easier for a user to find a good combination of model parameters. A comparison of trends in the search results might also suggest additional parameter combinations to try out.
Although neural nets can be applied to a number of data mining problems, including classification, regression, and clustering, they are more complicated than some of the other techniques. There are many neural net algorithms, each with numerous parameters that users can set. The complexity, combined with the nondescriptive nature of neural network models, often discourages all but the most technical users from employing this data mining technique.
However, there can be a real payoff from this effort. Unlike the other techniques that have been explored in this issue, a neural net has almost no limitations with respect to the kinds of relationships that it can model. For example, a neural net can easily model a relationship between the dependent variable and a ratio of two of the independent variables. Such a relationship can only be approximated in a stepwise fashion in a model built with a decision tree or Naıve-Bayes. Neural nets also have no problem with trigonometric or logarithmic relationships, but either of these could be a real problem for the other techniques. In many real-life business problems, the approximations used in decision trees or Naıve-Bayes are more than good enough, but if precision is important, then a neural network may be the right way to go.
The only way to know if you can benefit from using a neural net rather than one of the other techniques is through experimentation. Build a neural net model and build a decision tree (or Naıve-Bayes) model. Compare their accuracy. If the neural net accuracy is not significantly better, this will increase your confidence in the decision tree (or Naıve-Bayes) model. On the other hand, if the neural net model is significantly better, this tells you that you need to do more work. Examine cases where the models disagree. Maybe you'll find a way to improve the decision tree by, for example, precomputing some ratios for it. Or maybe the best solution for your problem is the neural net model!

Figure 1. This neural network contains six nodes labeled A through F. The yellow nodes (A, B and C) are input nodes which correspond to the independent variable columns in the credit risk problem (Debt, Income, and Married). The red node (F) is the output node which corresponds to Risk, the dependent column. The numbers on the arrows are weights.
Figure 2. This chart from Unica Technologyıs Pattern Recognition Workbench (PRW) shows how the overall error and the test error are falling as training of a neural network progresses.
Figure 3. SPSS's Neural Connection has a "What If" feature that displays a Sensitivity chart. By moving the scrollbars along the graphics, the user can explore the impact of changes in values of two independent variables.
Figure 4. Unica Technologyıs Pattern Recognition Workbench (PRW) can display the results of a search of network topologies. The left half of the screen contains error and accuracy measures. The right hand shows the parameters of the search. By automating the search, PRW makes it much easier for a user to find a good combination of model parameters.