
Finding the weights to use in a neural network is a critical part of the process. A popular technique for finding weights is known as a gradient descent. Except for the direction (down versus up), a gradient descent algorithm is just like a hill-climbing algorithm. Hill climbing algorithms are prone to finding a local optimum. A gradient descent algorithm can similarly be stymied by local optima. Gradient descent behaves like an ant looking for the deepest valley. Wherever the ant finds itself, it looks around its immediate vicinity to see which way goes down the fastest, and then heads off in that direction. After moving a short distance, it again looks around and heads off in what now appears to be the steepest way down. At some point it will find itself at the bottom of a valley and stop searching. However, it is quite possible that had the ant not always taken the steepest route down, that it would have found its way across a saddle to an even deeper valley just beyond. Getting stuck in a valley that is not the deepest is called finding a local optimum.
Neural net algorithms that employ a gradient descent algorithm usually employ some heuristics to cope with the local optimum problem. Included in these heuristics is making small random perturbations to the weights or occasionally taking a big step (i.e., using a larger learning rate for one epoch). Another way to deal with local optima is to run several training sessions, each starting with a different set of random weights. The hope is that by setting our ant down at different starting points, one of his searches will lead him to the deepest valley.
The mathematically astute reader might have noticed that the squashing function that we used in our example couldnęt be differentiated because it is not continuous (at 0). For this reason, a back propagation algorithm will not use a step function like we did, but instead use a continuous function. One of the most commonly used is the sigmoid function. The name, sigmoid, derives from the fact that the function is "S" shaped. Statisticians call this function the logistic function, perhaps because it is a function that approximates a flip-flop and can therefore be used to model the true/false used in logic.
Using I for input, O for output and with k as an arbitrary constant, the sigmoid function is:
As the constant k gets larger, this function approaches a step function. This can be seen in Figure 1, which plots the function for two values of k, 1 and 10. Note that no matter how large k becomes, this function remains continuous and differentiable. Many neural net implementations let the user set the value for k. Some may even permit different values for k in the hidden and output layers, for example.
Although larger values of k result in a function that more closely approaches a step function, this does not mean those larger values of k are preferable. In fact, neural net training frequently benefits from smaller values of k, probably because small changes in input are less likely to result in big changes in output. This also fits better with the "small step" approach used in neural net training.
The sigmoid function is not the only squashing function that is used in neural net products. Another commonly used s-shaped function is tanh (the hyper tangent).
The perceptron, the predecessor to the neural net, reached a dead end because it employed a 0-1 step function as a squashing function. The step function was used because it most closely modeled the neuron in the nervous system, which is thought to fire when its input crosses a threshold. Because the 0-1 step function is not continuous and therefore not differentiable, it was not possible to compute the contribution to error at any layer except the output layer and this made it impossible to train multi-layer networks. As a result, perceptrons could not be trained to model certain kinds of inter-variable relationships like the exclusive or. Unlike the regular or operator, which is true if either of its two components is true, the exclusive or is true only when exactly one of its two components is true. Once this limitation to perceptrons was publicized, work on perceptrons ceased for almost 15 years, until the idea to use a differentiable function like the sigmoid was advanced.
Figure 1. Two examples of the sigmoid function, with k=1 and k=10.
Estelle Brand (estelle@xore.com) and Rob Gerritsen (rob@xore.com) are founders of Exclusive Ore Inc., based in Bluebell, Pennsylvania, which is a consulting and training company specializing in data mining. During the last two years they have used more than a dozen data mining products. Their database management systems experience dates back to the dark ages. For more information about Exclusive Ore and data mining, see www.xore.com.