Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: What does unsupervised learning learn? See reader questions & answers on this topic!  Help others by sharing your knowledge The following advice is intended for inexperienced users. Experts may try more daring methods. If you are using a multilayer perceptron (MLP): o Check data for outliers. Transform variables or delete bad cases as appropriate to the purpose of the analysis. o Standardize quantitative inputs as described in "Should I standardize the input variables?" o Encode categorical inputs as described in "How should categories be encoded?" o Make sure you have more training cases than the total number of input units. The number of training cases required depends on the amount of noise in the targets and the complexity of the function you are trying to learn, but as a starting point, it's a good idea to have at least 10 times as many training cases as input units. This may not be enough for highly complex functions. For classification problems, the number of cases in the smallest class should be at least several times the number of input units. o If the target is: o quantitative, then it is usually a good idea to standardize the target variable as described in "Should I standardize the target variables?" Use an identity (usually called "linear") output activation function. o binary, then use 0/1 coding and a logistic output activation function. o categorical with 3 or more categories, then use 1ofC encoding as described in "How should categories be encoded?" and use a softmax output activation function as described in "What is a softmax activation function?" o Use a tanh (hyperbolic tangent) activation function for the hidden units. See "Why use activation functions?" for more information. o Use a bias term (sometimes called a "threshold") in every hidden and output unit. See "Why use a bias/threshold?" for an explanation of why biases are important. o When the network has hidden units, the results of training may depend critically on the random initial weights. You can set each initial weight (including biases) to a random number such as any of the following: o A uniform random variable between 2 and 2. o A uniform random variable between 0.2 and 0.2. o A normal random variable with a mean of 0 and a standard deviation of 1. o A normal random variable with a mean of 0 and a standard deviation of 0.1. If any layer in the network has a large number of units, you will need to adjust the initial weights (not including biases) of the connections from the large layer to subsequent layers. Generate random initial weights as described above, but then divide each of these random weights by the square root of the number of units in the large layer. More sophisticated methods are described by Bishop (1995). Train the network using several (anywhere from 10 to 1000) different sets of random initial weights. For the operational network, you can either use the weights that produce the smallest training error, or combine several trained networks as described in "How to combine networks?" o If possible, use conventional numerical optimization techniques as described in "What are conjugate gradients, LevenbergMarquardt, etc.?" If those techniques are unavailable in the software you are using, get better software. If you can't get better software, use RPROP or Quickprop as described in "What is backprop?" Only as a last resort should you use standard backprop. o Use batch training, because there are fewer mistakes that can be made with batch training than with incremental (sometimes called "online") training. If you insist on using incremental training, present the training cases to the network in random order. For more details, see "What are batch, incremental, online, offline, deterministic, stochastic, adaptive, instantaneous, pattern, epoch, constructive, and sequential learning?" o If you have to use standard backprop, you must set the learning rate by trial and error. Experiment with different learning rates. If the weights and errors change very slowly, try higher learning rates. If the weights fluctuate wildly and the error increases during training, try lower learning rates. If you follow all the instructions given above, you could start with a learning rate of .1 for batch training or .01 for incremental training. Momentum is not as critical as learning rate, but to be safe, set the momentum to zero. A larger momentum requires a smaller learning rate. For more details, see What learning rate should be used for backprop?" o Use a separate test set to estimate generalization error. If the test error is much higher than the training error, the network is probably overfitting. Read Part 3: Generalization of the FAQ and use one of the methods described there to improve generalization, such as early stopping, weight decay, or Bayesian learning. o Start with one hidden layer. For a classification problem with many categories, start with one unit in the hidden layer; otherwise, start with zero hidden units. Train the network, add one or few hidden units, retrain the network, and repeat. When you get overfitting, stop adding hidden units. For more information on the number of hidden layers and hidden units, see "How many hidden layers should I use?" and "How many hidden units should I use?" in Part 3 of the FAQ. If the generalization error is still not satisfactory, you can try: o adding a second hidden layer o using an RBF network o transforming the input variables o deleting inputs that are not useful o adding new input variables o getting more training cases o etc. If you are writing your own software, the opportunities for mistakes are limitless. Perhaps the most critical thing for gradientbased algorithms such as backprop is that you compute the gradient (partial derivatives) correctly. The usual backpropagation algorithm will give you the partial derivatives of the objective function with respect to each weight in the network. You can check these partial derivatives by using finitedifference approximations (Gill, Murray, and Wright, 1981) as follows: 1. Be sure to standardize the variables as described above. 2. Initialize the weights W as described above. For convenience of notation, let's arrange all the weights in one long vector so we can use a single subsbcript i to refer to different weights W_i. Call the entire set of values of the initial weights w0. So W is a vector of variables, and w0 is a vector of values of those variables. 3. Let's use the symbol F(W) to indicate the objective function you are trying to optimize with respect to the weights. If you are using batch training, F(W) is computed over the entire training set. If you are using incremental training, choose any one training case and compute F(W) for that single training case; use this same training case for all the following steps. 4. Pick any one weight W_i. Initially, W_i = w0_i. 5. Choose a constant called h with a value anywhere from .0001 to .00000001. 6. Change the value of W_i from w0_i to w0_i + h. Do not change any of the other weights. Compute the value of the objective function f1 = F(W) using this modified value of W_i. 7. Change the value of W_i to w0_i  h. Do not change any of the other weights. Compute another new value of the objective function f2 = F(W). 8. The central finite difference approximation to the partial derivative for W_i is (f2f1)/(2h). This value should usually be within about 10% of the partial derivative computed by backpropagation, except for derivatives close to zero. If the finite difference approximation is very different from the partial derivative computed by backpropagation, try a different value of h. If no value of h provides close agreement between the finite difference approximation and the partial derivative computed by backpropagation, you probably have a bug. 9. Repeat the above computations for each weight W_i for i=1, 2, 3, ... up to the total number of weights. References: Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press. Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization, Academic Press: London.  Next part is part 3 (of 7). Previous part is part 1.  Warren S. Sarle SAS Institute Inc. The opinions expressed here saswss@unx.sas.com SAS Campus Drive are mine and not necessarily (919) 6778000 Cary, NC 27513, USA those of SAS Institute. User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: What does unsupervised learning learn? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
