Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 2 of 7: Learning
Section - Help! My NN won't learn! What should I do?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Property taxes ]

Top Document: FAQ, Part 2 of 7: Learning
Previous Document: What does unsupervised learning learn?
See reader questions & answers on this topic! - Help others by sharing your knowledge

The following advice is intended for inexperienced users. Experts may try
more daring methods. 

If you are using a multilayer perceptron (MLP): 

 o Check data for outliers. Transform variables or delete bad cases as
   appropriate to the purpose of the analysis. 

 o Standardize quantitative inputs as described in "Should I standardize the
   input variables?" 

 o Encode categorical inputs as described in "How should categories be

 o Make sure you have more training cases than the total number of input
   units. The number of training cases required depends on the amount of
   noise in the targets and the complexity of the function you are trying to
   learn, but as a starting point, it's a good idea to have at least 10
   times as many training cases as input units. This may not be enough for
   highly complex functions. For classification problems, the number of
   cases in the smallest class should be at least several times the number
   of input units. 

 o If the target is: 
    o quantitative, then it is usually a good idea to standardize the target
      variable as described in "Should I standardize the target variables?"
      Use an identity (usually called "linear") output activation function. 
    o binary, then use 0/1 coding and a logistic output activation function.
    o categorical with 3 or more categories, then use 1-of-C encoding as
      described in "How should categories be encoded?" and use a softmax
      output activation function as described in "What is a softmax
      activation function?" 

 o Use a tanh (hyperbolic tangent) activation function for the hidden units.
   See "Why use activation functions?" for more information. 

 o Use a bias term (sometimes called a "threshold") in every hidden and
   output unit. See "Why use a bias/threshold?" for an explanation of why
   biases are important. 

 o When the network has hidden units, the results of training may depend
   critically on the random initial weights. You can set each initial weight
   (including biases) to a random number such as any of the following: 
    o A uniform random variable between -2 and 2. 
    o A uniform random variable between -0.2 and 0.2. 
    o A normal random variable with a mean of 0 and a standard deviation of
    o A normal random variable with a mean of 0 and a standard deviation of

   If any layer in the network has a large number of units, you will need to
   adjust the initial weights (not including biases) of the connections from
   the large layer to subsequent layers. Generate random initial weights as
   described above, but then divide each of these random weights by the
   square root of the number of units in the large layer. More sophisticated
   methods are described by Bishop (1995). 

   Train the network using several (anywhere from 10 to 1000) different sets
   of random initial weights. For the operational network, you can either
   use the weights that produce the smallest training error, or combine
   several trained networks as described in "How to combine networks?" 

 o If possible, use conventional numerical optimization techniques as
   described in "What are conjugate gradients, Levenberg-Marquardt, etc.?"
   If those techniques are unavailable in the software you are using, get
   better software. If you can't get better software, use RPROP or Quickprop
   as described in "What is backprop?" Only as a last resort should you use
   standard backprop. 

 o Use batch training, because there are fewer mistakes that can be made
   with batch training than with incremental (sometimes called "online")
   training. If you insist on using incremental training, present the
   training cases to the network in random order. For more details, see 
   "What are batch, incremental, on-line, off-line, deterministic,
   stochastic, adaptive, instantaneous, pattern, epoch, constructive, and
   sequential learning?" 

 o If you have to use standard backprop, you must set the learning rate by
   trial and error. Experiment with different learning rates. If the weights
   and errors change very slowly, try higher learning rates. If the weights
   fluctuate wildly and the error increases during training, try lower
   learning rates. If you follow all the instructions given above, you could
   start with a learning rate of .1 for batch training or .01 for
   incremental training. 

   Momentum is not as critical as learning rate, but to be safe, set the
   momentum to zero. A larger momentum requires a smaller learning rate. 

   For more details, see What learning rate should be used for backprop?" 

 o Use a separate test set to estimate generalization error. If the test
   error is much higher than the training error, the network is probably
   overfitting. Read Part 3: Generalization of the FAQ and use one of the
   methods described there to improve generalization, such as early
   stopping, weight decay, or Bayesian learning. 

 o Start with one hidden layer. 

   For a classification problem with many categories, start with one unit in
   the hidden layer; otherwise, start with zero hidden units. Train the
   network, add one or few hidden units, retrain the network, and repeat.
   When you get overfitting, stop adding hidden units. For more information
   on the number of hidden layers and hidden units, see "How many hidden
   layers should I use?" and "How many hidden units should I use?" in Part 3
   of the FAQ. 

   If the generalization error is still not satisfactory, you can try: 
    o adding a second hidden layer 
    o using an RBF network 
    o transforming the input variables 
    o deleting inputs that are not useful 
    o adding new input variables 
    o getting more training cases 
    o etc. 

If you are writing your own software, the opportunities for mistakes are
limitless. Perhaps the most critical thing for gradient-based algorithms
such as backprop is that you compute the gradient (partial derivatives)
correctly. The usual backpropagation algorithm will give you the partial
derivatives of the objective function with respect to each weight in the
network. You can check these partial derivatives by using finite-difference
approximations (Gill, Murray, and Wright, 1981) as follows: 

1. Be sure to standardize the variables as described above. 

2. Initialize the weights W as described above. For convenience of
   notation, let's arrange all the weights in one long vector so we can use
   a single subsbcript i to refer to different weights W_i. Call the
   entire set of values of the initial weights w0. So W is a vector of
   variables, and w0 is a vector of values of those variables. 

3. Let's use the symbol F(W) to indicate the objective function you are
   trying to optimize with respect to the weights. If you are using batch
   training, F(W) is computed over the entire training set. If you are
   using incremental training, choose any one training case and compute 
   F(W) for that single training case; use this same training case for all
   the following steps. 

4. Pick any one weight W_i. Initially, W_i = w0_i. 

5. Choose a constant called h with a value anywhere from .0001 to

6. Change the value of W_i from w0_i to w0_i + h. Do not change any
   of the other weights. Compute the value of the objective function f1 =
   F(W) using this modified value of W_i. 

7. Change the value of W_i to w0_i - h. Do not change any of the other
   weights. Compute another new value of the objective function f2 =

8. The central finite difference approximation to the partial derivative for
   W_i is (f2-f1)/(2h). This value should usually be within about
   10% of the partial derivative computed by backpropagation, except for
   derivatives close to zero. If the finite difference approximation is very
   different from the partial derivative computed by backpropagation, try a
   different value of h. If no value of h provides close agreement between
   the finite difference approximation and the partial derivative computed
   by backpropagation, you probably have a bug. 

9. Repeat the above computations for each weight W_i for i=1, 2, 3,
   ... up to the total number of weights. 


   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization,
   Academic Press: London. 


Next part is part 3 (of 7). Previous part is part 1. 


Warren S. Sarle       SAS Institute Inc.   The opinions expressed here    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.

User Contributions:

Comment about this article, ask questions, or add new information about this topic:


Top Document: FAQ, Part 2 of 7: Learning
Previous Document: What does unsupervised learning learn?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM