|
Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning Previous Document: What does unsupervised learning learn? See reader questions & answers on this topic! - Help others by sharing your knowledge
The following advice is intended for inexperienced users. Experts may try
more daring methods.
If you are using a multilayer perceptron (MLP):
o Check data for outliers. Transform variables or delete bad cases as
appropriate to the purpose of the analysis.
o Standardize quantitative inputs as described in "Should I standardize the
input variables?"
o Encode categorical inputs as described in "How should categories be
encoded?"
o Make sure you have more training cases than the total number of input
units. The number of training cases required depends on the amount of
noise in the targets and the complexity of the function you are trying to
learn, but as a starting point, it's a good idea to have at least 10
times as many training cases as input units. This may not be enough for
highly complex functions. For classification problems, the number of
cases in the smallest class should be at least several times the number
of input units.
o If the target is:
o quantitative, then it is usually a good idea to standardize the target
variable as described in "Should I standardize the target variables?"
Use an identity (usually called "linear") output activation function.
o binary, then use 0/1 coding and a logistic output activation function.
o categorical with 3 or more categories, then use 1-of-C encoding as
described in "How should categories be encoded?" and use a softmax
output activation function as described in "What is a softmax
activation function?"
o Use a tanh (hyperbolic tangent) activation function for the hidden units.
See "Why use activation functions?" for more information.
o Use a bias term (sometimes called a "threshold") in every hidden and
output unit. See "Why use a bias/threshold?" for an explanation of why
biases are important.
o When the network has hidden units, the results of training may depend
critically on the random initial weights. You can set each initial weight
(including biases) to a random number such as any of the following:
o A uniform random variable between -2 and 2.
o A uniform random variable between -0.2 and 0.2.
o A normal random variable with a mean of 0 and a standard deviation of
1.
o A normal random variable with a mean of 0 and a standard deviation of
0.1.
If any layer in the network has a large number of units, you will need to
adjust the initial weights (not including biases) of the connections from
the large layer to subsequent layers. Generate random initial weights as
described above, but then divide each of these random weights by the
square root of the number of units in the large layer. More sophisticated
methods are described by Bishop (1995).
Train the network using several (anywhere from 10 to 1000) different sets
of random initial weights. For the operational network, you can either
use the weights that produce the smallest training error, or combine
several trained networks as described in "How to combine networks?"
o If possible, use conventional numerical optimization techniques as
described in "What are conjugate gradients, Levenberg-Marquardt, etc.?"
If those techniques are unavailable in the software you are using, get
better software. If you can't get better software, use RPROP or Quickprop
as described in "What is backprop?" Only as a last resort should you use
standard backprop.
o Use batch training, because there are fewer mistakes that can be made
with batch training than with incremental (sometimes called "online")
training. If you insist on using incremental training, present the
training cases to the network in random order. For more details, see
"What are batch, incremental, on-line, off-line, deterministic,
stochastic, adaptive, instantaneous, pattern, epoch, constructive, and
sequential learning?"
o If you have to use standard backprop, you must set the learning rate by
trial and error. Experiment with different learning rates. If the weights
and errors change very slowly, try higher learning rates. If the weights
fluctuate wildly and the error increases during training, try lower
learning rates. If you follow all the instructions given above, you could
start with a learning rate of .1 for batch training or .01 for
incremental training.
Momentum is not as critical as learning rate, but to be safe, set the
momentum to zero. A larger momentum requires a smaller learning rate.
For more details, see What learning rate should be used for backprop?"
o Use a separate test set to estimate generalization error. If the test
error is much higher than the training error, the network is probably
overfitting. Read Part 3: Generalization of the FAQ and use one of the
methods described there to improve generalization, such as early
stopping, weight decay, or Bayesian learning.
o Start with one hidden layer.
For a classification problem with many categories, start with one unit in
the hidden layer; otherwise, start with zero hidden units. Train the
network, add one or few hidden units, retrain the network, and repeat.
When you get overfitting, stop adding hidden units. For more information
on the number of hidden layers and hidden units, see "How many hidden
layers should I use?" and "How many hidden units should I use?" in Part 3
of the FAQ.
If the generalization error is still not satisfactory, you can try:
o adding a second hidden layer
o using an RBF network
o transforming the input variables
o deleting inputs that are not useful
o adding new input variables
o getting more training cases
o etc.
If you are writing your own software, the opportunities for mistakes are
limitless. Perhaps the most critical thing for gradient-based algorithms
such as backprop is that you compute the gradient (partial derivatives)
correctly. The usual backpropagation algorithm will give you the partial
derivatives of the objective function with respect to each weight in the
network. You can check these partial derivatives by using finite-difference
approximations (Gill, Murray, and Wright, 1981) as follows:
1. Be sure to standardize the variables as described above.
2. Initialize the weights W as described above. For convenience of
notation, let's arrange all the weights in one long vector so we can use
a single subsbcript i to refer to different weights W_i. Call the
entire set of values of the initial weights w0. So W is a vector of
variables, and w0 is a vector of values of those variables.
3. Let's use the symbol F(W) to indicate the objective function you are
trying to optimize with respect to the weights. If you are using batch
training, F(W) is computed over the entire training set. If you are
using incremental training, choose any one training case and compute
F(W) for that single training case; use this same training case for all
the following steps.
4. Pick any one weight W_i. Initially, W_i = w0_i.
5. Choose a constant called h with a value anywhere from .0001 to
.00000001.
6. Change the value of W_i from w0_i to w0_i + h. Do not change any
of the other weights. Compute the value of the objective function f1 =
F(W) using this modified value of W_i.
7. Change the value of W_i to w0_i - h. Do not change any of the other
weights. Compute another new value of the objective function f2 =
F(W).
8. The central finite difference approximation to the partial derivative for
W_i is (f2-f1)/(2h). This value should usually be within about
10% of the partial derivative computed by backpropagation, except for
derivatives close to zero. If the finite difference approximation is very
different from the partial derivative computed by backpropagation, try a
different value of h. If no value of h provides close agreement between
the finite difference approximation and the partial derivative computed
by backpropagation, you probably have a bug.
9. Repeat the above computations for each weight W_i for i=1, 2, 3,
... up to the total number of weights.
References:
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
Oxford University Press.
Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization,
Academic Press: London.
------------------------------------------------------------------------
Next part is part 3 (of 7). Previous part is part 1.
--
Warren S. Sarle SAS Institute Inc. The opinions expressed here
saswss@unx.sas.com SAS Campus Drive are mine and not necessarily
(919) 677-8000 Cary, NC 27513, USA those of SAS Institute.
User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning Previous Document: What does unsupervised learning learn? Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page [ Usenet FAQs | Web FAQs | Documents | RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
|

PDP++ is a neural-network simulation system written in C++, developed as an advanced version of the original PDP software from McClelland and Rumelhart's "Explorations in Parallel Distributed Processing Handbook" (1987). The software is designed for both novice users and researchers, providing flexibility and power in cognitive neuroscience studies. Featured in Randall C. O'Reilly and Yuko Munakata's "Computational Explorations in Cognitive Neuroscience" (2000), PDP++ supports a wide range of algorithms. These include feedforward and recurrent error backpropagation, with continuous and real-time models such as Almeida-Pineda. It also incorporates constraint satisfaction algorithms like Boltzmann Machines, Hopfield networks, and mean-field networks, as well as self-organizing learning algorithms, including Self-organizing Maps (SOM) and Hebbian learning. Additionally, it supports mixtures-of-experts models and the Leabra algorithm, which combines error-driven and Hebbian learning with k-Winners-Take-All inhibitory competition. PDP++ is a comprehensive tool for exploring neural network models in cognitive neuroscience.