Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 2 of 7: Learning
Section - What are batch, incremental, on-line, off-line,

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Houses ]

Top Document: FAQ, Part 2 of 7: Learning
Previous Document: What are combination, activation, error, and
Next Document: What is backprop?
See reader questions & answers on this topic! - Help others by sharing your knowledge
deterministic, stochastic, adaptive, instantaneous,
pattern, constructive, and sequential learning? 

There are many ways to categorize learning methods. The distinctions are
overlapping and can be confusing, and the terminology is used very
inconsistently. This answer attempts to impose some order on the chaos,
probably in vain. 

Batch vs. Incremental Learning (also Instantaneous, Pattern, and

Batch learning proceeds as follows: 

   Initialize the weights.
   Repeat the following steps:
      Process all the training data.
      Update the weights.

Incremental learning proceeds as follows: 

   Initialize the weights.
   Repeat the following steps:
      Process one training case.
      Update the weights.

In the above sketches, the exact meaning of "Process" and "Update" depends
on the particular training algorithm and can be quite complicated for
methods such as Levenberg-Marquardt (see "What are conjugate gradients,
Levenberg-Marquardt, etc.?"). Standard backprop (see What is backprop?) is
quite simple, though. Batch standard backprop (without momentum) proceeds as

   Initialize the weights W.
   Repeat the following steps:
      Process all the training data DL to compute the gradient
         of the average error function AQ(DL,W).
      Update the weights by subtracting the gradient times the
         learning rate.

Incremental standard backprop (without momentum) can be done as follows: 

   Initialize the weights W.
   Repeat the following steps for j = 1 to NL:
      Process one training case (y_j,X_j) to compute the gradient
         of the error (loss) function Q(y_j,X_j,W).
      Update the weights by subtracting the gradient times the
         learning rate.

Alternatively, the index j can be chosen randomly each type the loop is
executed, or j can be chosen from a random permutation. 

The question of when to stop training is very complicated. Some of the
possibilities are: 

 o Stop when the average error function for the training set becomes small. 
 o Stop when the gradient of the average error function for the training set
   becomes small. 
 o Stop when the average error function for the validation set starts to go
   up, and use the weights from the step that yielded the smallest
   validation error. For details, see "What is early stopping?" 
 o Stop when your boredom level is no longer tolerable. 

It is very important NOT to use the following method--which does not work
--but is often mistakenly used by beginners: 

   Initialize the weights W.
   Repeat the following steps for j = 1 to NL:
      Repeat the following steps until Q(y_j,X_j,W) is small: 
         Compute the gradient of Q(y_j,X_j,W).
         Update the weights by subtracting the gradient times the
            learning rate.

The reason this method does not work is that by the time you have finished
processing the second case, the network will have forgotten what it learned
about the first case, and when you are finished with the third case, the
network will have forgotten what it learned about the first two cases, and
so on. If you really need to use a method like this, see the section below
on "Sequential Learning, Catastrophic Interference, and the
Stability-Plasticity Dilemma". 

The term "batch learning" is used quite consistently in the NN literature,
but "incremental learning" is often used for on-line, constructive, or
sequential learning. The usage of "batch" and "incremental" in the NN FAQ
follows Bertsekas and Tsitsiklis (1996), one of the few references that
keeps the relevant concepts straight. 

"Epoch learning" is synonymous with "batch learning." 

"Instantaneous learning" and "pattern learning" are usually synonyms for
incremental learning. "Instantaneous learning" is a misleading term because
people often think it means learning instantaneously. "Pattern learning" is
easily confused with "pattern recognition". Hence these terms are not used
in the FAQ. 

There are also intermediate methods, sometimes called mini-batch: 

   Initialize the weights.
   Repeat the following steps:
      Process two or more, but not all training cases.
      Update the weights.

Conventional numerical optimization techniques (see "What are conjugate
gradients, Levenberg-Marquardt, etc.?") are batch algorithms. Conventional
stochastic approximation techniques (see below) are incremental algorithms.
For a theoretical discussion comparing batch and incremental learning, see
Bertsekas and Tsitsiklis (1996, Chapter 3). 

For more information on incremental learning, see Saad (1998), but note that
the authors in that volume do not reliably distinguish between on-line and
incremental learning. 

On-line vs. Off-line Learning

In off-line learning, all the data are stored and can be accessed
repeatedly. Batch learning is always off-line. 

In on-line learning, each case is discarded after it is processed and the
weights are updated. On-line training is always incremental. 

Incremental learning can be done either on-line or off-line. 

With off-line learning, you can compute the objective function for any fixed
set of weights, so you can see whether you are making progess in training.
You can compute a minimum of the objective function to any desired
precision. You can use a variety of algorithms for avoiding bad local
minima, such as multiple random initializations or global optimization
algorithms. You can compute the error function on a validation set and use 
early stopping or choose from different networks to improve generalization.
You can use cross-validation and bootstrapping to estimate generalization
error. You can compute prediction and confidence intervals (error bars).
With on-line learning you can do none of these things because you cannot
compute the objective function on the training set or the error function on
the validation set for a fixed set of weights, since these data sets are not
stored. Hence, on-line learning is generally more difficult and unreliable
than off-line learning. Off-line incremental learning does not have all
these problems of on-line learning, which is why it is important to
distinguish between the concepts of on-line and incremental learning. 

Some of the theoretical difficulties of on-line learning are alleviated when
the assumptions of stochastic learning hold (see below) and training is
assumed to proceed indefinitely. 

For more information on on-line learning, see Saad (1998), but note that the
authors in that volume do not reliably distinguish between on-line and
incremental learning. 

Deterministic, Stochastic, and Adaptive Learning

Deterministic learning is based on optimization of an objective function
that can be recomputed many times and always produces the same value given
the same weights. Deterministic learning is always off-line. 

Stochastic methods are used when computation of the objective function is
corrupted by noise. In particular, basic stochastic approximation is a form
of on-line gradient descent learning in which the training cases are
obtained by a stationary random process: 

   Initialize the weights.
   Initialize the learning rate.
   Repeat the following steps:
      Randomly select one (or possibly more) case(s)
         from the population.
      Update the weights by subtracting the gradient
         times the learning rate.
      Reduce the learning rate according to an
         appropriate schedule.

In stochastic on-line learning, the noise-corrupted objective function is
the error function for any given case, assuming that the case-wise error
function has some stationary random distribution. The learning rate must
gradually decrease towards zero during training to guarantee convergence of
the weights to a local minimum of the noiseless objective function. This
gradual reduction of the learning rate is often called "annealing." 

If the function that the network is trying to learn changes over time, the
case-wise error function does not have a stationary random distribution. To
allow the network to track changes over time, the learning rate must be kept
strictly away from zero. Learning methods that track a changing environment
are often called "adaptive" (as in adaptive vector quantization, Gersho and
Gray, 1992) or "continuous" rather than "stochastic". There is a trade-off
between accuracy and speed of adaptation. Adaptive learning does not
converge in a stationary environment. Hence the long-run properties of
stochastic learning and adaptive learning are quite different, even though
the algorithms may differ only in the sequence of learning rates. 

The object of adaptive learning is to forget the past when it is no longer
relevant. If you want to remember the past in a changing learning
environment, then you would be more interested in sequential learning (see

In stochastic learning with a suitably annealed learning rate, overtraining
does not occur because the more you train, the more data you have, and the
network converges toward a local optimum of the objective function for the
entire population, not a local optimum for a finite training set. Of course,
this conclusion does not hold if you train by cycling through a finite
training set instead of collecting new data on every step. 

For a theoretical discussion of stochastic learning, see Bertsekas and
Tsitsiklis (1996, Chapter 4). For further references on stochastic
approximation, see "What is backprop?" For adaptive filtering, see Haykin

The term "adaptive learning" is sometimes used for gradient methods in which
the learning rate is changed during training. 

Constructive Learning (Growing networks)

Constructive learning adds units or connections to the network during
training. Typically, constructive learning begins with a network with no
hidden units, which is trained for a while. Then without altering the
existing weights, one or more new hidden units are added to the network,
training resumes, and so on. Many variations are possible, involving
different patterns of connections and schemes for freezing and thawing
weights. The most popular constructive algorithm is cascade correlation
(Fahlman and Lebiere, 1990), of which many variations are possible (e.g.,
Littmann and Ritter, 1996; Prechelt, 1997). Various other constructive
algorithms are summarized in Smieja (1993), Kwok and Yeung (1997; also other
papers at, and Reed
and Marks (1999). For theory, see Baum (1989), Jones (1992), and Meir and
Maiorov (1999). Lutz Prechelt has a bibliography on constructive algorithms

Constructive algorithms can be highly effective for escaping bad local
minima of the objective function, and are often much faster than algorithms
for global optimization such as simulated annealing and genetic algorithms.
A well-known example is the two-spirals problem, which is very difficult to
learn with standard backprop but relatively easy to learn with cascade
correlation (Fahlman and Lebiere, 1990). Some of the Donoho-Johnstone
benchmarks (especially "bumps") are almost impossible to learn with standard
backprop but can be learned very accurately with constructive algorithms

Constructive learning is commonly used to train multilayer perceptrons in
which the activation functions are step functions. Such networks are
difficult to train nonconstructively because the objective function is
discontinuous and gradient-descent methods cannot be used. Several clever
constructive algorithms (such as Upstart, Tiling, Marchand, etc.) have been
devised whereby a multilayer perceptron is constructed by training a
sequence of perceptrons, each of which is trained by some standard method
such as the well-known perceptron or pocket algorithms. Most constructive
algorithms of this kind are designed so that the training error goes to zero
when the network gets sufficiently large. Such guarantees do not apply to
the generalization error; you should guard against overfitting when you are
using constructive algorithms just as with nonconstructive algorithms (see
part 3 of the FAQ, especially "What is overfitting and how can I avoid it?")

Logically, you would expect "destructive" learning to start with a large
network and delete units during training, but I have never seen this term
used. The process of deleting units or connections is usually called
"pruning" (Reed, 1993; Reed and Marks 1999). The term "selectionism" has
also been used as the opposite of "constructivism" in cognitive neuroscience
(Quartz and Sejnowski, 1997). 

Sequential Learning, Catastrophic Interference, and the
Stability-Plasticity Dilemma

"Sequential learning" sometimes means incremental learning but also refers
to a very important problem in neuroscience (e.g., McClelland, McNaughton,
and O'Reilly 1995). To reduce confusion, the latter usage should be

Sequential learning in its purest form operates on a sequence of training
sets as follows: 

   Initialize the weights.
   Repeat the following steps:
      Collect one or more training cases.
      Train the network on the current training set
         using any method.
      Discard the current training set and all other
         information related to training except the

Pure sequential learning differs from on-line learning in that most on-line
algorithms require storage of information in addition to the weights, such
as the learning rate or aproximations to the objective function or Hessian
Matrix. Such additional storage is not allowed in pure sequential learning. 

Pure sequential learning is important as a model of how learning occurs in
real, biological brains. Humans and other animals are often observed to
learn new things without forgetting old things. But pure sequential learning
tends not to work well in artificial neural networks. With a fixed
architecture, distributed (rather than local) representations, and a
training algorithm based on minimizing an objective function, sequential
learning results in "catastrophic interference", because the minima of the
objective function for one training set may be totally diferent than the
minima for subsequent training sets. Hence each successive training set may
cause the network to forget completely all previous training sets. This
problem is also called the "stability-plasticity dilemma." 

Successful sequential learning usually requires one or more of the

 o Noise-free data. 
 o Constructive architectures. 
 o Local representations. 
 o Storage of extra information besides the weights (this is impure
   sequential learning and is not biologically plausible unless a a
   biological mechanism for storing the extra information is provided) 

The connectionist literature on catastrophic interference seems oblivious to
statistical and numerical theory, and much of the research is based on the
ludicrous idea of using an autoassociative backprop network to model
recognition memory. Some of the problems with this approach are explained by
Sharkey and Sharkey (1995). Current research of the PDP variety is reviewed
by French (1999). PDP remedies for catastrophic forgetting generally require
cheating, i.e., storing information outside the network. For example,
pseudorehearsal (Robins, 1995) requires storing the distribution of the
inputs, although this fact is often overlooked. It appears that the only way
to avoid catastrophic interference in a PDP-style network is to combine two
networks modularly: a fast-learning network to memorize data, and a
slow-learning network to generalize from data memorized by the fast-learning
network (McClelland, McNaughton, and O'Reilly 1995). The PDP literature
virtually ignores the ART literature (See "What is ART?"), which provides a
localist constructive solution to what the ART people call the
"stability-plasticity dilemma." None of this research deals with sequential
learning in a statistically sound manner, and many of the methods proposed
for sequential learning require noise-free data. The statistical theory of
sufficient statistics makes it obvious that efficient sequential learning
requires the storage of additional information besides the weights in a
standard feedforward network. I know of no references to this subject in the
NN literature, but Bishop (1991) provided a mathematically sound treatment
of a closely-related problem. 


   Baum, E.B. (1989), "A proposal for more powerful learning algorithms,"
   Neural Computation, 1, 201-207. 

   Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
   Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8. 

   Bishop, C. (1991), "A fast procedure for retraining the multilayer
   perceptron," International Journal of Neural Systems, 2, 229-236. 

   Fahlman, S.E., and Lebiere, C. (1990), "The Cascade-Correlation Learning
   Architecture", in Touretzky, D. S. (ed.), Advances in Neural Information
   Processing Systems 2,, Los Altos, CA: Morgan Kaufmann Publishers, pp.

   French, R.M. (1999), "Catastrophic forgetting in connectionist networks:
   Causes, consequences and solutions," Trends in Cognitive Sciences, 3,

   Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal
   Compression, Boston: Kluwer Academic Publishers. 

   Haykin, S. (1996), Adaptive Filter Theory, Englewood Cliffs, NJ:

   Jones, L. (1992), "A simple lemma on greedy approximation in Hilbert
   space and convergence rate for projection pursuit regression and neural
   network training," Annals of Statistics, 20, 608-613. 

   Kwok, T.Y. and Yeung, D.Y. (1997), "Constructive algorithms for structure
   learning in feedforward neural networks for regression problems," IEEE
   Transactions on Neural Networks, volume 8, 630-645. 

   Littmann, E., and Ritter, H. (1996), "Learning and generalization in
   cascade network architectures," Neural Computation, 8, 1521-1539. 

   McClelland, J., McNaughton, B. and O'Reilly, R. (1995), "Why there are
   complementary learning systems in the hippocampus and neocortex: Insights
   from the successes and failures of connectionist models of learning and
   memory," Psychological Review, 102, 419-457. 

   Meir, R., and Maiorov, V. (1999), "On the optimality of incremental
   neural network algorithms," in Kerans, M.S., Solla, S.A., amd Cohn, D.A.
   (eds.), Advances in Neural Information Processing Systems 11,
   Cambridge,MA: The MIT Press, pp. 295-301. 

   Prechelt, L. (1997), "Investigation of the CasCor Family of Learning
   Algorithms," Neural Networks, 10, 885-896, 

   Quartz, S.R., and Sejnowski, T.J. (1997), "The neural basis of cognitive
   development: A constructivist manifesto," Behavioral and Brain Sciences,
   20, 537-596, 

   Reed, R. (1993), "Pruning algorithms--A survey," IEEE Transactions on
   Neural Networks, 4, 740-747. 

   Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised
   Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The
   MIT Press, ISBN 0-262-18190-8.

   Robins, A. (1995), "Catastrophic Forgetting, Rehearsal, and
   Pseudorehearsal," Connection Science, 7, 123-146. 

   Saad, D., ed. (1998), On-Line Learning in Neural Networks, Cambridge:
   Cambridge University Press. 

   Sharkey, N.E., and Sharkey, A.J.C. (1995), "An analysis of catastrophic
   interference," Connection Science, 7, 301-329. 

   Smieja, F.J. (1993), "Neural Network Constructive Algorithms: Trading
   Generalization for Learning Efficiency?" Circuits, Systems and Signal
   Processing, 12, 331-374, 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

Top Document: FAQ, Part 2 of 7: Learning
Previous Document: What are combination, activation, error, and
Next Document: What is backprop?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM