Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What are combination, activation, error, and
Next Document: What is backprop?
See reader questions & answers on this topic! - Help others by sharing your knowledge
deterministic, stochastic, adaptive, instantaneous, =================================================== pattern, constructive, and sequential learning? ================================================ There are many ways to categorize learning methods. The distinctions are overlapping and can be confusing, and the terminology is used very inconsistently. This answer attempts to impose some order on the chaos, probably in vain. Batch vs. Incremental Learning (also Instantaneous, Pattern, and ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Epoch) ++++++ Batch learning proceeds as follows: Initialize the weights. Repeat the following steps: Process all the training data. Update the weights. Incremental learning proceeds as follows: Initialize the weights. Repeat the following steps: Process one training case. Update the weights. In the above sketches, the exact meaning of "Process" and "Update" depends on the particular training algorithm and can be quite complicated for methods such as Levenberg-Marquardt (see "What are conjugate gradients, Levenberg-Marquardt, etc.?"). Standard backprop (see What is backprop?) is quite simple, though. Batch standard backprop (without momentum) proceeds as follows: Initialize the weights W. Repeat the following steps: Process all the training data DL to compute the gradient of the average error function AQ(DL,W). Update the weights by subtracting the gradient times the learning rate. Incremental standard backprop (without momentum) can be done as follows: Initialize the weights W. Repeat the following steps for j = 1 to NL: Process one training case (y_j,X_j) to compute the gradient of the error (loss) function Q(y_j,X_j,W). Update the weights by subtracting the gradient times the learning rate. Alternatively, the index j can be chosen randomly each type the loop is executed, or j can be chosen from a random permutation. The question of when to stop training is very complicated. Some of the possibilities are: o Stop when the average error function for the training set becomes small. o Stop when the gradient of the average error function for the training set becomes small. o Stop when the average error function for the validation set starts to go up, and use the weights from the step that yielded the smallest validation error. For details, see "What is early stopping?" o Stop when your boredom level is no longer tolerable. It is very important NOT to use the following method--which does not work --but is often mistakenly used by beginners: Initialize the weights W. Repeat the following steps for j = 1 to NL: Repeat the following steps until Q(y_j,X_j,W) is small: Compute the gradient of Q(y_j,X_j,W). Update the weights by subtracting the gradient times the learning rate. The reason this method does not work is that by the time you have finished processing the second case, the network will have forgotten what it learned about the first case, and when you are finished with the third case, the network will have forgotten what it learned about the first two cases, and so on. If you really need to use a method like this, see the section below on "Sequential Learning, Catastrophic Interference, and the Stability-Plasticity Dilemma". The term "batch learning" is used quite consistently in the NN literature, but "incremental learning" is often used for on-line, constructive, or sequential learning. The usage of "batch" and "incremental" in the NN FAQ follows Bertsekas and Tsitsiklis (1996), one of the few references that keeps the relevant concepts straight. "Epoch learning" is synonymous with "batch learning." "Instantaneous learning" and "pattern learning" are usually synonyms for incremental learning. "Instantaneous learning" is a misleading term because people often think it means learning instantaneously. "Pattern learning" is easily confused with "pattern recognition". Hence these terms are not used in the FAQ. There are also intermediate methods, sometimes called mini-batch: Initialize the weights. Repeat the following steps: Process two or more, but not all training cases. Update the weights. Conventional numerical optimization techniques (see "What are conjugate gradients, Levenberg-Marquardt, etc.?") are batch algorithms. Conventional stochastic approximation techniques (see below) are incremental algorithms. For a theoretical discussion comparing batch and incremental learning, see Bertsekas and Tsitsiklis (1996, Chapter 3). For more information on incremental learning, see Saad (1998), but note that the authors in that volume do not reliably distinguish between on-line and incremental learning. On-line vs. Off-line Learning +++++++++++++++++++++++++++++ In off-line learning, all the data are stored and can be accessed repeatedly. Batch learning is always off-line. In on-line learning, each case is discarded after it is processed and the weights are updated. On-line training is always incremental. Incremental learning can be done either on-line or off-line. With off-line learning, you can compute the objective function for any fixed set of weights, so you can see whether you are making progess in training. You can compute a minimum of the objective function to any desired precision. You can use a variety of algorithms for avoiding bad local minima, such as multiple random initializations or global optimization algorithms. You can compute the error function on a validation set and use early stopping or choose from different networks to improve generalization. You can use cross-validation and bootstrapping to estimate generalization error. You can compute prediction and confidence intervals (error bars). With on-line learning you can do none of these things because you cannot compute the objective function on the training set or the error function on the validation set for a fixed set of weights, since these data sets are not stored. Hence, on-line learning is generally more difficult and unreliable than off-line learning. Off-line incremental learning does not have all these problems of on-line learning, which is why it is important to distinguish between the concepts of on-line and incremental learning. Some of the theoretical difficulties of on-line learning are alleviated when the assumptions of stochastic learning hold (see below) and training is assumed to proceed indefinitely. For more information on on-line learning, see Saad (1998), but note that the authors in that volume do not reliably distinguish between on-line and incremental learning. Deterministic, Stochastic, and Adaptive Learning ++++++++++++++++++++++++++++++++++++++++++++++++ Deterministic learning is based on optimization of an objective function that can be recomputed many times and always produces the same value given the same weights. Deterministic learning is always off-line. Stochastic methods are used when computation of the objective function is corrupted by noise. In particular, basic stochastic approximation is a form of on-line gradient descent learning in which the training cases are obtained by a stationary random process: Initialize the weights. Initialize the learning rate. Repeat the following steps: Randomly select one (or possibly more) case(s) from the population. Update the weights by subtracting the gradient times the learning rate. Reduce the learning rate according to an appropriate schedule. In stochastic on-line learning, the noise-corrupted objective function is the error function for any given case, assuming that the case-wise error function has some stationary random distribution. The learning rate must gradually decrease towards zero during training to guarantee convergence of the weights to a local minimum of the noiseless objective function. This gradual reduction of the learning rate is often called "annealing." If the function that the network is trying to learn changes over time, the case-wise error function does not have a stationary random distribution. To allow the network to track changes over time, the learning rate must be kept strictly away from zero. Learning methods that track a changing environment are often called "adaptive" (as in adaptive vector quantization, Gersho and Gray, 1992) or "continuous" rather than "stochastic". There is a trade-off between accuracy and speed of adaptation. Adaptive learning does not converge in a stationary environment. Hence the long-run properties of stochastic learning and adaptive learning are quite different, even though the algorithms may differ only in the sequence of learning rates. The object of adaptive learning is to forget the past when it is no longer relevant. If you want to remember the past in a changing learning environment, then you would be more interested in sequential learning (see below). In stochastic learning with a suitably annealed learning rate, overtraining does not occur because the more you train, the more data you have, and the network converges toward a local optimum of the objective function for the entire population, not a local optimum for a finite training set. Of course, this conclusion does not hold if you train by cycling through a finite training set instead of collecting new data on every step. For a theoretical discussion of stochastic learning, see Bertsekas and Tsitsiklis (1996, Chapter 4). For further references on stochastic approximation, see "What is backprop?" For adaptive filtering, see Haykin (1996). The term "adaptive learning" is sometimes used for gradient methods in which the learning rate is changed during training. Constructive Learning (Growing networks) ++++++++++++++++++++++++++++++++++++++++ Constructive learning adds units or connections to the network during training. Typically, constructive learning begins with a network with no hidden units, which is trained for a while. Then without altering the existing weights, one or more new hidden units are added to the network, training resumes, and so on. Many variations are possible, involving different patterns of connections and schemes for freezing and thawing weights. The most popular constructive algorithm is cascade correlation (Fahlman and Lebiere, 1990), of which many variations are possible (e.g., Littmann and Ritter, 1996; Prechelt, 1997). Various other constructive algorithms are summarized in Smieja (1993), Kwok and Yeung (1997; also other papers at http://info.cs.ust.hk/faculty/dyyeung/paper/cnn.html), and Reed and Marks (1999). For theory, see Baum (1989), Jones (1992), and Meir and Maiorov (1999). Lutz Prechelt has a bibliography on constructive algorithms at http://wwwipd.ira.uka.de/~prechelt/NN/construct.bib Constructive algorithms can be highly effective for escaping bad local minima of the objective function, and are often much faster than algorithms for global optimization such as simulated annealing and genetic algorithms. A well-known example is the two-spirals problem, which is very difficult to learn with standard backprop but relatively easy to learn with cascade correlation (Fahlman and Lebiere, 1990). Some of the Donoho-Johnstone benchmarks (especially "bumps") are almost impossible to learn with standard backprop but can be learned very accurately with constructive algorithms (see ftp://ftp.sas.com/pub/neural/dojo/dojo.html.) Constructive learning is commonly used to train multilayer perceptrons in which the activation functions are step functions. Such networks are difficult to train nonconstructively because the objective function is discontinuous and gradient-descent methods cannot be used. Several clever constructive algorithms (such as Upstart, Tiling, Marchand, etc.) have been devised whereby a multilayer perceptron is constructed by training a sequence of perceptrons, each of which is trained by some standard method such as the well-known perceptron or pocket algorithms. Most constructive algorithms of this kind are designed so that the training error goes to zero when the network gets sufficiently large. Such guarantees do not apply to the generalization error; you should guard against overfitting when you are using constructive algorithms just as with nonconstructive algorithms (see part 3 of the FAQ, especially "What is overfitting and how can I avoid it?") Logically, you would expect "destructive" learning to start with a large network and delete units during training, but I have never seen this term used. The process of deleting units or connections is usually called "pruning" (Reed, 1993; Reed and Marks 1999). The term "selectionism" has also been used as the opposite of "constructivism" in cognitive neuroscience (Quartz and Sejnowski, 1997). Sequential Learning, Catastrophic Interference, and the +++++++++++++++++++++++++++++++++++++++++++++++++++++++ Stability-Plasticity Dilemma ++++++++++++++++++++++++++++ "Sequential learning" sometimes means incremental learning but also refers to a very important problem in neuroscience (e.g., McClelland, McNaughton, and O'Reilly 1995). To reduce confusion, the latter usage should be preferred. Sequential learning in its purest form operates on a sequence of training sets as follows: Initialize the weights. Repeat the following steps: Collect one or more training cases. Train the network on the current training set using any method. Discard the current training set and all other information related to training except the weights. Pure sequential learning differs from on-line learning in that most on-line algorithms require storage of information in addition to the weights, such as the learning rate or aproximations to the objective function or Hessian Matrix. Such additional storage is not allowed in pure sequential learning. Pure sequential learning is important as a model of how learning occurs in real, biological brains. Humans and other animals are often observed to learn new things without forgetting old things. But pure sequential learning tends not to work well in artificial neural networks. With a fixed architecture, distributed (rather than local) representations, and a training algorithm based on minimizing an objective function, sequential learning results in "catastrophic interference", because the minima of the objective function for one training set may be totally diferent than the minima for subsequent training sets. Hence each successive training set may cause the network to forget completely all previous training sets. This problem is also called the "stability-plasticity dilemma." Successful sequential learning usually requires one or more of the following: o Noise-free data. o Constructive architectures. o Local representations. o Storage of extra information besides the weights (this is impure sequential learning and is not biologically plausible unless a a biological mechanism for storing the extra information is provided) The connectionist literature on catastrophic interference seems oblivious to statistical and numerical theory, and much of the research is based on the ludicrous idea of using an autoassociative backprop network to model recognition memory. Some of the problems with this approach are explained by Sharkey and Sharkey (1995). Current research of the PDP variety is reviewed by French (1999). PDP remedies for catastrophic forgetting generally require cheating, i.e., storing information outside the network. For example, pseudorehearsal (Robins, 1995) requires storing the distribution of the inputs, although this fact is often overlooked. It appears that the only way to avoid catastrophic interference in a PDP-style network is to combine two networks modularly: a fast-learning network to memorize data, and a slow-learning network to generalize from data memorized by the fast-learning network (McClelland, McNaughton, and O'Reilly 1995). The PDP literature virtually ignores the ART literature (See "What is ART?"), which provides a localist constructive solution to what the ART people call the "stability-plasticity dilemma." None of this research deals with sequential learning in a statistically sound manner, and many of the methods proposed for sequential learning require noise-free data. The statistical theory of sufficient statistics makes it obvious that efficient sequential learning requires the storage of additional information besides the weights in a standard feedforward network. I know of no references to this subject in the NN literature, but Bishop (1991) provided a mathematically sound treatment of a closely-related problem. References: Baum, E.B. (1989), "A proposal for more powerful learning algorithms," Neural Computation, 1, 201-207. Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8. Bishop, C. (1991), "A fast procedure for retraining the multilayer perceptron," International Journal of Neural Systems, 2, 229-236. Fahlman, S.E., and Lebiere, C. (1990), "The Cascade-Correlation Learning Architecture", in Touretzky, D. S. (ed.), Advances in Neural Information Processing Systems 2,, Los Altos, CA: Morgan Kaufmann Publishers, pp. 524-532, ftp://archive.cis.ohio-state.edu/pub/neuroprose/fahlman.cascor-tr.ps.Z, http://www.rafael-ni.hpg.com.br/arquivos/fahlman-cascor.pdf French, R.M. (1999), "Catastrophic forgetting in connectionist networks: Causes, consequences and solutions," Trends in Cognitive Sciences, 3, 128-135, http://www.fapse.ulg.ac.be/Lab/Trav/rfrench.html#TICS_cat_forget Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal Compression, Boston: Kluwer Academic Publishers. Haykin, S. (1996), Adaptive Filter Theory, Englewood Cliffs, NJ: Prentice-Hall. Jones, L. (1992), "A simple lemma on greedy approximation in Hilbert space and convergence rate for projection pursuit regression and neural network training," Annals of Statistics, 20, 608-613. Kwok, T.Y. and Yeung, D.Y. (1997), "Constructive algorithms for structure learning in feedforward neural networks for regression problems," IEEE Transactions on Neural Networks, volume 8, 630-645. Littmann, E., and Ritter, H. (1996), "Learning and generalization in cascade network architectures," Neural Computation, 8, 1521-1539. McClelland, J., McNaughton, B. and O'Reilly, R. (1995), "Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory," Psychological Review, 102, 419-457. Meir, R., and Maiorov, V. (1999), "On the optimality of incremental neural network algorithms," in Kerans, M.S., Solla, S.A., amd Cohn, D.A. (eds.), Advances in Neural Information Processing Systems 11, Cambridge,MA: The MIT Press, pp. 295-301. Prechelt, L. (1997), "Investigation of the CasCor Family of Learning Algorithms," Neural Networks, 10, 885-896, http://wwwipd.ira.uka.de/~prechelt/Biblio/#CasCor Quartz, S.R., and Sejnowski, T.J. (1997), "The neural basis of cognitive development: A constructivist manifesto," Behavioral and Brain Sciences, 20, 537-596, ftp://ftp.princeton.edu/pub/harnad/BBS/ Reed, R. (1993), "Pruning algorithms--A survey," IEEE Transactions on Neural Networks, 4, 740-747. Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8. Robins, A. (1995), "Catastrophic Forgetting, Rehearsal, and Pseudorehearsal," Connection Science, 7, 123-146. Saad, D., ed. (1998), On-Line Learning in Neural Networks, Cambridge: Cambridge University Press. Sharkey, N.E., and Sharkey, A.J.C. (1995), "An analysis of catastrophic interference," Connection Science, 7, 301-329. Smieja, F.J. (1993), "Neural Network Constructive Algorithms: Trading Generalization for Learning Efficiency?" Circuits, Systems and Signal Processing, 12, 331-374, ftp://borneo.gmd.de/pub/as/janus/pre_6.ps
Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What are combination, activation, error, and
Next Document: What is backprop?
Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page
Send corrections/additions to the FAQ Maintainer:
email@example.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM