Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Section - What is backprop?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Business Photos and Profiles ]


Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What are batch, incremental, on-line, off-line,
Next Document: What learning rate should be used for backprop?
See reader questions & answers on this topic! - Help others by sharing your knowledge

"Backprop" is short for "backpropagation of error". The term 
backpropagation causes much confusion. Strictly speaking, backpropagation
refers to the method for computing the gradient of the case-wise error
function with respect to the weights for a feedforward network, a
straightforward but elegant application of the chain rule of elementary
calculus (Werbos 1974/1994). By extension, backpropagation or backprop
refers to a training method that uses backpropagation to compute the
gradient. By further extension, a backprop network is a feedforward network
trained by backpropagation. 

"Standard backprop" is a euphemism for the generalized delta rule, the
training algorithm that was popularized by Rumelhart, Hinton, and Williams
in chapter 8 of Rumelhart and McClelland (1986), which remains the most
widely used supervised training method for neural nets. The generalized
delta rule (including momentum) is called the "heavy ball method" in the
numerical analysis literature (Polyak 1964, 1987; Bertsekas 1995, 78-79). 

Standard backprop can be used for both batch training (in which the weights
are updated after processing the entire training set) and incremental
training (in which the weights are updated after processing each case). For
batch training, standard backprop usually converges (eventually) to a local
minimum, if one exists. For incremental training, standard backprop does not
converge to a stationary point of the error surface. To obtain convergence,
the learning rate must be slowly reduced. This methodology is called
"stochastic approximation" or "annealing". 

The convergence properties of standard backprop, stochastic approximation,
and related methods, including both batch and incremental algorithms, are
discussed clearly and thoroughly by Bertsekas and Tsitsiklis (1996). For a
practical discussion of backprop training in MLPs, Reed and Marks (1999) is
the best reference I've seen. 

For batch processing, there is no reason to suffer through the slow
convergence and the tedious tuning of learning rates and momenta of standard
backprop. Much of the NN research literature is devoted to attempts to speed
up backprop. Most of these methods are inconsequential; two that are
effective are Quickprop (Fahlman 1989) and RPROP (Riedmiller and Braun
1993). Concise descriptions of these algorithms are given by Schiffmann,
Joost, and Werner (1994) and Reed and Marks (1999). But conventional methods
for nonlinear optimization are usually faster and more reliable than any of
the "props". See "What are conjugate gradients, Levenberg-Marquardt, etc.?".

Incremental backprop can be highly efficient for some large data sets if you
select a good learning rate, but that can be difficult to do (see "What
learning rate should be used for backprop?"). Also, incremental backprop is
very sensitive to ill-conditioning (see 
ftp://ftp.sas.com/pub/neural/illcond/illcond.html). 

For more on-line info on backprop, see Donald Tveter's Backpropagator's
Review at http://www.dontveter.com/bpr/bpr.html or 
http://gannoo.uce.ac.uk/bpr/bpr.html. 

References on backprop: 

   Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena
   Scientific, ISBN 1-886529-14-0. 

   Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
   Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8. 

   Polyak, B.T. (1964), "Some methods of speeding up the convergence of
   iteration methods," Z. Vycisl. Mat. i Mat. Fiz., 4, 1-17. 

   Polyak, B.T. (1987), Introduction to Optimization, NY: Optimization
   Software, Inc. 

   Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised
   Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The
   MIT Press, ISBN 0-262-18190-8.

   Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986), "Learning
   internal representations by error propagation", in Rumelhart, D.E. and
   McClelland, J. L., eds. (1986), Parallel Distributed Processing:
   Explorations in the Microstructure of Cognition, Volume 1, 318-362,
   Cambridge, MA: The MIT Press. 

   Werbos, P.J. (1974/1994), The Roots of Backpropagation, NY: John Wiley &
   Sons. Includes Werbos's 1974 Harvard Ph.D. thesis, Beyond Regression. 

References on stochastic approximation: 

   Robbins, H. & Monro, S. (1951), "A Stochastic Approximation Method",
   Annals of Mathematical Statistics, 22, 400-407. 

   Kiefer, J. & Wolfowitz, J. (1952), "Stochastic Estimation of the Maximum
   of a Regression Function," Annals of Mathematical Statistics, 23,
   462-466. 

   Kushner, H.J., and Yin, G. (1997), Stochastic Approximation Algorithms
   and Applications, NY: Springer-Verlag. 

   Kushner, H.J., and Clark, D. (1978), Stochastic Approximation Methods
   for Constrained and Unconstrained Systems, Springer-Verlag. 

   White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden
   Layer Feedforward Network Models", J. of the American Statistical Assoc.,
   84, 1008-1013. 

References on better props: 

   Fahlman, S.E. (1989), "Faster-Learning Variations on Back-Propagation: An
   Empirical Study", in Touretzky, D., Hinton, G, and Sejnowski, T., eds., 
   Proceedings of the 1988 Connectionist Models Summer School, Morgan
   Kaufmann, 38-51. 

   Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised
   Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The
   MIT Press, ISBN 0-262-18190-8.

   Riedmiller, M. (199?), "Advanced supervised learning in multi-layer
   perceptrons--from backpropagtion to adaptive learning algorithms," 
   ftp://i11s16.ira.uka.de/pub/neuro/papers/ 

   Riedmiller, M. and Braun, H. (1993), "A Direct Adaptive Method for Faster
   Backpropagation Learning: The RPROP Algorithm", Proceedings of the IEEE
   International Conference on Neural Networks 1993, San Francisco: IEEE. 

   Schiffmann, W., Joost, M., and Werner, R. (1994), "Optimization of the
   Backpropagation Algorithm for Training Multilayer Perceptrons," 
   ftp://archive.cis.ohio-state.edu/pub/neuroprose/schiff.bp_speedup.ps.Z 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

CAPTCHA




Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What are batch, incremental, on-line, off-line,
Next Document: What learning rate should be used for backprop?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)





Last Update March 27 2014 @ 02:11 PM