Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 2 of 7: Learning
Section - What are conjugate gradients,

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Property taxes ]

Top Document: FAQ, Part 2 of 7: Learning
Previous Document: What learning rate should be used for backprop?
Next Document: How does ill-conditioning affect NN training?
See reader questions & answers on this topic! - Help others by sharing your knowledge
Levenberg-Marquardt, etc.? 

Training a neural network is, in most cases, an exercise in numerical
optimization of a usually nonlinear objective function ("objective function"
means whatever function you are trying to optimize and is a slightly more
general term than "error function" in that it may include other quantities
such as penalties for weight decay; see "What are combination, activation,
error, and objective functions?" for more details). 

Methods of nonlinear optimization have been studied for hundreds of years,
and there is a huge literature on the subject in fields such as numerical
analysis, operations research, and statistical computing, e.g., Bertsekas
(1995), Bertsekas and Tsitsiklis (1996), Fletcher (1987), and Gill, Murray,
and Wright (1981). Masters (1995) has a good elementary discussion of
conjugate gradient and Levenberg-Marquardt algorithms in the context of NNs.

There is no single best method for nonlinear optimization. You need to
choose a method based on the characteristics of the problem to be solved.
For objective functions with continuous second derivatives (which would
include feedforward nets with the most popular differentiable activation
functions and error functions), three general types of algorithms have been
found to be effective for most practical purposes: 

 o For a small number of weights, stabilized Newton and Gauss-Newton
   algorithms, including various Levenberg-Marquardt and trust-region
   algorithms, are efficient. The memory required by these algorithms is
   proportional to the square of the number of weights. 
 o For a moderate number of weights, various quasi-Newton algorithms are
   efficient. The memory required by these algorithms is proportional to the
   square of the number of weights. 
 o For a large number of weights, various conjugate-gradient algorithms are
   efficient. The memory required by these algorithms is proportional to the
   number of weights. 

Additional variations on the above methods, such as limited-memory
quasi-Newton and double dogleg, can be found in textbooks such as Bertsekas
(1995). Objective functions that are not continuously differentiable are
more difficult to optimize. For continuous objective functions that lack
derivatives on certain manifolds, such as ramp activation functions (which
lack derivatives at the top and bottom of the ramp) and the
least-absolute-value error function (which lacks derivatives for cases with
zero error), subgradient methods can be used. For objective functions with
discontinuities, such as threshold activation functions and the
misclassification-count error function, Nelder-Mead simplex algorithm and
various secant methods can be used. However, these methods may be very slow
for large networks, and it is better to use continuously differentiable
objective functions when possible. 

All of the above methods find local optima--they are not guaranteed to find
a global optimum. In practice, Levenberg-Marquardt often finds better optima
for a variety of problems than do the other usual methods. I know of no
theoretical explanation for this empirical finding. 

For global optimization, there are also a variety of approaches. You can
simply run any of the local optimization methods from numerous random
starting points. Or you can try more complicated methods designed for global
optimization such as simulated annealing or genetic algorithms (see Reeves
1993 and "What about Genetic Algorithms and Evolutionary Computation?").
Global optimization for neural nets is especially difficult because the
number of distinct local optima can be astronomical. 

In most applications, it is advisable to train several networks with
different numbers of hidden units. Rather than train each network beginning
with completely random weights, it is usually more efficient to use
constructive learning (see "Constructive Learning (Growing networks)"),
where the weights that result from training smaller networks are used to
initialize larger networks. Constructive learning can be done with any of
the conventional optimization techniques or with the various "prop" methods,
and can be very effective at finding good local optima at less expense than
full-blown global optimization methods. 

Another important consideration in the choice of optimization algorithms is
that neural nets are often ill-conditioned (Saarinen, Bramley, and Cybenko
1993), especially when there are many hidden units. Algorithms that use only
first-order information, such as steepest descent and standard backprop, are
notoriously slow for ill-conditioned problems. Generally speaking, the more
use an algorithm makes of second-order information, the better it will
behave under ill-conditioning. The following methods are listed in order of
increasing use of second-order information: steepest descent, conjugate
gradients, quasi-Newton, Gauss-Newton, Newton-Raphson. Unfortunately, the
methods that are better for severe ill-conditioning are the methods that are
preferable for a small number of weights, and the methods that are
preferable for a large number of weights are not as good at handling severe
ill-conditioning. Therefore for networks with many hidden units, it is
advisable to try to alleviate ill-conditioning by standardizing input and
target variables, choosing initial values from a reasonable range, and using
weight decay or Bayesian regularization methods. For more discussion of
ill-conditioning in neural nets, see 

Writing programs for conventional optimization algorithms is considerably
more difficult than writing programs for standard backprop. As "Jive Dadson"
said in 

   Writing a good conjugate gradient algorithm turned out to be a lot of
   work. It's not even easy to find all the technical info you need. The
   devil is in the details. There are a lot of details. 

Indeed, some popular books on "numerical recipes" are notoriously bad (see for details). If you are not experienced in
both programming and numerical analysis, use software written by
professionals instead of trying to write your own. For a survey of
optimization software, see Moré and Wright (1993). 

For more on-line information on numerical optimization see: 

 o The kangaroos, a nontechnical description of various optimization
   methods, at 
 o Sam Roweis's paper on Levenberg-Marquardt at 
 o Jonathan Shewchuk's paper on conjugate gradients, "An Introduction to the
   Conjugate Gradient Method Without the Agonizing Pain," at 
 o Lester Ingber's page on Adaptive Simulated Annealing (ASA), karate, etc.
   at or 
 o The Netlib repository,, containing freely
   available software, documents, and databases of interest to the numerical
   and scientific computing community. 
 o The linear and nonlinear programming FAQs at 
 o Arnold Neumaier's page on global optimization at 
 o Simon Streltsov's page on global optimization at 


   Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena
   Scientific, ISBN 1-886529-14-0. 

   Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
   Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8. 

   Fletcher, R. (1987) Practical Methods of Optimization, NY: Wiley. 

   Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization,
   Academic Press: London. 

   Levenberg, K. (1944) "A method for the solution of certain problems in
   least squares," Quart. Appl. Math., 2, 164-168. 

   Marquardt, D. (1963) "An algorithm for least-squares estimation of
   nonlinear parameters," SIAM J. Appl. Math., 11, 431-441. This is the
   third most frequently cited paper in all the mathematical sciences. 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Moré, J.J. (1977) "The Levenberg-Marquardt algorithm: implementation and
   theory," in Watson, G.A., ed., Numerical Analysis, Lecture Notes in
   Mathematics 630, Springer-Verlag, Heidelberg, 105-116. 

   Moré, J.J. and Wright, S.J. (1993), Optimization Software Guide,
   Philadelphia: SIAM, ISBN 0-89871-322-6. 

   Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised
   Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The
   MIT Press, ISBN 0-262-18190-8.

   Reeves, C.R., ed. (1993) Modern Heuristic Techniques for Combinatorial
   Problems, NY: Wiley. 

   Rinnooy Kan, A.H.G., and Timmer, G.T., (1989) Global Optimization: A
   Survey, International Series of Numerical Mathematics, vol. 87, Basel:
   Birkhauser Verlag. 

   Saarinen, S., Bramley, R., and Cybenko, G. (1993), "Ill-conditioning in
   neural network training problems," Siam J. of Scientific Computing, 14,

User Contributions:

Comment about this article, ask questions, or add new information about this topic:


Top Document: FAQ, Part 2 of 7: Learning
Previous Document: What learning rate should be used for backprop?
Next Document: How does ill-conditioning affect NN training?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM