## Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z

# comp.ai.neural-nets FAQ, Part 2 of 7: LearningSection - What are conjugate gradients,

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Property taxes ]

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What learning rate should be used for backprop?
Next Document: How does ill-conditioning affect NN training?
```Levenberg-Marquardt, etc.?
===========================

Training a neural network is, in most cases, an exercise in numerical
optimization of a usually nonlinear objective function ("objective function"
means whatever function you are trying to optimize and is a slightly more
general term than "error function" in that it may include other quantities
such as penalties for weight decay; see "What are combination, activation,
error, and objective functions?" for more details).

Methods of nonlinear optimization have been studied for hundreds of years,
and there is a huge literature on the subject in fields such as numerical
analysis, operations research, and statistical computing, e.g., Bertsekas
(1995), Bertsekas and Tsitsiklis (1996), Fletcher (1987), and Gill, Murray,
and Wright (1981). Masters (1995) has a good elementary discussion of
conjugate gradient and Levenberg-Marquardt algorithms in the context of NNs.

There is no single best method for nonlinear optimization. You need to
choose a method based on the characteristics of the problem to be solved.
For objective functions with continuous second derivatives (which would
include feedforward nets with the most popular differentiable activation
functions and error functions), three general types of algorithms have been
found to be effective for most practical purposes:

o For a small number of weights, stabilized Newton and Gauss-Newton
algorithms, including various Levenberg-Marquardt and trust-region
algorithms, are efficient. The memory required by these algorithms is
proportional to the square of the number of weights.
o For a moderate number of weights, various quasi-Newton algorithms are
efficient. The memory required by these algorithms is proportional to the
square of the number of weights.
o For a large number of weights, various conjugate-gradient algorithms are
efficient. The memory required by these algorithms is proportional to the
number of weights.

Additional variations on the above methods, such as limited-memory
quasi-Newton and double dogleg, can be found in textbooks such as Bertsekas
(1995). Objective functions that are not continuously differentiable are
more difficult to optimize. For continuous objective functions that lack
derivatives on certain manifolds, such as ramp activation functions (which
lack derivatives at the top and bottom of the ramp) and the
least-absolute-value error function (which lacks derivatives for cases with
zero error), subgradient methods can be used. For objective functions with
discontinuities, such as threshold activation functions and the
misclassification-count error function, Nelder-Mead simplex algorithm and
various secant methods can be used. However, these methods may be very slow
for large networks, and it is better to use continuously differentiable
objective functions when possible.

All of the above methods find local optima--they are not guaranteed to find
a global optimum. In practice, Levenberg-Marquardt often finds better optima
for a variety of problems than do the other usual methods. I know of no
theoretical explanation for this empirical finding.

For global optimization, there are also a variety of approaches. You can
simply run any of the local optimization methods from numerous random
starting points. Or you can try more complicated methods designed for global
optimization such as simulated annealing or genetic algorithms (see Reeves
1993 and "What about Genetic Algorithms and Evolutionary Computation?").
Global optimization for neural nets is especially difficult because the
number of distinct local optima can be astronomical.

In most applications, it is advisable to train several networks with
different numbers of hidden units. Rather than train each network beginning
with completely random weights, it is usually more efficient to use
constructive learning (see "Constructive Learning (Growing networks)"),
where the weights that result from training smaller networks are used to
initialize larger networks. Constructive learning can be done with any of
the conventional optimization techniques or with the various "prop" methods,
and can be very effective at finding good local optima at less expense than
full-blown global optimization methods.

Another important consideration in the choice of optimization algorithms is
that neural nets are often ill-conditioned (Saarinen, Bramley, and Cybenko
1993), especially when there are many hidden units. Algorithms that use only
first-order information, such as steepest descent and standard backprop, are
notoriously slow for ill-conditioned problems. Generally speaking, the more
use an algorithm makes of second-order information, the better it will
behave under ill-conditioning. The following methods are listed in order of
increasing use of second-order information: steepest descent, conjugate
gradients, quasi-Newton, Gauss-Newton, Newton-Raphson. Unfortunately, the
methods that are better for severe ill-conditioning are the methods that are
preferable for a small number of weights, and the methods that are
preferable for a large number of weights are not as good at handling severe
ill-conditioning. Therefore for networks with many hidden units, it is
advisable to try to alleviate ill-conditioning by standardizing input and
target variables, choosing initial values from a reasonable range, and using
weight decay or Bayesian regularization methods. For more discussion of
ill-conditioning in neural nets, see
ftp://ftp.sas.com/pub/neural/illcond/illcond.html

Writing programs for conventional optimization algorithms is considerably
more difficult than writing programs for standard backprop. As "Jive Dadson"
said in comp.ai.neural-nets:

Writing a good conjugate gradient algorithm turned out to be a lot of
work. It's not even easy to find all the technical info you need. The
devil is in the details. There are a lot of details.

Indeed, some popular books on "numerical recipes" are notoriously bad (see
http://math.jpl.nasa.gov/nr/ for details). If you are not experienced in
both programming and numerical analysis, use software written by
professionals instead of trying to write your own. For a survey of
optimization software, see Moré and Wright (1993).

For more on-line information on numerical optimization see:

o The kangaroos, a nontechnical description of various optimization
methods, at ftp://ftp.sas.com/pub/neural/kangaroos.
o Sam Roweis's paper on Levenberg-Marquardt at
http://www.gatsby.ucl.ac.uk/~roweis/notes.html
o Jonathan Shewchuk's paper on conjugate gradients, "An Introduction to the
Conjugate Gradient Method Without the Agonizing Pain," at
http://www.cs.cmu.edu/~jrs/jrspapers.html
o Lester Ingber's page on Adaptive Simulated Annealing (ASA), karate, etc.
at http://www.ingber.com/ or http://www.alumni.caltech.edu/~ingber/
o The Netlib repository, http://www.netlib.org/, containing freely
available software, documents, and databases of interest to the numerical
and scientific computing community.
o The linear and nonlinear programming FAQs at
http://www.mcs.anl.gov/home/otc/Guide/faq/.
o Arnold Neumaier's page on global optimization at
http://solon.cma.univie.ac.at/~neum/glopt.html.
o Simon Streltsov's page on global optimization at http://cad.bu.edu/go.

References:

Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena
Scientific, ISBN 1-886529-14-0.

Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.

Fletcher, R. (1987) Practical Methods of Optimization, NY: Wiley.

Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization,

Levenberg, K. (1944) "A method for the solution of certain problems in
least squares," Quart. Appl. Math., 2, 164-168.

Marquardt, D. (1963) "An algorithm for least-squares estimation of
nonlinear parameters," SIAM J. Appl. Math., 11, 431-441. This is the
third most frequently cited paper in all the mathematical sciences.

Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0

Moré, J.J. (1977) "The Levenberg-Marquardt algorithm: implementation and
theory," in Watson, G.A., ed., Numerical Analysis, Lecture Notes in
Mathematics 630, Springer-Verlag, Heidelberg, 105-116.

Moré, J.J. and Wright, S.J. (1993), Optimization Software Guide,

Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised
Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The
MIT Press, ISBN 0-262-18190-8.

Reeves, C.R., ed. (1993) Modern Heuristic Techniques for Combinatorial
Problems, NY: Wiley.

Rinnooy Kan, A.H.G., and Timmer, G.T., (1989) Global Optimization: A
Survey, International Series of Numerical Mathematics, vol. 87, Basel:
Birkhauser Verlag.

Saarinen, S., Bramley, R., and Cybenko, G. (1993), "Ill-conditioning in
neural network training problems," Siam J. of Scientific Computing, 14,
693-714.

```

## User Contributions:

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What learning rate should be used for backprop?
Next Document: How does ill-conditioning affect NN training?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM