Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning Previous Document: What learning rate should be used for backprop? Next Document: How does ill-conditioning affect NN training? See reader questions & answers on this topic! - Help others by sharing your knowledge Levenberg-Marquardt, etc.? =========================== Training a neural network is, in most cases, an exercise in numerical optimization of a usually nonlinear objective function ("objective function" means whatever function you are trying to optimize and is a slightly more general term than "error function" in that it may include other quantities such as penalties for weight decay; see "What are combination, activation, error, and objective functions?" for more details). Methods of nonlinear optimization have been studied for hundreds of years, and there is a huge literature on the subject in fields such as numerical analysis, operations research, and statistical computing, e.g., Bertsekas (1995), Bertsekas and Tsitsiklis (1996), Fletcher (1987), and Gill, Murray, and Wright (1981). Masters (1995) has a good elementary discussion of conjugate gradient and Levenberg-Marquardt algorithms in the context of NNs. There is no single best method for nonlinear optimization. You need to choose a method based on the characteristics of the problem to be solved. For objective functions with continuous second derivatives (which would include feedforward nets with the most popular differentiable activation functions and error functions), three general types of algorithms have been found to be effective for most practical purposes: o For a small number of weights, stabilized Newton and Gauss-Newton algorithms, including various Levenberg-Marquardt and trust-region algorithms, are efficient. The memory required by these algorithms is proportional to the square of the number of weights. o For a moderate number of weights, various quasi-Newton algorithms are efficient. The memory required by these algorithms is proportional to the square of the number of weights. o For a large number of weights, various conjugate-gradient algorithms are efficient. The memory required by these algorithms is proportional to the number of weights. Additional variations on the above methods, such as limited-memory quasi-Newton and double dogleg, can be found in textbooks such as Bertsekas (1995). Objective functions that are not continuously differentiable are more difficult to optimize. For continuous objective functions that lack derivatives on certain manifolds, such as ramp activation functions (which lack derivatives at the top and bottom of the ramp) and the least-absolute-value error function (which lacks derivatives for cases with zero error), subgradient methods can be used. For objective functions with discontinuities, such as threshold activation functions and the misclassification-count error function, Nelder-Mead simplex algorithm and various secant methods can be used. However, these methods may be very slow for large networks, and it is better to use continuously differentiable objective functions when possible. All of the above methods find local optima--they are not guaranteed to find a global optimum. In practice, Levenberg-Marquardt often finds better optima for a variety of problems than do the other usual methods. I know of no theoretical explanation for this empirical finding. For global optimization, there are also a variety of approaches. You can simply run any of the local optimization methods from numerous random starting points. Or you can try more complicated methods designed for global optimization such as simulated annealing or genetic algorithms (see Reeves 1993 and "What about Genetic Algorithms and Evolutionary Computation?"). Global optimization for neural nets is especially difficult because the number of distinct local optima can be astronomical. In most applications, it is advisable to train several networks with different numbers of hidden units. Rather than train each network beginning with completely random weights, it is usually more efficient to use constructive learning (see "Constructive Learning (Growing networks)"), where the weights that result from training smaller networks are used to initialize larger networks. Constructive learning can be done with any of the conventional optimization techniques or with the various "prop" methods, and can be very effective at finding good local optima at less expense than full-blown global optimization methods. Another important consideration in the choice of optimization algorithms is that neural nets are often ill-conditioned (Saarinen, Bramley, and Cybenko 1993), especially when there are many hidden units. Algorithms that use only first-order information, such as steepest descent and standard backprop, are notoriously slow for ill-conditioned problems. Generally speaking, the more use an algorithm makes of second-order information, the better it will behave under ill-conditioning. The following methods are listed in order of increasing use of second-order information: steepest descent, conjugate gradients, quasi-Newton, Gauss-Newton, Newton-Raphson. Unfortunately, the methods that are better for severe ill-conditioning are the methods that are preferable for a small number of weights, and the methods that are preferable for a large number of weights are not as good at handling severe ill-conditioning. Therefore for networks with many hidden units, it is advisable to try to alleviate ill-conditioning by standardizing input and target variables, choosing initial values from a reasonable range, and using weight decay or Bayesian regularization methods. For more discussion of ill-conditioning in neural nets, see ftp://ftp.sas.com/pub/neural/illcond/illcond.html Writing programs for conventional optimization algorithms is considerably more difficult than writing programs for standard backprop. As "Jive Dadson" said in comp.ai.neural-nets: Writing a good conjugate gradient algorithm turned out to be a lot of work. It's not even easy to find all the technical info you need. The devil is in the details. There are a lot of details. Indeed, some popular books on "numerical recipes" are notoriously bad (see http://math.jpl.nasa.gov/nr/ for details). If you are not experienced in both programming and numerical analysis, use software written by professionals instead of trying to write your own. For a survey of optimization software, see Moré and Wright (1993). For more on-line information on numerical optimization see: o The kangaroos, a nontechnical description of various optimization methods, at ftp://ftp.sas.com/pub/neural/kangaroos. o Sam Roweis's paper on Levenberg-Marquardt at http://www.gatsby.ucl.ac.uk/~roweis/notes.html o Jonathan Shewchuk's paper on conjugate gradients, "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain," at http://www.cs.cmu.edu/~jrs/jrspapers.html o Lester Ingber's page on Adaptive Simulated Annealing (ASA), karate, etc. at http://www.ingber.com/ or http://www.alumni.caltech.edu/~ingber/ o The Netlib repository, http://www.netlib.org/, containing freely available software, documents, and databases of interest to the numerical and scientific computing community. o The linear and nonlinear programming FAQs at http://www.mcs.anl.gov/home/otc/Guide/faq/. o Arnold Neumaier's page on global optimization at http://solon.cma.univie.ac.at/~neum/glopt.html. o Simon Streltsov's page on global optimization at http://cad.bu.edu/go. References: Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-14-0. Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8. Fletcher, R. (1987) Practical Methods of Optimization, NY: Wiley. Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization, Academic Press: London. Levenberg, K. (1944) "A method for the solution of certain problems in least squares," Quart. Appl. Math., 2, 164-168. Marquardt, D. (1963) "An algorithm for least-squares estimation of nonlinear parameters," SIAM J. Appl. Math., 11, 431-441. This is the third most frequently cited paper in all the mathematical sciences. Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 Moré, J.J. (1977) "The Levenberg-Marquardt algorithm: implementation and theory," in Watson, G.A., ed., Numerical Analysis, Lecture Notes in Mathematics 630, Springer-Verlag, Heidelberg, 105-116. Moré, J.J. and Wright, S.J. (1993), Optimization Software Guide, Philadelphia: SIAM, ISBN 0-89871-322-6. Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0-262-18190-8. Reeves, C.R., ed. (1993) Modern Heuristic Techniques for Combinatorial Problems, NY: Wiley. Rinnooy Kan, A.H.G., and Timmer, G.T., (1989) Global Optimization: A Survey, International Series of Numerical Mathematics, vol. 87, Basel: Birkhauser Verlag. Saarinen, S., Bramley, R., and Cybenko, G. (1993), "Ill-conditioning in neural network training problems," Siam J. of Scientific Computing, 14, 693-714. User Contributions:Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning Previous Document: What learning rate should be used for backprop? Next Document: How does ill-conditioning affect NN training? Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page [ Usenet FAQs | Web FAQs | Documents | RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
|
Comment about this article, ask questions, or add new information about this topic: