Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: What learning rate should be used for backprop? Next Document: How does illconditioning affect NN training? See reader questions & answers on this topic!  Help others by sharing your knowledge LevenbergMarquardt, etc.? =========================== Training a neural network is, in most cases, an exercise in numerical optimization of a usually nonlinear objective function ("objective function" means whatever function you are trying to optimize and is a slightly more general term than "error function" in that it may include other quantities such as penalties for weight decay; see "What are combination, activation, error, and objective functions?" for more details). Methods of nonlinear optimization have been studied for hundreds of years, and there is a huge literature on the subject in fields such as numerical analysis, operations research, and statistical computing, e.g., Bertsekas (1995), Bertsekas and Tsitsiklis (1996), Fletcher (1987), and Gill, Murray, and Wright (1981). Masters (1995) has a good elementary discussion of conjugate gradient and LevenbergMarquardt algorithms in the context of NNs. There is no single best method for nonlinear optimization. You need to choose a method based on the characteristics of the problem to be solved. For objective functions with continuous second derivatives (which would include feedforward nets with the most popular differentiable activation functions and error functions), three general types of algorithms have been found to be effective for most practical purposes: o For a small number of weights, stabilized Newton and GaussNewton algorithms, including various LevenbergMarquardt and trustregion algorithms, are efficient. The memory required by these algorithms is proportional to the square of the number of weights. o For a moderate number of weights, various quasiNewton algorithms are efficient. The memory required by these algorithms is proportional to the square of the number of weights. o For a large number of weights, various conjugategradient algorithms are efficient. The memory required by these algorithms is proportional to the number of weights. Additional variations on the above methods, such as limitedmemory quasiNewton and double dogleg, can be found in textbooks such as Bertsekas (1995). Objective functions that are not continuously differentiable are more difficult to optimize. For continuous objective functions that lack derivatives on certain manifolds, such as ramp activation functions (which lack derivatives at the top and bottom of the ramp) and the leastabsolutevalue error function (which lacks derivatives for cases with zero error), subgradient methods can be used. For objective functions with discontinuities, such as threshold activation functions and the misclassificationcount error function, NelderMead simplex algorithm and various secant methods can be used. However, these methods may be very slow for large networks, and it is better to use continuously differentiable objective functions when possible. All of the above methods find local optimathey are not guaranteed to find a global optimum. In practice, LevenbergMarquardt often finds better optima for a variety of problems than do the other usual methods. I know of no theoretical explanation for this empirical finding. For global optimization, there are also a variety of approaches. You can simply run any of the local optimization methods from numerous random starting points. Or you can try more complicated methods designed for global optimization such as simulated annealing or genetic algorithms (see Reeves 1993 and "What about Genetic Algorithms and Evolutionary Computation?"). Global optimization for neural nets is especially difficult because the number of distinct local optima can be astronomical. In most applications, it is advisable to train several networks with different numbers of hidden units. Rather than train each network beginning with completely random weights, it is usually more efficient to use constructive learning (see "Constructive Learning (Growing networks)"), where the weights that result from training smaller networks are used to initialize larger networks. Constructive learning can be done with any of the conventional optimization techniques or with the various "prop" methods, and can be very effective at finding good local optima at less expense than fullblown global optimization methods. Another important consideration in the choice of optimization algorithms is that neural nets are often illconditioned (Saarinen, Bramley, and Cybenko 1993), especially when there are many hidden units. Algorithms that use only firstorder information, such as steepest descent and standard backprop, are notoriously slow for illconditioned problems. Generally speaking, the more use an algorithm makes of secondorder information, the better it will behave under illconditioning. The following methods are listed in order of increasing use of secondorder information: steepest descent, conjugate gradients, quasiNewton, GaussNewton, NewtonRaphson. Unfortunately, the methods that are better for severe illconditioning are the methods that are preferable for a small number of weights, and the methods that are preferable for a large number of weights are not as good at handling severe illconditioning. Therefore for networks with many hidden units, it is advisable to try to alleviate illconditioning by standardizing input and target variables, choosing initial values from a reasonable range, and using weight decay or Bayesian regularization methods. For more discussion of illconditioning in neural nets, see ftp://ftp.sas.com/pub/neural/illcond/illcond.html Writing programs for conventional optimization algorithms is considerably more difficult than writing programs for standard backprop. As "Jive Dadson" said in comp.ai.neuralnets: Writing a good conjugate gradient algorithm turned out to be a lot of work. It's not even easy to find all the technical info you need. The devil is in the details. There are a lot of details. Indeed, some popular books on "numerical recipes" are notoriously bad (see http://math.jpl.nasa.gov/nr/ for details). If you are not experienced in both programming and numerical analysis, use software written by professionals instead of trying to write your own. For a survey of optimization software, see Moré and Wright (1993). For more online information on numerical optimization see: o The kangaroos, a nontechnical description of various optimization methods, at ftp://ftp.sas.com/pub/neural/kangaroos. o Sam Roweis's paper on LevenbergMarquardt at http://www.gatsby.ucl.ac.uk/~roweis/notes.html o Jonathan Shewchuk's paper on conjugate gradients, "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain," at http://www.cs.cmu.edu/~jrs/jrspapers.html o Lester Ingber's page on Adaptive Simulated Annealing (ASA), karate, etc. at http://www.ingber.com/ or http://www.alumni.caltech.edu/~ingber/ o The Netlib repository, http://www.netlib.org/, containing freely available software, documents, and databases of interest to the numerical and scientific computing community. o The linear and nonlinear programming FAQs at http://www.mcs.anl.gov/home/otc/Guide/faq/. o Arnold Neumaier's page on global optimization at http://solon.cma.univie.ac.at/~neum/glopt.html. o Simon Streltsov's page on global optimization at http://cad.bu.edu/go. References: Bertsekas, D. P. (1995), Nonlinear Programming, Belmont, MA: Athena Scientific, ISBN 1886529140. Bertsekas, D. P. and Tsitsiklis, J. N. (1996), NeuroDynamic Programming, Belmont, MA: Athena Scientific, ISBN 1886529108. Fletcher, R. (1987) Practical Methods of Optimization, NY: Wiley. Gill, P.E., Murray, W. and Wright, M.H. (1981) Practical Optimization, Academic Press: London. Levenberg, K. (1944) "A method for the solution of certain problems in least squares," Quart. Appl. Math., 2, 164168. Marquardt, D. (1963) "An algorithm for leastsquares estimation of nonlinear parameters," SIAM J. Appl. Math., 11, 431441. This is the third most frequently cited paper in all the mathematical sciences. Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0471105880 Moré, J.J. (1977) "The LevenbergMarquardt algorithm: implementation and theory," in Watson, G.A., ed., Numerical Analysis, Lecture Notes in Mathematics 630, SpringerVerlag, Heidelberg, 105116. Moré, J.J. and Wright, S.J. (1993), Optimization Software Guide, Philadelphia: SIAM, ISBN 0898713226. Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0262181908. Reeves, C.R., ed. (1993) Modern Heuristic Techniques for Combinatorial Problems, NY: Wiley. Rinnooy Kan, A.H.G., and Timmer, G.T., (1989) Global Optimization: A Survey, International Series of Numerical Mathematics, vol. 87, Basel: Birkhauser Verlag. Saarinen, S., Bramley, R., and Cybenko, G. (1993), "Illconditioning in neural network training problems," Siam J. of Scientific Computing, 14, 693714. User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: What learning rate should be used for backprop? Next Document: How does illconditioning affect NN training? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
