Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What is early stopping? Next Document: What is Bayesian Learning? See reader questions & answers on this topic!  Help others by sharing your knowledge Weight decay adds a penalty term to the error function. The usual penalty is the sum of squared weights times a decay constant. In a linear model, this form of weight decay is equivalent to ridge regression. See "What is jitter?" for more explanation of ridge regression. Weight decay is a subset of regularization methods. The penalty term in weight decay, by definition, penalizes large weights. Other regularization methods may involve not only the weights but various derivatives of the output function (Bishop 1995). The weight decay penalty term causes the weights to converge to smaller absolute values than they otherwise would. Large weights can hurt generalization in two different ways. Excessively large weights leading to hidden units can cause the output function to be too rough, possibly with near discontinuities. Excessively large weights leading to output units can cause wild outputs far beyond the range of the data if the output activation function is not bounded to the same range as the data. To put it another way, large weights can cause excessive variance of the output (Geman, Bienenstock, and Doursat 1992). According to Bartlett (1997), the size (L_1 norm) of the weights is more important than the number of weights in determining generalization. Other penalty terms besides the sum of squared weights are sometimes used. Weight elimination (Weigend, Rumelhart, and Huberman 1991) uses: (w_i)^2 sum  i (w_i)^2 + c^2 where w_i is the ith weight and c is a userspecified constant. Whereas decay using the sum of squared weights tends to shrink the large coefficients more than the small ones, weight elimination tends to shrink the small coefficients more, and is therefore more useful for suggesting subset models (pruning). The generalization ability of the network can depend crucially on the decay constant, especially with small training sets. One approach to choosing the decay constant is to train several networks with different amounts of decay and estimate the generalization error for each; then choose the decay constant that minimizes the estimated generalization error. Weigend, Rumelhart, and Huberman (1991) iteratively update the decay constant during training. There are other important considerations for getting good results from weight decay. You must either standardize the inputs and targets, or adjust the penalty term for the standard deviations of all the inputs and targets. It is usually a good idea to omit the biases from the penalty term. A fundamental problem with weight decay is that different types of weights in the network will usually require different decay constants for good generalization. At the very least, you need three different decay constants for inputtohidden, hiddentohidden, and hiddentooutput weights. Adjusting all these decay constants to produce the best estimated generalization error often requires vast amounts of computation. Fortunately, there is a superior alternative to weight decay: hierarchical Bayesian learning. Bayesian learning makes it possible to estimate efficiently numerous decay constants. References: Bartlett, P.L. (1997), "For valid generalization, the size of the weights is more important than the size of the network," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 134140. Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press. Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 158. Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weightelimination with application to forecasting. In: R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.), Advances in Neural Information Processing Systems 3, San Mateo, CA: Morgan Kaufmann. User Contributions:Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What is early stopping? Next Document: What is Bayesian Learning? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM

Comment about this article, ask questions, or add new information about this topic: