Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Section - What is weight decay?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Houses ]


Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: What is early stopping?
Next Document: What is Bayesian Learning?
See reader questions & answers on this topic! - Help others by sharing your knowledge

Weight decay adds a penalty term to the error function. The usual penalty is
the sum of squared weights times a decay constant. In a linear model, this
form of weight decay is equivalent to ridge regression. See "What is
jitter?" for more explanation of ridge regression. 

Weight decay is a subset of regularization methods. The penalty term in
weight decay, by definition, penalizes large weights. Other regularization
methods may involve not only the weights but various derivatives of the
output function (Bishop 1995). 

The weight decay penalty term causes the weights to converge to smaller
absolute values than they otherwise would. Large weights can hurt
generalization in two different ways. Excessively large weights leading to
hidden units can cause the output function to be too rough, possibly with
near discontinuities. Excessively large weights leading to output units can
cause wild outputs far beyond the range of the data if the output activation
function is not bounded to the same range as the data. To put it another
way, large weights can cause excessive variance of the output (Geman,
Bienenstock, and Doursat 1992). According to Bartlett (1997), the size (L_1
norm) of the weights is more important than the number of weights in
determining generalization. 

Other penalty terms besides the sum of squared weights are sometimes used. 
Weight elimination (Weigend, Rumelhart, and Huberman 1991) uses: 

          (w_i)^2
   sum -------------
    i  (w_i)^2 + c^2

where w_i is the ith weight and c is a user-specified constant. Whereas
decay using the sum of squared weights tends to shrink the large
coefficients more than the small ones, weight elimination tends to shrink
the small coefficients more, and is therefore more useful for suggesting
subset models (pruning). 

The generalization ability of the network can depend crucially on the decay
constant, especially with small training sets. One approach to choosing the
decay constant is to train several networks with different amounts of decay
and estimate the generalization error for each; then choose the decay
constant that minimizes the estimated generalization error. Weigend,
Rumelhart, and Huberman (1991) iteratively update the decay constant during
training. 

There are other important considerations for getting good results from
weight decay. You must either standardize the inputs and targets, or adjust
the penalty term for the standard deviations of all the inputs and targets.
It is usually a good idea to omit the biases from the penalty term. 

A fundamental problem with weight decay is that different types of weights
in the network will usually require different decay constants for good
generalization. At the very least, you need three different decay constants
for input-to-hidden, hidden-to-hidden, and hidden-to-output weights.
Adjusting all these decay constants to produce the best estimated
generalization error often requires vast amounts of computation. 

Fortunately, there is a superior alternative to weight decay: hierarchical 
Bayesian learning. Bayesian learning makes it possible to estimate
efficiently numerous decay constants. 

References: 

   Bartlett, P.L. (1997), "For valid generalization, the size of the weights
   is more important than the size of the network," in Mozer, M.C., Jordan,
   M.I., and Petsche, T., (eds.) Advances in Neural Information Processing
   Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140. 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991).
   Generalization by weight-elimination with application to forecasting. In:
   R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.), Advances in Neural
   Information Processing Systems 3, San Mateo, CA: Morgan Kaufmann. 

User Contributions:

1
Majid Maqbool
Sep 27, 2024 @ 5:05 am
https://techpassion.co.uk/how-does-a-smart-tv-work-read-complete-details/
PDP++ is a neural-network simulation system written in C++, developed as an advanced version of the original PDP software from McClelland and Rumelhart's "Explorations in Parallel Distributed Processing Handbook" (1987). The software is designed for both novice users and researchers, providing flexibility and power in cognitive neuroscience studies. Featured in Randall C. O'Reilly and Yuko Munakata's "Computational Explorations in Cognitive Neuroscience" (2000), PDP++ supports a wide range of algorithms. These include feedforward and recurrent error backpropagation, with continuous and real-time models such as Almeida-Pineda. It also incorporates constraint satisfaction algorithms like Boltzmann Machines, Hopfield networks, and mean-field networks, as well as self-organizing learning algorithms, including Self-organizing Maps (SOM) and Hebbian learning. Additionally, it supports mixtures-of-experts models and the Leabra algorithm, which combines error-driven and Hebbian learning with k-Winners-Take-All inhibitory competition. PDP++ is a comprehensive tool for exploring neural network models in cognitive neuroscience.

Comment about this article, ask questions, or add new information about this topic:




Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: What is early stopping?
Next Document: What is Bayesian Learning?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)





Last Update March 27 2014 @ 02:11 PM