## Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z # comp.ai.neural-nets FAQ, Part 3 of 7: GeneralizationSection - What is weight decay?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Airports ]

Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: What is early stopping?
Next Document: What is Bayesian Learning?
See reader questions & answers on this topic! - Help others by sharing your knowledge
```
Weight decay adds a penalty term to the error function. The usual penalty is
the sum of squared weights times a decay constant. In a linear model, this
form of weight decay is equivalent to ridge regression. See "What is
jitter?" for more explanation of ridge regression.

Weight decay is a subset of regularization methods. The penalty term in
weight decay, by definition, penalizes large weights. Other regularization
methods may involve not only the weights but various derivatives of the
output function (Bishop 1995).

The weight decay penalty term causes the weights to converge to smaller
absolute values than they otherwise would. Large weights can hurt
generalization in two different ways. Excessively large weights leading to
hidden units can cause the output function to be too rough, possibly with
near discontinuities. Excessively large weights leading to output units can
cause wild outputs far beyond the range of the data if the output activation
function is not bounded to the same range as the data. To put it another
way, large weights can cause excessive variance of the output (Geman,
Bienenstock, and Doursat 1992). According to Bartlett (1997), the size (L_1
norm) of the weights is more important than the number of weights in
determining generalization.

Other penalty terms besides the sum of squared weights are sometimes used.
Weight elimination (Weigend, Rumelhart, and Huberman 1991) uses:

(w_i)^2
sum -------------
i  (w_i)^2 + c^2

where w_i is the ith weight and c is a user-specified constant. Whereas
decay using the sum of squared weights tends to shrink the large
coefficients more than the small ones, weight elimination tends to shrink
the small coefficients more, and is therefore more useful for suggesting
subset models (pruning).

The generalization ability of the network can depend crucially on the decay
constant, especially with small training sets. One approach to choosing the
decay constant is to train several networks with different amounts of decay
and estimate the generalization error for each; then choose the decay
constant that minimizes the estimated generalization error. Weigend,
Rumelhart, and Huberman (1991) iteratively update the decay constant during
training.

There are other important considerations for getting good results from
weight decay. You must either standardize the inputs and targets, or adjust
the penalty term for the standard deviations of all the inputs and targets.
It is usually a good idea to omit the biases from the penalty term.

A fundamental problem with weight decay is that different types of weights
in the network will usually require different decay constants for good
generalization. At the very least, you need three different decay constants
for input-to-hidden, hidden-to-hidden, and hidden-to-output weights.
Adjusting all these decay constants to produce the best estimated
generalization error often requires vast amounts of computation.

Fortunately, there is a superior alternative to weight decay: hierarchical
Bayesian learning. Bayesian learning makes it possible to estimate
efficiently numerous decay constants.

References:

Bartlett, P.L. (1997), "For valid generalization, the size of the weights
is more important than the size of the network," in Mozer, M.C., Jordan,
M.I., and Petsche, T., (eds.) Advances in Neural Information Processing
Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.

Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
Oxford University Press.

Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
the Bias/Variance Dilemma", Neural Computation, 4, 1-58.

Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
Cambridge University Press.

Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991).
Generalization by weight-elimination with application to forecasting. In:
R. P. Lippmann, J. Moody, & D. S. Touretzky (eds.), Advances in Neural
Information Processing Systems 3, San Mateo, CA: Morgan Kaufmann.

```

## User Contributions: Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: What is early stopping?
Next Document: What is Bayesian Learning?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM