Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 3 of 7: Generalization
Section - What is overfitting and how can I avoid it?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Business Photos and Profiles ]

Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: How does noise affect generalization?
Next Document: What is jitter? (Training with noise)
See reader questions & answers on this topic! - Help others by sharing your knowledge

The critical issue in developing a neural network is generalization: how
well will the network make predictions for cases that are not in the
training set? NNs, like other flexible nonlinear estimation methods such as
kernel regression and smoothing splines, can suffer from either underfitting
or overfitting. A network that is not sufficiently complex can fail to
detect fully the signal in a complicated data set, leading to underfitting.
A network that is too complex may fit the noise, not just the signal,
leading to overfitting. Overfitting is especially dangerous because it can
easily lead to predictions that are far beyond the range of the training
data with many of the common types of NNs. Overfitting can also produce wild
predictions in multilayer perceptrons even with noise-free data. 

For an elementary discussion of overfitting, see Smith (1996). For a more
rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992)
on the bias/variance trade-off (it's not really a dilemma). We are talking
about statistical bias here: the difference between the average value of an
estimator and the correct value. Underfitting produces excessive bias in the
outputs, whereas overfitting produces excessive variance. There are
graphical examples of overfitting and underfitting in Sarle (1995, 1999). 

The best way to avoid overfitting is to use lots of training data. If you
have at least 30 times as many training cases as there are weights in the
network, you are unlikely to suffer from much overfitting, although you may
get some slight overfitting no matter how large the training set is. For
noise-free data, 5 times as many training cases as weights may be
sufficient. But you can't arbitrarily reduce the number of weights for fear
of underfitting. 

Given a fixed amount of training data, there are at least six approaches to
avoiding underfitting and overfitting, and hence getting good

 o Model selection 
 o Jittering 
 o Early stopping 
 o Weight decay 
 o Bayesian learning 
 o Combining networks 

The first five approaches are based on well-understood theory. Methods for
combining networks do not have such a sound theoretical basis but are the
subject of current research. These six approaches are discussed in more
detail under subsequent questions. 

The complexity of a network is related to both the number of weights and the
size of the weights. Model selection is concerned with the number of
weights, and hence the number of hidden units and layers. The more weights
there are, relative to the number of training cases, the more overfitting
amplifies noise in the targets (Moody 1992). The other approaches listed
above are concerned, directly or indirectly, with the size of the weights.
Reducing the size of the weights reduces the "effective" number of
weights--see Moody (1992) regarding weight decay and Weigend (1994)
regarding early stopping. Bartlett (1997) obtained learning-theory results
in which generalization error is related to the L_1 norm of the weights
instead of the VC dimension. 

Overfitting is not confined to NNs with hidden units. Overfitting can occur
in generalized linear models (networks with no hidden units) if either or
both of the following conditions hold: 

1. The number of input variables (and hence the number of weights) is large
   with respect to the number of training cases. Typically you would want at
   least 10 times as many training cases as input variables, but with
   noise-free targets, twice as many training cases as input variables would
   be more than adequate. These requirements are smaller than those stated
   above for networks with hidden layers, because hidden layers are prone to
   creating ill-conditioning and other pathologies. 

2. The input variables are highly correlated with each other. This condition
   is called "multicollinearity" in the statistical literature.
   Multicollinearity can cause the weights to become extremely large because
   of numerical ill-conditioning--see "How does ill-conditioning affect NN

Methods for dealing with these problems in the statistical literature
include ridge regression (similar to weight decay), partial least squares
(similar to Early stopping), and various methods with even stranger names,
such as the lasso and garotte (van Houwelingen and le Cessie, ????). 


   Bartlett, P.L. (1997), "For valid generalization, the size of the weights
   is more important than the size of the network," in Mozer, M.C., Jordan,
   M.I., and Petsche, T., (eds.) Advances in Neural Information Processing
   Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140. 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
   Generalization and Regularization in Nonlinear Learning Systems", in
   Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
   Information Processing Systems 4, 847-854. 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," Proceedings of the 27th Symposium on the Interface of
   Computing Science and Statistics, 352-360, (this is a very large
   compressed postscript file, 747K, 10 pages) 

   Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results," 

   Smith, M. (1996). Neural Networks for Statistical Modeling, Boston:
   International Thomson Computer Press, ISBN 1-850-32842-0.

   van Houwelingen,H.C., and le Cessie, S. (????), "Shrinkage and penalized
   likelihood as methods to improve predictive accuracy," and 

   Weigend, A. (1994), "On overfitting and the effective number of hidden
   units," Proceedings of the 1993 Connectionist Models Summer School,

User Contributions:

Comment about this article, ask questions, or add new information about this topic:


Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: How does noise affect generalization?
Next Document: What is jitter? (Training with noise)

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM