Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Section - How can generalization error be estimated?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Forum ]


Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: How many hidden units should I use?
Next Document: What are cross-validation and bootstrapping?
See reader questions & answers on this topic! - Help others by sharing your knowledge

There are many methods for estimating generalization error. 

Single-sample statistics: AIC, SBC, MDL, FPE, Mallows' C_p, etc. 
   In linear models, statistical theory provides several simple estimators
   of the generalization error under various sampling assumptions
   (Darlington 1968; Efron and Tibshirani 1993; Miller 1990). These
   estimators adjust the training error for the number of weights being
   estimated, and in some cases for the noise variance if that is known. 

   These statistics can also be used as crude estimates of the
   generalization error in nonlinear models when you have a "large" training
   set. Correcting these statistics for nonlinearity requires substantially
   more computation (Moody 1992), and the theory does not always hold for
   neural networks due to violations of the regularity conditions. 

   Among the simple generalization estimators that do not require the noise
   variance to be known, Schwarz's Bayesian Criterion (known as both SBC and
   BIC; Schwarz 1978; Judge et al. 1980; Raftery 1995) often works well for
   NNs (Sarle 1995, 1999). AIC and FPE tend to overfit with NNs. Rissanen's
   Minimum Description Length principle (MDL; Rissanen 1978, 1987, 1999) is
   closely related to SBC. A special issue of Computer Journal contains
   several articles on MDL, which can be found online at 
   http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/ 
   Several articles on SBC/BIC are available at the University of
   Washigton's web site at http://www.stat.washington.edu/tech.reports 

   For classification problems, the formulas are not as simple as for
   regression with normal noise. See Efron (1986) regarding logistic
   regression. 

Split-sample or hold-out validation. 
   The most commonly used method for estimating generalization error in
   neural networks is to reserve part of the data as a "test" set, which
   must not be used in any way during training. The test set must be a
   representative sample of the cases that you want to generalize to. After
   training, run the network on the test set, and the error on the test set
   provides an unbiased estimate of the generalization error, provided that
   the test set was chosen randomly. The disadvantage of split-sample
   validation is that it reduces the amount of data available for both
   training and validation. See Weiss and Kulikowski (1991). 

Cross-validation (e.g., leave one out). 
   Cross-validation is an improvement on split-sample validation that allows
   you to use all of the data for training. The disadvantage of
   cross-validation is that you have to retrain the net many times. See 
   "What are cross-validation and bootstrapping?". 

Bootstrapping. 
   Bootstrapping is an improvement on cross-validation that often provides
   better estimates of generalization error at the cost of even more
   computing time. See "What are cross-validation and bootstrapping?". 

If you use any of the above methods to choose which of several different
networks to use for prediction purposes, the estimate of the generalization
error of the best network will be optimistic. For example, if you train
several networks using one data set, and use a second (validation set) data
set to decide which network is best, you must use a third (test set) data
set to obtain an unbiased estimate of the generalization error of the chosen
network. Hjorth (1994) explains how this principle extends to
cross-validation and bootstrapping. 

References: 

   Darlington, R.B. (1968), "Multiple Regression in Psychological Research
   and Practice," Psychological Bulletin, 69, 161-182. 

   Efron, B. (1986), "How biased is the apparent error rate of a prediction
   rule?" J. of the American Statistical Association, 81, 461-470. 

   Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
   London: Chapman & Hall. 

   Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods:
   Validation, Model Selection, and Bootstrap, London: Chapman & Hall. 

   Miller, A.J. (1990), Subset Selection in Regression, London: Chapman &
   Hall. 

   Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
   Generalization and Regularization in Nonlinear Learning Systems", in
   Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
   Information Processing Systems 4, 847-854. 

   Raftery, A.E. (1995), "Bayesian Model Selection in Social Research," in
   Marsden, P.V. (ed.), Sociological Methodology 1995, Cambridge, MA:
   Blackwell, ftp://ftp.stat.washington.edu/pub/tech.reports/ or 
   http://www.stat.washington.edu/tech.reports/bic.ps 

   Rissanen, J. (1978), "Modelling by shortest data description,"
   Automatica, 14, 465-471. 

   Rissanen, J. (1987), "Stochastic complexity" (with discussion), J. of the
   Royal Statistical Society, Series B, 49, 223-239. 

   Rissanen, J. (1999), "Hypothesis Selection and Testing by the MDL
   Principle," Computer Journal, 42, 260-269, 
   http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/ 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," Proceedings of the 27th Symposium on the Interface of
   Computing Science and Statistics, 352-360, 
   ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
   compressed postscript file, 747K, 10 pages) 

   Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results," 
   ftp://ftp.sas.com/pub/neural/dojo/dojo.html 

   Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn,
   Morgan Kaufmann. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:




Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: How many hidden units should I use?
Next Document: What are cross-validation and bootstrapping?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)





Last Update March 27 2014 @ 02:11 PM