Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: How many hidden units should I use? Next Document: What are crossvalidation and bootstrapping? See reader questions & answers on this topic!  Help others by sharing your knowledge There are many methods for estimating generalization error. Singlesample statistics: AIC, SBC, MDL, FPE, Mallows' C_p, etc. In linear models, statistical theory provides several simple estimators of the generalization error under various sampling assumptions (Darlington 1968; Efron and Tibshirani 1993; Miller 1990). These estimators adjust the training error for the number of weights being estimated, and in some cases for the noise variance if that is known. These statistics can also be used as crude estimates of the generalization error in nonlinear models when you have a "large" training set. Correcting these statistics for nonlinearity requires substantially more computation (Moody 1992), and the theory does not always hold for neural networks due to violations of the regularity conditions. Among the simple generalization estimators that do not require the noise variance to be known, Schwarz's Bayesian Criterion (known as both SBC and BIC; Schwarz 1978; Judge et al. 1980; Raftery 1995) often works well for NNs (Sarle 1995, 1999). AIC and FPE tend to overfit with NNs. Rissanen's Minimum Description Length principle (MDL; Rissanen 1978, 1987, 1999) is closely related to SBC. A special issue of Computer Journal contains several articles on MDL, which can be found online at http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/ Several articles on SBC/BIC are available at the University of Washigton's web site at http://www.stat.washington.edu/tech.reports For classification problems, the formulas are not as simple as for regression with normal noise. See Efron (1986) regarding logistic regression. Splitsample or holdout validation. The most commonly used method for estimating generalization error in neural networks is to reserve part of the data as a "test" set, which must not be used in any way during training. The test set must be a representative sample of the cases that you want to generalize to. After training, run the network on the test set, and the error on the test set provides an unbiased estimate of the generalization error, provided that the test set was chosen randomly. The disadvantage of splitsample validation is that it reduces the amount of data available for both training and validation. See Weiss and Kulikowski (1991). Crossvalidation (e.g., leave one out). Crossvalidation is an improvement on splitsample validation that allows you to use all of the data for training. The disadvantage of crossvalidation is that you have to retrain the net many times. See "What are crossvalidation and bootstrapping?". Bootstrapping. Bootstrapping is an improvement on crossvalidation that often provides better estimates of generalization error at the cost of even more computing time. See "What are crossvalidation and bootstrapping?". If you use any of the above methods to choose which of several different networks to use for prediction purposes, the estimate of the generalization error of the best network will be optimistic. For example, if you train several networks using one data set, and use a second (validation set) data set to decide which network is best, you must use a third (test set) data set to obtain an unbiased estimate of the generalization error of the chosen network. Hjorth (1994) explains how this principle extends to crossvalidation and bootstrapping. References: Darlington, R.B. (1968), "Multiple Regression in Psychological Research and Practice," Psychological Bulletin, 69, 161182. Efron, B. (1986), "How biased is the apparent error rate of a prediction rule?" J. of the American Statistical Association, 81, 461470. Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, London: Chapman & Hall. Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap, London: Chapman & Hall. Miller, A.J. (1990), Subset Selection in Regression, London: Chapman & Hall. Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", in Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural Information Processing Systems 4, 847854. Raftery, A.E. (1995), "Bayesian Model Selection in Social Research," in Marsden, P.V. (ed.), Sociological Methodology 1995, Cambridge, MA: Blackwell, ftp://ftp.stat.washington.edu/pub/tech.reports/ or http://www.stat.washington.edu/tech.reports/bic.ps Rissanen, J. (1978), "Modelling by shortest data description," Automatica, 14, 465471. Rissanen, J. (1987), "Stochastic complexity" (with discussion), J. of the Royal Statistical Society, Series B, 49, 223239. Rissanen, J. (1999), "Hypothesis Selection and Testing by the MDL Principle," Computer Journal, 42, 260269, http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/ Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages) Sarle, W.S. (1999), "DonohoJohnstone Benchmarks: Neural Net Results," ftp://ftp.sas.com/pub/neural/dojo/dojo.html Weiss, S.M. & Kulikowski, C.A. (1991), Computer Systems That Learn, Morgan Kaufmann. User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: How many hidden units should I use? Next Document: What are crossvalidation and bootstrapping? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM

1.What is overfitting and how can I avoid it?
2.How many hidden layers should I use?
3.How many hidden units should I use?
4. what's literature proof that One hidden layer is sufficient for the large majority of problems?