Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: How can generalization error be estimated? Next Document: How to compute prediction and confidence See reader questions & answers on this topic!  Help others by sharing your knowledge Crossvalidation and bootstrapping are both methods for estimating generalization error based on "resampling" (Weiss and Kulikowski 1991; Efron and Tibshirani 1993; Hjorth 1994; Plutowski, Sakata, and White 1994; Shao and Tu 1995). The resulting estimates of generalization error are often used for choosing among various models, such as different network architectures. Crossvalidation ++++++++++++++++ In kfold crossvalidation, you divide the data into k subsets of (approximately) equal size. You train the net k times, each time leaving out one of the subsets from training, but using only the omitted subset to compute whatever error criterion interests you. If k equals the sample size, this is called "leaveoneout" crossvalidation. "Leavevout" is a more elaborate and expensive version of crossvalidation that involves leaving out all possible subsets of v cases. Note that crossvalidation is quite different from the "splitsample" or "holdout" method that is commonly used for early stopping in NNs. In the splitsample method, only a single subset (the validation set) is used to estimate the generalization error, instead of k different subsets; i.e., there is no "crossing". While various people have suggested that crossvalidation be applied to early stopping, the proper way of doing so is not obvious. The distinction between crossvalidation and splitsample validation is extremely important because crossvalidation is markedly superior for small data sets; this fact is demonstrated dramatically by Goutte (1997) in a reply to Zhu and Rohwer (1996). For an insightful discussion of the limitations of crossvalidatory choice among several learning methods, see Stone (1977). Jackknifing +++++++++++ Leaveoneout crossvalidation is also easily confused with jackknifing. Both involve omitting each training case in turn and retraining the network on the remaining subset. But crossvalidation is used to estimate generalization error, while the jackknife is used to estimate the bias of a statistic. In the jackknife, you compute some statistic of interest in each subset of the data. The average of these subset statistics is compared with the corresponding statistic computed from the entire sample in order to estimate the bias of the latter. You can also get a jackknife estimate of the standard error of a statistic. Jackknifing can be used to estimate the bias of the training error and hence to estimate the generalization error, but this process is more complicated than leaveoneout crossvalidation (Efron, 1982; Ripley, 1996, p. 73). Choice of crossvalidation method +++++++++++++++++++++++++++++++++ Crossvalidation can be used simply to estimate the generalization error of a given model, or it can be used for model selection by choosing one of several models that has the smallest estimated generalization error. For example, you might use crossvalidation to choose the number of hidden units, or you could use crossvalidation to choose a subset of the inputs (subset selection). A subset that contains all relevant inputs will be called a "good" subsets, while the subset that contains all relevant inputs but no others will be called the "best" subset. Note that subsets are "good" and "best" in an asymptotic sense (as the number of training cases goes to infinity). With a small training set, it is possible that a subset that is smaller than the "best" subset may provide better generalization error. Leaveoneout crossvalidation often works well for estimating generalization error for continuous error functions such as the mean squared error, but it may perform poorly for discontinuous error functions such as the number of misclassified cases. In the latter case, kfold crossvalidation is preferred. But if k gets too small, the error estimate is pessimistically biased because of the difference in trainingset size between the fullsample analysis and the crossvalidation analyses. (For modelselection purposes, this bias can actually help; see the discussion below of Shao, 1993.) A value of 10 for k is popular for estimating generalization error. Leaveoneout crossvalidation can also run into trouble with various modelselection methods. Again, one problem is lack of continuitya small change in the data can cause a large change in the model selected (Breiman, 1996). For choosing subsets of inputs in linear regression, Breiman and Spector (1992) found 10fold and 5fold crossvalidation to work better than leaveoneout. Kohavi (1995) also obtained good results for 10fold crossvalidation with empirical decision trees (C4.5). Values of k as small as 5 or even 2 may work even better if you analyze several different random kway splits of the data to reduce the variability of the crossvalidation estimate. Leaveoneout crossvalidation also has more subtle deficiencies for model selection. Shao (1995) showed that in linear models, leaveoneout crossvalidation is asymptotically equivalent to AIC (and Mallows' C_p), but leavevout crossvalidation is asymptotically equivalent to Schwarz's Bayesian criterion (called SBC or BIC) when v = n[11/(log(n)1)], where n is the number of training cases. SBC provides consistent subsetselection, while AIC does not. That is, SBC will choose the "best" subset with probability approaching one as the size of the training set goes to infinity. AIC has an asymptotic probability of one of choosing a "good" subset, but less than one of choosing the "best" subset (Stone, 1979). Many simulation studies have also found that AIC overfits badly in small samples, and that SBC works well (e.g., Hurvich and Tsai, 1989; Shao and Tu, 1995). Hence, these results suggest that leaveoneout crossvalidation should overfit in small samples, but leavevout crossvalidation with appropriate v should do better. However, when true models have an infinite number of parameters, SBC is not efficient, and other criteria that are asymptotically efficient but not consistent for model selection may produce better generalization (Hurvich and Tsai, 1989). Shao (1993) obtained the surprising result that for selecting subsets of inputs in a linear regression, the probability of selecting the "best" does not converge to 1 (as the sample size n goes to infinity) for leavevout crossvalidation unless the proportion v/n approaches 1. At first glance, Shao's result seems inconsistent with the analysis by Kearns (1997) of splitsample validation, which shows that the best generalization is obtained with v/n strictly between 0 and 1, with little sensitivity to the precise value of v/n for large data sets. But the apparent conflict is due to the fundamentally different properties of crossvalidation and splitsample validation. To obtain an intuitive understanding of Shao (1993), let's review some background material on generalization error. Generalization error can be broken down into three additive parts, noise variance + estimation variance + squared estimation bias. Noise variance is the same for all subsets of inputs. Bias is nonzero for subsets that are not "good", but it's zero for all "good" subsets, since we are assuming that the function to be learned is linear. Hence the generalization error of "good" subsets will differ only in the estimation variance. The estimation variance is (2p/t)s^2 where p is the number of inputs in the subset, t is the training set size, and s^2 is the noise variance. The "best" subset is better than other "good" subsets only because the "best" subset has (by definition) the smallest value of p. But the t in the denominator means that differences in generalization error among the "good" subsets will all go to zero as t goes to infinity. Therefore it is difficult to guess which subset is "best" based on the generalization error even when t is very large. It is well known that unbiased estimates of the generalization error, such as those based on AIC, FPE, and C_p, do not produce consistent estimates of the "best" subset (e.g., see Stone, 1979). In leavevout crossvalidation, t=nv. The differences of the crossvalidation estimates of generalization error among the "good" subsets contain a factor 1/t, not 1/n. Therefore by making t small enough (and thereby making each regression based on t cases bad enough), we can make the differences of the crossvalidation estimates large enough to detect. It turns out that to make t small enough to guess the "best" subset consistently, we have to have t/n go to 0 as n goes to infinity. The crucial distinction between crossvalidation and splitsample validation is that with crossvalidation, after guessing the "best" subset, we train the linear regression model for that subset using all n cases, but with splitsample validation, only t cases are ever used for training. If our main purpose were really to choose the "best" subset, I suspect we would still have to have t/n go to 0 even for splitsample validation. But choosing the "best" subset is not the same thing as getting the best generalization. If we are more interested in getting good generalization than in choosing the "best" subset, we do not want to make our regression estimate based on only t cases as bad as we do in crossvalidation, because in splitsample validation that bad regression estimate is what we're stuck with. So there is no conflict between Shao and Kearns, but there is a conflict between the two goals of choosing the "best" subset and getting the best generalization in splitsample validation. Bootstrapping +++++++++++++ Bootstrapping seems to work better than crossvalidation in many cases (Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly analyzing subsets of the data, you repeatedly analyze subsamples of the data. Each subsample is a random sample with replacement from the full sample. Depending on what you want to do, anywhere from 50 to 2000 subsamples might be used. There are many more sophisticated bootstrap methods that can be used not only for estimating generalization error but also for estimating confidence bounds for network outputs (Efron and Tibshirani 1993). For estimating generalization error in classification problems, the .632+ bootstrap (an improvement on the popular .632 bootstrap) is one of the currently favored methods that has the advantage of performing well even when there is severe overfitting. Use of bootstrapping for NNs is described in Baxt and White (1995), Tibshirani (1996), and Masters (1995). However, the results obtained so far are not very thorough, and it is known that bootstrapping does not work well for some other methodologies such as empirical decision trees (Breiman, Friedman, Olshen, and Stone, 1984; Kohavi, 1995), for which it can be excessively optimistic. For further information +++++++++++++++++++++++ Crossvalidation and bootstrapping become considerably more complicated for time series data; see Hjorth (1994) and Snijders (1988). More information on jackknife and bootstrap confidence intervals is available at ftp://ftp.sas.com/pub/neural/jackboot.sas (this is a plaintext file). References: Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction", Neural Computation, 7, 624638. Breiman, L. (1996), "Heuristics of instability and stabilization in model selection," Annals of Statistics, 24, 23502383. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984), Classification and Regression Trees, Belmont, CA: Wadsworth. Breiman, L., and Spector, P. (1992), "Submodel selection and evaluation in regression: The Xrandom case," International Statistical Review, 60, 291319. Dijkstra, T.K., ed. (1988), On Model Uncertainty and Its Statistical Implications, Proceedings of a workshop held in Groningen, The Netherlands, September 2526, 1986, Berlin: SpringerVerlag. Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans, Philadelphia: SIAM. Efron, B. (1983), "Estimating the error rate of a prediction rule: Improvement on crossvalidation," J. of the American Statistical Association, 78, 316331. Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap, London: Chapman & Hall. Efron, B. and Tibshirani, R.J. (1997), "Improvements on crossvalidation: The .632+ bootstrap method," J. of the American Statistical Association, 92, 548560. Goutte, C. (1997), "Note on free lunches and crossvalidation," Neural Computation, 9, 12111215, ftp://eivind.imm.dtu.dk/dist/1997/goutte.nflcv.ps.gz. Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation, Model Selection, and Bootstrap, London: Chapman & Hall. Hurvich, C.M., and Tsai, C.L. (1989), "Regression and time series model selection in small samples," Biometrika, 76, 297307. Kearns, M. (1997), "A bound on the error of cross validation using the approximation and estimation rates, with consequences for the trainingtest split," Neural Computation, 9, 11431161. Kohavi, R. (1995), "A study of crossvalidation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI), pp. ?, http://robotics.stanford.edu/users/ronnyk/ Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++ Sourcebook, NY: John Wiley and Sons, ISBN 0471105880 Plutowski, M., Sakata, S., and White, H. (1994), "Crossvalidation estimates IMSE," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) Advances in Neural Information Processing Systems 6, San Mateo, CA: Morgan Kaufman, pp. 391398. Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Shao, J. (1993), "Linear model selection by crossvalidation," J. of the American Statistical Association, 88, 486494. Shao, J. (1995), "An asymptotic theory for linear model selection," Statistica Sinica ?. Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York: SpringerVerlag. Snijders, T.A.B. (1988), "On crossvalidation for predictor evaluation in time series," in Dijkstra (1988), pp. 5669. Stone, M. (1977), "Asymptotics for and against crossvalidation," Biometrika, 64, 2935. Stone, M. (1979), "Comments on model selection criteria of Akaike and Schwarz," J. of the Royal Statistical Society, Series B, 41, 276278. Tibshirani, R. (1996), "A comparison of some error estimates for neural network models," Neural Computation, 8, 152163. Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn, Morgan Kaufmann. Zhu, H., and Rohwer, R. (1996), "No free lunch for crossvalidation," Neural Computation, 8, 14211426. User Contributions:Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: How can generalization error be estimated? Next Document: How to compute prediction and confidence Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM

Comment about this article, ask questions, or add new information about this topic: