Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 3 of 7: Generalization
Section - What are cross-validation and bootstrapping?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Cities ]

Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: How can generalization error be estimated?
Next Document: How to compute prediction and confidence
See reader questions & answers on this topic! - Help others by sharing your knowledge

Cross-validation and bootstrapping are both methods for estimating
generalization error based on "resampling" (Weiss and Kulikowski 1991; Efron
and Tibshirani 1993; Hjorth 1994; Plutowski, Sakata, and White 1994; Shao
and Tu 1995). The resulting estimates of generalization error are often used
for choosing among various models, such as different network architectures. 


In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases. 

Note that cross-validation is quite different from the "split-sample" or
"hold-out" method that is commonly used for early stopping in NNs. In the
split-sample method, only a single subset (the validation set) is used to
estimate the generalization error, instead of k different subsets; i.e.,
there is no "crossing". While various people have suggested that
cross-validation be applied to early stopping, the proper way of doing so is
not obvious. 

The distinction between cross-validation and split-sample validation is
extremely important because cross-validation is markedly superior for small
data sets; this fact is demonstrated dramatically by Goutte (1997) in a
reply to Zhu and Rohwer (1996). For an insightful discussion of the
limitations of cross-validatory choice among several learning methods, see
Stone (1977). 


Leave-one-out cross-validation is also easily confused with jackknifing.
Both involve omitting each training case in turn and retraining the network
on the remaining subset. But cross-validation is used to estimate
generalization error, while the jackknife is used to estimate the bias of a
statistic. In the jackknife, you compute some statistic of interest in each
subset of the data. The average of these subset statistics is compared with
the corresponding statistic computed from the entire sample in order to
estimate the bias of the latter. You can also get a jackknife estimate of
the standard error of a statistic. Jackknifing can be used to estimate the
bias of the training error and hence to estimate the generalization error,
but this process is more complicated than leave-one-out cross-validation
(Efron, 1982; Ripley, 1996, p. 73). 

Choice of cross-validation method

Cross-validation can be used simply to estimate the generalization error of
a given model, or it can be used for model selection by choosing one of
several models that has the smallest estimated generalization error. For
example, you might use cross-validation to choose the number of hidden
units, or you could use cross-validation to choose a subset of the inputs
(subset selection). A subset that contains all relevant inputs will be
called a "good" subsets, while the subset that contains all relevant inputs
but no others will be called the "best" subset. Note that subsets are "good"
and "best" in an asymptotic sense (as the number of training cases goes to
infinity). With a small training set, it is possible that a subset that is
smaller than the "best" subset may provide better generalization error. 

Leave-one-out cross-validation often works well for estimating
generalization error for continuous error functions such as the mean squared
error, but it may perform poorly for discontinuous error functions such as
the number of misclassified cases. In the latter case, k-fold
cross-validation is preferred. But if k gets too small, the error estimate
is pessimistically biased because of the difference in training-set size
between the full-sample analysis and the cross-validation analyses. (For
model-selection purposes, this bias can actually help; see the discussion
below of Shao, 1993.) A value of 10 for k is popular for estimating
generalization error. 

Leave-one-out cross-validation can also run into trouble with various
model-selection methods. Again, one problem is lack of continuity--a small
change in the data can cause a large change in the model selected (Breiman,
1996). For choosing subsets of inputs in linear regression, Breiman and
Spector (1992) found 10-fold and 5-fold cross-validation to work better than
leave-one-out. Kohavi (1995) also obtained good results for 10-fold
cross-validation with empirical decision trees (C4.5). Values of k as small
as 5 or even 2 may work even better if you analyze several different random 
k-way splits of the data to reduce the variability of the cross-validation

Leave-one-out cross-validation also has more subtle deficiencies for model
selection. Shao (1995) showed that in linear models, leave-one-out
cross-validation is asymptotically equivalent to AIC (and Mallows' C_p), but
leave-v-out cross-validation is asymptotically equivalent to Schwarz's
Bayesian criterion (called SBC or BIC) when v =
n[1-1/(log(n)-1)], where n is the number of training cases. SBC
provides consistent subset-selection, while AIC does not. That is, SBC will
choose the "best" subset with probability approaching one as the size of the
training set goes to infinity. AIC has an asymptotic probability of one of
choosing a "good" subset, but less than one of choosing the "best" subset
(Stone, 1979). Many simulation studies have also found that AIC overfits
badly in small samples, and that SBC works well (e.g., Hurvich and Tsai,
1989; Shao and Tu, 1995). Hence, these results suggest that leave-one-out
cross-validation should overfit in small samples, but leave-v-out
cross-validation with appropriate v should do better. However, when true
models have an infinite number of parameters, SBC is not efficient, and
other criteria that are asymptotically efficient but not consistent for
model selection may produce better generalization (Hurvich and Tsai, 1989). 

Shao (1993) obtained the surprising result that for selecting subsets of
inputs in a linear regression, the probability of selecting the "best" does
not converge to 1 (as the sample size n goes to infinity) for leave-v-out
cross-validation unless the proportion v/n approaches 1. At first glance,
Shao's result seems inconsistent with the analysis by Kearns (1997) of
split-sample validation, which shows that the best generalization is
obtained with v/n strictly between 0 and 1, with little sensitivity to the
precise value of v/n for large data sets. But the apparent conflict is due
to the fundamentally different properties of cross-validation and
split-sample validation. 

To obtain an intuitive understanding of Shao (1993), let's review some
background material on generalization error. Generalization error can be
broken down into three additive parts, noise variance + estimation variance
+ squared estimation bias. Noise variance is the same for all subsets of
inputs. Bias is nonzero for subsets that are not "good", but it's zero for
all "good" subsets, since we are assuming that the function to be learned is
linear. Hence the generalization error of "good" subsets will differ only in
the estimation variance. The estimation variance is (2p/t)s^2 where p
is the number of inputs in the subset, t is the training set size, and s^2
is the noise variance. The "best" subset is better than other "good" subsets
only because the "best" subset has (by definition) the smallest value of p.
But the t in the denominator means that differences in generalization error
among the "good" subsets will all go to zero as t goes to infinity.
Therefore it is difficult to guess which subset is "best" based on the
generalization error even when t is very large. It is well known that
unbiased estimates of the generalization error, such as those based on AIC,
FPE, and C_p, do not produce consistent estimates of the "best" subset
(e.g., see Stone, 1979). 

In leave-v-out cross-validation, t=n-v. The differences of the
cross-validation estimates of generalization error among the "good" subsets
contain a factor 1/t, not 1/n. Therefore by making t small enough (and
thereby making each regression based on t cases bad enough), we can make
the differences of the cross-validation estimates large enough to detect. It
turns out that to make t small enough to guess the "best" subset
consistently, we have to have t/n go to 0 as n goes to infinity. 

The crucial distinction between cross-validation and split-sample validation
is that with cross-validation, after guessing the "best" subset, we train
the linear regression model for that subset using all n cases, but with
split-sample validation, only t cases are ever used for training. If our
main purpose were really to choose the "best" subset, I suspect we would
still have to have t/n go to 0 even for split-sample validation. But
choosing the "best" subset is not the same thing as getting the best
generalization. If we are more interested in getting good generalization
than in choosing the "best" subset, we do not want to make our regression
estimate based on only t cases as bad as we do in cross-validation, because
in split-sample validation that bad regression estimate is what we're stuck
with. So there is no conflict between Shao and Kearns, but there is a
conflict between the two goals of choosing the "best" subset and getting the
best generalization in split-sample validation. 


Bootstrapping seems to work better than cross-validation in many cases
(Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly
analyzing subsets of the data, you repeatedly analyze subsamples of the
data. Each subsample is a random sample with replacement from the full
sample. Depending on what you want to do, anywhere from 50 to 2000
subsamples might be used. There are many more sophisticated bootstrap
methods that can be used not only for estimating generalization error but
also for estimating confidence bounds for network outputs (Efron and
Tibshirani 1993). For estimating generalization error in classification
problems, the .632+ bootstrap (an improvement on the popular .632 bootstrap)
is one of the currently favored methods that has the advantage of performing
well even when there is severe overfitting. Use of bootstrapping for NNs is
described in Baxt and White (1995), Tibshirani (1996), and Masters (1995).
However, the results obtained so far are not very thorough, and it is known
that bootstrapping does not work well for some other methodologies such as
empirical decision trees (Breiman, Friedman, Olshen, and Stone, 1984;
Kohavi, 1995), for which it can be excessively optimistic. 

For further information

Cross-validation and bootstrapping become considerably more complicated for
time series data; see Hjorth (1994) and Snijders (1988). 

More information on jackknife and bootstrap confidence intervals is
available at (this is a plain-text


   Baxt, W.G. and White, H. (1995) "Bootstrapping confidence intervals for
   clinical input variable effects in a network trained to identify the
   presence of acute myocardial infarction", Neural Computation, 7, 624-638.

   Breiman, L. (1996), "Heuristics of instability and stabilization in model
   selection," Annals of Statistics, 24, 2350-2383. 

   Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984), 
   Classification and Regression Trees, Belmont, CA: Wadsworth. 

   Breiman, L., and Spector, P. (1992), "Submodel selection and evaluation
   in regression: The X-random case," International Statistical Review, 60,

   Dijkstra, T.K., ed. (1988), On Model Uncertainty and Its Statistical
   Implications, Proceedings of a workshop held in Groningen, The
   Netherlands, September 25-26, 1986, Berlin: Springer-Verlag. 

   Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling
   Plans, Philadelphia: SIAM. 

   Efron, B. (1983), "Estimating the error rate of a prediction rule:
   Improvement on cross-validation," J. of the American Statistical
   Association, 78, 316-331. 

   Efron, B. and Tibshirani, R.J. (1993), An Introduction to the Bootstrap,
   London: Chapman & Hall. 

   Efron, B. and Tibshirani, R.J. (1997), "Improvements on cross-validation:
   The .632+ bootstrap method," J. of the American Statistical Association,
   92, 548-560. 

   Goutte, C. (1997), "Note on free lunches and cross-validation," Neural
   Computation, 9, 1211-1215, 

   Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods Validation,
   Model Selection, and Bootstrap, London: Chapman & Hall. 

   Hurvich, C.M., and Tsai, C.-L. (1989), "Regression and time series model
   selection in small samples," Biometrika, 76, 297-307. 

   Kearns, M. (1997), "A bound on the error of cross validation using the
   approximation and estimation rates, with consequences for the
   training-test split," Neural Computation, 9, 1143-1161. 

   Kohavi, R. (1995), "A study of cross-validation and bootstrap for
   accuracy estimation and model selection," International Joint Conference
   on Artificial Intelligence (IJCAI), pp. ?, 

   Masters, T. (1995) Advanced Algorithms for Neural Networks: A C++
   Sourcebook, NY: John Wiley and Sons, ISBN 0-471-10588-0 

   Plutowski, M., Sakata, S., and White, H. (1994), "Cross-validation
   estimates IMSE," in Cowan, J.D., Tesauro, G., and Alspector, J. (eds.) 
   Advances in Neural Information Processing Systems 6, San Mateo, CA:
   Morgan Kaufman, pp. 391-398. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Shao, J. (1993), "Linear model selection by cross-validation," J. of the
   American Statistical Association, 88, 486-494. 

   Shao, J. (1995), "An asymptotic theory for linear model selection,"
   Statistica Sinica ?. 

   Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York:

   Snijders, T.A.B. (1988), "On cross-validation for predictor evaluation in
   time series," in Dijkstra (1988), pp. 56-69. 

   Stone, M. (1977), "Asymptotics for and against cross-validation,"
   Biometrika, 64, 29-35. 

   Stone, M. (1979), "Comments on model selection criteria of Akaike and
   Schwarz," J. of the Royal Statistical Society, Series B, 41, 276-278. 

   Tibshirani, R. (1996), "A comparison of some error estimates for neural
   network models," Neural Computation, 8, 152-163. 

   Weiss, S.M. and Kulikowski, C.A. (1991), Computer Systems That Learn,
   Morgan Kaufmann. 

   Zhu, H., and Rohwer, R. (1996), "No free lunch for cross-validation,"
   Neural Computation, 8, 1421-1426. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: How can generalization error be estimated?
Next Document: How to compute prediction and confidence

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM