Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 3 of 7: Generalization
Section - What is early stopping?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Cities ]

Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: What is jitter? (Training with noise)
Next Document: What is weight decay?
See reader questions & answers on this topic! - Help others by sharing your knowledge

NN practitioners often use nets with many times as many parameters as
training cases. E.g., Nelson and Illingworth (1991, p. 165) discuss training
a network with 16,219 parameters with only 50 training cases! The method
used is called "early stopping" or "stopped training" and proceeds as

1. Divide the available data into training and validation sets. 
2. Use a large number of hidden units. 
3. Use very small random initial values. 
4. Use a slow learning rate. 
5. Compute the validation error rate periodically during training. 
6. Stop training when the validation error rate "starts to go up". 

It is crucial to realize that the validation error is not a good estimate
of the generalization error. One method for getting an unbiased estimate of
the generalization error is to run the net on a third set of data, the test
set, that is not used at all during the training process. For other methods,
see "How can generalization error be estimated?" 

Early stopping has several advantages: 

 o It is fast. 
 o It can be applied successfully to networks in which the number of weights
   far exceeds the sample size. 
 o It requires only one major decision by the user: what proportion of
   validation cases to use. 

But there are several unresolved practical issues in early stopping: 

 o How many cases do you assign to the training and validation sets? Rules
   of thumb abound, but appear to be no more than folklore. The only
   systematic results known to the FAQ maintainer are in Sarle (1995), which
   deals only with the case of a single input. Amari et al. (1995) attempts
   a theoretical approach but contains serious errors that completely
   invalidate the results, especially the incorrect assumption that the
   direction of approach to the optimum is distributed isotropically. 
 o Do you split the data into training and validation sets randomly or by
   some systematic algorithm? 
 o How do you tell when the validation error rate "starts to go up"? It may
   go up and down numerous times during training. The safest approach is to
   train to convergence, then go back and see which iteration had the lowest
   validation error. For more elaborate algorithms, see Prechelt (1994,

Statisticians tend to be skeptical of stopped training because it appears to
be statistically inefficient due to the use of the split-sample technique;
i.e., neither training nor validation makes use of the entire sample, and
because the usual statistical theory does not apply. However, there has been
recent progress addressing both of the above concerns (Wang 1994). 

Early stopping is closely related to ridge regression. If the learning rate
is sufficiently small, the sequence of weight vectors on each iteration will
approximate the path of continuous steepest descent down the error surface.
Early stopping chooses a point along this path that optimizes an estimate of
the generalization error computed from the validation set. Ridge regression
also defines a path of weight vectors by varying the ridge value. The ridge
value is often chosen by optimizing an estimate of the generalization error
computed by cross-validation, generalized cross-validation, or bootstrapping
(see "What are cross-validation and bootstrapping?") There always exists a
positive ridge value that will improve the expected generalization error in
a linear model. A similar result has been obtained for early stopping in
linear models (Wang, Venkatesh, and Judd 1994). In linear models, the ridge
path lies close to, but does not coincide with, the path of continuous
steepest descent; in nonlinear models, the two paths can diverge widely. The
relationship is explored in more detail by Sjberg and Ljung (1992). 


   S. Amari, N.Murata, K.-R. Muller, M. Finke, H. Yang. Asymptotic
   Statistical Theory of Overtraining and Cross-Validation. METR 95-06,
   1995, Department of Mathematical Engineering and Information Physics,
   University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan. 

   Finnof, W., Hergert, F., and Zimmermann, H.G. (1993), "Improving model
   selection by nonconvergent methods," Neural Networks, 6, 771-783. 

   Nelson, M.C. and Illingworth, W.T. (1991), A Practical Guide to Neural
   Nets, Reading, MA: Addison-Wesley. 

   Orr, G.B., and Mueller, K.-R., eds. (1998), Neural Networks: Tricks of
   the Trade, Berlin: Springer, ISBN 3-540-65311-2. 

   Prechelt, L. (1998), "Early stopping--But when?" in Orr and Mueller
   (1998), 55-69. 

   Prechelt, L. (1994), "PROBEN1--A set of neural network benchmark problems
   and benchmarking rules," Technical Report 21/94, Universitat Karlsruhe,
   76128 Karlsruhe, Germany, 

   Sarle, W.S. (1995), "Stopped Training and Other Remedies for
   Overfitting," Proceedings of the 27th Symposium on the Interface of
   Computing Science and Statistics, 352-360, (this is a very large
   compressed postscript file, 747K, 10 pages) 

   Sjberg, J. and Ljung, L. (1992), "Overtraining, Regularization, and
   Searching for Minimum in Neural Networks," Technical Report
   LiTH-ISY-I-1297, Department of Electrical Engineering, Linkoping
   University, S-581 83 Linkoping, Sweden, . 

   Wang, C. (1994), A Theory of Generalisation in Learning Machines with
   Neural Network Application, Ph.D. thesis, University of Pennsylvania. 

   Wang, C., Venkatesh, S.S., and Judd, J.S. (1994), "Optimal Stopping and
   Effective Machine Complexity in Learning," NIPS6, 303-310. 

   Weigend, A. (1994), "On overfitting and the effective number of hidden
   units," Proceedings of the 1993 Connectionist Models Summer School,

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: What is jitter? (Training with noise)
Next Document: What is weight decay?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM