Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What is jitter? (Training with noise) Next Document: What is weight decay? See reader questions & answers on this topic!  Help others by sharing your knowledge NN practitioners often use nets with many times as many parameters as training cases. E.g., Nelson and Illingworth (1991, p. 165) discuss training a network with 16,219 parameters with only 50 training cases! The method used is called "early stopping" or "stopped training" and proceeds as follows: 1. Divide the available data into training and validation sets. 2. Use a large number of hidden units. 3. Use very small random initial values. 4. Use a slow learning rate. 5. Compute the validation error rate periodically during training. 6. Stop training when the validation error rate "starts to go up". It is crucial to realize that the validation error is not a good estimate of the generalization error. One method for getting an unbiased estimate of the generalization error is to run the net on a third set of data, the test set, that is not used at all during the training process. For other methods, see "How can generalization error be estimated?" Early stopping has several advantages: o It is fast. o It can be applied successfully to networks in which the number of weights far exceeds the sample size. o It requires only one major decision by the user: what proportion of validation cases to use. But there are several unresolved practical issues in early stopping: o How many cases do you assign to the training and validation sets? Rules of thumb abound, but appear to be no more than folklore. The only systematic results known to the FAQ maintainer are in Sarle (1995), which deals only with the case of a single input. Amari et al. (1995) attempts a theoretical approach but contains serious errors that completely invalidate the results, especially the incorrect assumption that the direction of approach to the optimum is distributed isotropically. o Do you split the data into training and validation sets randomly or by some systematic algorithm? o How do you tell when the validation error rate "starts to go up"? It may go up and down numerous times during training. The safest approach is to train to convergence, then go back and see which iteration had the lowest validation error. For more elaborate algorithms, see Prechelt (1994, 1998). Statisticians tend to be skeptical of stopped training because it appears to be statistically inefficient due to the use of the splitsample technique; i.e., neither training nor validation makes use of the entire sample, and because the usual statistical theory does not apply. However, there has been recent progress addressing both of the above concerns (Wang 1994). Early stopping is closely related to ridge regression. If the learning rate is sufficiently small, the sequence of weight vectors on each iteration will approximate the path of continuous steepest descent down the error surface. Early stopping chooses a point along this path that optimizes an estimate of the generalization error computed from the validation set. Ridge regression also defines a path of weight vectors by varying the ridge value. The ridge value is often chosen by optimizing an estimate of the generalization error computed by crossvalidation, generalized crossvalidation, or bootstrapping (see "What are crossvalidation and bootstrapping?") There always exists a positive ridge value that will improve the expected generalization error in a linear model. A similar result has been obtained for early stopping in linear models (Wang, Venkatesh, and Judd 1994). In linear models, the ridge path lies close to, but does not coincide with, the path of continuous steepest descent; in nonlinear models, the two paths can diverge widely. The relationship is explored in more detail by Sjberg and Ljung (1992). References: S. Amari, N.Murata, K.R. Muller, M. Finke, H. Yang. Asymptotic Statistical Theory of Overtraining and CrossValidation. METR 9506, 1995, Department of Mathematical Engineering and Information Physics, University of Tokyo, Hongo 731, Bunkyoku, Tokyo 113, Japan. Finnof, W., Hergert, F., and Zimmermann, H.G. (1993), "Improving model selection by nonconvergent methods," Neural Networks, 6, 771783. Nelson, M.C. and Illingworth, W.T. (1991), A Practical Guide to Neural Nets, Reading, MA: AddisonWesley. Orr, G.B., and Mueller, K.R., eds. (1998), Neural Networks: Tricks of the Trade, Berlin: Springer, ISBN 3540653112. Prechelt, L. (1998), "Early stoppingBut when?" in Orr and Mueller (1998), 5569. Prechelt, L. (1994), "PROBEN1A set of neural network benchmark problems and benchmarking rules," Technical Report 21/94, Universitat Karlsruhe, 76128 Karlsruhe, Germany, ftp://ftp.ira.uka.de/pub/papers/techreports/1994/. Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages) Sjberg, J. and Ljung, L. (1992), "Overtraining, Regularization, and Searching for Minimum in Neural Networks," Technical Report LiTHISYI1297, Department of Electrical Engineering, Linkoping University, S581 83 Linkoping, Sweden, http://www.control.isy.liu.se . Wang, C. (1994), A Theory of Generalisation in Learning Machines with Neural Network Application, Ph.D. thesis, University of Pennsylvania. Wang, C., Venkatesh, S.S., and Judd, J.S. (1994), "Optimal Stopping and Effective Machine Complexity in Learning," NIPS6, 303310. Weigend, A. (1994), "On overfitting and the effective number of hidden units," Proceedings of the 1993 Connectionist Models Summer School, 335342. User Contributions:Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What is jitter? (Training with noise) Next Document: What is weight decay? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM

Comment about this article, ask questions, or add new information about this topic: