Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization Previous Document: How does noise affect generalization? Next Document: What is jitter? (Training with noise) See reader questions & answers on this topic! - Help others by sharing your knowledge The critical issue in developing a neural network is generalization: how well will the network make predictions for cases that are not in the training set? NNs, like other flexible nonlinear estimation methods such as kernel regression and smoothing splines, can suffer from either underfitting or overfitting. A network that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting. A network that is too complex may fit the noise, not just the signal, leading to overfitting. Overfitting is especially dangerous because it can easily lead to predictions that are far beyond the range of the training data with many of the common types of NNs. Overfitting can also produce wild predictions in multilayer perceptrons even with noise-free data. For an elementary discussion of overfitting, see Smith (1996). For a more rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992) on the bias/variance trade-off (it's not really a dilemma). We are talking about statistical bias here: the difference between the average value of an estimator and the correct value. Underfitting produces excessive bias in the outputs, whereas overfitting produces excessive variance. There are graphical examples of overfitting and underfitting in Sarle (1995, 1999). The best way to avoid overfitting is to use lots of training data. If you have at least 30 times as many training cases as there are weights in the network, you are unlikely to suffer from much overfitting, although you may get some slight overfitting no matter how large the training set is. For noise-free data, 5 times as many training cases as weights may be sufficient. But you can't arbitrarily reduce the number of weights for fear of underfitting. Given a fixed amount of training data, there are at least six approaches to avoiding underfitting and overfitting, and hence getting good generalization: o Model selection o Jittering o Early stopping o Weight decay o Bayesian learning o Combining networks The first five approaches are based on well-understood theory. Methods for combining networks do not have such a sound theoretical basis but are the subject of current research. These six approaches are discussed in more detail under subsequent questions. The complexity of a network is related to both the number of weights and the size of the weights. Model selection is concerned with the number of weights, and hence the number of hidden units and layers. The more weights there are, relative to the number of training cases, the more overfitting amplifies noise in the targets (Moody 1992). The other approaches listed above are concerned, directly or indirectly, with the size of the weights. Reducing the size of the weights reduces the "effective" number of weights--see Moody (1992) regarding weight decay and Weigend (1994) regarding early stopping. Bartlett (1997) obtained learning-theory results in which generalization error is related to the L_1 norm of the weights instead of the VC dimension. Overfitting is not confined to NNs with hidden units. Overfitting can occur in generalized linear models (networks with no hidden units) if either or both of the following conditions hold: 1. The number of input variables (and hence the number of weights) is large with respect to the number of training cases. Typically you would want at least 10 times as many training cases as input variables, but with noise-free targets, twice as many training cases as input variables would be more than adequate. These requirements are smaller than those stated above for networks with hidden layers, because hidden layers are prone to creating ill-conditioning and other pathologies. 2. The input variables are highly correlated with each other. This condition is called "multicollinearity" in the statistical literature. Multicollinearity can cause the weights to become extremely large because of numerical ill-conditioning--see "How does ill-conditioning affect NN training?" Methods for dealing with these problems in the statistical literature include ridge regression (similar to weight decay), partial least squares (similar to Early stopping), and various methods with even stranger names, such as the lasso and garotte (van Houwelingen and le Cessie, ????). References: Bartlett, P.L. (1997), "For valid generalization, the size of the weights is more important than the size of the network," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140. Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 1-58. Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems", in Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural Information Processing Systems 4, 847-854. Sarle, W.S. (1995), "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352-360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages) Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results," ftp://ftp.sas.com/pub/neural/dojo/dojo.html Smith, M. (1996). Neural Networks for Statistical Modeling, Boston: International Thomson Computer Press, ISBN 1-850-32842-0. van Houwelingen,H.C., and le Cessie, S. (????), "Shrinkage and penalized likelihood as methods to improve predictive accuracy," http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/shrinkage.pdf and http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/figures.pdf Weigend, A. (1994), "On overfitting and the effective number of hidden units," Proceedings of the 1993 Connectionist Models Summer School, 335-342. User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization Previous Document: How does noise affect generalization? Next Document: What is jitter? (Training with noise) Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page [ Usenet FAQs | Web FAQs | Documents | RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
|
PDP++ is a neural-network simulation system written in C++, developed as an advanced version of the original PDP software from McClelland and Rumelhart's "Explorations in Parallel Distributed Processing Handbook" (1987). The software is designed for both novice users and researchers, providing flexibility and power in cognitive neuroscience studies. Featured in Randall C. O'Reilly and Yuko Munakata's "Computational Explorations in Cognitive Neuroscience" (2000), PDP++ supports a wide range of algorithms. These include feedforward and recurrent error backpropagation, with continuous and real-time models such as Almeida-Pineda. It also incorporates constraint satisfaction algorithms like Boltzmann Machines, Hopfield networks, and mean-field networks, as well as self-organizing learning algorithms, including Self-organizing Maps (SOM) and Hebbian learning. Additionally, it supports mixtures-of-experts models and the Leabra algorithm, which combines error-driven and Hebbian learning with k-Winners-Take-All inhibitory competition. PDP++ is a comprehensive tool for exploring neural network models in cognitive neuroscience.