Top Document: comp.ai.neuralnets FAQ, Part 1 of 7: Introduction Previous Document: What are the population, sample, training set, See reader questions & answers on this topic!  Help others by sharing your knowledge There is considerable overlap between the fields of neural networks and statistics. Statistics is concerned with data analysis. In neural network terminology, statistical inference means learning to generalize from noisy data. Some neural networks are not concerned with data analysis (e.g., those intended to model biological systems) and therefore have little to do with statistics. Some neural networks do not learn (e.g., Hopfield nets) and therefore have little to do with statistics. Some neural networks can learn successfully only from noisefree data (e.g., ART or the perceptron rule) and therefore would not be considered statistical methods. But most neural networks that can learn to generalize effectively from noisy data are similar or identical to statistical methods. For example: o Feedforward nets with no hidden layer (including functionallink neural nets and higherorder neural nets) are basically generalized linear models. o Feedforward nets with one hidden layer are closely related to projection pursuit regression. o Probabilistic neural nets are identical to kernel discriminant analysis. o Kohonen nets for adaptive vector quantization are very similar to kmeans cluster analysis. o Kohonen selforganizing maps are discrete approximations to principal curves and surfaces. o Hebbian learning is closely related to principal component analysis. Some neural network areas that appear to have no close relatives in the existing statistical literature are: o Reinforcement learning (although this is treated in the operations research literature on Markov decision processes). o Stopped training (the purpose and effect of stopped training are similar to shrinkage estimation, but the method is quite different). Feedforward nets are a subset of the class of nonlinear regression and discrimination models. Statisticians have studied the properties of this general class but had not considered the specific case of feedforward neural nets before such networks were popularized in the neural network field. Still, many results from the statistical theory of nonlinear models apply directly to feedforward nets, and the methods that are commonly used for fitting nonlinear models, such as various LevenbergMarquardt and conjugate gradient algorithms, can be used to train feedforward nets. The application of statistical theory to neural networks is explored in detail by Bishop (1995) and Ripley (1996). Several summary articles have also been published relating statistical models to neural networks, including Cheng and Titterington (1994), Kuan and White (1994), Ripley (1993, 1994), Sarle (1994), and several articles in Cherkassky, Friedman, and Wechsler (1994). Among the many statistical concepts important to neural nets is the bias/variance tradeoff in nonparametric estimation, discussed by Geman, Bienenstock, and Doursat, R. (1992). Some more advanced results of statistical theory applied to neural networks are given by White (1989a, 1989b, 1990, 1992a) and White and Gallant (1992), reprinted in White (1992b). While neural nets are often defined in terms of their algorithms or implementations, statistical methods are usually defined in terms of their results. The arithmetic mean, for example, can be computed by a (very simple) backprop net, by applying the usual formula SUM(x_i)/n, or by various other methods. What you get is still an arithmetic mean regardless of how you compute it. So a statistician would consider standard backprop, Quickprop, and LevenbergMarquardt as different algorithms for implementing the same statistical model such as a feedforward net. On the other hand, different training criteria, such as least squares and cross entropy, are viewed by statisticians as fundamentally different estimation methods with different statistical properties. It is sometimes claimed that neural networks, unlike statistical models, require no distributional assumptions. In fact, neural networks involve exactly the same sort of distributional assumptions as statistical models (Bishop, 1995), but statisticians study the consequences and importance of these assumptions while many neural networkers ignore them. For example, leastsquares training methods are widely used by statisticians and neural networkers. Statisticians realize that leastsquares training involves implicit distributional assumptions in that leastsquares estimates have certain optimality properties for noise that is normally distributed with equal variance for all training cases and that is independent between different cases. These optimality properties are consequences of the fact that leastsquares estimation is maximum likelihood under those conditions. Similarly, crossentropy is maximum likelihood for noise with a Bernoulli distribution. If you study the distributional assumptions, then you can recognize and deal with violations of the assumptions. For example, if you have normally distributed noise but some training cases have greater noise variance than others, then you may be able to use weighted least squares instead of ordinary least squares to obtain more efficient estimates. Hundreds, perhaps thousands of people have run comparisons of neural nets with "traditional statistics" (whatever that means). Most such studies involve one or two data sets, and are of little use to anyone else unless they happen to be analyzing the same kind of data. But there is an impressive comparative study of supervised classification by Michie, Spiegelhalter, and Taylor (1994), which not only compares many classification methods on many data sets, but also provides unusually extensive analyses of the results. Another useful study on supervised classification by Lim, Loh, and Shih (1999) is available online. There is an excellent comparison of unsupervised Kohonen networks and kmeans clustering by Balakrishnan, Cooper, Jacob, and Lewis (1994). There are many methods in the statistical literature that can be used for flexible nonlinear modeling. These methods include: o Polynomial regression (Eubank, 1999) o Fourier series regression (Eubank, 1999; Haerdle, 1990) o Wavelet smoothing (Donoho and Johnstone, 1995; Donoho, Johnstone, Kerkyacharian, and Picard, 1995) o Knearest neighbor regression and discriminant analysis (Haerdle, 1990; Hand, 1981, 1997; Ripley, 1996) o Kernel regression and discriminant analysis (Eubank, 1999; Haerdle, 1990; Hand, 1981, 1982, 1997; Ripley, 1996) o Local polynomial smoothing (Eubank, 1999; Wand and Jones, 1995; Fan and Gijbels, 1995) o LOESS (Cleveland and Gross, 1991) o Smoothing splines (such as thinplate splines) (Eubank, 1999; Wahba, 1990; Green and Silverman, 1994; Haerdle, 1990) o Bsplines (Eubank, 1999) o Treebased models (CART, AID, etc.) (Haerdle, 1990; Lim, Loh, and Shih, 1997; Hand, 1997; Ripley, 1996) o Multivariate adaptive regression splines (MARS) (Friedman, 1991) o Projection pursuit (Friedman and Stuetzle, 1981; Haerdle, 1990; Ripley, 1996) o Various Bayesian methods (Dey, 1998) o GMDH (Farlow, 1984) Why use neural nets rather than any of the above methods? There are many answers to that question depending on what kind of neural net you're interested in. The most popular variety of neural net, the MLP, tends to be useful in the same situations as projection pursuit regression, i.e.: o the number of inputs is fairly large, o many of the inputs are relevant, but o most of the predictive information lies in a lowdimensional subspace. The main advantage of MLPs over projection pursuit regression is that computing predicted values from MLPs is simpler and faster. Also, MLPs are better at learning moderately pathological functions than are many other methods with stronger smoothness assumptions (see ftp://ftp.sas.com/pub/neural/dojo/dojo.html) as long as the number of pathological features (such as discontinuities) in the function is not too large. For more discussion of the theoretical benefits of various types of neural nets, see How do MLPs compare with RBFs? Communication between statisticians and neural net researchers is often hindered by the different terminology used in the two fields. There is a comparison of neural net and statistical jargon in ftp://ftp.sas.com/pub/neural/jargon For free statistical software, see the StatLib repository at http://lib.stat.cmu.edu/ at Carnegie Mellon University. There are zillions of introductory textbooks on statistics. One of the better ones is Moore and McCabe (1989). At an intermediate level, the books on linear regression by Weisberg (1985) and Myers (1986), on logistic regression by Hosmer and Lemeshow (1989), and on discriminant analysis by Hand (1981) can be recommended. At a more advanced level, the book on generalized linear models by McCullagh and Nelder (1989) is an essential reference, and the book on nonlinear regression by Gallant (1987) has much material relevant to neural nets. Several introductory statistics texts are available on the web: o David Lane, HyperStat, at http://www.ruf.rice.edu/~lane/hyperstat/contents.html o Jan de Leeuw (ed.), Statistics: The Study of Stability in Variation , at http://www.stat.ucla.edu/textbook/ o StatSoft, Inc., Electronic Statistics Textbook, at http://www.statsoft.com/textbook/stathome.html o David Stockburger, Introductory Statistics: Concepts, Models, and Applications, at http://www.psychstat.smsu.edu/sbk00.htm o University of Newcastle (Australia) Statistics Department, SurfStat Australia, http://surfstat.newcastle.edu.au/surfstat/ A more advanced book covering many topics that are also relevant to NNs is: o Frank Harrell, REGRESSION MODELING STRATEGIES With Applications to Linear Models, Logistic Regression, and Survival Analysis, at http://hesweb1.med.virginia.edu/biostat/rms/ References: Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A study of the classification capabilities of neural networks using unsupervised learning: A comparison with kmeans clustering", Psychometrika, 59, 509525. Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press. Cheng, B. and Titterington, D.M. (1994), "Neural Networks: A Review from a Statistical Perspective", Statistical Science, 9, 254. Cherkassky, V., Friedman, J.H., and Wechsler, H., eds. (1994), From Statistics to Neural Networks: Theory and Pattern Recognition Applications, Berlin: SpringerVerlag. Cleveland and Gross (1991), "Computational Methods for Local Regression," Statistics and Computing 1, 4762. Dey, D., ed. (1998) Practical Nonparametric and Semiparametric Bayesian Statistics, Springer Verlag. Donoho, D.L., and Johnstone, I.M. (1995), "Adapting to unknown smoothness via wavelet shrinkage," J. of the American Statistical Association, 90, 12001224. Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1995), "Wavelet shrinkage: asymptopia (with discussion)?" J. of the Royal Statistical Society, Series B, 57, 301369. Eubank, R.L. (1999), Nonparametric Regression and Spline Smoothing, 2nd ed., Marcel Dekker, ISBN 0824793374. Fan, J., and Gijbels, I. (1995), "Datadriven bandwidth selection in local polynomial: variable bandwidth and spatial adaptation," J. of the Royal Statistical Society, Series B, 57, 371394. Farlow, S.J. (1984), Selforganizing Methods in Modeling: GMDH Type Algorithms, NY: Marcel Dekker. (GMDH) Friedman, J.H. (1991), "Multivariate adaptive regression splines", Annals of Statistics, 19, 1141. (MARS) Friedman, J.H. and Stuetzle, W. (1981) "Projection pursuit regression," J. of the American Statistical Association, 76, 817823. Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley. Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and the Bias/Variance Dilemma", Neural Computation, 4, 158. Green, P.J., and Silverman, B.W. (1994), Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach, London: Chapman & Hall. Haerdle, W. (1990), Applied Nonparametric Regression, Cambridge Univ. Press. Hand, D.J. (1981) Discrimination and Classification, NY: Wiley. Hand, D.J. (1982) Kernel Discriminant Analysis, Research Studies Press. Hand, D.J. (1997) Construction and Assessment of Classification Rules, NY: Wiley. Hill, T., Marquez, L., O'Connor, M., and Remus, W. (1994), "Artificial neural network models for forecasting and decision making," International J. of Forecasting, 10, 515. Kuan, C.M. and White, H. (1994), "Artificial Neural Networks: An Econometric Perspective", Econometric Reviews, 13, 191. Kushner, H. & Clark, D. (1978), Stochastic Approximation Methods for Constrained and Unconstrained Systems, SpringerVerlag. Lim, T.S., Loh, W.Y. and Shih, Y.S. ( 1999?), "A comparison of prediction accuracy, complexity, and training time of thirtythree old and new classification algorithms," Machine Learning, forthcoming, preprint available at http://www.recursivepartitioning.com/mach1317.pdf, and appendix containing complete tables of error rates, ranks, and training times at http://www.recursivepartitioning.com/appendix.pdf McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall. Michie, D., Spiegelhalter, D.J. and Taylor, C.C., eds. (1994), Machine Learning, Neural and Statistical Classification, NY: Ellis Horwood; this book is out of print but available online at http://www.amsta.leeds.ac.uk/~charles/statlog/ Moore, D.S., and McCabe, G.P. (1989), Introduction to the Practice of Statistics, NY: W.H. Freeman. Myers, R.H. (1986), Classical and Modern Regression with Applications, Boston: Duxbury Press. Ripley, B.D. (1993), "Statistical Aspects of Neural Networks", in O.E. BarndorffNielsen, J.L. Jensen and W.S. Kendall, eds., Networks and Chaos: Statistical and Probabilistic Aspects, Chapman & Hall. ISBN 0 412 46530 2. Ripley, B.D. (1994), "Neural Networks and Related Methods for Classification," Journal of the Royal Statistical Society, Series B, 56, 409456. Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Sarle, W.S. (1994), "Neural Networks and Statistical Models," Proceedings of the Nineteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute, pp 15381550. ( ftp://ftp.sas.com/pub/neural/neural1.ps) Wahba, G. (1990), Spline Models for Observational Data, SIAM. Wand, M.P., and Jones, M.C. (1995), Kernel Smoothing, London: Chapman & Hall. Weisberg, S. (1985), Applied Linear Regression, NY: Wiley White, H. (1989a), "Learning in Artificial Neural Networks: A Statistical Perspective," Neural Computation, 1, 425464. White, H. (1989b), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 10081013. White, H. (1990), "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3, 535550. White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," in Page, C. and Le Page, R. (eds.), Computing Science and Statistics. White, H., and Gallant, A.R. (1992), "On Learning the Derivatives of an Unknown Mapping with Multilayer Feedforward Networks," Neural Networks, 5, 129138. White, H. (1992b), Artificial Neural Networks: Approximation and Learning Theory, Blackwell.  Next part is part 2 (of 7).  Warren S. Sarle SAS Institute Inc. The opinions expressed here saswss@unx.sas.com SAS Campus Drive are mine and not necessarily (919) 6778000 Cary, NC 27513, USA those of SAS Institute. User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 1 of 7: Introduction Previous Document: What are the population, sample, training set, Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
