Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 1 of 7: Introduction
Section - How are NNs related to statistical methods?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Property taxes ]

Top Document: FAQ, Part 1 of 7: Introduction
Previous Document: What are the population, sample, training set,
See reader questions & answers on this topic! - Help others by sharing your knowledge

There is considerable overlap between the fields of neural networks and
statistics. Statistics is concerned with data analysis. In neural network
terminology, statistical inference means learning to generalize from noisy
data. Some neural networks are not concerned with data analysis (e.g., those
intended to model biological systems) and therefore have little to do with
statistics. Some neural networks do not learn (e.g., Hopfield nets) and
therefore have little to do with statistics. Some neural networks can learn
successfully only from noise-free data (e.g., ART or the perceptron rule)
and therefore would not be considered statistical methods. But most neural
networks that can learn to generalize effectively from noisy data are
similar or identical to statistical methods. For example: 

 o Feedforward nets with no hidden layer (including functional-link neural
   nets and higher-order neural nets) are basically generalized linear
 o Feedforward nets with one hidden layer are closely related to projection
   pursuit regression. 
 o Probabilistic neural nets are identical to kernel discriminant analysis. 
 o Kohonen nets for adaptive vector quantization are very similar to k-means
   cluster analysis. 
 o Kohonen self-organizing maps are discrete approximations to principal
   curves and surfaces. 
 o Hebbian learning is closely related to principal component analysis. 

Some neural network areas that appear to have no close relatives in the
existing statistical literature are: 

 o Reinforcement learning (although this is treated in the operations
   research literature on Markov decision processes). 
 o Stopped training (the purpose and effect of stopped training are similar
   to shrinkage estimation, but the method is quite different). 

Feedforward nets are a subset of the class of nonlinear regression and
discrimination models. Statisticians have studied the properties of this
general class but had not considered the specific case of feedforward neural
nets before such networks were popularized in the neural network field.
Still, many results from the statistical theory of nonlinear models apply
directly to feedforward nets, and the methods that are commonly used for
fitting nonlinear models, such as various Levenberg-Marquardt and conjugate
gradient algorithms, can be used to train feedforward nets. The application
of statistical theory to neural networks is explored in detail by Bishop
(1995) and Ripley (1996). Several summary articles have also been published
relating statistical models to neural networks, including Cheng and
Titterington (1994), Kuan and White (1994), Ripley (1993, 1994), Sarle
(1994), and several articles in Cherkassky, Friedman, and Wechsler (1994).
Among the many statistical concepts important to neural nets is the
bias/variance trade-off in nonparametric estimation, discussed by Geman,
Bienenstock, and Doursat, R. (1992). Some more advanced results of
statistical theory applied to neural networks are given by White (1989a,
1989b, 1990, 1992a) and White and Gallant (1992), reprinted in White

While neural nets are often defined in terms of their algorithms or
implementations, statistical methods are usually defined in terms of their
results. The arithmetic mean, for example, can be computed by a (very
simple) backprop net, by applying the usual formula SUM(x_i)/n, or by
various other methods. What you get is still an arithmetic mean regardless
of how you compute it. So a statistician would consider standard backprop,
Quickprop, and Levenberg-Marquardt as different algorithms for implementing
the same statistical model such as a feedforward net. On the other hand,
different training criteria, such as least squares and cross entropy, are
viewed by statisticians as fundamentally different estimation methods with
different statistical properties. 

It is sometimes claimed that neural networks, unlike statistical models,
require no distributional assumptions. In fact, neural networks involve
exactly the same sort of distributional assumptions as statistical models
(Bishop, 1995), but statisticians study the consequences and importance of
these assumptions while many neural networkers ignore them. For example,
least-squares training methods are widely used by statisticians and neural
networkers. Statisticians realize that least-squares training involves
implicit distributional assumptions in that least-squares estimates have
certain optimality properties for noise that is normally distributed with
equal variance for all training cases and that is independent between
different cases. These optimality properties are consequences of the fact
that least-squares estimation is maximum likelihood under those conditions.
Similarly, cross-entropy is maximum likelihood for noise with a Bernoulli
distribution. If you study the distributional assumptions, then you can
recognize and deal with violations of the assumptions. For example, if you
have normally distributed noise but some training cases have greater noise
variance than others, then you may be able to use weighted least squares
instead of ordinary least squares to obtain more efficient estimates. 

Hundreds, perhaps thousands of people have run comparisons of neural nets
with "traditional statistics" (whatever that means). Most such studies
involve one or two data sets, and are of little use to anyone else unless
they happen to be analyzing the same kind of data. But there is an
impressive comparative study of supervised classification by Michie,
Spiegelhalter, and Taylor (1994), which not only compares many
classification methods on many data sets, but also provides unusually
extensive analyses of the results. Another useful study on supervised
classification by Lim, Loh, and Shih (1999) is available on-line. There is
an excellent comparison of unsupervised Kohonen networks and k-means
clustering by Balakrishnan, Cooper, Jacob, and Lewis (1994). 

There are many methods in the statistical literature that can be used for
flexible nonlinear modeling. These methods include: 

 o Polynomial regression (Eubank, 1999) 
 o Fourier series regression (Eubank, 1999; Haerdle, 1990) 
 o Wavelet smoothing (Donoho and Johnstone, 1995; Donoho, Johnstone,
   Kerkyacharian, and Picard, 1995) 
 o K-nearest neighbor regression and discriminant analysis (Haerdle, 1990;
   Hand, 1981, 1997; Ripley, 1996) 
 o Kernel regression and discriminant analysis (Eubank, 1999; Haerdle, 1990;
   Hand, 1981, 1982, 1997; Ripley, 1996) 
 o Local polynomial smoothing (Eubank, 1999; Wand and Jones, 1995; Fan and
   Gijbels, 1995) 
 o LOESS (Cleveland and Gross, 1991) 
 o Smoothing splines (such as thin-plate splines) (Eubank, 1999; Wahba,
   1990; Green and Silverman, 1994; Haerdle, 1990) 
 o B-splines (Eubank, 1999) 
 o Tree-based models (CART, AID, etc.) (Haerdle, 1990; Lim, Loh, and Shih,
   1997; Hand, 1997; Ripley, 1996) 
 o Multivariate adaptive regression splines (MARS) (Friedman, 1991) 
 o Projection pursuit (Friedman and Stuetzle, 1981; Haerdle, 1990; Ripley,
 o Various Bayesian methods (Dey, 1998) 
 o GMDH (Farlow, 1984) 

Why use neural nets rather than any of the above methods? There are many
answers to that question depending on what kind of neural net you're
interested in. The most popular variety of neural net, the MLP, tends to be
useful in the same situations as projection pursuit regression, i.e.: 

 o the number of inputs is fairly large, 
 o many of the inputs are relevant, but 
 o most of the predictive information lies in a low-dimensional subspace. 

The main advantage of MLPs over projection pursuit regression is that
computing predicted values from MLPs is simpler and faster. Also, MLPs are
better at learning moderately pathological functions than are many other
methods with stronger smoothness assumptions (see as long as the number of
pathological features (such as discontinuities) in the function is not too
large. For more discussion of the theoretical benefits of various types of
neural nets, see How do MLPs compare with RBFs? 

Communication between statisticians and neural net researchers is often
hindered by the different terminology used in the two fields. There is a
comparison of neural net and statistical jargon in 

For free statistical software, see the StatLib repository at at Carnegie Mellon University. 

There are zillions of introductory textbooks on statistics. One of the
better ones is Moore and McCabe (1989). At an intermediate level, the books
on linear regression by Weisberg (1985) and Myers (1986), on logistic
regression by Hosmer and Lemeshow (1989), and on discriminant analysis by
Hand (1981) can be recommended. At a more advanced level, the book on
generalized linear models by McCullagh and Nelder (1989) is an essential
reference, and the book on nonlinear regression by Gallant (1987) has much
material relevant to neural nets. 

Several introductory statistics texts are available on the web: 

 o David Lane, HyperStat, at 
 o Jan de Leeuw (ed.), Statistics: The Study of Stability in Variation , at 
 o StatSoft, Inc., Electronic Statistics Textbook, at 
 o David Stockburger, Introductory Statistics: Concepts, Models, and
   Applications, at 
 o University of Newcastle (Australia) Statistics Department, SurfStat

A more advanced book covering many topics that are also relevant to NNs is: 

   Applications to Linear Models, Logistic Regression, and Survival Analysis,


   Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A
   study of the classification capabilities of neural networks using
   unsupervised learning: A comparison with k-means clustering",
   Psychometrika, 59, 509-525. 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Cheng, B. and Titterington, D.M. (1994), "Neural Networks: A Review from
   a Statistical Perspective", Statistical Science, 9, 2-54. 

   Cherkassky, V., Friedman, J.H., and Wechsler, H., eds. (1994), From
   Statistics to Neural Networks: Theory and Pattern Recognition
   Applications, Berlin: Springer-Verlag. 

   Cleveland and Gross (1991), "Computational Methods for Local Regression,"
   Statistics and Computing 1, 47-62. 

   Dey, D., ed. (1998) Practical Nonparametric and Semiparametric Bayesian
   Statistics, Springer Verlag. 

   Donoho, D.L., and Johnstone, I.M. (1995), "Adapting to unknown smoothness
   via wavelet shrinkage," J. of the American Statistical Association, 90,

   Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1995),
   "Wavelet shrinkage: asymptopia (with discussion)?" J. of the Royal
   Statistical Society, Series B, 57, 301-369. 

   Eubank, R.L. (1999), Nonparametric Regression and Spline Smoothing, 2nd
   ed., Marcel Dekker, ISBN 0-8247-9337-4. 

   Fan, J., and Gijbels, I. (1995), "Data-driven bandwidth selection in
   local polynomial: variable bandwidth and spatial adaptation," J. of the
   Royal Statistical Society, Series B, 57, 371-394. 

   Farlow, S.J. (1984), Self-organizing Methods in Modeling: GMDH Type
   Algorithms, NY: Marcel Dekker. (GMDH) 

   Friedman, J.H. (1991), "Multivariate adaptive regression splines", Annals
   of Statistics, 19, 1-141. (MARS) 

   Friedman, J.H. and Stuetzle, W. (1981) "Projection pursuit regression,"
   J. of the American Statistical Association, 76, 817-823. 

   Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley. 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Green, P.J., and Silverman, B.W. (1994), Nonparametric Regression and
   Generalized Linear Models: A Roughness Penalty Approach, London:
   Chapman & Hall. 

   Haerdle, W. (1990), Applied Nonparametric Regression, Cambridge Univ.

   Hand, D.J. (1981) Discrimination and Classification, NY: Wiley. 

   Hand, D.J. (1982) Kernel Discriminant Analysis, Research Studies Press. 

   Hand, D.J. (1997) Construction and Assessment of Classification Rules,
   NY: Wiley. 

   Hill, T., Marquez, L., O'Connor, M., and Remus, W. (1994), "Artificial
   neural network models for forecasting and decision making," International
   J. of Forecasting, 10, 5-15. 

   Kuan, C.-M. and White, H. (1994), "Artificial Neural Networks: An
   Econometric Perspective", Econometric Reviews, 13, 1-91. 

   Kushner, H. & Clark, D. (1978), Stochastic Approximation Methods for
   Constrained and Unconstrained Systems, Springer-Verlag. 

   Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. ( 1999?), "A comparison of
   prediction accuracy, complexity, and training time of thirty-three old
   and new classification algorithms," Machine Learning, forthcoming,
   preprint available at,
   and appendix containing complete tables of error rates, ranks, and
   training times at 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Michie, D., Spiegelhalter, D.J. and Taylor, C.C., eds. (1994), Machine
   Learning, Neural and Statistical Classification, NY: Ellis Horwood; this
   book is out of print but available online at 

   Moore, D.S., and McCabe, G.P. (1989), Introduction to the Practice of
   Statistics, NY: W.H. Freeman. 

   Myers, R.H. (1986), Classical and Modern Regression with Applications,
   Boston: Duxbury Press. 

   Ripley, B.D. (1993), "Statistical Aspects of Neural Networks", in O.E.
   Barndorff-Nielsen, J.L. Jensen and W.S. Kendall, eds., Networks and
   Chaos: Statistical and Probabilistic Aspects, Chapman & Hall. ISBN 0 412
   46530 2. 

   Ripley, B.D. (1994), "Neural Networks and Related Methods for
   Classification," Journal of the Royal Statistical Society, Series B, 56,

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Sarle, W.S. (1994), "Neural Networks and Statistical Models," 
   Proceedings of the Nineteenth Annual SAS Users Group International
   Conference, Cary, NC: SAS Institute, pp 1538-1550. ( 

   Wahba, G. (1990), Spline Models for Observational Data, SIAM. 

   Wand, M.P., and Jones, M.C. (1995), Kernel Smoothing, London: Chapman &

   Weisberg, S. (1985), Applied Linear Regression, NY: Wiley 

   White, H. (1989a), "Learning in Artificial Neural Networks: A Statistical
   Perspective," Neural Computation, 1, 425-464. 

   White, H. (1989b), "Some Asymptotic Results for Learning in Single Hidden
   Layer Feedforward Network Models", J. of the American Statistical Assoc.,
   84, 1008-1013. 

   White, H. (1990), "Connectionist Nonparametric Regression: Multilayer
   Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3,

   White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles
   Using Neural Networks," in Page, C. and Le Page, R. (eds.), Computing
   Science and Statistics. 

   White, H., and Gallant, A.R. (1992), "On Learning the Derivatives of an
   Unknown Mapping with Multilayer Feedforward Networks," Neural Networks,
   5, 129-138. 

   White, H. (1992b), Artificial Neural Networks: Approximation and
   Learning Theory, Blackwell. 


Next part is part 2 (of 7). 


Warren S. Sarle       SAS Institute Inc.   The opinions expressed here    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.

User Contributions:

Comment about this article, ask questions, or add new information about this topic:


Top Document: FAQ, Part 1 of 7: Introduction
Previous Document: What are the population, sample, training set,

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM