Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 3 of 7: Generalization
Section - How is generalization possible?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Business Photos and Profiles ]

Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: News Headers
Next Document: How does noise affect generalization?
See reader questions & answers on this topic! - Help others by sharing your knowledge

During learning, the outputs of a supervised neural net come to approximate
the target values given the inputs in the training set. This ability may be
useful in itself, but more often the purpose of using a neural net is to
generalize--i.e., to have the outputs of the net approximate target values
given inputs that are not in the training set. Generalizaton is not always
possible, despite the blithe assertions of some authors. For example,
Caudill and Butler, 1990, p. 8, claim that "A neural network is able to
generalize", but they provide no justification for this claim, and they
completely neglect the complex issues involved in getting good
generalization. Anyone who reads is well aware from the
numerous posts pleading for help that artificial neural networks do not
automatically generalize. 

Generalization requires prior knowledge, as pointed out by Hume (1739/1978),
Russell (1948), and Goodman (1954/1983) and rigorously proved by Wolpert
(1995a, 1996a, 1996b). For any practical application, you have to know what
the relevant inputs are (you can't simply include every imaginable input).
You have to know a restricted class of input-output functions that contains
an adequate approximation to the function you want to learn (you can't use a
learning method that is capable of fitting every imaginable function). And
you have to know that the cases you want to generalize to bear some
resemblance to the training cases. Thus, there are three conditions that are
typically necessary--although not sufficient--for good generalization: 

1. The first necessary condition is that the inputs to the network contain
   sufficient information pertaining to the target, so that there exists a
   mathematical function relating correct outputs to inputs with the desired
   degree of accuracy. You can't expect a network to learn a nonexistent
   function--neural nets are not clairvoyant! For example, if you want to
   forecast the price of a stock, a historical record of the stock's prices
   is rarely sufficient input; you need detailed information on the
   financial state of the company as well as general economic conditions,
   and to avoid nasty surprises, you should also include inputs that can
   accurately predict wars in the Middle East and earthquakes in Japan.
   Finding good inputs for a net and collecting enough training data often
   take far more time and effort than training the network. 

2. The second necessary condition is that the function you are trying to
   learn (that relates inputs to correct outputs) be, in some sense, smooth.
   In other words, a small change in the inputs should, most of the time,
   produce a small change in the outputs. For continuous inputs and targets,
   smoothness of the function implies continuity and restrictions on the
   first derivative over most of the input space. Some neural nets can learn
   discontinuities as long as the function consists of a finite number of
   continuous pieces. Very nonsmooth functions such as those produced by
   pseudo-random number generators and encryption algorithms cannot be
   generalized by neural nets. Often a nonlinear transformation of the input
   space can increase the smoothness of the function and improve

   For classification, if you do not need to estimate posterior
   probabilities, then smoothness is not theoretically necessary. In
   particular, feedforward networks with one hidden layer trained by
   minimizing the error rate (a very tedious training method) are
   universally consistent classifiers if the number of hidden units grows at
   a suitable rate relative to the number of training cases (Devroye,
   Györfi, and Lugosi, 1996). However, you are likely to get better
   generalization with realistic sample sizes if the classification
   boundaries are smoother. 

   For Boolean functions, the concept of smoothness is more elusive. It
   seems intuitively clear that a Boolean network with a small number of
   hidden units and small weights will compute a "smoother" input-output
   function than a network with many hidden units and large weights. If you
   know a good reference characterizing Boolean functions for which good
   generalization is possible, please inform the FAQ maintainer

3. The third necessary condition for good generalization is that the
   training cases be a sufficiently large and representative subset
   ("sample" in statistical terminology) of the set of all cases that you
   want to generalize to (the "population" in statistical terminology). The
   importance of this condition is related to the fact that there are,
   loosely speaking, two different types of generalization: interpolation
   and extrapolation. Interpolation applies to cases that are more or less
   surrounded by nearby training cases; everything else is extrapolation. In
   particular, cases that are outside the range of the training data require
   extrapolation. Cases inside large "holes" in the training data may also
   effectively require extrapolation. Interpolation can often be done
   reliably, but extrapolation is notoriously unreliable. Hence it is
   important to have sufficient training data to avoid the need for
   extrapolation. Methods for selecting good training sets are discussed in
   numerous statistical textbooks on sample surveys and experimental design.

Thus, for an input-output function that is smooth, if you have a test case
that is close to some training cases, the correct output for the test case
will be close to the correct outputs for those training cases. If you have
an adequate sample for your training set, every case in the population will
be close to a sufficient number of training cases. Hence, under these
conditions and with proper training, a neural net will be able to generalize
reliably to the population. 

If you have more information about the function, e.g. that the outputs
should be linearly related to the inputs, you can often take advantage of
this information by placing constraints on the network or by fitting a more
specific model, such as a linear model, to improve generalization.
Extrapolation is much more reliable in linear models than in flexible
nonlinear models, although still not nearly as safe as interpolation. You
can also use such information to choose the training cases more efficiently.
For example, with a linear model, you should choose training cases at the
outer limits of the input space instead of evenly distributing them
throughout the input space. 


   Caudill, M. and Butler, C. (1990). Naturally Intelligent Systems. MIT
   Press: Cambridge, Massachusetts. 

   Devroye, L., Györfi, L., and Lugosi, G. (1996), A Probabilistic Theory of
   Pattern Recognition, NY: Springer. 

   Goodman, N. (1954/1983), Fact, Fiction, and Forecast, 1st/4th ed.,
   Cambridge, MA: Harvard University Press. 

   Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R. (1986), 
   Induction: Processes of Inference, Learning, and Discovery, Cambridge, MA:
   The MIT Press. 

   Howson, C. and Urbach, P. (1989), Scientific Reasoning: The Bayesian
   Approach, La Salle, IL: Open Court. 

   Hume, D. (1739/1978), A Treatise of Human Nature, Selby-Bigge, L.A.,
   and Nidditch, P.H. (eds.), Oxford: Oxford University Press. 

   Plotkin, H. (1993), Darwin Machines and the Nature of Knowledge,
   Cambridge, MA: Harvard University Press. 

   Russell, B. (1948), Human Knowledge: Its Scope and Limits, London:

   Stone, C.J. (1977), "Consistent nonparametric regression," Annals of
   Statistics, 5, 595-645. 

   Stone, C.J. (1982), "Optimal global rates of convergence for
   nonparametric regression," Annals of Statistics, 10, 1040-1053. 

   Vapnik, V.N. (1995), The Nature of Statistical Learning Theory, NY:

   Wolpert, D.H. (1995a), "The relationship between PAC, the statistical
   physics framework, the Bayesian framework, and the VC framework," in
   Wolpert (1995b), 117-214. 

   Wolpert, D.H. (ed.) (1995b), The Mathematics of Generalization: The
   Proceedings of the SFI/CNLS Workshop on Formal Approaches to
   Supervised Learning, Santa Fe Institute Studies in the Sciences of
   Complexity, Volume XX, Reading, MA: Addison-Wesley. 

   Wolpert, D.H. (1996a), "The lack of a priori distinctions between
   learning algorithms," Neural Computation, 8, 1341-1390. 

   Wolpert, D.H. (1996b), "The existence of a priori distinctions between
   learning algorithms," Neural Computation, 8, 1391-1420. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:


Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: News Headers
Next Document: How does noise affect generalization?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM