Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Section - What is Bayesian Learning?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Property taxes ]


Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: What is weight decay?
Next Document: How to combine networks?
See reader questions & answers on this topic! - Help others by sharing your knowledge

By Radford Neal. 

Conventional training methods for multilayer perceptrons ("backprop" nets)
can be interpreted in statistical terms as variations on maximum likelihood
estimation. The idea is to find a single set of weights for the network that
maximize the fit to the training data, perhaps modified by some sort of
weight penalty to prevent overfitting. 

The Bayesian school of statistics is based on a different view of what it
means to learn from data, in which probability is used to represent
uncertainty about the relationship being learned (a use that is shunned in
conventional--i.e., frequentist--statistics). Before we have seen any data,
our prior opinions about what the true relationship might be can be
expresssed in a probability distribution over the network weights that
define this relationship. After we look at the data (or after our program
looks at the data), our revised opinions are captured by a posterior
distribution over network weights. Network weights that seemed plausible
before, but which don't match the data very well, will now be seen as being
much less likely, while the probability for values of the weights that do
fit the data well will have increased. 

Typically, the purpose of training is to make predictions for future cases
in which only the inputs to the network are known. The result of
conventional network training is a single set of weights that can be used to
make such predictions. In contrast, the result of Bayesian training is a
posterior distribution over network weights. If the inputs of the network
are set to the values for some new case, the posterior distribution over
network weights will give rise to a distribution over the outputs of the
network, which is known as the predictive distribution for this new case. If
a single-valued prediction is needed, one might use the mean of the
predictive distribution, but the full predictive distribution also tells you
how uncertain this prediction is. 

Why bother with all this? The hope is that Bayesian methods will provide
solutions to such fundamental problems as: 

 o How to judge the uncertainty of predictions. This can be solved by
   looking at the predictive distribution, as described above. 
 o How to choose an appropriate network architecture (eg, the number hidden
   layers, the number of hidden units in each layer). 
 o How to adapt to the characteristics of the data (eg, the smoothness of
   the function, the degree to which different inputs are relevant). 

Good solutions to these problems, especially the last two, depend on using
the right prior distribution, one that properly represents the uncertainty
that you probably have about which inputs are relevant, how smooth the
function is, how much noise there is in the observations, etc. Such
carefully vague prior distributions are usually defined in a hierarchical
fashion, using hyperparameters, some of which are analogous to the weight
decay constants of more conventional training procedures. The use of
hyperparameters is discussed by Mackay (1992a, 1992b, 1995) and Neal (1993a,
1996), who in particular use an "Automatic Relevance Determination" scheme
that aims to allow many possibly-relevant inputs to be included without
damaging effects. 

Selection of an appropriate network architecture is another place where
prior knowledge plays a role. One approach is to use a very general
architecture, with lots of hidden units, maybe in several layers or groups,
controlled using hyperparameters. This approach is emphasized by Neal
(1996), who argues that there is no statistical need to limit the complexity
of the network architecture when using well-designed Bayesian methods. It is
also possible to choose between architectures in a Bayesian fashion, using
the "evidence" for an architecture, as discussed by Mackay (1992a, 1992b). 

Implementing all this is one of the biggest problems with Bayesian methods.
Dealing with a distribution over weights (and perhaps hyperparameters) is
not as simple as finding a single "best" value for the weights. Exact
analytical methods for models as complex as neural networks are out of the
question. Two approaches have been tried: 

1. Find the weights/hyperparameters that are most probable, using methods
   similar to conventional training (with regularization), and then
   approximate the distribution over weights using information available at
   this maximum. 
2. Use a Monte Carlo method to sample from the distribution over weights.
   The most efficient implementations of this use dynamical Monte Carlo
   methods whose operation resembles that of backprop with momentum. 

The first method comes in two flavours. Buntine and Weigend (1991) describe
a procedure in which the hyperparameters are first integrated out
analytically, and numerical methods are then used to find the most probable
weights. MacKay (1992a, 1992b) instead finds the values for the
hyperparameters that are most likely, integrating over the weights (using an
approximation around the most probable weights, conditional on the
hyperparameter values). There has been some controversy regarding the merits
of these two procedures, with Wolpert (1993) claiming that analytically
integrating over the hyperparameters is preferable because it is "exact".
This criticism has been rebutted by Mackay (1993). It would be inappropriate
to get into the details of this controversy here, but it is important to
realize that the procedures based on analytical integration over the
hyperparameters do not provide exact solutions to any of the problems of
practical interest. The discussion of an analogous situation in a different
statistical context by O'Hagan (1985) may be illuminating. 

Monte Carlo methods for Bayesian neural networks have been developed by Neal
(1993a, 1996). In this approach, the posterior distribution is represented
by a sample of perhaps a few dozen sets of network weights. The sample is
obtained by simulating a Markov chain whose equilibrium distribution is the
posterior distribution for weights and hyperparameters. This technique is
known as "Markov chain Monte Carlo (MCMC)"; see Neal (1993b) for a review.
The method is exact in the limit as the size of the sample and the length of
time for which the Markov chain is run increase, but convergence can
sometimes be slow in practice, as for any network training method. 

Work on Bayesian neural network learning has so far concentrated on
multilayer perceptron networks, but Bayesian methods can in principal be
applied to other network models, as long as they can be interpreted in
statistical terms. For some models (eg, RBF networks), this should be a
fairly simple matter; for others (eg, Boltzmann Machines), substantial
computational problems would need to be solved. 

Software implementing Bayesian neural network models (intended for research
use) is available from the home pages of David MacKay and Radford Neal. 

There are many books that discuss the general concepts of Bayesian
inference, though they mostly deal with models that are simpler than neural
networks. Here are some recent ones: 

   Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian Theory, New York:
   John Wiley. 

   Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian
   Data Analysis, London: Chapman & Hall, ISBN 0-412-03991-5. 

   O'Hagan, A. (1994) Bayesian Inference (Volume 2B in Kendall's Advanced
   Theory of Statistics), ISBN 0-340-52922-9. 

   Robert, C. P. (1995) The Bayesian Choice, New York: Springer-Verlag. 

The following books and papers have tutorial material on Bayesian learning
as applied to neural network models: 

   Bishop, C. M. (1995) Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Lee, H.K.H (1999), Model Selection and Model Averaging for Neural
   Networks, Doctoral dissertation, Carnegie Mellon University,
   Pittsburgh, USA, http://lib.stat.cmu.edu/~herbie/thesis.html 

   MacKay, D. J. C. (1995) "Probable networks and plausible predictions - a
   review of practical Bayesian methods for supervised neural networks",
   available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/network.ps.gz. 

   Mueller, P. and Insua, D.R. (1995) "Issues in Bayesian Analysis of Neural
   Network Models," Neural Computation, 10, 571-592, (also Institute of
   Statistics and Decision Sciences Working Paper 95-31), 
   ftp://ftp.isds.duke.edu/pub/WorkingPapers/95-31.ps 

   Neal, R. M. (1996) Bayesian Learning for Neural Networks, New York:
   Springer-Verlag, ISBN 0-387-94724-8. 

   Ripley, B. D. (1996) Pattern Recognition and Neural Networks,
   Cambridge: Cambridge University Press. 

   Thodberg, H. H. (1996) "A review of Bayesian neural networks with an
   application to near infrared spectroscopy", IEEE Transactions on Neural
   Networks, 7, 56-72. 

Some other references: 

   Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds.,
   (1985), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V.
   (North-Holland). 

   Buntine, W. L. and Weigend, A. S. (1991) "Bayesian back-propagation", 
   Complex Systems, 5, 603-643. 

   MacKay, D. J. C. (1992a) "Bayesian interpolation", Neural Computation,
   4, 415-447. 

   MacKay, D. J. C. (1992b) "A practical Bayesian framework for
   backpropagation networks," Neural Computation, 4, 448-472. 

   MacKay, D. J. C. (1993) "Hyperparameters: Optimize or Integrate Out?",
   available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/alpha.ps.gz. 

   Neal, R. M. (1993a) "Bayesian learning via stochastic dynamics", in C. L.
   Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural
   Information Processing Systems 5, San Mateo, California: Morgan
   Kaufmann, 475-482. 

   Neal, R. M. (1993b) Probabilistic Inference Using Markov Chain Monte
   Carlo Methods, available at 
   ftp://ftp.cs.utoronto.ca/pub/radford/review.ps.Z. 

   O'Hagan, A. (1985) "Shoulders in hierarchical models", in J. M. Bernardo,
   M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (editors), Bayesian
   Statistics 2, Amsterdam: Elsevier Science Publishers B.V. (North-Holland),
   697-710. 

   Sarle, W. S. (1995) "Stopped Training and Other Remedies for
   Overfitting," Proceedings of the 27th Symposium on the Interface of
   Computing Science and Statistics, 352-360, 
   ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
   compressed postscript file, 747K, 10 pages) 

   Wolpert, D. H. (1993) "On the use of evidence in neural networks", in C.
   L. Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural
   Information Processing Systems 5, San Mateo, California: Morgan
   Kaufmann, 539-546. 

Finally, David MacKay maintains a FAQ about Bayesian methods for neural
networks, at http://wol.ra.phy.cam.ac.uk/mackay/Bayes_FAQ.html . 

Comments on Bayesian learning
+++++++++++++++++++++++++++++

By Warren Sarle. 

Bayesian purists may argue over the proper way to do a Bayesian analysis,
but even the crudest Bayesian computation (maximizing over both parameters
and hyperparameters) is shown by Sarle (1995) to generalize better than
early stopping when learning nonlinear functions. This approach requires the
use of slightly informative hyperpriors and at least twice as many training
cases as weights in the network. A full Bayesian analysis by MCMC can be
expected to work even better under even broader conditions. Bayesian
learning works well by frequentist standards--what MacKay calls the
"evidence framework" is used by frequentist statisticians under the name
"empirical Bayes." Although considerable research remains to be done,
Bayesian learning seems to be the most promising approach to training neural
networks. 

Bayesian learning should not be confused with the "Bayes classifier." In the
latter, the distribution of the inputs given the target class is assumed to
be known exactly, and the prior probabilities of the classes are assumed
known, so that the posterior probabilities can be computed by a
(theoretically) simple application of Bayes' theorem. The Bayes classifier
involves no learning--you must already know everything that needs to be
known! The Bayes classifier is a gold standard that can almost never be used
in real life but is useful in theoretical work and in simulation studies
that compare classification methods. The term "Bayes rule" is also used to
mean any classification rule that gives results identical to those of a
Bayes classifier. 

Bayesian learning also should not be confused with the "naive" or "idiot's"
Bayes classifier (Warner et al. 1961; Ripley, 1996), which assumes that the
inputs are conditionally independent given the target class. The naive Bayes
classifier is usually applied with categorical inputs, and the distribution
of each input is estimated by the proportions in the training set; hence the
naive Bayes classifier is a frequentist method. 

The term "Bayesian network" often refers not to a neural network but to a
belief network (also called a causal net, influence diagram, constraint
network, qualitative Markov network, or gallery). Belief networks are more
closely related to expert systems than to neural networks, and do not
necessarily involve learning (Pearl, 1988; Ripley, 1996). Here are some URLs
on Bayesian belief networks: 

 o http://bayes.stat.washington.edu/almond/belief.html 
 o http://www.cs.orst.edu/~dambrosi/bayesian/frame.html 
 o http://www2.sis.pitt.edu/~genie 
 o http://www.research.microsoft.com/dtg/msbn 

References for comments: 

   Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks
   of Plausible Inference, San Mateo, CA: Morgan Kaufmann. 

   Ripley, B. D. (1996) Pattern Recognition and Neural Networks,
   Cambridge: Cambridge University Press. 

   Warner, H.R., Toronto, A.F., Veasy, L.R., and Stephenson, R. (1961), "A
   mathematical model for medical diagnosis--application to congenital heart
   disease," J. of the American Medical Association, 177, 177-184. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:




Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: What is weight decay?
Next Document: How to combine networks?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)





Last Update March 27 2014 @ 02:11 PM