Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What is weight decay? Next Document: How to combine networks? See reader questions & answers on this topic!  Help others by sharing your knowledge By Radford Neal. Conventional training methods for multilayer perceptrons ("backprop" nets) can be interpreted in statistical terms as variations on maximum likelihood estimation. The idea is to find a single set of weights for the network that maximize the fit to the training data, perhaps modified by some sort of weight penalty to prevent overfitting. The Bayesian school of statistics is based on a different view of what it means to learn from data, in which probability is used to represent uncertainty about the relationship being learned (a use that is shunned in conventionali.e., frequentiststatistics). Before we have seen any data, our prior opinions about what the true relationship might be can be expresssed in a probability distribution over the network weights that define this relationship. After we look at the data (or after our program looks at the data), our revised opinions are captured by a posterior distribution over network weights. Network weights that seemed plausible before, but which don't match the data very well, will now be seen as being much less likely, while the probability for values of the weights that do fit the data well will have increased. Typically, the purpose of training is to make predictions for future cases in which only the inputs to the network are known. The result of conventional network training is a single set of weights that can be used to make such predictions. In contrast, the result of Bayesian training is a posterior distribution over network weights. If the inputs of the network are set to the values for some new case, the posterior distribution over network weights will give rise to a distribution over the outputs of the network, which is known as the predictive distribution for this new case. If a singlevalued prediction is needed, one might use the mean of the predictive distribution, but the full predictive distribution also tells you how uncertain this prediction is. Why bother with all this? The hope is that Bayesian methods will provide solutions to such fundamental problems as: o How to judge the uncertainty of predictions. This can be solved by looking at the predictive distribution, as described above. o How to choose an appropriate network architecture (eg, the number hidden layers, the number of hidden units in each layer). o How to adapt to the characteristics of the data (eg, the smoothness of the function, the degree to which different inputs are relevant). Good solutions to these problems, especially the last two, depend on using the right prior distribution, one that properly represents the uncertainty that you probably have about which inputs are relevant, how smooth the function is, how much noise there is in the observations, etc. Such carefully vague prior distributions are usually defined in a hierarchical fashion, using hyperparameters, some of which are analogous to the weight decay constants of more conventional training procedures. The use of hyperparameters is discussed by Mackay (1992a, 1992b, 1995) and Neal (1993a, 1996), who in particular use an "Automatic Relevance Determination" scheme that aims to allow many possiblyrelevant inputs to be included without damaging effects. Selection of an appropriate network architecture is another place where prior knowledge plays a role. One approach is to use a very general architecture, with lots of hidden units, maybe in several layers or groups, controlled using hyperparameters. This approach is emphasized by Neal (1996), who argues that there is no statistical need to limit the complexity of the network architecture when using welldesigned Bayesian methods. It is also possible to choose between architectures in a Bayesian fashion, using the "evidence" for an architecture, as discussed by Mackay (1992a, 1992b). Implementing all this is one of the biggest problems with Bayesian methods. Dealing with a distribution over weights (and perhaps hyperparameters) is not as simple as finding a single "best" value for the weights. Exact analytical methods for models as complex as neural networks are out of the question. Two approaches have been tried: 1. Find the weights/hyperparameters that are most probable, using methods similar to conventional training (with regularization), and then approximate the distribution over weights using information available at this maximum. 2. Use a Monte Carlo method to sample from the distribution over weights. The most efficient implementations of this use dynamical Monte Carlo methods whose operation resembles that of backprop with momentum. The first method comes in two flavours. Buntine and Weigend (1991) describe a procedure in which the hyperparameters are first integrated out analytically, and numerical methods are then used to find the most probable weights. MacKay (1992a, 1992b) instead finds the values for the hyperparameters that are most likely, integrating over the weights (using an approximation around the most probable weights, conditional on the hyperparameter values). There has been some controversy regarding the merits of these two procedures, with Wolpert (1993) claiming that analytically integrating over the hyperparameters is preferable because it is "exact". This criticism has been rebutted by Mackay (1993). It would be inappropriate to get into the details of this controversy here, but it is important to realize that the procedures based on analytical integration over the hyperparameters do not provide exact solutions to any of the problems of practical interest. The discussion of an analogous situation in a different statistical context by O'Hagan (1985) may be illuminating. Monte Carlo methods for Bayesian neural networks have been developed by Neal (1993a, 1996). In this approach, the posterior distribution is represented by a sample of perhaps a few dozen sets of network weights. The sample is obtained by simulating a Markov chain whose equilibrium distribution is the posterior distribution for weights and hyperparameters. This technique is known as "Markov chain Monte Carlo (MCMC)"; see Neal (1993b) for a review. The method is exact in the limit as the size of the sample and the length of time for which the Markov chain is run increase, but convergence can sometimes be slow in practice, as for any network training method. Work on Bayesian neural network learning has so far concentrated on multilayer perceptron networks, but Bayesian methods can in principal be applied to other network models, as long as they can be interpreted in statistical terms. For some models (eg, RBF networks), this should be a fairly simple matter; for others (eg, Boltzmann Machines), substantial computational problems would need to be solved. Software implementing Bayesian neural network models (intended for research use) is available from the home pages of David MacKay and Radford Neal. There are many books that discuss the general concepts of Bayesian inference, though they mostly deal with models that are simpler than neural networks. Here are some recent ones: Bernardo, J. M. and Smith, A. F. M. (1994) Bayesian Theory, New York: John Wiley. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995) Bayesian Data Analysis, London: Chapman & Hall, ISBN 0412039915. O'Hagan, A. (1994) Bayesian Inference (Volume 2B in Kendall's Advanced Theory of Statistics), ISBN 0340529229. Robert, C. P. (1995) The Bayesian Choice, New York: SpringerVerlag. The following books and papers have tutorial material on Bayesian learning as applied to neural network models: Bishop, C. M. (1995) Neural Networks for Pattern Recognition, Oxford: Oxford University Press. Lee, H.K.H (1999), Model Selection and Model Averaging for Neural Networks, Doctoral dissertation, Carnegie Mellon University, Pittsburgh, USA, http://lib.stat.cmu.edu/~herbie/thesis.html MacKay, D. J. C. (1995) "Probable networks and plausible predictions  a review of practical Bayesian methods for supervised neural networks", available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/network.ps.gz. Mueller, P. and Insua, D.R. (1995) "Issues in Bayesian Analysis of Neural Network Models," Neural Computation, 10, 571592, (also Institute of Statistics and Decision Sciences Working Paper 9531), ftp://ftp.isds.duke.edu/pub/WorkingPapers/9531.ps Neal, R. M. (1996) Bayesian Learning for Neural Networks, New York: SpringerVerlag, ISBN 0387947248. Ripley, B. D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Thodberg, H. H. (1996) "A review of Bayesian neural networks with an application to near infrared spectroscopy", IEEE Transactions on Neural Networks, 7, 5672. Some other references: Bernardo, J.M., DeGroot, M.H., Lindley, D.V. and Smith, A.F.M., eds., (1985), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V. (NorthHolland). Buntine, W. L. and Weigend, A. S. (1991) "Bayesian backpropagation", Complex Systems, 5, 603643. MacKay, D. J. C. (1992a) "Bayesian interpolation", Neural Computation, 4, 415447. MacKay, D. J. C. (1992b) "A practical Bayesian framework for backpropagation networks," Neural Computation, 4, 448472. MacKay, D. J. C. (1993) "Hyperparameters: Optimize or Integrate Out?", available at ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/alpha.ps.gz. Neal, R. M. (1993a) "Bayesian learning via stochastic dynamics", in C. L. Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural Information Processing Systems 5, San Mateo, California: Morgan Kaufmann, 475482. Neal, R. M. (1993b) Probabilistic Inference Using Markov Chain Monte Carlo Methods, available at ftp://ftp.cs.utoronto.ca/pub/radford/review.ps.Z. O'Hagan, A. (1985) "Shoulders in hierarchical models", in J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith (editors), Bayesian Statistics 2, Amsterdam: Elsevier Science Publishers B.V. (NorthHolland), 697710. Sarle, W. S. (1995) "Stopped Training and Other Remedies for Overfitting," Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 352360, ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large compressed postscript file, 747K, 10 pages) Wolpert, D. H. (1993) "On the use of evidence in neural networks", in C. L. Giles, S. J. Hanson, and J. D. Cowan (editors), Advances in Neural Information Processing Systems 5, San Mateo, California: Morgan Kaufmann, 539546. Finally, David MacKay maintains a FAQ about Bayesian methods for neural networks, at http://wol.ra.phy.cam.ac.uk/mackay/Bayes_FAQ.html . Comments on Bayesian learning +++++++++++++++++++++++++++++ By Warren Sarle. Bayesian purists may argue over the proper way to do a Bayesian analysis, but even the crudest Bayesian computation (maximizing over both parameters and hyperparameters) is shown by Sarle (1995) to generalize better than early stopping when learning nonlinear functions. This approach requires the use of slightly informative hyperpriors and at least twice as many training cases as weights in the network. A full Bayesian analysis by MCMC can be expected to work even better under even broader conditions. Bayesian learning works well by frequentist standardswhat MacKay calls the "evidence framework" is used by frequentist statisticians under the name "empirical Bayes." Although considerable research remains to be done, Bayesian learning seems to be the most promising approach to training neural networks. Bayesian learning should not be confused with the "Bayes classifier." In the latter, the distribution of the inputs given the target class is assumed to be known exactly, and the prior probabilities of the classes are assumed known, so that the posterior probabilities can be computed by a (theoretically) simple application of Bayes' theorem. The Bayes classifier involves no learningyou must already know everything that needs to be known! The Bayes classifier is a gold standard that can almost never be used in real life but is useful in theoretical work and in simulation studies that compare classification methods. The term "Bayes rule" is also used to mean any classification rule that gives results identical to those of a Bayes classifier. Bayesian learning also should not be confused with the "naive" or "idiot's" Bayes classifier (Warner et al. 1961; Ripley, 1996), which assumes that the inputs are conditionally independent given the target class. The naive Bayes classifier is usually applied with categorical inputs, and the distribution of each input is estimated by the proportions in the training set; hence the naive Bayes classifier is a frequentist method. The term "Bayesian network" often refers not to a neural network but to a belief network (also called a causal net, influence diagram, constraint network, qualitative Markov network, or gallery). Belief networks are more closely related to expert systems than to neural networks, and do not necessarily involve learning (Pearl, 1988; Ripley, 1996). Here are some URLs on Bayesian belief networks: o http://bayes.stat.washington.edu/almond/belief.html o http://www.cs.orst.edu/~dambrosi/bayesian/frame.html o http://www2.sis.pitt.edu/~genie o http://www.research.microsoft.com/dtg/msbn References for comments: Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, San Mateo, CA: Morgan Kaufmann. Ripley, B. D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Warner, H.R., Toronto, A.F., Veasy, L.R., and Stephenson, R. (1961), "A mathematical model for medical diagnosisapplication to congenital heart disease," J. of the American Medical Association, 177, 177184. User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What is weight decay? Next Document: How to combine networks? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
