Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What are crossvalidation and bootstrapping? See reader questions & answers on this topic!  Help others by sharing your knowledge intervals (error bars)? ======================= (This answer is only about half finished. I will get around to the other half eventually.) In addition to estimating overall generalization error, it is often useful to be able to estimate the accuracy of the network's predictions for individual cases. Let: Y = the target variable y_i = the value of Y for the ith case X = the vector of input variables x_i = the value of X for the ith case N = the noise in the target variable n_i = the value of N for the ith case m(X) = E(YX) = the conditional mean of Y given X w = a vector of weights for a neural network w^ = the weight obtained via training the network p(X,w) = the output of a neural network given input X and weights w p_i = p(x_i,w) L = the number of training (learning) cases, (y_i,x_i), i=1, ..., L Q(w) = the objective function Assume the data are generated by the model: Y = m(X) + N E(NX) = 0 N and X are independent The network is trained by attempting to minimize the objective function Q(w), which, for example, could be the sum of squared errors or the negative log likelihood based on an assumed family of noise distributions. Given a test input x_0, a 100c% prediction interval for y_0 is an interval [LPB_0,UPB_0] such that Pr(LPB_0 <= y_0 <= UPB_0) = c, where c is typically .95 or .99, and the probability is computed over repeated random selection of the training set and repeated observation of Y given the test input x_0. A 100c% confidence interval for p_0 is an interval [LCB_0,UCB_0] such that Pr(LCB_0 <= p_0 <= UCB_0) = c, where again the probability is computed over repeated random selection of the training set. Note that p_0 is a nonrandom quantity, since x_0 is given. A confidence interval is narrower than the corresponding prediction interval, since the prediction interval must include variation due to noise in y_0, while the confidence interval does not. Both intervals include variation due to sampling of the training set and possible variation in the training process due, for example, to random initial weights and local minima of the objective function. Traditional statistical methods for nonlinear models depend on several assumptions (Gallant, 1987): 1. The inputs for the training cases are either fixed or obtained by simple random sampling or some similarly wellbehaved process. 2. Q(w) has continuous first and second partial derivatives with respect to w over some convex, bounded subset S_W of the weight space. 3. Q(w) has a unique global minimum at w^, which is an interior point of S_W. 4. The model is wellspecified, which requires (a) that there exist weights w$ in the interior of S_W such that m(x) = p(x,w$), and (b) that the assumptions about the noise distribution are correct. (Sorry about the w$ notation, but I'm running out of plain text symbols.) These traditional methods are based on a linear approximation to p(x,w) in a neighborhood of w$, yielding a quadratic approximation to Q(w). Hence the Hessian of Q(w) (the square matrix of secondorder partial derivatives with respect to w) frequently appears in these methods. Assumption (3) is not satisfied for neural nets, because networks with hidden units always have multiple global minima, and the global minima are often improper. Hence, confidence intervals for the weights cannot be obtained using standard Hessianbased methods. However, Hwang and Ding (1997) have shown that confidence intervals for predicted values can be obtained because the predicted values are statistically identified even though the weights are not. Cardell, Joerding, and Li (1994) describe a more serious violation of assumption (3), namely that that for some m(x), no finite global minimum exists. In such situations, it may be possible to use regularization methods such as weight decay to obtain valid confidence intervals (De Veaux, Schumi, Schweinsberg, and Ungar, 1998), but more research is required on this subject, since the derivation in the cited paper assumes a finite global minimum. For large samples, the sampling variability in w^ can be approximated in various ways: o Fisher's information matrix, which is the expected value of the Hessian of Q(w) divided by L, can be used when Q(w) is the negative log likelihood (Spall, 1998). o The delta method, based on the Hessian of Q(w) or the GaussNewton approximation using the crossproduct Jacobian of Q(w), can also be used when Q(w) is the negative log likelihood (Tibshirani, 1996; Hwang and Ding, 1997; De Veaux, Schumi, Schweinsberg, and Ungar, 1998). o The sandwich estimator, a more elaborate Hessianbased method, relaxes assumption (4) (Gallant, 1987; White, 1989; Tibshirani, 1996). o Bootstrapping can be used without knowing the form of the noise distribution and takes into account variability introduced by local minima in training, but requires training the network many times on different resamples of the training set (Tibshirani, 1996; Heskes 1997). References: Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why some feedforward networks cannot learn some polynomials," Neural Computation, 6, 761766. De Veaux,R.D., Schumi, J., Schweinsberg, J., and Ungar, L.H. (1998), "Prediction intervals for neural networks via nonlinear regression," Technometrics, 40, 273282. Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley. Heskes, T. (1997), "Practical confidence and prediction intervals," in Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp. 176182. Hwang, J.T.G., and Ding, A.A. (1997), "Prediction intervals for artificial neural networks," J. of the American Statistical Association, 92, 748757. Nix, D.A., and Weigend, A.S. (1995), "Learning local error bars for nonlinear regression," in Tesauro, G., Touretzky, D., and Leen, T., (eds.) Advances in Neural Information Processing Systems 7, Cambridge, MA: The MIT Press, pp. 489496. Spall, J.C. (1998), "Resamplingbased calculation of the information matrix in nonlinear statistical models," Proceedings of the 4th Joint Conference on Information Sciences, October 2328, Research Triangle PArk, NC, USA, Vol 4, pp. 3539. Tibshirani, R. (1996), "A comparison of some error estimates for neural network models," Neural Computation, 8, 152163. White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward Network Models", J. of the American Statistical Assoc., 84, 10081013.  Next part is part 4 (of 7). Previous part is part 2.  Warren S. Sarle SAS Institute Inc. The opinions expressed here saswss@unx.sas.com SAS Campus Drive are mine and not necessarily (919) 6778000 Cary, NC 27513, USA those of SAS Institute. User Contributions:1 Andy Apr 24, 2015 @ 7:19 pm Why is it generally a good idea to omit the biases from the penalty term for weight decay? Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What are crossvalidation and bootstrapping? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
