## Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z

# comp.ai.neural-nets FAQ, Part 3 of 7: GeneralizationSection - How to compute prediction and confidence

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Forum ]

Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: What are cross-validation and bootstrapping?
```intervals (error bars)?
=======================

(This answer is only about half finished. I will get around to the other
half eventually.)

In addition to estimating over-all generalization error, it is often useful
to be able to estimate the accuracy of the network's predictions for
individual cases.

Let:

Y      = the target variable
y_i    = the value of Y for the ith case
X      = the vector of input variables
x_i    = the value of X for the ith case
N      = the noise in the target variable
n_i    = the value of N for the ith case
m(X)   = E(Y|X) = the conditional mean of Y given X
w      = a vector of weights for a neural network
w^     = the weight obtained via training the network
p(X,w) = the output of a neural network given input X and weights w
p_i    = p(x_i,w)
L      = the number of training (learning) cases, (y_i,x_i), i=1, ..., L
Q(w)   = the objective function

Assume the data are generated by the model:

Y = m(X) + N
E(N|X) = 0
N and X are independent

The network is trained by attempting to minimize the objective function
Q(w), which, for example, could be the sum of squared errors or the
negative log likelihood based on an assumed family of noise distributions.

Given a test input x_0, a 100c% prediction interval for y_0 is an
interval [LPB_0,UPB_0] such that Pr(LPB_0 <= y_0 <=
UPB_0) = c, where c is typically .95 or .99, and the probability is
computed over repeated random selection of the training set and repeated
observation of Y given the test input x_0. A 100c% confidence interval
for p_0 is an interval [LCB_0,UCB_0] such that Pr(LCB_0 <=
p_0 <= UCB_0) = c, where again the probability is computed over
repeated random selection of the training set. Note that p_0 is a
nonrandom quantity, since x_0 is given. A confidence interval is narrower
than the corresponding prediction interval, since the prediction interval
must include variation due to noise in y_0, while the confidence interval
does not. Both intervals include variation due to sampling of the training
set and possible variation in the training process due, for example, to
random initial weights and local minima of the objective function.

Traditional statistical methods for nonlinear models depend on several
assumptions (Gallant, 1987):

1. The inputs for the training cases are either fixed or obtained by simple
random sampling or some similarly well-behaved process.
2. Q(w) has continuous first and second partial derivatives with respect
to w over some convex, bounded subset S_W of the weight space.
3. Q(w) has a unique global minimum at w^, which is an interior point of
S_W.
4. The model is well-specified, which requires (a) that there exist weights
w\$ in the interior of S_W such that m(x) = p(x,w\$), and (b)
that the assumptions about the noise distribution are correct. (Sorry
about the w\$ notation, but I'm running out of plain text symbols.)

These traditional methods are based on a linear approximation to p(x,w)
in a neighborhood of w\$, yielding a quadratic approximation to Q(w).
Hence the Hessian of Q(w) (the square matrix of second-order partial
derivatives with respect to w) frequently appears in these methods.

Assumption (3) is not satisfied for neural nets, because networks with
hidden units always have multiple global minima, and the global minima are
often improper. Hence, confidence intervals for the weights cannot be
obtained using standard Hessian-based methods. However, Hwang and Ding
(1997) have shown that confidence intervals for predicted values can be
obtained because the predicted values are statistically identified even
though the weights are not.

Cardell, Joerding, and Li (1994) describe a more serious violation of
assumption (3), namely that that for some m(x), no finite global minimum
exists. In such situations, it may be possible to use regularization methods
such as weight decay to obtain valid confidence intervals (De Veaux, Schumi,
Schweinsberg, and Ungar, 1998), but more research is required on this
subject, since the derivation in the cited paper assumes a finite global
minimum.

For large samples, the sampling variability in w^ can be approximated in
various ways:

o Fisher's information matrix, which is the expected value of the Hessian
of Q(w) divided by L, can be used when Q(w) is the negative log
likelihood (Spall, 1998).
o The delta method, based on the Hessian of Q(w) or the Gauss-Newton
approximation using the cross-product Jacobian of Q(w), can also be
used when Q(w) is the negative log likelihood (Tibshirani, 1996; Hwang
and Ding, 1997; De Veaux, Schumi, Schweinsberg, and Ungar, 1998).
o The sandwich estimator, a more elaborate Hessian-based method, relaxes
assumption (4) (Gallant, 1987; White, 1989; Tibshirani, 1996).
o Bootstrapping can be used without knowing the form of the noise
distribution and takes into account variability introduced by local
minima in training, but requires training the network many times on
different resamples of the training set (Tibshirani, 1996; Heskes 1997).

References:

Cardell, N.S., Joerding, W., and Li, Y. (1994), "Why some feedforward
networks cannot learn some polynomials," Neural Computation, 6, 761-766.

De Veaux,R.D., Schumi, J., Schweinsberg, J., and Ungar, L.H. (1998),
"Prediction intervals for neural networks via nonlinear regression,"
Technometrics, 40, 273-282.

Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley.

Heskes, T. (1997), "Practical confidence and prediction intervals," in
Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural
Information Processing Systems 9, Cambrideg, MA: The MIT Press, pp.
176-182.

Hwang, J.T.G., and Ding, A.A. (1997), "Prediction intervals for
artificial neural networks," J. of the American Statistical Association,
92, 748-757.

Nix, D.A., and Weigend, A.S. (1995), "Learning local error bars for
nonlinear regression," in Tesauro, G., Touretzky, D., and Leen, T.,
(eds.) Advances in Neural Information Processing Systems 7, Cambridge,
MA: The MIT Press, pp. 489-496.

Spall, J.C. (1998), "Resampling-based calculation of the information
matrix in nonlinear statistical models," Proceedings of the 4th Joint
Conference on Information Sciences, October 23-28, Research Triangle
PArk, NC, USA, Vol 4, pp. 35-39.

Tibshirani, R. (1996), "A comparison of some error estimates for neural
network models," Neural Computation, 8, 152-163.

White, H. (1989), "Some Asymptotic Results for Learning in Single Hidden
Layer Feedforward Network Models", J. of the American Statistical Assoc.,
84, 1008-1013.

------------------------------------------------------------------------

Next part is part 4 (of 7). Previous part is part 2.

--

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
```

## User Contributions:

Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: What are cross-validation and bootstrapping?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM