Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What is overfitting and how can I avoid it? Next Document: What is early stopping? See reader questions & answers on this topic!  Help others by sharing your knowledge Jitter is artificial noise deliberately added to the inputs during training. Training with jitter is a form of smoothing related to kernel regression (see "What is GRNN?"). It is also closely related to regularization methods such as weight decay and ridge regression. Training with jitter works because the functions that we want NNs to learn are mostly smooth. NNs can learn functions with discontinuities, but the functions must be piecewise continuous in a finite number of regions if our network is restricted to a finite number of hidden units. In other words, if we have two cases with similar inputs, the desired outputs will usually be similar. That means we can take any training case and generate new training cases by adding small amounts of jitter to the inputs. As long as the amount of jitter is sufficiently small, we can assume that the desired output will not change enough to be of any consequence, so we can just use the same target value. The more training cases, the merrier, so this looks like a convenient way to improve training. But too much jitter will obviously produce garbage, while too little jitter will have little effect (Koistinen and Holmström 1992). Consider any point in the input space, not necessarily one of the original training cases. That point could possibly arise as a jittered input as a result of jittering any of several of the original neighboring training cases. The average target value at the given input point will be a weighted average of the target values of the original training cases. For an infinite number of jittered cases, the weights will be proportional to the probability densities of the jitter distribution, located at the original training cases and evaluated at the given input point. Thus the average target values given an infinite number of jittered cases will, by definition, be the NadarayaWatson kernel regression estimator using the jitter density as the kernel. Hence, training with jitter is an approximation to training with the kernel regression estimator as target. Choosing the amount (variance) of jitter is equivalent to choosing the bandwidth of the kernel regression estimator (Scott 1992). When studying nonlinear models such as feedforward NNs, it is often helpful first to consider what happens in linear models, and then to see what difference the nonlinearity makes. So let's consider training with jitter in a linear model. Notation: x_ij is the value of the jth input (j=1, ..., p) for the ith training case (i=1, ..., n). X={x_ij} is an n by p matrix. y_i is the target value for the ith training case. Y={y_i} is a column vector. Without jitter, the leastsquares weights are B = inv(X'X)X'Y, where "inv" indicates a matrix inverse and "'" indicates transposition. Note that if we replicate each training case c times, or equivalently stack c copies of the X and Y matrices on top of each other, the leastsquares weights are inv(cX'X)cX'Y = (1/c)inv(X'X)cX'Y = B, same as before. With jitter, x_ij is replaced by c cases x_ij+z_ijk, k=1, ..., c, where z_ijk is produced by some random number generator, usually with a normal distribution with mean 0 and standard deviation s, and the z_ijk's are all independent. In place of the n by p matrix X, this gives us a big matrix, say Q, with cn rows and p columns. To compute the leastsquares weights, we need Q'Q. Let's consider the jth diagonal element of Q'Q, which is 2 2 2 sum (x_ij+z_ijk) = sum (x_ij + z_ijk + 2 x_ij z_ijk) i,k i,k which is approximately, for c large, 2 2 c(sum x_ij + ns ) i which is c times the corresponding diagonal element of X'X plus ns^2. Now consider the u,vth offdiagonal element of Q'Q, which is sum (x_iu+z_iuk)(x_iv+z_ivk) i,k which is approximately, for c large, c(sum x_iu x_iv) i which is just c times the corresponding element of X'X. Thus, Q'Q equals c(X'X+ns^2I), where I is an identity matrix of appropriate size. Similar computations show that the crossproduct of Q with the target values is cX'Y. Hence the leastsquares weights with jitter of variance s^2 are given by 2 2 2 B(ns ) = inv(c(X'X+ns I))cX'Y = inv(X'X+ns I)X'Y In the statistics literature, B(ns^2) is called a ridge regression estimator with ridge value ns^2. If we were to add jitter to the target values Y, the crossproduct X'Y would not be affected for large c for the same reason that the offdiagonal elements of X'X are not afected by jitter. Hence, adding jitter to the targets will not change the optimal weights; it will just slow down training (An 1996). The ordinary least squares training criterion is (YXB)'(YXB). Weight decay uses the training criterion (YXB)'(YXB)+d^2B'B, where d is the decay rate. Weight decay can also be implemented by inventing artificial training cases. Augment the training data with p new training cases containing the matrix dI for the inputs and a zero vector for the targets. To put this in a formula, let's use A;B to indicate the matrix A stacked on top of the matrix B, so (A;B)'(C;D)=A'C+B'D. Thus the augmented inputs are X;dI and the augmented targets are Y;0, where 0 indicates the zero vector of the appropriate size. The squared error for the augmented training data is: (Y;0(X;dI)B)'(Y;0(X;dI)B) = (Y;0)'(Y;0)  2(Y;0)'(X;dI)B + B'(X;dI)'(X;dI)B = Y'Y  2Y'XB + B'(X'X+d^2I)B = Y'Y  2Y'XB + B'X'XB + B'(d^2I)B = (YXB)'(YXB)+d^2B'B which is the weightdecay training criterion. Thus the weightdecay estimator is: inv[(X;dI)'(X;dI)](X;dI)'(Y;0) = inv(X'X+d^2I)X'Y which is the same as the jitter estimator B(d^2), i.e. jitter with variance d^2/n. The equivalence between the weightdecay estimator and the jitter estimator does not hold for nonlinear models unless the jitter variance is small relative to the curvature of the nonlinear function (An 1996). However, the equivalence of the two estimators for linear models suggests that they will often produce similar results even for nonlinear models. Details for nonlinear models, including classification problems, are given in An (1996). B(0) is obviously the ordinary leastsquares estimator. It can be shown that as s^2 increases, the Euclidean norm of B(ns^2) decreases; in other words, adding jitter causes the weights to shrink. It can also be shown that under the usual statistical assumptions, there always exists some value of ns^2 > 0 such that B(ns^2) provides better expected generalization than B(0). Unfortunately, there is no way to calculate a value of ns^2 from the training data that is guaranteed to improve generalization. There are other types of shrinkage estimators called Stein estimators that do guarantee better generalization than B(0), but I'm not aware of a nonlinear generalization of Stein estimators applicable to neural networks. The statistics literature describes numerous methods for choosing the ridge value. The most obvious way is to estimate the generalization error by crossvalidation, generalized crossvalidation, or bootstrapping, and to choose the ridge value that yields the smallest such estimate. There are also quicker methods based on empirical Bayes estimation, one of which yields the following formula, useful as a first guess: 2 p(YXB(0))'(YXB(0)) s =  1 n(np)B(0)'B(0) You can iterate this a few times: 2 p(YXB(0))'(YXB(0)) s =  l+1 2 2 n(np)B(s )'B(s ) l l Note that the more training cases you have, the less noise you need. References: An, G. (1996), "The effects of adding noise during backpropagation training on a generalization performance," Neural Computation, 8, 643674. Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press. Holmström, L. and Koistinen, P. (1992) "Using additive noise in backpropagation training", IEEE Transaction on Neural Networks, 3, 2438. Koistinen, P. and Holmström, L. (1992) "Kernel regression and backpropagation training with noise," NIPS4, 10331039. Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The MIT Press, ISBN 0262181908. Scott, D.W. (1992) Multivariate Density Estimation, Wiley. Vinod, H.D. and Ullah, A. (1981) Recent Advances in Regression Methods, NY: MarcelDekker. User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: What is overfitting and how can I avoid it? Next Document: What is early stopping? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
