Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 3 of 7: Generalization
Section - What is jitter? (Training with noise)

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Property taxes ]

Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: What is overfitting and how can I avoid it?
Next Document: What is early stopping?
See reader questions & answers on this topic! - Help others by sharing your knowledge

Jitter is artificial noise deliberately added to the inputs during training.
Training with jitter is a form of smoothing related to kernel regression
(see "What is GRNN?"). It is also closely related to regularization methods
such as weight decay and ridge regression. 

Training with jitter works because the functions that we want NNs to learn
are mostly smooth. NNs can learn functions with discontinuities, but the
functions must be piecewise continuous in a finite number of regions if our
network is restricted to a finite number of hidden units. 

In other words, if we have two cases with similar inputs, the desired
outputs will usually be similar. That means we can take any training case
and generate new training cases by adding small amounts of jitter to the
inputs. As long as the amount of jitter is sufficiently small, we can assume
that the desired output will not change enough to be of any consequence, so
we can just use the same target value. The more training cases, the merrier,
so this looks like a convenient way to improve training. But too much jitter
will obviously produce garbage, while too little jitter will have little
effect (Koistinen and Holmström 1992). 

Consider any point in the input space, not necessarily one of the original
training cases. That point could possibly arise as a jittered input as a
result of jittering any of several of the original neighboring training
cases. The average target value at the given input point will be a weighted
average of the target values of the original training cases. For an infinite
number of jittered cases, the weights will be proportional to the
probability densities of the jitter distribution, located at the original
training cases and evaluated at the given input point. Thus the average
target values given an infinite number of jittered cases will, by
definition, be the Nadaraya-Watson kernel regression estimator using the
jitter density as the kernel. Hence, training with jitter is an
approximation to training with the kernel regression estimator as target.
Choosing the amount (variance) of jitter is equivalent to choosing the
bandwidth of the kernel regression estimator (Scott 1992). 

When studying nonlinear models such as feedforward NNs, it is often helpful
first to consider what happens in linear models, and then to see what
difference the nonlinearity makes. So let's consider training with jitter in
a linear model. Notation: 

   x_ij is the value of the jth input (j=1, ..., p) for the
        ith training case (i=1, ..., n).
   X={x_ij} is an n by p matrix.
   y_i is the target value for the ith training case.
   Y={y_i} is a column vector.

Without jitter, the least-squares weights are B = inv(X'X)X'Y, where
"inv" indicates a matrix inverse and "'" indicates transposition. Note that
if we replicate each training case c times, or equivalently stack c copies
of the X and Y matrices on top of each other, the least-squares weights are
inv(cX'X)cX'Y = (1/c)inv(X'X)cX'Y = B, same as before. 

With jitter, x_ij is replaced by c cases x_ij+z_ijk, k=1, ...,
c, where z_ijk is produced by some random number generator, usually with
a normal distribution with mean 0 and standard deviation s, and the 
z_ijk's are all independent. In place of the n by p matrix X, this
gives us a big matrix, say Q, with cn rows and p columns. To compute the
least-squares weights, we need Q'Q. Let's consider the jth diagonal
element of Q'Q, which is 

                   2           2       2
   sum (x_ij+z_ijk) = sum (x_ij + z_ijk + 2 x_ij z_ijk)
   i,k                i,k

which is approximately, for c large, 

             2     2
   c(sum x_ij  + ns ) 

which is c times the corresponding diagonal element of X'X plus ns^2.
Now consider the u,vth off-diagonal element of Q'Q, which is 

   sum (x_iu+z_iuk)(x_iv+z_ivk)

which is approximately, for c large, 

   c(sum x_iu x_iv)

which is just c times the corresponding element of X'X. Thus, Q'Q equals
c(X'X+ns^2I), where I is an identity matrix of appropriate size.
Similar computations show that the crossproduct of Q with the target values
is cX'Y. Hence the least-squares weights with jitter of variance s^2 are
given by 

       2                2                    2
   B(ns ) = inv(c(X'X+ns I))cX'Y = inv(X'X+ns I)X'Y

In the statistics literature, B(ns^2) is called a ridge regression
estimator with ridge value ns^2. 

If we were to add jitter to the target values Y, the cross-product X'Y
would not be affected for large c for the same reason that the off-diagonal
elements of X'X are not afected by jitter. Hence, adding jitter to the
targets will not change the optimal weights; it will just slow down training
(An 1996). 

The ordinary least squares training criterion is (Y-XB)'(Y-XB).
Weight decay uses the training criterion (Y-XB)'(Y-XB)+d^2B'B,
where d is the decay rate. Weight decay can also be implemented by
inventing artificial training cases. Augment the training data with p new
training cases containing the matrix dI for the inputs and a zero vector
for the targets. To put this in a formula, let's use A;B to indicate the
matrix A stacked on top of the matrix B, so (A;B)'(C;D)=A'C+B'D.
Thus the augmented inputs are X;dI and the augmented targets are Y;0,
where 0 indicates the zero vector of the appropriate size. The squared error
for the augmented training data is: 

   = (Y;0)'(Y;0) - 2(Y;0)'(X;dI)B + B'(X;dI)'(X;dI)B
   = Y'Y - 2Y'XB + B'(X'X+d^2I)B
   = Y'Y - 2Y'XB + B'X'XB + B'(d^2I)B
   = (Y-XB)'(Y-XB)+d^2B'B

which is the weight-decay training criterion. Thus the weight-decay
estimator is: 

    inv[(X;dI)'(X;dI)](X;dI)'(Y;0) = inv(X'X+d^2I)X'Y

which is the same as the jitter estimator B(d^2), i.e. jitter with
variance d^2/n. The equivalence between the weight-decay estimator and
the jitter estimator does not hold for nonlinear models unless the jitter
variance is small relative to the curvature of the nonlinear function (An
1996). However, the equivalence of the two estimators for linear models
suggests that they will often produce similar results even for nonlinear
models. Details for nonlinear models, including classification problems, are
given in An (1996). 

B(0) is obviously the ordinary least-squares estimator. It can be shown
that as s^2 increases, the Euclidean norm of B(ns^2) decreases; in
other words, adding jitter causes the weights to shrink. It can also be
shown that under the usual statistical assumptions, there always exists some
value of ns^2 > 0 such that B(ns^2) provides better expected
generalization than B(0). Unfortunately, there is no way to calculate a
value of ns^2 from the training data that is guaranteed to improve
generalization. There are other types of shrinkage estimators called Stein
estimators that do guarantee better generalization than B(0), but I'm not
aware of a nonlinear generalization of Stein estimators applicable to neural

The statistics literature describes numerous methods for choosing the ridge
value. The most obvious way is to estimate the generalization error by
cross-validation, generalized cross-validation, or bootstrapping, and to
choose the ridge value that yields the smallest such estimate. There are
also quicker methods based on empirical Bayes estimation, one of which
yields the following formula, useful as a first guess: 

    2    p(Y-XB(0))'(Y-XB(0))
   s   = --------------------
    1      n(n-p)B(0)'B(0)

You can iterate this a few times: 

    2      p(Y-XB(0))'(Y-XB(0))
   s     = --------------------
    l+1              2     2
            n(n-p)B(s )'B(s )
                     l     l

Note that the more training cases you have, the less noise you need. 


   An, G. (1996), "The effects of adding noise during backpropagation
   training on a generalization performance," Neural Computation, 8,

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Holmström, L. and Koistinen, P. (1992) "Using additive noise in
   back-propagation training", IEEE Transaction on Neural Networks, 3,

   Koistinen, P. and Holmström, L. (1992) "Kernel regression and
   backpropagation training with noise," NIPS4, 1033-1039. 

   Reed, R.D., and Marks, R.J, II (1999), Neural Smithing: Supervised
   Learning in Feedforward Artificial Neural Networks, Cambridge, MA: The
   MIT Press, ISBN 0-262-18190-8. 

   Scott, D.W. (1992) Multivariate Density Estimation, Wiley. 

   Vinod, H.D. and Ullah, A. (1981) Recent Advances in Regression Methods,
   NY: Marcel-Dekker. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:


Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: What is overfitting and how can I avoid it?
Next Document: What is early stopping?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM