Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Section - How does noise affect generalization?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Neighborhoods ]


Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: How is generalization possible?
Next Document: What is overfitting and how can I avoid it?
See reader questions & answers on this topic! - Help others by sharing your knowledge

"Statistical noise" means variation in the target values that is
unpredictable from the inputs of a specific network, regardless of the
architecture or weights. "Physical noise" refers to variation in the target
values that is inherently unpredictable regardless of what inputs are used.
Noise in the inputs usually refers to measurement error, so that if the same
object or example is presented to the network more than once, the input
values differ. 

Noise in the actual data is never a good thing, since it limits the accuracy
of generalization that can be achieved no matter how extensive the training
set is. On the other hand, injecting artificial noise (jitter) into the
inputs during training is one of several ways to improve generalization for
smooth functions when you have a small training set. 

Certain assumptions about noise are necessary for theoretical results.
Usually, the noise distribution is assumed to have zero mean and finite
variance. The noise in different cases is usually assumed to be independent
or to follow some known stochastic model, such as an autoregressive process.
The more you know about the noise distribution, the more effectively you can
train the network (e.g., McCullagh and Nelder 1989). 

If you have noise in the target values, what the network learns depends
mainly on the error function. For example, if the noise is independent with
finite variance for all training cases, a network that is well-trained using
least squares will produce outputs that approximate the conditional mean of
the target values (White, 1990; Bishop, 1995). Note that for a binary 0/1
variable, the mean is equal to the probability of getting a 1. Hence, the
results in White (1990) immediately imply that for a categorical target with
independent noise using 1-of-C coding (see "How should categories be
encoded?"), a network that is well-trained using least squares will produce
outputs that approximate the posterior probabilities of each class (see
Rojas, 1996, if you want a simple explanation of why least-squares estimates
probabilities). Posterior probabilities can also be learned using
cross-entropy and various other error functions (Finke and Müller, 1994;
Bishop, 1995). The conditional median can be learned by least-absolute-value
training (White, 1992a). Conditional modes can be approximated by yet other
error functions (e.g., Rohwer and van der Rest 1996). For noise
distributions that cannot be adequately approximated by a single location
estimate (such as the mean, median, or mode), a network can be trained to
approximate quantiles (White, 1992a) or mixture components (Bishop, 1995;
Husmeier, 1999). 

If you have noise in the target values, the mean squared generalization
error can never be less than the variance of the noise, no matter how much
training data you have. But you can estimate the mean of the target values,
conditional on a given set of input values, to any desired degree of
accuracy by obtaining a sufficiently large and representative training set,
assuming that the function you are trying to learn is one that can indeed be
learned by the type of net you are using, and assuming that the complexity
of the network is regulated appropriately (White 1990). 

Noise in the target values increases the danger of overfitting (Moody 1992).

Noise in the inputs limits the accuracy of generalization, but in a more
complicated way than does noise in the targets. In a region of the input
space where the function being learned is fairly flat, input noise will have
little effect. In regions where that function is steep, input noise can
degrade generalization severely. 

Furthermore, if the target function is Y=f(X), but you observe noisy inputs
X+D, you cannot obtain an arbitrarily accurate estimate of f(X) given X+D no
matter how large a training set you use. The net will not learn f(X), but
will instead learn a convolution of f(X) with the distribution of the noise
D (see "What is jitter?)" 

For more details, see one of the statistically-oriented references on neural
nets such as: 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press, especially section 6.4. 

   Finke, M., and Müller, K.-R. (1994), "Estimating a-posteriori
   probabilities using stochastic network models," in Mozer, Smolensky,
   Touretzky, Elman, & Weigend, eds., Proceedings of the 1993 Connectionist
   Models Summer School, Hillsdale, NJ: Lawrence Erlbaum Associates, pp.
   324-331. 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Husmeier, D. (1999), Neural Networks for Conditional Probability
   Estimation: Forecasting Beyond Point Predictions, Berlin: Springer
   Verlag, ISBN 185233095. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
   Generalization and Regularization in Nonlinear Learning Systems", in
   Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
   Information Processing Systems 4, 847-854. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Rohwer, R., and van der Rest, J.C. (1996), "Minimum description length,
   regularization, and multimodal data," Neural Computation, 8, 595-609. 

   Rojas, R. (1996), "A short proof of the posterior probability property of
   classifier neural networks," Neural Computation, 8, 41-43. 

   White, H. (1990), "Connectionist Nonparametric Regression: Multilayer
   Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3,
   535-550. Reprinted in White (1992). 

   White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles
   Using Neural Networks," in Page, C. and Le Page, R. (eds.), Proceedings
   of the 23rd Sympsium on the Interface: Computing Science and Statistics,
   Alexandria, VA: American Statistical Association, pp. 190-199. Reprinted
   in White (1992b). 

   White, H. (1992b), Artificial Neural Networks: Approximation and
   Learning Theory, Blackwell. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:




Top Document: comp.ai.neural-nets FAQ, Part 3 of 7: Generalization
Previous Document: How is generalization possible?
Next Document: What is overfitting and how can I avoid it?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)





Last Update March 27 2014 @ 02:11 PM