Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 2 of 7: Learning
Section - Should I nonlinearly transform the data?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Business Photos and Profiles ]

Top Document: FAQ, Part 2 of 7: Learning
Previous Document: Should I normalize/standardize/rescale the
Next Document: How to measure importance of inputs?
See reader questions & answers on this topic! - Help others by sharing your knowledge

Most importantly, nonlinear transformations of the targets are important
with noisy data, via their effect on the error function. Many commonly used
error functions are functions solely of the difference abs(target-output).
Nonlinear transformations (unlike linear transformations) change the
relative sizes of these differences. With most error functions, the net will
expend more effort, so to speak, trying to learn target values for which
abs(target-output) is large. 

For example, suppose you are trying to predict the price of a stock. If the
price of the stock is 10 (in whatever currency unit) and the output of the
net is 5 or 15, yielding a difference of 5, that is a huge error. If the
price of the stock is 1000 and the output of the net is 995 or 1005,
yielding the same difference of 5, that is a tiny error. You don't want the
net to treat those two differences as equally important. By taking
logarithms, you are effectively measuring errors in terms of ratios rather
than differences, since a difference between two logs corresponds to the
ratio of the original values. This has approximately the same effect as
looking at percentage differences, abs(target-output)/target or
abs(target-output)/output, rather than simple differences. 

Less importantly, smooth functions are usually easier to learn than rough
functions. Generalization is also usually better for smooth functions. So
nonlinear transformations (of either inputs or targets) that make the
input-output function smoother are usually beneficial. For classification
problems, you want the class boundaries to be smooth. When there are only a
few inputs, it is often possible to transform the data to a linear
relationship, in which case you can use a linear model instead of a more
complex neural net, and many things (such as estimating generalization error
and error bars) will become much simpler. A variety of NN architectures (RBF
networks, B-spline networks, etc.) amount to using many nonlinear
transformations, possibly involving multiple variables simultaneously, to
try to make the input-output function approximately linear (Ripley 1996,
chapter 4). There are particular applications, such as signal and image
processing, in which very elaborate transformations are useful (Masters

It is usually advisable to choose an error function appropriate for the
distribution of noise in your target variables (McCullagh and Nelder 1989).
But if your software does not provide a sufficient variety of error
functions, then you may need to transform the target so that the noise
distribution conforms to whatever error function you are using. For example,
if you have to use least-(mean-)squares training, you will get the best
results if the noise distribution is approximately Gaussian with constant
variance, since least-(mean-)squares is maximum likelihood in that case.
Heavy-tailed distributions (those in which extreme values occur more often
than in a Gaussian distribution, often as indicated by high kurtosis) are
especially of concern, due to the loss of statistical efficiency of
least-(mean-)square estimates (Huber 1981). Note that what is important is
the distribution of the noise, not the distribution of the target values. 

The distribution of inputs may suggest transformations, but this is by far
the least important consideration among those listed here. If an input is
strongly skewed, a logarithmic, square root, or other power (between -1 and
1) transformation may be worth trying. If an input has high kurtosis but low
skewness, an arctan transform can reduce the influence of extreme values: 

             input - mean
   arctan( c ------------ )
             stand. dev.

where c is a constant that controls how far the extreme values are brought
in towards the mean. Arctan usually works better than tanh, which squashes
the extreme values too much. Using robust estimates of location and scale
(Iglewicz 1983) instead of the mean and standard deviation will work even
better for pathological distributions. 


   Atkinson, A.C. (1985) Plots, Transformations and Regression, Oxford:
   Clarendon Press. 

   Carrol, R.J. and Ruppert, D. (1988) Transformation and Weighting in
   Regression, London: Chapman and Hall. 

   Huber, P.J. (1981), Robust Statistics, NY: Wiley. 

   Iglewicz, B. (1983), "Robust scale estimators and confidence intervals
   for location", in Hoaglin, D.C., Mosteller, M. and Tukey, J.W., eds., 
   Understanding Robust and Exploratory Data Analysis, NY: Wiley. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman and Hall. 

   Masters, T. (1994), Signal and Image Processing with Neural Networks: A
   C++ Sourcebook, NY: Wiley.

   Ripley, B.D. (1996), Pattern Recognition and Neural Networks,
   Cambridge: Cambridge University Press. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:


Top Document: FAQ, Part 2 of 7: Learning
Previous Document: Should I normalize/standardize/rescale the
Next Document: How to measure importance of inputs?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM