Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 2 of 7: Learning
Section - How should categories be encoded?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Houses ]

Top Document: FAQ, Part 2 of 7: Learning
Previous Document: How does ill-conditioning affect NN training?
Next Document: Why not code binary inputs as 0 and 1?
See reader questions & answers on this topic! - Help others by sharing your knowledge

First, consider unordered categories. If you want to classify cases into one
of C categories (i.e. you have a categorical target variable), use 1-of-C
coding. That means that you code C binary (0/1) target variables
corresponding to the C categories. Statisticians call these "dummy"
variables. Each dummy variable is given the value zero except for the one
corresponding to the correct category, which is given the value one. Then
use a softmax output activation function (see "What is a softmax activation
function?") so that the net, if properly trained, will produce valid
posterior probability estimates (McCullagh and Nelder, 1989; Finke and
Müller, 1994). If the categories are Red, Green, and Blue, then the data
would look like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   0   0
    Green      0   1   0
    Blue       0   0   1

When there are only two categories, it is simpler to use just one dummy
variable with a logistic output activation function; this is equivalent to
using softmax with two dummy variables. 

The common practice of using target values of .1 and .9 instead of 0 and 1
prevents the outputs of the network from being directly interpretable as
posterior probabilities, although it is easy to rescale the outputs to
produce probabilities (Hampshire and Pearlmutter, 1991, figure 3). This
practice has also been advocated on the grounds that infinite weights are
required to obtain outputs of 0 or 1 from a logistic function, but in fact,
weights of about 10 to 30 will produce outputs close enough to 0 and 1 for
all practical purposes, assuming standardized inputs. Large weights will not
cause overflow if the activation functions are coded properly; see "How to
avoid overflow in the logistic function?" 

Another common practice is to use a logistic activation function for each
output. Thus, the outputs are not constrained to sum to one, so they are not
admissible posterior probability estimates. The usual justification advanced
for this procedure is that if a test case is not similar to any of the
training cases, all of the outputs will be small, indicating that the case
cannot be classified reliably. This claim is incorrect, since a test case
that is not similar to any of the training cases will require the net to
extrapolate, and extrapolation is thoroughly unreliable; such a test case
may produce all small outputs, all large outputs, or any combination of
large and small outputs. If you want a classification method that detects
novel cases for which the classification may not be reliable, you need a
method based on probability density estimation. For example, see "What is

It is very important not to use a single variable for an unordered
categorical target. Suppose you used a single variable with values 1, 2, and
3 for red, green, and blue, and the training data with two inputs looked
like this: 

      |    1    1
      |   1   1
      |       1   1
      |     1   1
      |      X
      |    3   3           2   2
      |     3     3      2
      |  3   3            2    2
      |     3   3       2    2

Consider a test point located at the X. The correct output would be that X
has about a 50-50 chance of being a 1 or a 3. But if you train with a single
target variable with values of 1, 2, and 3, the output for X will be the
average of 1 and 3, so the net will say that X is definitely a 2! 

If you are willing to forego the simple posterior-probability interpretation
of outputs, you can try more elaborate coding schemes, such as the
error-correcting output codes suggested by Dietterich and Bakiri (1995). 

For an input with categorical values, you can use 1-of-(C-1) coding if the
network has a bias unit. This is just like 1-of-C coding, except that you
omit one of the dummy variables (doesn't much matter which one). Using all C
of the dummy variables creates a linear dependency on the bias unit, which
is not advisable unless you are using weight decay or Bayesian learning or
some such thing that requires all C weights to be treated on an equal basis.
1-of-(C-1) coding looks like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   0
    Green      0   1
    Blue       0   0

If you use 1-of-C or 1-of-(C-1) coding, it is important to standardize the
dummy inputs; see "Should I standardize the input variables?" "Why not code
binary inputs as 0 and 1?" for details. 

Another possible coding is called "effects" coding or "deviations from
means" coding in statistics. It is like 1-of-(C-1) coding, except that when
a case belongs to the category for the omitted dummy variable, all of the
dummy variables are set to -1, like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   0
    Green      0   1
    Blue      -1  -1

As long as a bias unit is used, any network with effects coding can be
transformed into an equivalent network with 1-of-(C-1) coding by a linear
transformation of the weights, so if you train to a global optimum, there
will be no difference in the outputs for these two types of coding. One
advantage of effects coding is that the dummy variables require no
standardizing, since effects coding directly produces values that are
approximately standardized. 

If you are using weight decay, you want to make sure that shrinking the
weights toward zero biases ('bias' in the statistical sense) the net in a
sensible, usually smooth, way. If you use 1 of C-1 coding for an input,
weight decay biases the output for the C-1 categories towards the output for
the 1 omitted category, which is probably not what you want, although there
might be special cases where it would make sense. If you use 1 of C coding
for an input, weight decay biases the output for all C categories roughly
towards the mean output for all the categories, which is smoother and
usually a reasonable thing to do. 

Now consider ordered categories. For inputs, some people recommend a
"thermometer code" (Smith, 1996; Masters, 1993) like this: 

   Category  Dummy variables
   --------  ---------------
    Red        1   1   1
    Green      0   1   1
    Blue       0   0   1

However, thermometer coding is equivalent to 1-of-C coding, in that for any
network using 1-of-C coding, there exists a network with thermometer coding
that produces identical outputs; the weights in the thermometer-encoded
network are just the differences of successive weights in the 1-of-C-encoded
network. To get a genuinely ordinal representation, you must constrain the
weights connecting the dummy variables to the hidden units to be nonnegative
(except for the first dummy variable). Another approach that makes some use
of the order information is to use weight decay or Bayesian learning to
encourage the the weights for all but the first dummy variable to be small. 

It is often effective to represent an ordinal input as a single variable
like this: 

   Category  Input
   --------  -----
    Red        1
    Green      2
    Blue       3

Although this representation involves only a single quantitative input,
given enough hidden units, the net is capable of computing nonlinear
transformations of that input that will produce results equivalent to any of
the dummy coding schemes. But using a single quantitative input makes it
easier for the net to use the order of the categories to generalize when
that is appropriate. 

B-splines provide a way of coding ordinal inputs into fewer than C variables
while retaining information about the order of the categories. See Brown and
Harris (1994) or Gifi (1990, 365-370). 

Target variables with ordered categories require thermometer coding. The
outputs are thus cumulative probabilities, so to obtain the posterior
probability of any category except the first, you must take the difference
between successive outputs. It is often useful to use a proportional-odds
model, which ensures that these differences are positive. For more details
on ordered categorical targets, see McCullagh and Nelder (1989, chapter 5). 


   Brown, M., and Harris, C. (1994), Neurofuzzy Adaptive Modelling and
   Control, NY: Prentice Hall. 

   Dietterich, T.G. and Bakiri, G. (1995), "Error-correcting output codes: A
   general method for improving multiclass inductive learning programs," in
   Wolpert, D.H. (ed.), The Mathematics of Generalization: The Proceedings
   of the SFI/CNLS Workshop on Formal Approaches to Supervised Learning,
   Santa Fe Institute Studies in the Sciences of Complexity, Volume XX,
   Reading, MA: Addison-Wesley, pp. 395-407. 

   Finke, M. and Müller, K.-R. (1994), "Estimating a-posteriori
   probabilities using stochastic network models," in Mozer, M., Smolensky,
   P., Touretzky, D., Elman, J., and Weigend, A. (eds.), Proceedings of the
   1993 Connectionist Models Summer School, Hillsdale, NJ: Lawrence
   Erlbaum Associates, pp. 324-331. 

   Gifi, A. (1990), Nonlinear Multivariate Analysis, NY: John Wiley & Sons,
   ISBN 0-471-92620-5. 

   Hampshire II, J.B., and Pearlmutter, B. (1991), "Equivalence proofs for
   multi-layer perceptron classifiers and the Bayesian discriminant
   function," in Touretzky, D.S., Elman, J.L., Sejnowski, T.J., and Hinton,
   G.E. (eds.), Connectionist Models: Proceedings of the 1990 Summer
   School, San Mateo, CA: Morgan Kaufmann, pp.159-172. 

   Masters, T. (1993). Practical Neural Network Recipes in C++, San Diego:
   Academic Press. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Smith, M. (1996). Neural Networks for Statistical Modeling, Boston:
   International Thomson Computer Press, ISBN 1-850-32842-0.

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

Top Document: FAQ, Part 2 of 7: Learning
Previous Document: How does ill-conditioning affect NN training?
Next Document: Why not code binary inputs as 0 and 1?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM