Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Section - Why not code binary inputs as 0 and 1?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Property taxes ]


Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: How should categories be encoded?
Next Document: Why use a bias/threshold?
See reader questions & answers on this topic! - Help others by sharing your knowledge

The importance of standardizing input variables is discussed in detail under
"Should I standardize the input variables?" But for the benefit of those
people who don't believe in theory, here is an example using the 5-bit
parity problem. The unstandardized data are: 

   x1    x2    x3    x4    x5    target

    0     0     0     0     0       0  
    1     0     0     0     0       1  
    0     1     0     0     0       1  
    1     1     0     0     0       0  
    0     0     1     0     0       1  
    1     0     1     0     0       0  
    0     1     1     0     0       0  
    1     1     1     0     0       1  
    0     0     0     1     0       1  
    1     0     0     1     0       0  
    0     1     0     1     0       0  
    1     1     0     1     0       1  
    0     0     1     1     0       0  
    1     0     1     1     0       1  
    0     1     1     1     0       1  
    1     1     1     1     0       0  
    0     0     0     0     1       1  
    1     0     0     0     1       0  
    0     1     0     0     1       0  
    1     1     0     0     1       1  
    0     0     1     0     1       0  
    1     0     1     0     1       1  
    0     1     1     0     1       1  
    1     1     1     0     1       0  
    0     0     0     1     1       0  
    1     0     0     1     1       1  
    0     1     0     1     1       1  
    1     1     0     1     1       0  
    0     0     1     1     1       1  
    1     0     1     1     1       0  
    0     1     1     1     1       0  
    1     1     1     1     1       1  

The network characteristics were: 

Inputs:                       5
Hidden units:                 5
Outputs:                      5
Activation for hidden units:  tanh
Activation for output units:  logistic
Error function:               cross-entropy
Initial weights:              random normal with mean=0,
                                 st.dev.=1/sqrt(5) for input-to-hidden
                                        =1         for hidden-to-output
Training method:              batch standard backprop
Learning rate:                0.1
Momentum:                     0.9
Minimum training iterations:  100
Maximum training iterations:  10000

One hundred networks were trained from different random initial weights. The
following bar chart shows the distribution of the average cross-entropy
after training: 

Frequency

   |                                                              ****
   |                                                              ****
80 +                                                              ****
   |                                                              ****
   |                                                              ****
   |                                                              ****
60 +                                                              ****
   |                                                              ****
   |                                                              ****
   |                                                              ****
40 +                                                              ****
   |                                                              ****
   |                                                              ****
   |                                                              ****
20 +                                                              ****
   |                                                              ****
   |                                                              ****
   |  ****        ****                                            ****
   ---------------------------------------------------------------------
       0.0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0

                           Average Cross-Entropy

As you can see, very few networks found a good (near zero) local optimum.
Recoding the inputs from {0,1} to {-1,1} produced the following distribution
of the average cross-entropy after training: 

Frequency

   |              ****
   |  ****        ****  ****
   |  ****        ****  ****
20 +  ****        ****  ****
   |  ****        ****  ****
   |  ****        ****  ****
   |  ****        ****  ****
   |  ****        ****  ****
10 +  ****  ****  ****  ****
   |  ****  ****  ****  ****                                      ****
   |  ****  ****  ****  ****  ****                                ****
   |  ****  ****  ****  ****  ****        ****                    ****
   |  ****  ****  ****  ****  ****  ****  ****                    ****
   ---------------------------------------------------------------------
       0.0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0

                           Average Cross-Entropy

The results are dramatically better. The difference is due to simple
geometry. The initial hyperplanes pass fairly near the origin. If the data
are centered near the origin (as with {-1,1} coding), the initial
hyperplanes will cut through the data in a variety of directions. If the
data are offset from the origin (as with {0,1} coding), many of the initial
hyperplanes will miss the data entirely, and those that pass through the
data will provide a only a limited range of directions, making it difficult
to find local optima that use hyperplanes that go in different directions.
If the data are far from the origin (as with {9,10} coding), most of the
initial hyperplanes will miss the data entirely, which will cause most of
the hidden units to saturate and make any learning difficult. See "Should I
standardize the input variables?" for more information. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:




Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: How should categories be encoded?
Next Document: Why use a bias/threshold?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)





Last Update March 27 2014 @ 02:11 PM