Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Section - Why not code binary inputs as 0 and 1?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Property taxes ]


Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: How should categories be encoded?
Next Document: Why use a bias/threshold?
See reader questions & answers on this topic! - Help others by sharing your knowledge

The importance of standardizing input variables is discussed in detail under
"Should I standardize the input variables?" But for the benefit of those
people who don't believe in theory, here is an example using the 5-bit
parity problem. The unstandardized data are: 

   x1    x2    x3    x4    x5    target

    0     0     0     0     0       0  
    1     0     0     0     0       1  
    0     1     0     0     0       1  
    1     1     0     0     0       0  
    0     0     1     0     0       1  
    1     0     1     0     0       0  
    0     1     1     0     0       0  
    1     1     1     0     0       1  
    0     0     0     1     0       1  
    1     0     0     1     0       0  
    0     1     0     1     0       0  
    1     1     0     1     0       1  
    0     0     1     1     0       0  
    1     0     1     1     0       1  
    0     1     1     1     0       1  
    1     1     1     1     0       0  
    0     0     0     0     1       1  
    1     0     0     0     1       0  
    0     1     0     0     1       0  
    1     1     0     0     1       1  
    0     0     1     0     1       0  
    1     0     1     0     1       1  
    0     1     1     0     1       1  
    1     1     1     0     1       0  
    0     0     0     1     1       0  
    1     0     0     1     1       1  
    0     1     0     1     1       1  
    1     1     0     1     1       0  
    0     0     1     1     1       1  
    1     0     1     1     1       0  
    0     1     1     1     1       0  
    1     1     1     1     1       1  

The network characteristics were: 

Inputs:                       5
Hidden units:                 5
Outputs:                      5
Activation for hidden units:  tanh
Activation for output units:  logistic
Error function:               cross-entropy
Initial weights:              random normal with mean=0,
                                 st.dev.=1/sqrt(5) for input-to-hidden
                                        =1         for hidden-to-output
Training method:              batch standard backprop
Learning rate:                0.1
Momentum:                     0.9
Minimum training iterations:  100
Maximum training iterations:  10000

One hundred networks were trained from different random initial weights. The
following bar chart shows the distribution of the average cross-entropy
after training: 

Frequency

   |                                                              ****
   |                                                              ****
80 +                                                              ****
   |                                                              ****
   |                                                              ****
   |                                                              ****
60 +                                                              ****
   |                                                              ****
   |                                                              ****
   |                                                              ****
40 +                                                              ****
   |                                                              ****
   |                                                              ****
   |                                                              ****
20 +                                                              ****
   |                                                              ****
   |                                                              ****
   |  ****        ****                                            ****
   ---------------------------------------------------------------------
       0.0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0

                           Average Cross-Entropy

As you can see, very few networks found a good (near zero) local optimum.
Recoding the inputs from {0,1} to {-1,1} produced the following distribution
of the average cross-entropy after training: 

Frequency

   |              ****
   |  ****        ****  ****
   |  ****        ****  ****
20 +  ****        ****  ****
   |  ****        ****  ****
   |  ****        ****  ****
   |  ****        ****  ****
   |  ****        ****  ****
10 +  ****  ****  ****  ****
   |  ****  ****  ****  ****                                      ****
   |  ****  ****  ****  ****  ****                                ****
   |  ****  ****  ****  ****  ****        ****                    ****
   |  ****  ****  ****  ****  ****  ****  ****                    ****
   ---------------------------------------------------------------------
       0.0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0

                           Average Cross-Entropy

The results are dramatically better. The difference is due to simple
geometry. The initial hyperplanes pass fairly near the origin. If the data
are centered near the origin (as with {-1,1} coding), the initial
hyperplanes will cut through the data in a variety of directions. If the
data are offset from the origin (as with {0,1} coding), many of the initial
hyperplanes will miss the data entirely, and those that pass through the
data will provide a only a limited range of directions, making it difficult
to find local optima that use hyperplanes that go in different directions.
If the data are far from the origin (as with {9,10} coding), most of the
initial hyperplanes will miss the data entirely, which will cause most of
the hidden units to saturate and make any learning difficult. See "Should I
standardize the input variables?" for more information. 

User Contributions:

1
Majid Maqbool
Sep 27, 2024 @ 5:05 am
https://techpassion.co.uk/how-does-a-smart-tv-work-read-complete-details/
PDP++ is a neural-network simulation system written in C++, developed as an advanced version of the original PDP software from McClelland and Rumelhart's "Explorations in Parallel Distributed Processing Handbook" (1987). The software is designed for both novice users and researchers, providing flexibility and power in cognitive neuroscience studies. Featured in Randall C. O'Reilly and Yuko Munakata's "Computational Explorations in Cognitive Neuroscience" (2000), PDP++ supports a wide range of algorithms. These include feedforward and recurrent error backpropagation, with continuous and real-time models such as Almeida-Pineda. It also incorporates constraint satisfaction algorithms like Boltzmann Machines, Hopfield networks, and mean-field networks, as well as self-organizing learning algorithms, including Self-organizing Maps (SOM) and Hebbian learning. Additionally, it supports mixtures-of-experts models and the Leabra algorithm, which combines error-driven and Hebbian learning with k-Winners-Take-All inhibitory competition. PDP++ is a comprehensive tool for exploring neural network models in cognitive neuroscience.

Comment about this article, ask questions, or add new information about this topic:




Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: How should categories be encoded?
Next Document: Why use a bias/threshold?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)





Last Update March 27 2014 @ 02:11 PM