Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning Previous Document: How should categories be encoded? Next Document: Why use a bias/threshold? See reader questions & answers on this topic! - Help others by sharing your knowledge The importance of standardizing input variables is discussed in detail under "Should I standardize the input variables?" But for the benefit of those people who don't believe in theory, here is an example using the 5-bit parity problem. The unstandardized data are: x1 x2 x3 x4 x5 target 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 The network characteristics were: Inputs: 5 Hidden units: 5 Outputs: 5 Activation for hidden units: tanh Activation for output units: logistic Error function: cross-entropy Initial weights: random normal with mean=0, st.dev.=1/sqrt(5) for input-to-hidden =1 for hidden-to-output Training method: batch standard backprop Learning rate: 0.1 Momentum: 0.9 Minimum training iterations: 100 Maximum training iterations: 10000 One hundred networks were trained from different random initial weights. The following bar chart shows the distribution of the average cross-entropy after training: Frequency | **** | **** 80 + **** | **** | **** | **** 60 + **** | **** | **** | **** 40 + **** | **** | **** | **** 20 + **** | **** | **** | **** **** **** --------------------------------------------------------------------- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Average Cross-Entropy As you can see, very few networks found a good (near zero) local optimum. Recoding the inputs from {0,1} to {-1,1} produced the following distribution of the average cross-entropy after training: Frequency | **** | **** **** **** | **** **** **** 20 + **** **** **** | **** **** **** | **** **** **** | **** **** **** | **** **** **** 10 + **** **** **** **** | **** **** **** **** **** | **** **** **** **** **** **** | **** **** **** **** **** **** **** | **** **** **** **** **** **** **** **** --------------------------------------------------------------------- 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Average Cross-Entropy The results are dramatically better. The difference is due to simple geometry. The initial hyperplanes pass fairly near the origin. If the data are centered near the origin (as with {-1,1} coding), the initial hyperplanes will cut through the data in a variety of directions. If the data are offset from the origin (as with {0,1} coding), many of the initial hyperplanes will miss the data entirely, and those that pass through the data will provide a only a limited range of directions, making it difficult to find local optima that use hyperplanes that go in different directions. If the data are far from the origin (as with {9,10} coding), most of the initial hyperplanes will miss the data entirely, which will cause most of the hidden units to saturate and make any learning difficult. See "Should I standardize the input variables?" for more information. User Contributions:Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning Previous Document: How should categories be encoded? Next Document: Why use a bias/threshold? Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page [ Usenet FAQs | Web FAQs | Documents | RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
|
Comment about this article, ask questions, or add new information about this topic: