Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z - Internet FAQ Archives FAQ, Part 3 of 7: Generalization
Section - How many hidden layers should I use?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Forum archive ]

Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: How to combine networks?
Next Document: How many hidden units should I use?
See reader questions & answers on this topic! - Help others by sharing your knowledge

You may not need any hidden layers at all. Linear and generalized linear
models are useful in a wide variety of applications (McCullagh and Nelder
1989). And even if the function you want to learn is mildly nonlinear, you
may get better generalization with a simple linear model than with a
complicated nonlinear model if there is too little data or too much noise to
estimate the nonlinearities accurately. 

In MLPs with step/threshold/Heaviside activation functions, you need two
hidden layers for full generality (Sontag 1992). For further discussion, see
Bishop (1995, 121-126). 

In MLPs with any of a wide variety of continuous nonlinear hidden-layer
activation functions, one hidden layer with an arbitrarily large number of
units suffices for the "universal approximation" property (e.g., Hornik,
Stinchcombe and White 1989; Hornik 1993; for more references, see Bishop
1995, 130, and Ripley, 1996, 173-180). But there is no theory yet to tell
you how many hidden units are needed to approximate any given function. 

If you have only one input, there seems to be no advantage to using more
than one hidden layer. But things get much more complicated when there are
two or more inputs. To illustrate, examples with two inputs and one output
will be used so that the results can be shown graphically. In each example
there are 441 training cases on a regular 21-by-21 grid. The test sets have
1681 cases on a regular 41-by-41 grid over the same domain as the training
set. If you are reading the HTML version of this document via a web browser,
you can see surface plots based on the test set by clicking on the file
names mentioned in the folowing text. Each plot is a gif file, approximately
9K in size. 

Consider a target function of two inputs, consisting of a Gaussian hill in
the middle of a plane (hill.gif). An MLP with an identity output activation
function can easily fit the hill by surrounding it with a few sigmoid
(logistic, tanh, arctan, etc.) hidden units, but there will be spurious
ridges and valleys where the plane should be flat (h_mlp_6.gif). It takes
dozens of hidden units to flatten out the plane accurately (h_mlp_30.gif). 

Now suppose you use a logistic output activation function. As the input to a
logistic function goes to negative infinity, the output approaches zero. The
plane in the Gaussian target function also has a value of zero. If the
weights and bias for the output layer yield large negative values outside
the base of the hill, the logistic function will flatten out any spurious
ridges and valleys. So fitting the flat part of the target function is easy 
(h_mlpt_3_unsq.gif and h_mlpt_3.gif). But the logistic function also tends
to lower the top of the hill. 

If instead of a rounded hill, the target function was a mesa with a large,
flat top with a value of one, the logistic output activation function would
be able to smooth out the top of the mesa just like it smooths out the plane
below. Target functions like this, with large flat areas with values of
either zero or one, are just what you have in many noise-free classificaton
problems. In such cases, a single hidden layer is likely to work well. 

When using a logistic output activation function, it is common practice to
scale the target values to a range of .1 to .9. Such scaling is bad in a
noise-free classificaton problem, because it prevents the logistic function
from smoothing out the flat areas (h_mlpt1-9_3.gif). 

For the Gaussian target function, [.1,.9] scaling would make it easier to
fit the top of the hill, but would reintroduce undulations in the plane. It
would be better for the Gaussian target function to scale the target values
to a range of 0 to .9. But for a more realistic and complicated target
function, how would you know the best way to scale the target values? 

By introducing a second hidden layer with one sigmoid activation function
and returning to an identity output activation function, you can let the net
figure out the best scaling (h_mlp1_3.gif). Actually, the bias and weight
for the output layer scale the output rather than the target values, and you
can use whatever range of target values is convenient. 

For more complicated target functions, especially those with several hills
or valleys, it is useful to have several units in the second hidden layer.
Each unit in the second hidden layer enables the net to fit a separate hill
or valley. So an MLP with two hidden layers can often yield an accurate
approximation with fewer weights than an MLP with one hidden layer. (Chester

To illustrate the use of multiple units in the second hidden layer, the next
example resembles a landscape with a Gaussian hill and a Gaussian valley,
both elliptical (hillanvale.gif). The table below gives the RMSE (root mean
squared error) for the test set with various architectures. If you are
reading the HTML version of this document via a web browser, click on any
number in the table to see a surface plot of the corresponding network

The MLP networks in the table have one or two hidden layers with a tanh
activation function. The output activation function is the identity. Using a
squashing function on the output layer is of no benefit for this function,
since the only flat area in the function has a target value near the middle
of the target range. 

          Hill and Valley Data: RMSE for the Test Set
              (Number of weights in parentheses)
                         MLP Networks

HUs in                  HUs in Second Layer
First  ----------------------------------------------------------
Layer     0           1           2           3           4
 1     0.204(  5)  0.204(  7)  0.189( 10)  0.187( 13)  0.185( 16)
 2     0.183(  9)  0.163( 11)  0.147( 15)  0.094( 19)  0.096( 23)
 3     0.159( 13)  0.095( 15)  0.054( 20)  0.033( 25)  0.045( 30)
 4     0.137( 17)  0.093( 19)  0.009( 25)  0.021( 31)  0.016( 37)
 5     0.121( 21)  0.092( 23)              0.010( 37)  0.011( 44)
 6     0.100( 25)  0.092( 27)              0.007( 43)  0.005( 51)
 7     0.086( 29)  0.077( 31)
 8     0.079( 33)  0.062( 35)
 9     0.072( 37)  0.055( 39)
10     0.059( 41)  0.047( 43)
12     0.047( 49)  0.042( 51)
15     0.039( 61)  0.032( 63)
20     0.025( 81)  0.018( 83)  
25     0.021(101)  0.016(103)  
30     0.018(121)  0.015(123)  
40     0.012(161)  0.015(163)  
50     0.008(201)  0.014(203)  

For an MLP with only one hidden layer (column 0), Gaussian hills and valleys
require a large number of hidden units to approximate well. When there is
one unit in the second hidden layer, the network can represent one hill or
valley easily, which is what happens with three to six units in the first
hidden layer. But having only one unit in the second hidden layer is of
little benefit for learning two hills or valleys. Using two units in the
second hidden layer enables the network to approximate two hills or valleys
easily; in this example, only four units are required in the first hidden
layer to get an excellent fit. Each additional unit in the second hidden
layer enables the network to learn another hill or valley with a relatively
small number of units in the first hidden layer, as explained by Chester
(1990). In this example, having three or four units in the second hidden
layer helps little, and actually produces a worse approximation when there
are four units in the first hidden layer due to problems with local minima. 

Unfortunately, using two hidden layers exacerbates the problem of local
minima, and it is important to use lots of random initializations or other
methods for global optimization. Local minima with two hidden layers can
have extreme spikes or blades even when the number of weights is much
smaller than the number of training cases. One of the few advantages of 
standard backprop is that it is so slow that spikes and blades will not
become very sharp for practical training times. 

More than two hidden layers can be useful in certain architectures such as
cascade correlation (Fahlman and Lebiere 1990) and in special applications,
such as the two-spirals problem (Lang and Witbrock 1988) and ZIP code
recognition (Le Cun et al. 1989). 

RBF networks are most often used with a single hidden layer. But an extra,
linear hidden layer before the radial hidden layer enables the network to
ignore irrelevant inputs (see How do MLPs compare with RBFs?") The linear
hidden layer allows the RBFs to take elliptical, rather than radial
(circular), shapes in the space of the inputs. Hence the linear layer gives
you an elliptical basis function (EBF) network. In the hill and valley
example, an ORBFUN network requires nine hidden units (37 weights) to get
the test RMSE below .01, but by adding a linear hidden layer, you can get an
essentially perfect fit with three linear units followed by two radial units
(20 weights). 


   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Chester, D.L. (1990), "Why Two Hidden Layers are Better than One,"
   IJCNN-90-WASH-DC, Lawrence Erlbaum, 1990, volume 1, 265-268. 

   Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation Learning
   Architecture," NIPS2, 524-532, 

   Hornik, K., Stinchcombe, M. and White, H. (1989), "Multilayer feedforward
   networks are universal approximators," Neural Networks, 2, 359-366. 

   Hornik, K. (1993), "Some new results on neural network approximation,"
   Neural Networks, 6, 1069-1072. 

   Lang, K.J. and Witbrock, M.J. (1988), "Learning to tell two spirals
   apart," in Touretzky, D., Hinton, G., and Sejnowski, T., eds., 
   Procedings of the 1988 Connectionist Models Summer School, San Mateo,
   CA: Morgan Kaufmann. 

   Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E.,
   Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to
   handwritten ZIP code recognition", Neural Computation, 1, 541-551. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Sontag, E.D. (1992), "Feedback stabilization using two-hidden-layer
   nets", IEEE Transactions on Neural Networks, 3, 981-990. 

User Contributions:

Comment about this article, ask questions, or add new information about this topic:

Top Document: FAQ, Part 3 of 7: Generalization
Previous Document: How to combine networks?
Next Document: How many hidden units should I use?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer: (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM