Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: How to combine networks? Next Document: How many hidden units should I use? See reader questions & answers on this topic!  Help others by sharing your knowledge You may not need any hidden layers at all. Linear and generalized linear models are useful in a wide variety of applications (McCullagh and Nelder 1989). And even if the function you want to learn is mildly nonlinear, you may get better generalization with a simple linear model than with a complicated nonlinear model if there is too little data or too much noise to estimate the nonlinearities accurately. In MLPs with step/threshold/Heaviside activation functions, you need two hidden layers for full generality (Sontag 1992). For further discussion, see Bishop (1995, 121126). In MLPs with any of a wide variety of continuous nonlinear hiddenlayer activation functions, one hidden layer with an arbitrarily large number of units suffices for the "universal approximation" property (e.g., Hornik, Stinchcombe and White 1989; Hornik 1993; for more references, see Bishop 1995, 130, and Ripley, 1996, 173180). But there is no theory yet to tell you how many hidden units are needed to approximate any given function. If you have only one input, there seems to be no advantage to using more than one hidden layer. But things get much more complicated when there are two or more inputs. To illustrate, examples with two inputs and one output will be used so that the results can be shown graphically. In each example there are 441 training cases on a regular 21by21 grid. The test sets have 1681 cases on a regular 41by41 grid over the same domain as the training set. If you are reading the HTML version of this document via a web browser, you can see surface plots based on the test set by clicking on the file names mentioned in the folowing text. Each plot is a gif file, approximately 9K in size. Consider a target function of two inputs, consisting of a Gaussian hill in the middle of a plane (hill.gif). An MLP with an identity output activation function can easily fit the hill by surrounding it with a few sigmoid (logistic, tanh, arctan, etc.) hidden units, but there will be spurious ridges and valleys where the plane should be flat (h_mlp_6.gif). It takes dozens of hidden units to flatten out the plane accurately (h_mlp_30.gif). Now suppose you use a logistic output activation function. As the input to a logistic function goes to negative infinity, the output approaches zero. The plane in the Gaussian target function also has a value of zero. If the weights and bias for the output layer yield large negative values outside the base of the hill, the logistic function will flatten out any spurious ridges and valleys. So fitting the flat part of the target function is easy (h_mlpt_3_unsq.gif and h_mlpt_3.gif). But the logistic function also tends to lower the top of the hill. If instead of a rounded hill, the target function was a mesa with a large, flat top with a value of one, the logistic output activation function would be able to smooth out the top of the mesa just like it smooths out the plane below. Target functions like this, with large flat areas with values of either zero or one, are just what you have in many noisefree classificaton problems. In such cases, a single hidden layer is likely to work well. When using a logistic output activation function, it is common practice to scale the target values to a range of .1 to .9. Such scaling is bad in a noisefree classificaton problem, because it prevents the logistic function from smoothing out the flat areas (h_mlpt19_3.gif). For the Gaussian target function, [.1,.9] scaling would make it easier to fit the top of the hill, but would reintroduce undulations in the plane. It would be better for the Gaussian target function to scale the target values to a range of 0 to .9. But for a more realistic and complicated target function, how would you know the best way to scale the target values? By introducing a second hidden layer with one sigmoid activation function and returning to an identity output activation function, you can let the net figure out the best scaling (h_mlp1_3.gif). Actually, the bias and weight for the output layer scale the output rather than the target values, and you can use whatever range of target values is convenient. For more complicated target functions, especially those with several hills or valleys, it is useful to have several units in the second hidden layer. Each unit in the second hidden layer enables the net to fit a separate hill or valley. So an MLP with two hidden layers can often yield an accurate approximation with fewer weights than an MLP with one hidden layer. (Chester 1990). To illustrate the use of multiple units in the second hidden layer, the next example resembles a landscape with a Gaussian hill and a Gaussian valley, both elliptical (hillanvale.gif). The table below gives the RMSE (root mean squared error) for the test set with various architectures. If you are reading the HTML version of this document via a web browser, click on any number in the table to see a surface plot of the corresponding network output. The MLP networks in the table have one or two hidden layers with a tanh activation function. The output activation function is the identity. Using a squashing function on the output layer is of no benefit for this function, since the only flat area in the function has a target value near the middle of the target range. Hill and Valley Data: RMSE for the Test Set (Number of weights in parentheses) MLP Networks HUs in HUs in Second Layer First  Layer 0 1 2 3 4 1 0.204( 5) 0.204( 7) 0.189( 10) 0.187( 13) 0.185( 16) 2 0.183( 9) 0.163( 11) 0.147( 15) 0.094( 19) 0.096( 23) 3 0.159( 13) 0.095( 15) 0.054( 20) 0.033( 25) 0.045( 30) 4 0.137( 17) 0.093( 19) 0.009( 25) 0.021( 31) 0.016( 37) 5 0.121( 21) 0.092( 23) 0.010( 37) 0.011( 44) 6 0.100( 25) 0.092( 27) 0.007( 43) 0.005( 51) 7 0.086( 29) 0.077( 31) 8 0.079( 33) 0.062( 35) 9 0.072( 37) 0.055( 39) 10 0.059( 41) 0.047( 43) 12 0.047( 49) 0.042( 51) 15 0.039( 61) 0.032( 63) 20 0.025( 81) 0.018( 83) 25 0.021(101) 0.016(103) 30 0.018(121) 0.015(123) 40 0.012(161) 0.015(163) 50 0.008(201) 0.014(203) For an MLP with only one hidden layer (column 0), Gaussian hills and valleys require a large number of hidden units to approximate well. When there is one unit in the second hidden layer, the network can represent one hill or valley easily, which is what happens with three to six units in the first hidden layer. But having only one unit in the second hidden layer is of little benefit for learning two hills or valleys. Using two units in the second hidden layer enables the network to approximate two hills or valleys easily; in this example, only four units are required in the first hidden layer to get an excellent fit. Each additional unit in the second hidden layer enables the network to learn another hill or valley with a relatively small number of units in the first hidden layer, as explained by Chester (1990). In this example, having three or four units in the second hidden layer helps little, and actually produces a worse approximation when there are four units in the first hidden layer due to problems with local minima. Unfortunately, using two hidden layers exacerbates the problem of local minima, and it is important to use lots of random initializations or other methods for global optimization. Local minima with two hidden layers can have extreme spikes or blades even when the number of weights is much smaller than the number of training cases. One of the few advantages of standard backprop is that it is so slow that spikes and blades will not become very sharp for practical training times. More than two hidden layers can be useful in certain architectures such as cascade correlation (Fahlman and Lebiere 1990) and in special applications, such as the twospirals problem (Lang and Witbrock 1988) and ZIP code recognition (Le Cun et al. 1989). RBF networks are most often used with a single hidden layer. But an extra, linear hidden layer before the radial hidden layer enables the network to ignore irrelevant inputs (see How do MLPs compare with RBFs?") The linear hidden layer allows the RBFs to take elliptical, rather than radial (circular), shapes in the space of the inputs. Hence the linear layer gives you an elliptical basis function (EBF) network. In the hill and valley example, an ORBFUN network requires nine hidden units (37 weights) to get the test RMSE below .01, but by adding a linear hidden layer, you can get an essentially perfect fit with three linear units followed by two radial units (20 weights). References: Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press. Chester, D.L. (1990), "Why Two Hidden Layers are Better than One," IJCNN90WASHDC, Lawrence Erlbaum, 1990, volume 1, 265268. Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation Learning Architecture," NIPS2, 524532, ftp://archive.cis.ohiostate.edu/pub/neuroprose/fahlman.cascortr.ps.Z. Hornik, K., Stinchcombe, M. and White, H. (1989), "Multilayer feedforward networks are universal approximators," Neural Networks, 2, 359366. Hornik, K. (1993), "Some new results on neural network approximation," Neural Networks, 6, 10691072. Lang, K.J. and Witbrock, M.J. (1988), "Learning to tell two spirals apart," in Touretzky, D., Hinton, G., and Sejnowski, T., eds., Procedings of the 1988 Connectionist Models Summer School, San Mateo, CA: Morgan Kaufmann. Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E., Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to handwritten ZIP code recognition", Neural Computation, 1, 541551. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall. Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press. Sontag, E.D. (1992), "Feedback stabilization using twohiddenlayer nets", IEEE Transactions on Neural Networks, 3, 981990. User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 3 of 7: Generalization Previous Document: How to combine networks? Next Document: How many hidden units should I use? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
