Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: Why not code binary inputs as 0 and 1? Next Document: Why use activation functions? See reader questions & answers on this topic!  Help others by sharing your knowledge Sigmoid hidden and output units usually use a "bias" or "threshold" term in computing the net input to the unit. For a linear output unit, a bias term is equivalent to an intercept in a regression model. A bias term can be treated as a connection weight from a special unit with a constant, nonzero activation value. The term "bias" is usually used with respect to a "bias unit" with a constant value of one. The term "threshold" is usually used with respect to a unit with a constant value of negative one. Not all authors follow this distinction. Regardless of the terminology, whether biases or thresholds are added or subtracted has no effect on the performance of the network. The single bias unit is connected to every hidden or output unit that needs a bias term. Hence the bias terms can be learned just like other weights. Consider a multilayer perceptron with any of the usual sigmoid activation functions. Choose any hidden unit or output unit. Let's say there are N inputs to that unit, which define an Ndimensional space. The given unit draws a hyperplane through that space, producing an "on" output on one side and an "off" output on the other. (With sigmoid units the plane will not be sharp  there will be some gray area of intermediate values near the separating plane  but ignore this for now.) The weights determine where this hyperplane lies in the input space. Without a bias term, this separating hyperplane is constrained to pass through the origin of the space defined by the inputs. For some problems that's OK, but in many problems the hyperplane would be much more useful somewhere else. If you have many units in a layer, they share the same input space and without bias they would ALL be constrained to pass through the origin. The "universal approximation" property of multilayer perceptrons with most commonlyused hiddenlayer activation functions does not hold if you omit the bias terms. But Hornik (1993) shows that a sufficient condition for the universal approximation property without biases is that no derivative of the activation function vanishes at the origin, which implies that with the usual sigmoid activation functions, a fixed nonzero bias term can be used instead of a trainable bias. Typically, every hidden and output unit has its own bias term. The main exception to this is when the activations of two or more units in one layer always sum to a nonzero constant. For example, you might scale the inputs to sum to one (see Should I standardize the input cases?), or you might use a normalized RBF function in the hidden layer (see How do MLPs compare with RBFs?). If there do exist units in one layer whose activations sum to a nonzero constant, then any subsequent layer does not need bias terms if it receives connections from the units that sum to a constant, since using bias terms in the subsequent layer would create linear dependencies. If you have a large number of hidden units, it may happen that one or more hidden units "saturate" as a result of having large incoming weights, producing a constant activation. If this happens, then the saturated hidden units act like bias units, and the output bias terms are redundant. However, you should not rely on this phenomenon to avoid using output biases, since networks without output biases are usually illconditioned (see ftp://ftp.sas.com/pub/neural/illcond/illcond.html) and harder to train than networks that use output biases. Regarding biaslike terms in RBF networks, see "How do MLPs compare with RBFs?" Reference: Hornik, K. (1993), "Some new results on neural network approximation," Neural Networks, 6, 10691072. User Contributions:1 Andy Apr 24, 2015 @ 7:19 pm Why is it generally a good idea to omit the biases from the penalty term for weight decay? Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: Why not code binary inputs as 0 and 1? Next Document: Why use activation functions? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
