## Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z

# comp.ai.neural-nets FAQ, Part 2 of 7: LearningSection - Why use a bias/threshold?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Counties ]

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: Why not code binary inputs as 0 and 1?
Next Document: Why use activation functions?

Sigmoid hidden and output units usually use a "bias" or "threshold" term in
computing the net input to the unit. For a linear output unit, a bias term
is equivalent to an intercept in a regression model.

A bias term can be treated as a connection weight from a special unit with a
constant, nonzero activation value. The term "bias" is usually used with
respect to a "bias unit" with a constant value of one. The term "threshold"
is usually used with respect to a unit with a constant value of negative
one. Not all authors follow this distinction. Regardless of the terminology,
whether biases or thresholds are added or subtracted has no effect on the
performance of the network.

The single bias unit is connected to every hidden or output unit that needs
a bias term. Hence the bias terms can be learned just like other weights.

Consider a multilayer perceptron with any of the usual sigmoid activation
functions. Choose any hidden unit or output unit. Let's say there are N
inputs to that unit, which define an N-dimensional space. The given unit
draws a hyperplane through that space, producing an "on" output on one side
and an "off" output on the other. (With sigmoid units the plane will not be
sharp -- there will be some gray area of intermediate values near the
separating plane -- but ignore this for now.)

The weights determine where this hyperplane lies in the input space. Without
a bias term, this separating hyperplane is constrained to pass through the
origin of the space defined by the inputs. For some problems that's OK, but
in many problems the hyperplane would be much more useful somewhere else. If
you have many units in a layer, they share the same input space and without
bias they would ALL be constrained to pass through the origin.

The "universal approximation" property of multilayer perceptrons with most
commonly-used hidden-layer activation functions does not hold if you omit
the bias terms. But Hornik (1993) shows that a sufficient condition for the
universal approximation property without biases is that no derivative of the
activation function vanishes at the origin, which implies that with the
usual sigmoid activation functions, a fixed nonzero bias term can be used

Typically, every hidden and output unit has its own bias term. The main
exception to this is when the activations of two or more units in one layer
always sum to a nonzero constant. For example, you might scale the inputs to
sum to one (see Should I standardize the input cases?), or you might use a
normalized RBF function in the hidden layer (see How do MLPs compare with
RBFs?). If there do exist units in one layer whose activations sum to a
nonzero constant, then any subsequent layer does not need bias terms if it
receives connections from the units that sum to a constant, since using bias
terms in the subsequent layer would create linear dependencies.

If you have a large number of hidden units, it may happen that one or more
hidden units "saturate" as a result of having large incoming weights,
producing a constant activation. If this happens, then the saturated hidden
units act like bias units, and the output bias terms are redundant. However,
you should not rely on this phenomenon to avoid using output biases, since
networks without output biases are usually ill-conditioned (see
ftp://ftp.sas.com/pub/neural/illcond/illcond.html) and harder to train than
networks that use output biases.

Regarding bias-like terms in RBF networks, see "How do MLPs compare with
RBFs?"

Reference:

Hornik, K. (1993), "Some new results on neural network approximation,"
Neural Networks, 6, 1069-1072.

## User Contributions:

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: Why not code binary inputs as 0 and 1?
Next Document: Why use activation functions?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM