## Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z

# comp.ai.neural-nets FAQ, Part 2 of 7: LearningSection - How do MLPs compare with RBFs?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Property taxes ]

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What is the curse of dimensionality?
Next Document: What are OLS and subset/stepwise regression?
```
Multilayer perceptrons (MLPs) and radial basis function (RBF) networks are
the two most commonly-used types of feedforward network. They have much more
in common than most of the NN literature would suggest. The only fundamental
difference is the way in which hidden units combine values coming from
preceding layers in the network--MLPs use inner products, while RBFs use
Euclidean distance. There are also differences in the customary methods for
training MLPs and RBF networks, although most methods for training MLPs can
also be applied to RBF networks. Furthermore, there are crucial differences
between two broad types of RBF network--ordinary RBF networks and normalized
RBF networks--that are ignored in most of the NN literature. These
differences have important consequences for the generalization ability of
the networks, especially when the number of inputs is large.

Notation:

a_j     is the altitude or height of the jth hidden unit
b_j     is the bias of the jth hidden unit
f       is the fan-in of the jth hidden unit
h_j     is the activation of the jth hidden unit
s       is a common width shared by all hidden units in the layer
s_j     is the width of the jth hidden unit
w_ij    is the weight connecting the ith input to
the jth hidden unit
w_i     is the common weight for the ith input shared by
all hidden units in the layer
x_i     is the ith input

The inputs to each hidden or output unit must be combined with the weights
to yield a single value called the "net input" to which the activation
function is applied. There does not seem to be a standard term for the
function that combines the inputs and weights; I will use the term
"combination function". Thus, each hidden or output unit in a feedforward
network first computes a combination function to produce the net input, and
then applies an activation function to the net input yielding the activation
of the unit.

A multilayer perceptron has one or more hidden layers for which the
combination function is the inner product of the inputs and weights, plus a
bias. The activation function is usually a logistic or tanh function.
Hence the formula for the activation is typically:

h_j = tanh( b_j + sum[w_ij*x_i] )

The MLP architecture is the most popular one in practical applications. Each
layer uses a linear combination function. The inputs are fully connected to
the first hidden layer, each hidden layer is fully connected to the next,
and the last hidden layer is fully connected to the outputs. You can also
have "skip-layer" connections; direct connections from inputs to outputs are
especially useful.

Consider the multidimensional space of inputs to a given hidden unit. Since
an MLP uses linear combination functions, the set of all points in the space
having a given value of the activation function is a hyperplane. The
hyperplanes corresponding to different activation levels are parallel to
each other (the hyperplanes for different units are not parallel in
general). These parallel hyperplanes are the isoactivation contours of the
hidden unit.

Radial basis function (RBF) networks usually have only one hidden layer for
which the combination function is based on the Euclidean distance between
the input vector and the weight vector. RBF networks do not have anything
that's exactly the same as the bias term in an MLP. But some types of RBFs
have a "width" associated with each hidden unit or with the the entire
hidden layer; instead of adding it in the combination function like a bias,
you divide the Euclidean distance by the width.

To see the similarity between RBF networks and MLPs, it is convenient to
treat the combination function as the square of distance/width. Then the
familiar exp or softmax activation functions produce members of the
popular class of Gaussian RBF networks. It can also be useful to add another
term to the combination function that determines what I will call the
"altitude" of the unit. The altitude is the maximum height of the Gaussian
curve above the horizontal axis. I have not seen altitudes used in the NN
literature; if you know of a reference, please tell me (saswss@unx.sas.com).

The output activation function in RBF networks is usually the identity. The
identity output activation function is a computational convenience in
training (see Hybrid training and the curse of dimensionality) but it is
possible and often desirable to use other output activation functions just
as you would in an MLP.

There are many types of radial basis functions. Gaussian RBFs seem to be the
most popular by far in the NN literature. In the statistical literature,
thin plate splines are also used (Green and Silverman 1994). This FAQ will
concentrate on Gaussian RBFs.

There are two distinct types of Gaussian RBF architectures. The first type
uses the exp activation function, so the activation of the unit is a
Gaussian "bump" as a function of the inputs. There seems to be no specific
term for this type of Gaussian RBF network; I will use the term "ordinary
RBF", or ORBF, network.

The second type of Gaussian RBF architecture uses the softmax activation
function, so the activations of all the hidden units are normalized to sum
to one. This type of network is often called a "normalized RBF", or NRBF,
network. In a NRBF network, the output units should not have a bias, since
the constant bias term would be linearly dependent on the constant sum of
the hidden units.

While the distinction between these two types of Gaussian RBF architectures
is sometimes mentioned in the NN literature, its importance has rarely been
appreciated except by Tao (1993) and Werntges (1993). Shorten and
Murray-Smith (1996) also compare ordinary and normalized Gaussian RBF
networks.

There are several subtypes of both ORBF and NRBF architectures. Descriptions
and formulas are as follows:

ORBFUN
Ordinary radial basis function (RBF) network with unequal widths
h_j = exp( - s_j^-2 * sum[(w_ij-x_i)^2] )
ORBFEQ
Ordinary radial basis function (RBF) network with equal widths
h_j = exp( - s^-2 * sum[(w_ij-x_i)^2] )
NRBFUN
Normalized RBF network with unequal widths and heights
h_j = softmax(f*log(a_j) - s_j^-2 *
sum[(w_ij-x_i)^2] )
NRBFEV
Normalized RBF network with equal volumes
h_j = softmax( f*log(s_j) - s_j^-2 *
sum[(w_ij-x_i)^2] )
NRBFEH
Normalized RBF network with equal heights (and unequal widths)
h_j = softmax( - s_j^-2 * sum[(w_ij-x_i)^2] )
NRBFEW
Normalized RBF network with equal widths (and unequal heights)
h_j = softmax( f*log(a_j) - s^-2 *
sum[(w_ij-x_i)^2] )
NRBFEQ
Normalized RBF network with equal widths and heights
h_j = softmax( - s^-2 * sum[(w_ij-x_i)^2] )

To illustrate various architectures, an example with two inputs and one
output will be used so that the results can be shown graphically. The
function being learned resembles a landscape with a Gaussian hill and a
logistic plateau as shown in ftp://ftp.sas.com/pub/neural/hillplat.gif.
There are 441 training cases on a regular 21-by-21 grid. The table below
shows the root mean square error (RMSE) for a test data set. The test set
has 1681 cases on a regular 41-by-41 grid over the same domain as the
training set. If you are reading the HTML version of this document via a web
browser, click on any number in the table to see a surface plot of the
corresponding network output (each plot is a gif file, approximately 9K).

The MLP networks in the table have one hidden layer with a tanh activation
function. All of the networks use an identity activation function for the
outputs.

Hill and Plateau Data: RMSE for the Test Set

HUs  MLP   ORBFEQ  ORBFUN  NRBFEQ  NRBFEW  NRBFEV  NRBFEH  NRBFUN

2  0.218   0.247   0.247   0.230   0.230   0.230   0.230   0.230
3  0.192   0.244   0.143   0.218   0.218   0.036   0.012   0.001
4  0.174   0.216   0.096   0.193   0.193   0.036   0.007
5  0.160   0.188   0.083   0.086   0.051   0.003
6  0.123   0.142   0.058   0.053   0.030
7  0.107   0.123   0.051   0.025   0.019
8  0.093   0.105   0.043   0.020   0.008
9  0.084   0.085   0.038   0.017
10  0.077   0.082   0.033   0.016
12  0.059   0.074   0.024   0.005
15  0.042   0.060   0.019
20  0.023   0.046   0.010
30  0.019   0.024
40  0.016   0.022
50  0.010   0.014

The ORBF architectures use radial combination functions and the exp
activation function. Only two of the radial combination functions are useful
with ORBF architectures. For radial combination functions including an
altitude, the altitude would be redundant with the hidden-to-output weights.

Radial combination functions are based on the Euclidean distance between the
vector of inputs to the unit and the vector of corresponding weights. Thus,
the isoactivation contours for ORBF networks are concentric hyperspheres. A
variety of activation functions can be used with the radial combination
function, but the exp activation function, yielding a Gaussian surface, is
the most useful. Radial networks typically have only one hidden layer, but
it can be useful to include a linear layer for dimensionality reduction or
oblique rotation before the RBF layer.

The output of an ORBF network consists of a number of superimposed bumps,
hence the output is quite bumpy unless many hidden units are used. Thus an
ORBF network with only a few hidden units is incapable of fitting a wide
variety of simple, smooth functions, and should rarely be used.

The NRBF architectures also use radial combination functions but the
activation function is softmax, which forces the sum of the activations for
the hidden layer to equal one. Thus, each output unit computes a weighted
average of the hidden-to-output weights, and the output values must lie
within the range of the hidden-to-output weights. Therefore, if the
hidden-to-output weights are within a reasonable range (such as the range of
the target values), you can be sure that the outputs will be within that
same range for all possible inputs, even when the net is extrapolating. No
comparably useful bound exists for the output of an ORBF network.

If you extrapolate far enough in a Gaussian ORBF network with an identity
output activation function, the activation of every hidden unit will
approach zero, hence the extrapolated output of the network will equal the
output bias. If you extrapolate far enough in an NRBF network, one hidden
unit will come to dominate the output. Hence if you want the network to
extrapolate different values in a different directions, an NRBF should be

Radial combination functions incorporating altitudes are useful with NRBF
architectures. The NRBF architectures combine some of the virtues of both
the RBF and MLP architectures, as explained below. However, the
isoactivation contours are considerably more complicated than for ORBF
architectures.

Consider the case of an NRBF network with only two hidden units. If the
hidden units have equal widths, the isoactivation contours are parallel
hyperplanes; in fact, this network is equivalent to an MLP with one logistic
hidden unit. If the hidden units have unequal widths, the isoactivation
contours are concentric hyperspheres; such a network is almost equivalent to
an ORBF network with one Gaussian hidden unit.

If there are more than two hidden units in an NRBF network, the
isoactivation contours have no such simple characterization. If the RBF
widths are very small, the isoactivation contours are approximately
piecewise linear for RBF units with equal widths, and approximately
piecewise spherical for RBF units with unequal widths. The larger the
widths, the smoother the isoactivation contours where the pieces join. As
Shorten and Murray-Smith (1996) point out, the activation is not necessarily
a monotone function of distance from the center when unequal widths are
used.

The NRBFEQ architecture is a smoothed variant of the learning vector
quantization (Kohonen 1988, Ripley 1996) and counterpropagation
(Hecht-Nielsen 1990), architectures. In LVQ and counterprop, the hidden
units are often called "codebook vectors". LVQ amounts to nearest-neighbor
classification on the codebook vectors, while counterprop is
nearest-neighbor regression on the codebook vectors. The NRBFEQ architecture
uses not just the single nearest neighbor, but a weighted average of near
neighbors. As the width of the NRBFEQ functions approaches zero, the weights
approach one for the nearest neighbor and zero for all other codebook
vectors. LVQ and counterprop use ad hoc algorithms of uncertain reliability,
but standard numerical optimization algorithms (not to mention backprop) can
be applied with the NRBFEQ architecture.

In a NRBFEQ architecture, if each observation is taken as an RBF center, and
if the weights are taken to be the target values, the outputs are simply
weighted averages of the target values, and the network is identical to the
well-known Nadaraya-Watson kernel regression estimator, which has been
reinvented at least twice in the neural net literature (see "What is
GRNN?"). A similar NRBFEQ network used for classification is equivalent to
kernel discriminant analysis (see "What is PNN?").

Kernels with variable widths are also used for regression in the statistical
literature. Such kernel estimators correspond to the the NRBFEV
architecture, in which the kernel functions have equal volumes but different
altitudes. In the neural net literature, variable-width kernels appear
always to be of the NRBFEH variety, with equal altitudes but unequal
volumes. The analogy with kernel regression would make the NRBFEV
architecture the obvious choice, but which of the two architectures works
better in practice is an open question.

Hybrid training and the curse of dimensionality
+++++++++++++++++++++++++++++++++++++++++++++++

A comparison of the various architectures must separate training issues from
architectural issues to avoid common sources of confusion. RBF networks are
often trained by "hybrid" methods, in which the hidden weights (centers)
are first obtained by unsupervised learning, after which the output weights
are obtained by supervised learning. Unsupervised methods for choosing the
centers include:

1. Distribute the centers in a regular grid over the input space.
2. Choose a random subset of the training cases to serve as centers.
3. Cluster the training cases based on the input variables, and use the mean
of each cluster as a center.

Various heuristic methods are also available for choosing the RBF widths
(e.g., Moody and Darken 1989; Sarle 1994b). Once the centers and widths are
fixed, the output weights can be learned very efficiently, since the
computation reduces to a linear or generalized linear model. The hybrid
training approach can thus be much faster than the nonlinear optimization
that would be required for supervised training of all of the weights in the
network.

Hybrid training is not often applied to MLPs because no effective methods
are known for unsupervised training of the hidden units (except when there
is only one input).

Hybrid training will usually require more hidden units than supervised
training. Since supervised training optimizes the locations of the centers,
while hybrid training does not, supervised training will provide a better
approximation to the function to be learned for a given number of hidden
units. Thus, the better fit provided by supervised training will often let
you use fewer hidden units for a given accuracy of approximation than you
would need with hybrid training. And if the hidden-to-output weights are
learned by linear least-squares, the fact that hybrid training requires more
hidden units implies that hybrid training will also require more training
cases for the same accuracy of generalization (Tarassenko and Roberts 1994).

The number of hidden units required by hybrid methods becomes an
increasingly serious problem as the number of inputs increases. In fact, the
required number of hidden units tends to increase exponentially with the
number of inputs. This drawback of hybrid methods is discussed by Minsky and
Papert (1969). For example, with method (1) for RBF networks, you would need
at least five elements in the grid along each dimension to detect a moderate
degree of nonlinearity; so if you have Nx inputs, you would need at least
5^Nx hidden units. For methods (2) and (3), the number of hidden units
increases exponentially with the effective dimensionality of the input
distribution. If the inputs are linearly related, the effective
dimensionality is the number of nonnegligible (a deliberately vague term)
eigenvalues of the covariance matrix, so the inputs must be highly
correlated if the effective dimensionality is to be much less than the
number of inputs.

The exponential increase in the number of hidden units required for hybrid
learning is one aspect of the curse of dimensionality. The number of
training cases required also increases exponentially in general. No neural
network architecture--in fact no method of learning or statistical
estimation--can escape the curse of dimensionality in general, hence there
is no practical method of learning general functions in more than a few
dimensions.

Fortunately, in many practical applications of neural networks with a large
number of inputs, most of those inputs are additive, redundant, or
irrelevant, and some architectures can take advantage of these properties to
yield useful results. But escape from the curse of dimensionality requires
fully supervised training as well as special types of data. Supervised
training for RBF networks can be done by "backprop" (see "What is
backprop?") or other optimization methods (see "What are conjugate
gradients, Levenberg-Marquardt, etc.?"), or by subset regression "What are
OLS and subset/stepwise regression?").

+++++++++++++++

An additive model is one in which the output is a sum of linear or nonlinear
transformations of the inputs. If an additive model is appropriate, the
number of weights increases linearly with the number of inputs, so high
dimensionality is not a curse. Various methods of training additive models
are available in the statistical literature (e.g. Hastie and Tibshirani
1990). You can also create a feedforward neural network, called a
Additive models have been proposed in the neural net literature under the
name "topologically distributed encoding" (Geiger 1990).

Projection pursuit regression (PPR) provides both universal approximation
and the ability to avoid the curse of dimensionality for certain common
types of target functions (Friedman and Stuetzle 1981). Like MLPs, PPR
computes the output as a sum of nonlinear transformations of linear
combinations of the inputs. Each term in the sum is analogous to a hidden
unit in an MLP. But unlike MLPs, PPR allows general, smooth nonlinear
transformations rather than a specific nonlinear activation function, and
allows a different transformation for each term. The nonlinear
transformations in PPR are usually estimated by nonparametric regression,
but you can set up a projection pursuit network (PPN), in which each
nonlinear transformation is performed by a subnetwork. If a PPN provides an
adequate fit with few terms, then the curse of dimensionality can be
avoided, and the results may even be interpretable.

If the target function can be accurately approximated by projection pursuit,
then it can also be accurately approximated by an MLP with a single hidden
layer. The disadvantage of the MLP is that there is little hope of
interpretability. An MLP with two or more hidden layers can provide a
parsimonious fit to a wider variety of target functions than can projection
pursuit, but no simple characterization of these functions is known.

Redundant inputs
++++++++++++++++

With proper training, all of the RBF architectures listed above, as well as
MLPs, can process redundant inputs effectively. When there are redundant
inputs, the training cases lie close to some (possibly nonlinear) subspace.
If the same degree of redundancy applies to the test cases, the network need
produce accurate outputs only near the subspace occupied by the data. Adding
redundant inputs has little effect on the effective dimensionality of the
data; hence the curse of dimensionality does not apply, and even hybrid
methods (2) and (3) can be used. However, if the test cases do not follow
the same pattern of redundancy as the training cases, generalization will
require extrapolation and will rarely work well.

Irrelevant inputs
+++++++++++++++++

MLP architectures are good at ignoring irrelevant inputs. MLPs can also
select linear subspaces of reduced dimensionality. Since the first hidden
layer forms linear combinations of the inputs, it confines the networks
attention to the linear subspace spanned by the weight vectors. Hence,
adding irrelevant inputs to the training data does not increase the number
of hidden units required, although it increases the amount of training data
required.

ORBF architectures are not good at ignoring irrelevant inputs. The number of
hidden units required grows exponentially with the number of inputs,
regardless of how many inputs are relevant. This exponential growth is
related to the fact that ORBFs have local receptive fields, meaning that
changing the hidden-to-output weights of a given unit will affect the output
of the network only in a neighborhood of the center of the hidden unit,
where the size of the neighborhood is determined by the width of the hidden
unit. (Of course, if the width of the unit is learned, the receptive field
could grow to cover the entire training set.)

Local receptive fields are often an advantage compared to the distributed
architecture of MLPs, since local units can adapt to local patterns in the
data without having unwanted side effects in other regions. In a distributed
architecture such as an MLP, adapting the network to fit a local pattern in
the data can cause spurious side effects in other parts of the input space.

However, ORBF architectures often must be used with relatively small
neighborhoods, so that several hidden units are required to cover the range
of an input. When there are many nonredundant inputs, the hidden units must
cover the entire input space, and the number of units required is
essentially the same as in the hybrid case (1) where the centers are in a
regular grid; hence the exponential growth in the number of hidden units
with the number of inputs, regardless of whether the inputs are relevant.

You can enable an ORBF architecture to ignore irrelevant inputs by using an
extra, linear hidden layer before the radial hidden layer. This type of
network is sometimes called an "elliptical basis function" network. If the
number of units in the linear hidden layer equals the number of inputs, the
linear hidden layer performs an oblique rotation of the input space that can
suppress irrelevant directions and differentally weight relevant directions
according to their importance. If you think that the presence of irrelevant
inputs is highly likely, you can force a reduction of dimensionality by
using fewer units in the linear hidden layer than the number of inputs.

Note that the linear and radial hidden layers must be connected in series,
not in parallel, to ignore irrelevant inputs. In some applications it is
useful to have linear and radial hidden layers connected in parallel, but in
such cases the radial hidden layer will be sensitive to all inputs.

For even greater flexibility (at the cost of more weights to be learned),
you can have a separate linear hidden layer for each RBF unit, allowing a
different oblique rotation for each RBF unit.

NRBF architectures with equal widths (NRBFEW and NRBFEQ) combine the
advantage of local receptive fields with the ability to ignore irrelevant
inputs. The receptive field of one hidden unit extends from the center in
all directions until it encounters the receptive field of another hidden
unit. It is convenient to think of a "boundary" between the two receptive
fields, defined as the hyperplane where the two units have equal
activations, even though the effect of each unit will extend somewhat beyond
the boundary. The location of the boundary depends on the heights of the
hidden units. If the two units have equal heights, the boundary lies midway
between the two centers. If the units have unequal heights, the boundary is
farther from the higher unit.

If a hidden unit is surrounded by other hidden units, its receptive field is
indeed local, curtailed by the field boundaries with other units. But if a
hidden unit is not completely surrounded, its receptive field can extend
infinitely in certain directions. If there are irrelevant inputs, or more
generally, irrelevant directions that are linear combinations of the inputs,
the centers need only be distributed in a subspace orthogonal to the
irrelevant directions. In this case, the hidden units can have local
receptive fields in relevant directions but infinite receptive fields in
irrelevant directions.

For NRBF architectures allowing unequal widths (NRBFUN, NRBFEV, and NRBFEH),
the boundaries between receptive fields are generally hyperspheres rather
than hyperplanes. In order to ignore irrelevant inputs, such networks must
be trained to have equal widths. Hence, if you think there is a strong
possibility that some of the inputs are irrelevant, it is usually better to
use an architecture with equal widths.

References:

There are few good references on RBF networks. Bishop (1995) gives one of
the better surveys, but also see Tao (1993) and Werntges (1993) for the
importance of normalization. Orr (1996) provides a useful introduction.

Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
Oxford University Press.

Friedman, J.H. and Stuetzle, W. (1981), "Projection pursuit regression,"
J. of the American Statistical Association, 76, 817-823.

Geiger, H. (1990), "Storing and Processing Information in Connectionist
Systems," in Eckmiller, R., ed., Advanced Neural Computers, 271-277,
Amsterdam: North-Holland.

Green, P.J. and Silverman, B.W. (1994), Nonparametric Regression and
Generalized Linear Models: A roughness penalty approach,, London:
Chapman & Hall.

Hastie, T.J. and Tibshirani, R.J. (1990) Generalized Additive Models,
London: Chapman & Hall.

Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1
(suppl 1), 303.

Minsky, M.L. and Papert, S.A. (1969), Perceptrons, Cambridge, MA: MIT
Press.

Moody, J. and Darken, C.J. (1989), "Fast learning in networks of
locally-tuned processing units," Neural Computation, 1, 281-294.

Orr, M.J.L. (1996), "Introduction to radial basis function networks,"
http://www.anc.ed.ac.uk/~mjo/papers/intro.ps or
http://www.anc.ed.ac.uk/~mjo/papers/intro.ps.gz

Ripley, B.D. (1996), Pattern Recognition and Neural Networks,
Cambridge: Cambridge University Press.

Sarle, W.S. (1994a), "Neural Networks and Statistical Models," in SAS
Institute Inc., Proceedings of the Nineteenth Annual SAS Users Group
International Conference, Cary, NC: SAS Institute Inc., pp 1538-1550,
ftp://ftp.sas.com/pub/neural/neural1.ps.

Sarle, W.S. (1994b), "Neural Network Implementation in SAS Software," in
SAS Institute Inc., Proceedings of the Nineteenth Annual SAS Users
Group International Conference, Cary, NC: SAS Institute Inc., pp
1551-1573, ftp://ftp.sas.com/pub/neural/neural2.ps.

Shorten, R., and Murray-Smith, R. (1996), "Side effects of normalising
radial basis function networks" International Journal of Neural Systems,
7, 167-179.

Tao, K.M. (1993), "A closer look at the radial basis function (RBF)
networks," Conference Record of The Twenty-Seventh Asilomar
Conference on Signals, Systems and Computers (Singh, A., ed.), vol 1,
401-405, Los Alamitos, CA: IEEE Comput. Soc. Press.

Tarassenko, L. and Roberts, S. (1994), "Supervised and unsupervised
learning in radial basis function classifiers," IEE Proceedings-- Vis.
Image Signal Processing, 141, 210-216.

Werntges, H.W. (1993), "Partitions of unity improve neural function
approximation," Proceedings of the IEEE International Conference on
Neural Networks, San Francisco, CA, vol 2, 914-918.

```

## User Contributions:

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: What is the curse of dimensionality?
Next Document: What are OLS and subset/stepwise regression?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM