## Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z

# comp.ai.neural-nets FAQ, Part 2 of 7: LearningSection - What are combination, activation, error, and

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Forum archive ]

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Next Document: What are batch, incremental, on-line, off-line,
```objective functions?
=====================

Most neural networks involve combination, activation, error, and objective
functions.

Combination functions
+++++++++++++++++++++

Each non-input unit in a neural network combines values that are fed into it
via synaptic connections from other units, producing a single value called
the "net input". There is no standard term in the NN literature for the
function that combines values. In this FAQ, it will be called the
"combination function". The combination function is a vector-to scalar
function. Most NNs use either a linear combination function (as in MLPs) or
a Euclidean distance combination function (as in RBF networks). There is a
detailed discussion of networks using these two kinds of combination
function under "How do MLPs compare with RBFs?"

Activation functions
++++++++++++++++++++

Most units in neural networks transform their net input by using a
scalar-to-scalar function called an "activation function", yielding a value
called the unit's "activation". Except possibly for output units, the
activation value is fed via synaptic connections to one or more other units.
The activation function is sometimes called a "transfer", and activation
functions with a bounded range are often called "squashing" functions, such
as the commonly used tanh (hyperbolic tangent) and logistic (1/(1+exp(-x))))
functions. If a unit does not transform its net input, it is said to have an
"identity" or "linear" activation function. The reason for using
non-identity activation functions is explained under "Why use activation
functions?"

Error functions
+++++++++++++++

Most methods for training supervised networks require a measure of the
discrepancy between the networks output value and the target (desired
output) value (even unsupervised networks may require such a measure of
discrepancy--see "What does unsupervised learning learn?").

Let:

o j be an index for cases
o X or X_j be an input vector
o W be a collection (vector, matrix, or some more complicated structure)
of weights and possibly other parameter estimates
o y or y_j be a target scalar
o M(X,W) be the output function computed by the network (the letter M
is used to suggest "mean", "median", or "mode")
o p or p_j = M(X_j,W) be an output (the letter p is used to suggest
"predicted value" or "posterior probability")
o r or r_j = y_j - p_j be a residual
o Q(y,X,W) be the case-wise error function written to show the
dependence on the weights explicitly
o L(y,p) be the case-wise error function in simpler form where the
weights are implicit (the letter L is used to suggest "loss" function)
o D be a list of indices designating a data set, including inputs and
target values
o DL designate the training (learning) set
o DV designate the validation set
o DT designate the test set
o #(D) be the number of elements (cases) in D
o NL be the number of cases in the training (learning) set
o NV be the number of cases in the validation set
o NT be the number of cases in the test set
o TQ(D,W) be the total error function
o AQ(D,W) be the average error function

The difference between the target and output values for case j, r_j =
y_j - p_j, is called the "residual" or "error". This is NOT the
"error function"! Note that the residual can be either positive or negative,
and negative residuals with large absolute values are typically considered
just as bad as large positive residuals. Error functions, on the other hand,
are defined so that bigger is worse.

Usually, an error function Q(y,X,W) is applied to each case and is
defined in terms of the target and output values Q(y,X,W) =
L(y,M(X,W)) = L(y,p). Error functions are also called "loss"
functions, especially when the two usages of the term "error" would sound
silly when used together. For example, instead of the awkward phrase
"squared-error error", you would typically use "squared-error loss" to mean
an error function equal to the squared residual, L(y,p) = (y -
p)^2. Another common error function is the classification loss for a
binary target y in {0, 1}:

L(y,p) = 0 if |y-p| < 0.5
1 otherwise

The error function for an entire data set is usually defined as the sum of
the case-wise error functions for all the cases in a data set:

TQ(D,W) =  sum    Q(y_j,X_j,W)
j in D

Thus, for squared-error loss, the total error is the sum of squared errors
(i.e., residuals), abbreviated SSE. For classification loss, the total error
is the number of misclassified cases.

It is often more convenient to work with the average (i.e, arithmetic mean)
error:

AQ(D,W) = TQ(D,W)/#(D)

For squared-error loss, the average error is the mean or average of squared
errors (i.e., residuals), abbreviated MSE or ASE (statisticians have a
slightly different meaning for MSE in linear models,
TQ(D,W)/[#(D)-#(W)] ). For classification loss, the average error
is the proportion of misclassified cases. The average error is also called
the "empirical risk."

Using the average error instead of the total error is especially convenient
when using batch backprop-type training methods where the user must supply a
learning rate to multiply by the negative gradient to compute the change in
the weights. If you use the gradient of the average error, the choice of
learning rate will be relatively insensitive to the number of training
cases. But if you use the gradient of the total error, you must use smaller
learning rates for larger training sets. For example, consider any training
set DL_1 and a second training set DL_2 created by duplicating every
case in DL_1. For any set of weights, DL_1 and DL_2 have the same
average error, but the total error of DL_2 is twice that of DL_1. Hence
the gradient of the total error of DL_2 is twice the gradient for DL_1.
So if you use the gradient of the total error, the learning rate for DL_2
should be half the learning rate for DL_1. But if you use the gradient of
the average error, you can use the same learning rate for both training
sets, and you will get exactly the same results from batch training.

The term "error function" is commonly used to mean any of the functions,
Q(y,X,W), L(y,p), TQ(D,W), or AQ(D,W). You can usually tell
from the context which function is the intended meaning. The term "error
surface" refers to TQ(D,W) or AQ(D,W) as a function of W.

Objective functions
+++++++++++++++++++

The objective function is what you directly try to minimize during training.

Neural network training is often performed by trying to minimize the total
error TQ(DL,W) or, equivalently, the average error AQ(DL,W) for the
training set, as a function of the weights W. However, as discussed in Part
3 of the FAQ, minimizing training error can lead to overfitting and poor
generalization if the number of training cases is small relative to the
complexity of the network. A common approach to improving generalization
error is regularization, i.e., trying to minimize an objective function that
is the sum of the total error function and a regularization function. The
regularization function is a function of the weights W or of the output
function M(X,W). For example, in weight decay, the regularization
function is the sum of squared weights. A crude form of Bayesian learning
can be done using a regularization function that is the log of the prior
density of the weights (weight decay is a special case of this). For more
information on regularization, see Part 3 of the FAQ.

If no regularization function is used, the objective function is equal to
the total or average error function (or perhaps some other monotone function
thereof).

```

## User Contributions:

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Next Document: What are batch, incremental, on-line, off-line,

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM