Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: News Headers Next Document: What are batch, incremental, online, offline, See reader questions & answers on this topic!  Help others by sharing your knowledge objective functions? ===================== Most neural networks involve combination, activation, error, and objective functions. Combination functions +++++++++++++++++++++ Each noninput unit in a neural network combines values that are fed into it via synaptic connections from other units, producing a single value called the "net input". There is no standard term in the NN literature for the function that combines values. In this FAQ, it will be called the "combination function". The combination function is a vectorto scalar function. Most NNs use either a linear combination function (as in MLPs) or a Euclidean distance combination function (as in RBF networks). There is a detailed discussion of networks using these two kinds of combination function under "How do MLPs compare with RBFs?" Activation functions ++++++++++++++++++++ Most units in neural networks transform their net input by using a scalartoscalar function called an "activation function", yielding a value called the unit's "activation". Except possibly for output units, the activation value is fed via synaptic connections to one or more other units. The activation function is sometimes called a "transfer", and activation functions with a bounded range are often called "squashing" functions, such as the commonly used tanh (hyperbolic tangent) and logistic (1/(1+exp(x)))) functions. If a unit does not transform its net input, it is said to have an "identity" or "linear" activation function. The reason for using nonidentity activation functions is explained under "Why use activation functions?" Error functions +++++++++++++++ Most methods for training supervised networks require a measure of the discrepancy between the networks output value and the target (desired output) value (even unsupervised networks may require such a measure of discrepancysee "What does unsupervised learning learn?"). Let: o j be an index for cases o X or X_j be an input vector o W be a collection (vector, matrix, or some more complicated structure) of weights and possibly other parameter estimates o y or y_j be a target scalar o M(X,W) be the output function computed by the network (the letter M is used to suggest "mean", "median", or "mode") o p or p_j = M(X_j,W) be an output (the letter p is used to suggest "predicted value" or "posterior probability") o r or r_j = y_j  p_j be a residual o Q(y,X,W) be the casewise error function written to show the dependence on the weights explicitly o L(y,p) be the casewise error function in simpler form where the weights are implicit (the letter L is used to suggest "loss" function) o D be a list of indices designating a data set, including inputs and target values o DL designate the training (learning) set o DV designate the validation set o DT designate the test set o #(D) be the number of elements (cases) in D o NL be the number of cases in the training (learning) set o NV be the number of cases in the validation set o NT be the number of cases in the test set o TQ(D,W) be the total error function o AQ(D,W) be the average error function The difference between the target and output values for case j, r_j = y_j  p_j, is called the "residual" or "error". This is NOT the "error function"! Note that the residual can be either positive or negative, and negative residuals with large absolute values are typically considered just as bad as large positive residuals. Error functions, on the other hand, are defined so that bigger is worse. Usually, an error function Q(y,X,W) is applied to each case and is defined in terms of the target and output values Q(y,X,W) = L(y,M(X,W)) = L(y,p). Error functions are also called "loss" functions, especially when the two usages of the term "error" would sound silly when used together. For example, instead of the awkward phrase "squarederror error", you would typically use "squarederror loss" to mean an error function equal to the squared residual, L(y,p) = (y  p)^2. Another common error function is the classification loss for a binary target y in {0, 1}: L(y,p) = 0 if yp < 0.5 1 otherwise The error function for an entire data set is usually defined as the sum of the casewise error functions for all the cases in a data set: TQ(D,W) = sum Q(y_j,X_j,W) j in D Thus, for squarederror loss, the total error is the sum of squared errors (i.e., residuals), abbreviated SSE. For classification loss, the total error is the number of misclassified cases. It is often more convenient to work with the average (i.e, arithmetic mean) error: AQ(D,W) = TQ(D,W)/#(D) For squarederror loss, the average error is the mean or average of squared errors (i.e., residuals), abbreviated MSE or ASE (statisticians have a slightly different meaning for MSE in linear models, TQ(D,W)/[#(D)#(W)] ). For classification loss, the average error is the proportion of misclassified cases. The average error is also called the "empirical risk." Using the average error instead of the total error is especially convenient when using batch backproptype training methods where the user must supply a learning rate to multiply by the negative gradient to compute the change in the weights. If you use the gradient of the average error, the choice of learning rate will be relatively insensitive to the number of training cases. But if you use the gradient of the total error, you must use smaller learning rates for larger training sets. For example, consider any training set DL_1 and a second training set DL_2 created by duplicating every case in DL_1. For any set of weights, DL_1 and DL_2 have the same average error, but the total error of DL_2 is twice that of DL_1. Hence the gradient of the total error of DL_2 is twice the gradient for DL_1. So if you use the gradient of the total error, the learning rate for DL_2 should be half the learning rate for DL_1. But if you use the gradient of the average error, you can use the same learning rate for both training sets, and you will get exactly the same results from batch training. The term "error function" is commonly used to mean any of the functions, Q(y,X,W), L(y,p), TQ(D,W), or AQ(D,W). You can usually tell from the context which function is the intended meaning. The term "error surface" refers to TQ(D,W) or AQ(D,W) as a function of W. Objective functions +++++++++++++++++++ The objective function is what you directly try to minimize during training. Neural network training is often performed by trying to minimize the total error TQ(DL,W) or, equivalently, the average error AQ(DL,W) for the training set, as a function of the weights W. However, as discussed in Part 3 of the FAQ, minimizing training error can lead to overfitting and poor generalization if the number of training cases is small relative to the complexity of the network. A common approach to improving generalization error is regularization, i.e., trying to minimize an objective function that is the sum of the total error function and a regularization function. The regularization function is a function of the weights W or of the output function M(X,W). For example, in weight decay, the regularization function is the sum of squared weights. A crude form of Bayesian learning can be done using a regularization function that is the log of the prior density of the weights (weight decay is a special case of this). For more information on regularization, see Part 3 of the FAQ. If no regularization function is used, the objective function is equal to the total or average error function (or perhaps some other monotone function thereof). User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: News Headers Next Document: What are batch, incremental, online, offline, Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
