Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: Why use a bias/threshold? Next Document: How to avoid overflow in the logistic function? See reader questions & answers on this topic!  Help others by sharing your knowledge Activation functions for the hidden units are needed to introduce nonlinearity into the network. Without nonlinearity, hidden units would not make nets more powerful than just plain perceptrons (which do not have any hidden units, just input and output units). The reason is that a linear function of linear functions is again a linear function. However, it is the nonlinearity (i.e, the capability to represent nonlinear functions) that makes multilayer networks so powerful. Almost any nonlinear function does the job, except for polynomials. For backpropagation learning, the activation function must be differentiable, and it helps if the function is bounded; the sigmoidal functions such as logistic and tanh and the Gaussian function are the most common choices. Functions such as tanh or arctan that produce both positive and negative values tend to yield faster training than functions that produce only positive values such as logistic, because of better numerical conditioning (see ftp://ftp.sas.com/pub/neural/illcond/illcond.html). For hidden units, sigmoid activation functions are usually preferable to threshold activation functions. Networks with threshold units are difficult to train because the error function is stepwise constant, hence the gradient either does not exist or is zero, making it impossible to use backprop or more efficient gradientbased training methods. Even for training methods that do not use gradientssuch as simulated annealing and genetic algorithmssigmoid units are easier to train than threshold units. With sigmoid units, a small change in the weights will usually produce a change in the outputs, which makes it possible to tell whether that change in the weights is good or bad. With threshold units, a small change in the weights will often produce no change in the outputs. For the output units, you should choose an activation function suited to the distribution of the target values: o For binary (0/1) targets, the logistic function is an excellent choice (Jordan, 1995). o For categorical targets using 1ofC coding, the softmax activation function is the logical extension of the logistic function. o For continuousvalued targets with a bounded range, the logistic and tanh functions can be used, provided you either scale the outputs to the range of the targets or scale the targets to the range of the output activation function ("scaling" means multiplying by and adding appropriate constants). o If the target values are positive but have no known upper bound, you can use an exponential output activation function, but beware of overflow. o For continuousvalued targets with no known bounds, use the identity or "linear" activation function (which amounts to no activation function) unless you have a very good reason to do otherwise. There are certain natural associations between output activation functions and various noise distributions which have been studied by statisticians in the context of generalized linear models. The output activation function is the inverse of what statisticians call the "link function". See: McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall. Jordan, M.I. (1995), "Why the logistic function? A tutorial discussion on probabilities and neural networks", MIT Computational Cognitive Science Report 9503, http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z. For more information on activation functions, see Donald Tveter's Backpropagator's Review. User Contributions:1 Andy Apr 24, 2015 @ 7:19 pm Why is it generally a good idea to omit the biases from the penalty term for weight decay? Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: Why use a bias/threshold? Next Document: How to avoid overflow in the logistic function? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM
