Search the FAQ Archives

3 - A - B - C - D - E - F - G - H - I - J - K - L - M
N - O - P - Q - R - S - T - U - V - W - X - Y - Z
faqs.org - Internet FAQ Archives

comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Section - Why use activation functions?

( Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Counties ]


Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: Why use a bias/threshold?
Next Document: How to avoid overflow in the logistic function?
See reader questions & answers on this topic! - Help others by sharing your knowledge
Activation functions for the hidden units are needed to introduce
nonlinearity into the network. Without nonlinearity, hidden units would not
make nets more powerful than just plain perceptrons (which do not have any
hidden units, just input and output units). The reason is that a linear
function of linear functions is again a linear function. However, it is the
nonlinearity (i.e, the capability to represent nonlinear functions) that
makes multilayer networks so powerful. Almost any nonlinear function does
the job, except for polynomials. For backpropagation learning, the
activation function must be differentiable, and it helps if the function is
bounded; the sigmoidal functions such as logistic and tanh and the Gaussian
function are the most common choices. Functions such as tanh or arctan that
produce both positive and negative values tend to yield faster training than
functions that produce only positive values such as logistic, because of
better numerical conditioning (see 
ftp://ftp.sas.com/pub/neural/illcond/illcond.html). 

For hidden units, sigmoid activation functions are usually preferable to
threshold activation functions. Networks with threshold units are difficult
to train because the error function is stepwise constant, hence the gradient
either does not exist or is zero, making it impossible to use backprop or
more efficient gradient-based training methods. Even for training methods
that do not use gradients--such as simulated annealing and genetic
algorithms--sigmoid units are easier to train than threshold units. With
sigmoid units, a small change in the weights will usually produce a change
in the outputs, which makes it possible to tell whether that change in the
weights is good or bad. With threshold units, a small change in the weights
will often produce no change in the outputs. 

For the output units, you should choose an activation function suited to the
distribution of the target values: 

 o For binary (0/1) targets, the logistic function is an excellent choice
   (Jordan, 1995). 
 o For categorical targets using 1-of-C coding, the softmax activation
   function is the logical extension of the logistic function. 
 o For continuous-valued targets with a bounded range, the logistic and tanh
   functions can be used, provided you either scale the outputs to the range
   of the targets or scale the targets to the range of the output activation
   function ("scaling" means multiplying by and adding appropriate
   constants). 
 o If the target values are positive but have no known upper bound, you can
   use an exponential output activation function, but beware of overflow. 
 o For continuous-valued targets with no known bounds, use the identity or
   "linear" activation function (which amounts to no activation function)
   unless you have a very good reason to do otherwise. 

There are certain natural associations between output activation functions
and various noise distributions which have been studied by statisticians in
the context of generalized linear models. The output activation function is
the inverse of what statisticians call the "link function". See: 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Jordan, M.I. (1995), "Why the logistic function? A tutorial discussion on
   probabilities and neural networks", MIT Computational Cognitive Science
   Report 9503, http://www.cs.berkeley.edu/~jordan/papers/uai.ps.Z. 

For more information on activation functions, see Donald Tveter's 
Backpropagator's Review. 

User Contributions:

1
Majid Maqbool
Sep 27, 2024 @ 5:05 am
https://techpassion.co.uk/how-does-a-smart-tv-work-read-complete-details/
PDP++ is a neural-network simulation system written in C++, developed as an advanced version of the original PDP software from McClelland and Rumelhart's "Explorations in Parallel Distributed Processing Handbook" (1987). The software is designed for both novice users and researchers, providing flexibility and power in cognitive neuroscience studies. Featured in Randall C. O'Reilly and Yuko Munakata's "Computational Explorations in Cognitive Neuroscience" (2000), PDP++ supports a wide range of algorithms. These include feedforward and recurrent error backpropagation, with continuous and real-time models such as Almeida-Pineda. It also incorporates constraint satisfaction algorithms like Boltzmann Machines, Hopfield networks, and mean-field networks, as well as self-organizing learning algorithms, including Self-organizing Maps (SOM) and Hebbian learning. Additionally, it supports mixtures-of-experts models and the Leabra algorithm, which combines error-driven and Hebbian learning with k-Winners-Take-All inhibitory competition. PDP++ is a comprehensive tool for exploring neural network models in cognitive neuroscience.

Comment about this article, ask questions, or add new information about this topic:




Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: Why use a bias/threshold?
Next Document: How to avoid overflow in the logistic function?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)





Last Update March 27 2014 @ 02:11 PM