comp.ai.neural-nets FAQ, Part 2 of 7: LearningSection - What is a softmax activation function?

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: How to avoid overflow in the logistic function?
Next Document: What is the curse of dimensionality?

See reader questions & answers on this topic! - Help others by sharing your knowledge


If you want the outputs of a network to be interpretable as posterior
probabilities for a categorical target variable, it is highly desirable for
those outputs to lie between zero and one and to sum to one. The purpose of
the softmax activation function is to enforce these constraints on the
outputs. Let the net input to each output unit be q_i, i=1,...,c,
where c is the number of categories. Then the softmax output p_i is: 

           exp(q_i)
   p_i = ------------
          c
         sum exp(q_j)
         j=1

Unless you are using weight decay or Bayesian estimation or some such thing
that requires the weights to be treated on an equal basis, you can choose
any one of the output units and leave it completely unconnected--just set
the net input to 0. Connecting all of the output units will just give you
redundant weights and will slow down training. To see this, add an arbitrary
constant z to each net input and you get: 

           exp(q_i+z)       exp(q_i) exp(z)       exp(q_i)    
   p_i = ------------   = ------------------- = ------------   
          c                c                     c            
         sum exp(q_j+z)   sum exp(q_j) exp(z)   sum exp(q_j)  
         j=1              j=1                   j=1

so nothing changes. Hence you can always pick one of the output units, and
add an appropriate constant to each net input to produce any desired net
input for the selected output unit, which you can choose to be zero or
whatever is convenient. You can use the same trick to make sure that none of
the exponentials overflows. 

Statisticians usually call softmax a "multiple logistic" function. It
reduces to the simple logistic function when there are only two categories.
Suppose you choose to set q_2 to 0. Then 

           exp(q_1)         exp(q_1)              1
   p_1 = ------------ = ----------------- = -------------
          c             exp(q_1) + exp(0)   1 + exp(-q_1)
         sum exp(q_j)
         j=1

and p_2, of course, is 1-p_1. 

The softmax function derives naturally from log-linear models and leads to
convenient interpretations of the weights in terms of odds ratios. You
could, however, use a variety of other nonnegative functions on the real
line in place of the exp function. Or you could constrain the net inputs to
the output units to be nonnegative, and just divide by the sum--that's
called the Bradley-Terry-Luce model. 

The softmax function is also used in the hidden layer of normalized
radial-basis-function networks; see "How do MLPs compare with RBFs?" 

References: 

   Bridle, J.S. (1990a). Probabilistic Interpretation of Feedforward
   Classification Network Outputs, with Relationships to Statistical Pattern
   Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing:
   Algorithms, Architectures and Applications, Berlin: Springer-Verlag, pp.
   227-236. 

   Bridle, J.S. (1990b). Training Stochastic Model Recognition Algorithms as
   Networks can lead to Maximum Mutual Information Estimation of Parameters.
   In: D.S.Touretzky (ed.), Advances in Neural Information Processing
   Systems 2, San Mateo: Morgan Kaufmann, pp. 211-217. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. See Chapter 5.

User Contributions:

Majid Maqbool

⚠

Sep 27, 2024 @ 5:05 am

https://techpassion.co.uk/how-does-a-smart-tv-work-read-complete-details/
PDP++ is a neural-network simulation system written in C++, developed as an advanced version of the original PDP software from McClelland and Rumelhart's "Explorations in Parallel Distributed Processing Handbook" (1987). The software is designed for both novice users and researchers, providing flexibility and power in cognitive neuroscience studies. Featured in Randall C. O'Reilly and Yuko Munakata's "Computational Explorations in Cognitive Neuroscience" (2000), PDP++ supports a wide range of algorithms. These include feedforward and recurrent error backpropagation, with continuous and real-time models such as Almeida-Pineda. It also incorporates constraint satisfaction algorithms like Boltzmann Machines, Hopfield networks, and mean-field networks, as well as self-organizing learning algorithms, including Self-organizing Maps (SOM) and Hebbian learning. Additionally, it supports mixtures-of-experts models and the Leabra algorithm, which combines error-driven and Hebbian learning with k-Winners-Take-All inhibitory competition. PDP++ is a comprehensive tool for exploring neural network models in cognitive neuroscience.

Comment about this article, ask questions, or add new information about this topic:

Archived related questions and answers

Top Document: comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Previous Document: How to avoid overflow in the logistic function?
Next Document: What is the curse of dimensionality?

Part1 - Part2 - Part3 - Part4 - Part5 - Part6 - Part7 - Single Page

[ Usenet FAQs | Web FAQs | Documents | RFC Index ]

Send corrections/additions to the FAQ Maintainer:
saswss@unx.sas.com (Warren Sarle)

Last Update March 27 2014 @ 02:11 PM

comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Section - What is a softmax activation function?

Search the FAQ Archives

comp.ai.neural-nets FAQ, Part 2 of 7: Learning
Section - What is a softmax activation function?

User Contributions:

Comment about this article, ask questions, or add new information about this topic: