Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: How to avoid overflow in the logistic function? Next Document: What is the curse of dimensionality? See reader questions & answers on this topic!  Help others by sharing your knowledge If you want the outputs of a network to be interpretable as posterior probabilities for a categorical target variable, it is highly desirable for those outputs to lie between zero and one and to sum to one. The purpose of the softmax activation function is to enforce these constraints on the outputs. Let the net input to each output unit be q_i, i=1,...,c, where c is the number of categories. Then the softmax output p_i is: exp(q_i) p_i =  c sum exp(q_j) j=1 Unless you are using weight decay or Bayesian estimation or some such thing that requires the weights to be treated on an equal basis, you can choose any one of the output units and leave it completely unconnectedjust set the net input to 0. Connecting all of the output units will just give you redundant weights and will slow down training. To see this, add an arbitrary constant z to each net input and you get: exp(q_i+z) exp(q_i) exp(z) exp(q_i) p_i =  =  =  c c c sum exp(q_j+z) sum exp(q_j) exp(z) sum exp(q_j) j=1 j=1 j=1 so nothing changes. Hence you can always pick one of the output units, and add an appropriate constant to each net input to produce any desired net input for the selected output unit, which you can choose to be zero or whatever is convenient. You can use the same trick to make sure that none of the exponentials overflows. Statisticians usually call softmax a "multiple logistic" function. It reduces to the simple logistic function when there are only two categories. Suppose you choose to set q_2 to 0. Then exp(q_1) exp(q_1) 1 p_1 =  =  =  c exp(q_1) + exp(0) 1 + exp(q_1) sum exp(q_j) j=1 and p_2, of course, is 1p_1. The softmax function derives naturally from loglinear models and leads to convenient interpretations of the weights in terms of odds ratios. You could, however, use a variety of other nonnegative functions on the real line in place of the exp function. Or you could constrain the net inputs to the output units to be nonnegative, and just divide by the sumthat's called the BradleyTerryLuce model. The softmax function is also used in the hidden layer of normalized radialbasisfunction networks; see "How do MLPs compare with RBFs?" References: Bridle, J.S. (1990a). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing: Algorithms, Architectures and Applications, Berlin: SpringerVerlag, pp. 227236. Bridle, J.S. (1990b). Training Stochastic Model Recognition Algorithms as Networks can lead to Maximum Mutual Information Estimation of Parameters. In: D.S.Touretzky (ed.), Advances in Neural Information Processing Systems 2, San Mateo: Morgan Kaufmann, pp. 211217. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd ed., London: Chapman & Hall. See Chapter 5. User Contributions:Comment about this article, ask questions, or add new information about this topic:Top Document: comp.ai.neuralnets FAQ, Part 2 of 7: Learning Previous Document: How to avoid overflow in the logistic function? Next Document: What is the curse of dimensionality? Part1  Part2  Part3  Part4  Part5  Part6  Part7  Single Page [ Usenet FAQs  Web FAQs  Documents  RFC Index ] Send corrections/additions to the FAQ Maintainer: saswss@unx.sas.com (Warren Sarle)
Last Update March 27 2014 @ 02:11 PM

1.What is overfitting and how can I avoid it?
2.How many hidden layers should I use?
3.How many hidden units should I use?
4. what's literature proof that One hidden layer is sufficient for the large majority of problems?