Why the Softmax Function?

29 Aug 2018

tldr; Why is the softmax function commonly used as the last layer of a neural network in a classification problem? The answer is that (1) assuming a generative model for the data and (2) applying Bayes' rule gives us the softmax function as a natural representation for the posterior distribution in a multi-class classification problem.

Why is the softmax function commonly used as the last layer of a neural network in a classification problem?

Some recent evidence¹ suggests that using the softmax as an activation function is helpful for amplifying error gradients. But we think that the more satisfying answer has already been given in a 1995 article written by Michael Jordan².

He discussed only the logistic function \(f(t)=\frac{1}{1+e^{-t}}\) in the case of binary classification, but we show that it generalizes to the softmax function \(f(t_i)=\frac{e^{t_i}}{\sum_j e^{t_j}}\) in the case where there are more than two classes.

We summarize relevant points in the article below.

Discriminative vs. Generative Classifiers

Assuming that every datapoint \(x\) has an associated class variable \(\omega\), we can model a classifier in one of two ways.

In a discriminative classifier, we model the conditional probability \(p(\omega|x)\) directly. This is commonly done using a neural network.

In a generative classifier, we instead assume a generative model for the data, specifying both the likelihood \(p(x|\omega)\) and the prior \(p(\omega)\). The posterior \(p(\omega|x)\) is subsequently inferred using Bayes’ rule.

Posterior of a Generative Classifier

We can derive the posterior with Bayes’ rule as follows in the binary case.

\[\begin{eqnarray} p(w_0 | x) & = & \frac{p(x|w_0) p(w_0)}{p(x)} \\ & = & \frac{p(x|w_0) p(w_0)}{p(x|w_0) p(w_0) + p(x|w_1) p(w_1) } \\ & = & \frac{1}{1 + \frac{p(x|w_1)}{p(x|w_0)} \frac{p(w_1)}{p(w_0)}} \\ & = & \frac{1}{1 + exp(-log(\frac{p(x|w_0)}{p(x|w_1)} \frac{p(w_0)}{p(w_1)}))} \\ & = & \frac{1}{1+e^{-f(x)}} \end{eqnarray}\]

And we find that the posterior is a logistic function! This result is a consequence of just the assumption of a generative model for the data and an application of Bayes’ rule.

Note that \(p(w_i)\) can be estimated by counting the proportion of the data belonging to class \(i\).

So long as the class conditional density belongs in the exponential family \(p(x|\omega) = exp(\eta (\omega) T(x) - A(\omega) + B(x))\) with \(T(x)\) linear, the posterior distribution will be a logistic-linear function, i.e. \(f(x)\) is linear.

Example: \(p(x|\omega)\) is Gaussian

Given Gaussian (with unit variance) class conditional densities, we find that the posterior distribution is the logistic function \(y = \frac{1}{1 + e^{-2x}}\), which smoothly interpolates between the two classes depending on the distance to the class means.

Softmax

Generalizing this to more than two classes is straightforward.

\[\begin{eqnarray} p(w_i | x) & = & \frac{p(x|w_i) p(w_i)}{p(x)} \\ & = & \frac{p(x|w_i) p(w_i)}{\sum_j p(x|w_j) p(w_j)} \\ & = & \frac{exp(log(p(x|w_i) p(w_i)))}{\sum_j exp(log(p(x|w_j) p(w_j)))} \\ & = & \frac{e^{f_i (x)}}{\sum_j e^{f_j (x)}} \end{eqnarray}\]

We find that indeed the posterior can be written as a softmax. As long as the class conditional density is in the exponential family with \(T(x)\) and \(B(x)\) linear, the posterior distribution will be a softmax-linear function.

Preferring a Discriminative Classifier

A neural network is a discriminative classifier because it directly models \(p(\omega|x)\) without specifying how the data is generated.

Using a discriminative classifier is helpful because:

The processes responsible for generating the data might be very complicated and we do not know how to model them.
The discriminative approach is invariant to a family of classification problems (the exponential family). If the specification is not a good match for the data, then the performance of the generative classifier will suffer.
It is more parameter efficient. In the Gaussian case, the generative classifier is parametrized with \(O(n^2)\) numbers (covariance matrix and mean vectors), whereas the discriminative classifier is parametrized with \(O(n)\) numbers (the coefficients of the linear combination).

On the other hand, one reason we might prefer a generative classifier is that if we have the model specification done correctly, it might be more sample efficient to estimate the parameters (about \(30\%\) more efficient in the Gaussian case³).

Read On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. Andrew Ng and Michael Jordan. for a more detailed discussion.

Neural Networks are a Generalization

With exponential family class conditional densities, the decision boundaries between the classes are linear. It is possible to generalize this by specifying another class of generative models for which we find that the posterior gives non-linear decision boundaries.

Instead of doing this though, a simpler approach is to retain the logistic/softmax function, and replace the linear function with non-linear representations. This is exactly what Neural Networks (and Generalized Additive Models) do.

To sum up, the softmax function arises as a natural representation for the posterior distribution in a multi-class classification problem assuming a generative classifier. Using a neural network with a softmax at the end as a discriminative classifier allows us to bypass the need to specify a generative model for the data, is possibly more efficient, and generalizes to non-linear decision boundaries⁴.

Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting. Anders Øland, Aayush Bansal, Roger B. Dannenberg, Bhiksha Raj. ↩
Why the logistic function? A tutorial discussion on probabilities and neural networks. Michael I. Jordan. ↩
The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis. Bradley Efron. ↩
Note, however, that the existence of adversarial perturbations on a wide variety of classification problems demonstrates that the non-linear decision boundaries learned by neural networks are highly inaccurate. ↩