Numerical stability of softmax

NOTE

This is part of the 7th homework, titled Deep Learning I, for the course Machine Learning (IN2064) in the Winter Semester 2024/25 at TUM.

Problem 3:

In machine learning you often come across problems which contain the following quantity:

y = lo g i = 1 \sum N e^{x_{i}}

For example, if we want to calculate the log-likelihood of a neural network with a softmax output, we get this quantity due to the normalisation constant. If you try to calculate it naively, you will quickly encounter underflows or overflows, depending on the scale of $x_{i}$ . Despite working in log-space, the limited precision of computers is not enough, and the result will be $\infty$ or $- \infty$ . To combat this issue, we typically use the following identity:

y = lo g i = 1 \sum N e^{x_{i}} = a + lo g i = 1 \sum N e^{x_{i} - a}

for an arbitrary $a$ . This means you can shift the center of the exponential sum. A typical value is setting $a$ to the maximum ( $a = max_{i} x_{i}$ ), which forces the greatest value to be zero, and even if the other values would underflow, you get a reasonable result.

Your task is to show that the identity holds.

Solution:

a + lo g i = 1 \sum N e^{x_{i} - a} = a + lo g i = 1 \sum N \frac{e ^{x_{i}}}{e ^{a}} = a + lo g (\frac{1}{e ^{a}} i = 1 \sum N e^{x_{i}}) = a + lo g (e^{- a} \cdot i = 1 \sum N e^{x_{i}}) = * a + lo g (e^{- a}) + lo g i = 1 \sum N e^{x_{i}} = a - a + lo g i = 1 \sum N e^{x_{i}} = lo g i = 1 \sum N e^{x_{i}} .

In $^{*}$ we used the fact that $lo g (M \cdot N) = lo g (M) + lo g (N)$ . In this manner we have proved that LHS = RHS. Q.E.D.

Problem 4:

Similar to the previous exercise, we can compute the output of the softmax function

π_{i} = \frac{e ^{x_{i}}}{\sum _{i = 1}^{N} e ^{x_{i}}}

in a numerically stable way by shifting by an arbitrary constant (a):

\frac{e ^{x_{i}}}{\sum _{i = 1}^{N} e ^{x_{i}}} = \frac{e ^{x_{i} - a}}{\sum _{i = 1}^{N} e ^{x_{i} - a}} .

Often, $a = max_{i} x_{i}$ . Show that the above identity holds.

Solution:

The proof is rather trivial:

\frac{e ^{x_{i}}}{\sum _{i = 1}^{N} e ^{x_{i}}} = \frac{e ^{- a} e ^{x_{i}}}{e ^{- a} \sum _{i = 1}^{N} e ^{x_{i}}} = \frac{e ^{x_{i} - a}}{\sum _{i = 1}^{N} e ^{x_{i} - a}}

Q.E.D

aderylo: field-report

Explorer

Numerical stability of softmax

Problem 3:

Solution:

Problem 4:

Solution:

Graph View

Table of Contents

Backlinks