Entropy and Cross-Entropy

Cross-entropy is often used in machine learning as a loss function. We describe some technical foundations of entropy and cross-entropy in this article.

Entropy

  • $\text{Surprise}(x_i) = \text{log}(\frac{1}{P(x_i)})$
  • Surprise has an inverse relationship to Probability
  • e.g. when probability $P(x_i)$ = 1, Surprise = 0; When $P(x_i) \to 0$, Surprise $\to \infty$
  • given a sequence $X = [x_0, \ldots, x_n]$, $\text{Surprise}(X) = \sum_{i} \text{Surprise}(x_i)$

  • Entropy is the expected value of the Surprise (i.e. the average Surprise that we could expect). Entropy is a function of a single distribution $P$ and represents the expected amount of information in an event sampled from $P$. Hence for a variable with $C$ classes \(\text{Entropy} = \sum_{c \in C} \text{Surprise}(x_k) \text{P}(x_k) \\= \sum_{c \in C} \text{log} \frac{1}{P(x_k)} \text{P}(x_k) = \sum_{c \in C} \text{P}(x_k) [\text{log}(1) - \text{log} P(x_k)] \\= \sum_{c \in C} \text{P}(x_k) [0 - \text{log} P(x_k)] = - \sum_{c \in C} \text{P}(x_k) \text{ log} P(x_k)\)

Cross-Entropy

  • Cross Entropy is a function of two distributions $P$ and $Q$. The underlying data is drawn from a data-generating distribution $P$, but messages are encoded via the incorrect distribution $Q$. Then cross entropy is the expected length of a message encoded according to $Q$ but drawn according to $P$. If $Q$ exactly models $P$, then the cross entropy is 0. So in machine learning, we want to minimize cross entropy.
  • For instance, assume we are modeling a problem with $C$ different classes. Let distribution $P$ be the gold/observed distribution, and $Q$ represented the model prediction probabilities/distribution. Then the cross entropy of a sample/example $x$ is: $- \sum_{c \in C} P(x) \times \text{log}(Q(x))$.
    • But $P(x) = 1$ for the correct class $c$ and 0 for all other classes. So this reduces to just: $- \text{log } Q_c(x)$.
Written on December 7, 2022