Cross-Entropy can be thought of as a measure of how much one distribution differs from a 2nd distribution, which is the ground truth. It is one of the two terms in the KL-Divergence, which is also a sort-of measure of (non-symmetric) distance between two distribution, the other term being the ground-truth entropy. Here is the KL for the discrete case:

\[D_{KL}(P||Q) = - \sum_x P(x)\log{Q(x)} + \sum_x P(x)\log{P(x)} = H(P,Q) - H(P)\]

\(H(P)\) being the entropy, and \(H(P,Q)\) being the cross-entropy.

But in machine learning, the data distribution is usually given, so you cannot change the data entropy. It is a constant. Instead you focus on the cross entropy term, trying to minimize it as much as possible. So let’s focus on this cross entropy term:

\[- \sum_x P(x)\log{Q(x)}\]

In classification problems the distributions are discrete, and often there are cases where the ground truth, the “real” labels, are 1-hot vectors. i.e. \(P(x)\) is 1 for a specific value, and 0 for all the rest. This reduces the cross entropy to (switching to y instead of x, which is usually denoting labels):


Where i is the index of the correct label. If \(Q(x_i)=1\) we get 0, but as it goes further from 1 and more into 0, the value grows up to infinity. We can think of the cross entropy as a objective/loss function, which we want to minimize.

It can also be that the classification is binary - e.g. “SPAM” and “NOT-SPAM”. In that case we can write the complete cross entropy of a single observation as follows:

\[-y_i \log{\hat y_i} - (1-y_i)\log{1-\hat y_i}\]

Here \(\hat y_i\) is the predicted value, taken from the distribution we control, i.e. it is \(Q(y_i)\). Notice that if \(y_i=1\), the 2nd term zeros, and we get the same expression as before. If \(y_i=0\) then the first term zeros, and now we want that \(\hat y_i\) will be as close as possible to 0.

This is called Binary-Cross-Entropy, and is often introduced in intro to machine learning, as it’s the most simple form to understand.

You can define this loss with 2 graphs: 1 for when you have a positive sample, and one when you have a negative sample.

But the ground-truth may not be a 1-hot vector. Cross-Entropy can capture the distance for any valid distribution of \(y_i\), meaning for any \(y_i\in[0,1]\). You can check a 3D graph I made in GeoGebra here.

You can see that the values at \(y=1\) and at \(y=0\) are the same as before, but in between we get some interpolation between the two. Nevertheless this indeed captures distance between \(y\) and \(\hat y\), but now the minimal value \(y=\hat y\) will not be 0, but something small. If you play around with this function, you can see it doesn’t fit for values outside of the \([0,1]\) interval.

The general cross-entropy however is a generalization to more than just 2 values (and even more than just discrete set of values). Here is a contour plot of the 2 first values in a 3 values distribution. The real distribution (\(y_i\)) was (0.1, 0.2, 0.7). And I calculated the cross-entropy for different values of the first 2 values. You can see that indeed the minimal score is around (0.1,0.2):

Or if you prefer a 3D graph:

So in conclusion cross entropy is an important meaure of distance between any 2 distributions, they are a very popular choice for discrete classification problems, where the ground truth labels are 1-hot vectors, and are even more simple and useful in binary classifications, where the loss can be viewed in a single 2D graph.