The definition of (Shannon) Entropy wasn’t intuition at first sight for me. Had to get abstract a bit to get a sense of what it meant. Here’s an explanation flow that might work for some of you to get there.
Imagine a coin toss machine that tosses a coin and sends out a signal indicating what the outcome of the toss was. For a fair coin, there are two possible outcomes with equal probability of 0.5. How many bits of information does the machine need to communicate the outcome? Well it’s 1. Say 0 for heads and 1 for tails. Put differently, how many binary answer (yes/no) questions do we need to determine the outcome? It’s 1, the answer to “Was is heads?” will tell us what the outcome was.
Imagine two coins being tossed. Four possible outcomes (hh, ht, th, tt) with equal probabilities of 0.25. In this case, we need two bits or binary-answer-questions (00, 01, 10, 11) to know what happened. For three coins, it will be 3 and so on.
So what is this “No. of bit” number? This is the minimum number of bits needed to convey the outcome of the coin toss. A coin toss machine with one coin, will require to send one bit (0 or 1) to the receiver to communicate what was the output of a given toss. Similarly, two and three coin toss machines will have to send 2 and 3 bits respectively to communicate the output. Using the definition of “log”, this number is also the binary log of the possible outcomes. i.e, ex: 2 power 3 = 8 => 8 possible outcomes requires 3 binary bits to represent outcome. This is also the “entropy” for the event.
So is entropy the minimum number of bits we need to communicate an outcome? Well, not quite, but that’s not a bad place to start. Imagine this scenario: what if the coin toss outcomes are not equally likely? What if we used coins such that it always landed on heads? i.e. the probability of heads is 1 and tails 0. Well, then the machine will always send a “1”. There is nothing we will learn from that signal.
And what if — in the example of two coin toss example, the probabilities are not equal? Like depicted below, we could come up with a coding scheme to represent outputs such that it is different for different outcomes. We can’t really apply “binary log of the possible outcomes”, because there are still four possible outcomes, but it’s not equally probable and so “2 bits” is not universally required.
Here’s where we need to adjust the way we defined Entropy a bit. What we are really interested in, is not how many bits of data we need to communicate the outcomes, but rather it is“how much information we are really learning from what was communicated”, regardless of the actual physical bits used to communicate.
In other words, we want Entropy to represent how much “information content” is present in the outcome — however it is communicated to us. We want it to be a quantitative measure of the information content aka uncertainty associated with the event. It will be “0”, meaning “I learnt nothing” or “I was completely certain about the outcome”, for a biased coin that always turns up heads on tossing. And a “1” for a fair coin toss outcome, meaning “I learnt a lot” or “I am most uncertain about” a fair coin toss output as it could either be heads or tails. This uncertainty or information content is what is measured by Entropy.
While in a way, this might give us a theoretical limit on the number of bits needed to communicate the outcome — let’s not constrain ourselves with that as the basis of the definition. Bits come in full — there’s no “half a bit”. But our refreshed definition of Entropy is a measure of “news” in the communication. So similar to other measures like area, length, weight, etc., it is continuous and can be any number or fraction. This is Property #1 of Entropy.
We will go back to Figure 1 and Figure 2 examples to introduce another property for this measure. When we are dealing with events containing equally probable outcomes, the measure should increase with the number of possible outcomes. Figure 1 had a “1” bit entropy and Figure 2 has “2” bits and so on. This is Property #2 of Entropy.
Side note: Such equally probable instances where the number of possible outcomes is a power of 2 are realistically the places where we can apply the theoretical limit of “what I learnt, i.e. information entropy” = “no. of bits needed to communicate the outcome”. In many other cases, this “shortest code” becomes just a theoretical limit and in reality, more bits than information in the message might be needed to communicate the outcomes of the event.
I said “many other cases” in the side note, because in addition to such scenarios, it is also possible to achieve the theoretical “shortest code” in scenarios with the number of possible outcomes that are not a power of two and are not equally probable. In scenarios where it is possible to group the outcomes such that the groups are uniformly distributed, we could achieve this. Take a look at the example below with three outcomes.
Going back to determining how many binary questions we need to derive the outcome, first, we could ask “Is the outcome x1?” — this is akin to grouping x1 into one and x2,x3 into the other. Both these groups have equal probability of being the outcomes (Probability of x1 = 50% = Sum of Probabilities of x2 and x3 = 33.33% + 16.66%). If the answer is “no”, then we can ask “is it x2? And that’s it, the max questions we need to ask to determine the outcome is 2.
The equivalent bit representation for this “one question is all I need if outcome is x1 and two is max questions I need to ask otherwise” is provided in the table illustration. Figure 6 below provides a binary tree view of the same. As opposed to previous examples, in this non-equally probable example, you might notice that the no. of bits required to represent outcomes vary by outcome. Using the first bit or question to determine if the most likely outcomes have occurred makes sense if we want to ask fewer questions or fewer bits overall to represent outcomes of such a source.
This leads us to Property #3 of Entropy. If a choice can be broken down into two successive choices — i.e. choosing between x1, x2, x3 was broken down into choosing between (x1) and (x2, x3) in the example above — then the original Entropy should be the weighted sum of the individual Entropies for the groups. In other words, the total expected entropy at a node of the binary tree representation = weighted sum of entropies from the two child nodes, where the weights are the probability that particular child node is true.
So how do we define this measure? Let’s go back to the “binary log of the possible outcomes” definition — which we saw did not hold true for non-equally probable outcomes. If you think about it, the number of possible outcomes in an event with equally probable outcomes is the same as the inverse of the probability. i.e. in a coin toss, no. of possible outcomes = 2 = 1/0.5. Where 0.5 is the probability of the outcomes. For the two coin toss example, no. of possible outcomes = 4 = 1/0.25. So perhaps the right way to define it is not “binary log of the possible outcomes”, but it is “binary log of (1/p)”, where “p” is the probability of a given outcome. While this is for an outcome, the Shannon Entropy for the event can then be described as the expected Information content from the event.
Using this definition, the information content and the expected Entropy for the different examples we have seen so far is presented in the tables in Figures 8 and 9 below. In each, we can notice the three properties have been met. The measure is continuous, it increases monotonically (compare single and double fair coin toss examples) and by definition of expectation — it is a weighted sum of individual entropies.
Let’s look at the single unfair coin toss example a bit more closely. You might notice the information content associated with heads will be binary log of (1/0.9) = 0.15 bits. Compare this with a normal coin with 50% probability of heads, the binary log of (1/0.5) = 1 bit. The biased coin has less information associated with heads, as it is heads 90% of the times, i.e. almost always. With such a coin, getting a tail is much more newsworthy than getting a head. That is why the information associated with tails is, binary log of (1/0.1) = 3.32 bits, a lot more than the information associated with a head.
Put differently, the more rare an outcome is, the more questions we need to ask to determine if it had occurred. “Is the asteroid in our solar system?” is perhaps the first question we need to answer before probing with further, harder, questions to plot it’s course. And before we can confirm that the extremely unlikely scenario of collision with earth is going to occur, we might have asked a whole lot more questions. This is all to say that more uncommon outcomes or scenarios with a lot of uncertainty require learning a lot more information associated with that outcome. Using “minimum bits to represent” is just a very “sender of message” specific way of saying the same thing. And Entropy is this measure of information value associated with an event.
Below is a plot of Entropy in the case of two possibilities with probabilities p and (1-p), that will help notice a couple of interesting observations, reiterating what we saw above. We can notice H is 0 in both the extreme cases of probability 1 and 0. In both those cases, we don’t really learn anything. It is the highest when the possibilities are equally probable. That also makes sense since this is the situation where no bias is involved and where things are completely random and anything can happen.
Cross Entropy & KL Divergence
Let’s refer back to the examples in Figure 9. Imagine that we were planning to model the communication of the outcomes of the “two coin toss” event. Let’s say we started with the assumption that these are fair coins, so we predicted all outcomes are equally probable and therefore started used two bits to representing the outcomes — as depicted in the left table of Figure 9 (all 0.25). After a while, we observe that these are not really fair coins and the real probability distribution is as shown in the right table of Figure 9 (0.5, 0.25, 0.125, 0.125).
Now, obviously we are not being efficient here. The Expected entropies for these two distributions are different, 1.75 vs 2, the individual information content for the different outcomes are different and it is obvious our coding scheme could be better. For example, we don’t need two bits to represent the first outcome that has 0.5 probability, just one should be fine.
How do we get a sense of how much inefficiency is here? This is exactly what Cross Entropy and KL Divergence help us do. Cross Entropy is the expected entropy under the true distribution P when you use a coding scheme optimized for a predicted distribution Q. The table in Figure 10 demonstrates how Cross Entropy is calculated. The information content of outcomes (aka, the coding scheme used for that outcome) is based on Q, but the true distribution P is used as weights for calculating the expected Entropy. This is the Cross Entropy for distributions P, Q. And the Kullback–Leibler divergence is the difference between the Cross Entropy H for PQ and the true Entropy H for P.
KL for (P||Q) gives the average extra bits required when true distribution P is represented using a coding scheme optimized for Q. Put differently, it would be the information gain we will achieve if we start representing the same event using P, the true distribution, rather than Q the prior distribution. Which would, of course, be different if P were to be the predicted distribution and if a coding scheme optimized for P was used to represent true distribution Q.
Both a careful look at the formula and intuitively, the KL Divergence, though it is a measure of the difference between the two distributions, it is not a true metric i.e. KL (P||Q) <> KL for (Q||P). However, it can be easily shown that this divergence reduces as Q gets closer to P and will be 0 when P=Q. This is demonstrated in Figure 12.
And this is what we use as a loss function while training Neural Networks. When we have an image classification problem, the training data and corresponding correct labels represent P, the true distribution. The NN predictions are our estimations Q. The math involved in such single label classifications is relatively more simple, as P will be 1 for a given label and 0 for others.
Writing things down as an equation (and applying the power rules to get the -1 out):
Binary Cross Entropy
In a multi-class classification problem, “n” represents the number of classes. In the example in Figure 13, this was 4. In a binary classification problem, i.e. just like in a coin toss, the number of classes/outputs is 2 and either should be true, and the probability of one is (1 — probability of other), we can write it down as below:
We can use this binary cross entropy representation for multi-label classification problems as well. In the example seen in Figure 13, it was a multi-class classification problem where only output can be true i.e. only one label can be tagged to an image. In cases where we can have multiple labels associated with an image, i.e. where the probabilities for each label or output by itself can be anything between 0 and 1 and they are independent of each other, we can still leverage cross entropy for our loss. Each label classification is an independent binary cross entropy problem by itself and the global error can be the sum of the Binary cross entropies across the predicted probabilities of all labels.
From this point onwards, the appropriate usage of sigmoid vs softmax for multi-classification vs multi-label problems should become a lot more apparent. And hopefully, though a tad long-winded, this post provides a bit of clarity on how these functions work.