KL divergence or relative entropy is a measure of how one probability distribution is different from a reference probability distribution.
Calculate the distance between two probability distributions
Log Likelihood Ratio
Let's say we have data , with two different distributions and , and we want to calculate the difference between these two distributions for , they most straightforward way could be:
If we take a log on both distributions (which prevents from rounding zero), the formula goes to:
to represent this formula in another way,
which is called the
Log Likelihood Ratio.
Before we go to any further, let's think about what we are truly looking for: A number which represents the average difference between two distributions.
Since we are dealing with random variables, there is no such term for "average", but
Expected Value instead.
The expected value for
discrete random variable is:
This is also called as weight average of instances of random variables.
continuous random variable, the formula become:
If we look closer, the
Log Likelihood Ratio is just a function of random variable, since we are looking for the "average" difference between two distributions, we apply the weight and the formula becomes:
For distributions P and Q of a
continuous random variable, the Kullback-Leibler divergence is computed as an integral:
if P and Q represent the probability distribution of a
discrete random variable, the Kullback-Leibler divergence is calculated as a summation:
Example: Calculating KL Divergence in Python
We can make the KL divergence concrete with a worked example.
Consider a random variable with three events as different colors. We may have two different probability distributions for this variable; for example:
... # define distributions events = ['red', 'green', 'blue'] p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05]
We can plot a bar chart of these probabilities to compare them directly as probability histograms.
The complete example is listed below.
# plot of distributions from matplotlib import pyplot # define distributions events = ['red', 'green', 'blue'] p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05] print('P=%.3f Q=%.3f' % (sum(p), sum(q))) # plot first distribution pyplot.subplot(2,1,1) pyplot.bar(events, p) # plot second distribution pyplot.subplot(2,1,2) pyplot.bar(events, q) # show the plot pyplot.show()
Running the example creates a histogram for each probability distribution, allowing the probabilities for each event to be directly compared.
We can see that indeed the distributions are different.
Next, we can develop a function to calculate the KL divergence between the two distributions.
We will use log base-2 to ensure the result has units in bits.
We can then use this function to calculate the KL divergence of P from Q, as well as the reverse, Q from P.
# example of calculating the kl divergence between two mass functions from math import log2 # calculate the kl divergence def kl_divergence(p, q): return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p))) # define distributions p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05] # calculate (P || Q) kl_pq = kl_divergence(p, q) print('KL(P || Q): %.3f bits' % kl_pq) # calculate (Q || P) kl_qp = kl_divergence(q, p) print('KL(Q || P): %.3f bits' % kl_qp)
KL(P || Q): 1.927 bits KL(Q || P): 2.022 bits