KL divergence or relative entropy is a measure of how one probability distribution is different from a reference probability distribution.
Core concept
Calculate the distance between two probability distributions
Log Likelihood Ratio
Let's say we have data , with two different distributions and , and we want to calculate the difference between these two distributions for , they most straightforward way could be:
If we take a log on both distributions (which prevents from rounding zero), the formula goes to:
to represent this formula in another way,
which is called the Log Likelihood Ratio
.
Expected Value
Before we go to any further, let's think about what we are truly looking for: A number which represents the average difference between two distributions.
Since we are dealing with random variables, there is no such term for "average", but Expected Value
instead.
The expected value for discrete random variable
is:
This is also called as weight average of instances of random variables.
for continuous random variable
, the formula become:
Formula
If we look closer, the Log Likelihood Ratio
is just a function of random variable, since we are looking for the "average" difference between two distributions, we apply the weight and the formula becomes:
For distributions P and Q of a continuous random variable
, the Kullback-Leibler divergence is computed as an integral:
if P and Q represent the probability distribution of a discrete random variable
, the Kullback-Leibler divergence is calculated as a summation:
Example: Calculating KL Divergence in Python
We can make the KL divergence concrete with a worked example.
Consider a random variable with three events as different colors. We may have two different probability distributions for this variable; for example:
... # define distributions events = ['red', 'green', 'blue'] p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05]
We can plot a bar chart of these probabilities to compare them directly as probability histograms.
The complete example is listed below.
# plot of distributions from matplotlib import pyplot # define distributions events = ['red', 'green', 'blue'] p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05] print('P=%.3f Q=%.3f' % (sum(p), sum(q))) # plot first distribution pyplot.subplot(2,1,1) pyplot.bar(events, p) # plot second distribution pyplot.subplot(2,1,2) pyplot.bar(events, q) # show the plot pyplot.show()
P=1.000 Q=1.000
Running the example creates a histogram for each probability distribution, allowing the probabilities for each event to be directly compared.
We can see that indeed the distributions are different.
Next, we can develop a function to calculate the KL divergence between the two distributions.
We will use log base-2 to ensure the result has units in bits.
We can then use this function to calculate the KL divergence of P from Q, as well as the reverse, Q from P.
# example of calculating the kl divergence between two mass functions from math import log2 # calculate the kl divergence def kl_divergence(p, q): return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p))) # define distributions p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05] # calculate (P || Q) kl_pq = kl_divergence(p, q) print('KL(P || Q): %.3f bits' % kl_pq) # calculate (Q || P) kl_qp = kl_divergence(q, p) print('KL(Q || P): %.3f bits' % kl_qp)
KL(P || Q): 1.927 bits
KL(Q || P): 2.022 bits