One intuition is that KL-divergence represents a sort of “distance” between probability distributions. However, this isn’t quite right as it doesn’t satisfy some basic properties a real distance (a norm) would satisfy, including the fact that it isn’t symmetric: KL(Q, P) != KL(P,Q), and it does not satisfy the triangle inequality. Nonetheless, KL(P,Q) gives you a good idea of how “far” is P is from Q: in the context of encoding, if you wanted to come up with an ideal encoding of symbols coming from P, but you guessed Q as the distribution of these symbols, then KL(P, Q) is the extra number of bits you’d have to use. One nice property is that in the case that KL(P,Q) = 0, P and Q are equal (almost everywhere, which for most applications is irrelevant). This makes it useful in the ML context as you can minimize KL divergence and know that the resulting “guessed” distribution is getting closer to the data distribution you’re trying to guess using some parametrized function (an NN).
kgwgk|2 years ago
Not sure about "real" but one can have useful distances which are not symmetric like the distance between cities measured in time or in gallons.
leourbina|2 years ago
In comparison, both of your examples are much closer to norms as they both satisfy the triangle inequality.
For reference, this is what I’m referring to when I say a “norm”:
https://en.m.wikipedia.org/wiki/Norm_(mathematics)