top | item 37222166

(no title)

leourbina | 2 years ago

One intuition is that KL-divergence represents a sort of “distance” between probability distributions. However, this isn’t quite right as it doesn’t satisfy some basic properties a real distance (a norm) would satisfy, including the fact that it isn’t symmetric: KL(Q, P) != KL(P,Q), and it does not satisfy the triangle inequality. Nonetheless, KL(P,Q) gives you a good idea of how “far” is P is from Q: in the context of encoding, if you wanted to come up with an ideal encoding of symbols coming from P, but you guessed Q as the distribution of these symbols, then KL(P, Q) is the extra number of bits you’d have to use. One nice property is that in the case that KL(P,Q) = 0, P and Q are equal (almost everywhere, which for most applications is irrelevant). This makes it useful in the ML context as you can minimize KL divergence and know that the resulting “guessed” distribution is getting closer to the data distribution you’re trying to guess using some parametrized function (an NN).

discuss

kgwgk|2 years ago

> it doesn’t satisfy some basic properties a real distance (a norm) would satisfy, including the fact that it isn’t symmetric [...] and it does not satisfy the triangle inequality.

Not sure about "real" but one can have useful distances which are not symmetric like the distance between cities measured in time or in gallons.

leourbina|2 years ago

It just needs to be clarified that KL divergence isn’t a proper mathematical norm, so it doesn’t behave the way we intuitively think a distance should. As mentioned, it doesn’t satisfy the triangle inequality, which is a basic property for any distance-like function.

In comparison, both of your examples are much closer to norms as they both satisfy the triangle inequality.

For reference, this is what I’m referring to when I say a “norm”:

https://en.m.wikipedia.org/wiki/Norm_(mathematics)