KL (Kullback-Leibler) Divergence

Approximating KL Divergence

KL散度又称相对熵,为信息散度(增益),是两个概率分布\(P\)和\(Q\)之间差别的非对称的度量,即度量使用基于\(Q\)的编码方式对\(P\)进行编码所需的额外bits数,\(P\)表示数据的真实分布,则\(Q\)表示\(P\)的近似分布。

\[D_{KL}(P \Vert Q) = - \sum_{x \in X} P(x) \log \frac{1}{P(x)} + \sum_{x \in X} P(x) \log \frac{1}{Q(x)} = \sum_{x \in X} P(x) \log \frac{P(x)}{Q(x)}.\]

KL散度性质:

  • 不对称性;
  • 为非负值,因为对数函数为凸函数;
  • 不满足三角不等式。

KL散度局限性:

当两个分布距离很远,完全没有重叠时,KL散度值失去意义。

单变量高斯分布概率密度:

\[N(x\vert \mu,\sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp\left\{ - \frac{1}{2\sigma^2}(x - \mu)^2 \right\}.\]

多变量高斯分布概率密度,其中\(\Sigma\)为协方差矩阵:

\[N(x\vert \mu,\Sigma) = \frac{1}{(2\pi)^{n/2} \lvert \Sigma \rvert^{1/2}}\exp\left\{ -\frac{1}{2}(x - \mu)^T \Sigma^{-1}(x-\mu) \right\}.\]

矩阵的迹的性质:

  • \(tr(\alpha A + \beta B) = \alpha tr(A) + \beta tr(B)\).
  • \(tr(A) = tr(A^T)\).
  • \(tr(AB) = tr(BA)\).
  • \(tr(ABC) = tr(CAB) = tr(BCA)\).
  • \(\lambda^T A \lambda = tr(\lambda^T A \lambda) = tr(A \lambda \lambda^T)\).

多变量分布中期望与协方差的性质:

\[\mathbb{E}[xx^T] = \Sigma + \mu\mu^T.\] \[\begin{aligned} \Sigma & = \mathbb{E}\left[ (x-\mu)(x- \mu)^T \right] \\ & = \mathbb{E}[xx^T - x\mu^T - \mu x^T + \mu\mu^T] \\ & = \mathbb{E}[xx^T] - \mu\mu^T. \end{aligned}\] \[\begin{aligned} \mathbb{E}(x^T A x) & = \mathbb{E}[tr(x^T A x)] \\ & = \mathbb{E}{tr(Axx^T)} \\ & = tr[\mathbb{E}(Axx^T)] \\ & = tr[A\mathbb{E}(xx^T)] \\ & = tr[A(\Sigma + \mu\mu^T)] \\ & = tr(A\Sigma) + tr(A\mu\mu^T) \\ & = tr(A\Sigma)+ tr(\mu^T A \mu) \\ & = tr(A\Sigma) + \mu^T A \mu. \end{aligned}\]

在强化学习策略比较中:

Define the policies as multivariate Gaussian distribution:

\[\pi_1(a\vert s) \sim N(\mu_1, \Sigma_1), \quad \pi_2(a \vert s) \sim N(\mu_2, \Sigma_2),\]

where

  • \(\mu_1,\mu_2\) are produced by the neural networks \(\pi_{\theta_1}(s), \pi_{\theta_2}(s)\).
  • \(\Sigma = \Sigma_1 = \Sigma_2\) is a diagonal matrix independent of state \(s\).

Then,

\[\begin{aligned} D_{KL}(\pi_1 \Vert \pi_2) & = \mathbb{E}_{\pi_1}[\log \pi_1 - \log \pi_2] \\ & = \frac{1}{2}\mathbb{E}_{\pi_1}\left[ -\log\lvert\Sigma_1 \rvert - (x - \mu_1)^T \Sigma_1^{-1}(x - \mu_1) + \log\lvert\Sigma_2 \rvert + (x - \mu_2)^T \Sigma_2^{-1}(x - \mu_2) \right] \\ & = \frac{1}{2}\log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} + \frac{1}{2}\mathbb{E}_{\pi_1}\left[ - (x - \mu_1)^T \Sigma_1^{-1}(x - \mu_1) + (x - \mu_2)^T \Sigma_2^{-1}(x - \mu_2) \right] \\ & = \frac{1}{2}\log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} + \frac{1}{2}\mathbb{E}_{\pi_1}\left\{ -tr\left[ \Sigma_1^{-1} (x - \mu_1) (x - \mu_1)^T \right] + tr\left[ \Sigma_2^{-1} (x - \mu_2) (x - \mu_2)^T \right] \right\} \\ & = \frac{1}{2}\log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} + \frac{1}{2}\mathbb{E}_{\pi_1}\left\{ -tr\left[ \Sigma_1^{-1} (x - \mu_1) (x - \mu_1)^T \right] \right\} + \frac{1}{2}\mathbb{E}_{\pi_1}\left\{ tr\left[ \Sigma_2^{-1} (x - \mu_2) (x - \mu_2)^T \right] \right\} \\ & = \frac{1}{2}\log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} - \frac{1}{2} tr \left\{ \mathbb{E}_{\pi_1}\left[ \Sigma_1^{-1} (x - \mu_1) (x - \mu_1)^T \right] \right\} + \frac{1}{2} tr \left\{ \mathbb{E}_{\pi_1}\left[ \Sigma_2^{-1} (x - \mu_2) (x - \mu_2)^T \right] \right\} \\ & = \frac{1}{2}\log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} - \frac{1}{2} tr \left\{ \Sigma_1^{-1} \mathbb{E}_{\pi_1}\left[ (x - \mu_1) (x - \mu_1)^T \right] \right\} + \frac{1}{2} tr \left\{ \mathbb{E}_{\pi_1}\left[ \Sigma_2^{-1} (xx^T - \mu_2 x^T - x\mu_2^T + \mu_2 \mu_2^T) \right] \right\} \\ & = \frac{1}{2}\log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} - \frac{1}{2} tr \left\{ \Sigma_1^{-1} \Sigma_1 \right\} + \frac{1}{2} tr \left\{ \Sigma_2^{-1} \mathbb{E}_{\pi_1}\left[ xx^T - \mu_2 x^T - x\mu_2^T + \mu_2 \mu_2^T \right] \right\} \\ & = \frac{1}{2}\log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} - \frac{1}{2} tr \left\{ \Sigma_1^{-1} \Sigma_1 \right\} + \frac{1}{2} tr \left\{ \Sigma_2^{-1} \left[ \Sigma_1 + \mu_1 \mu_1^T - \mu_2 \mu_1^T - \mu_1 \mu_2^T + \mu_2 \mu_2^T \right] \right\} \\ & = \frac{1}{2} \left\{ \log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} - tr \left\{ \Sigma_1^{-1} \Sigma_1 \right\} + tr(\Sigma_2^{-1}\Sigma_1) + tr \left\{ \Sigma_2^{-1} \left[ \mu_1 \mu_1^T - \mu_2 \mu_1^T - \mu_1 \mu_2^T + \mu_2 \mu_2^T \right] \right\} \right\} \\ & = \frac{1}{2} \left\{ \log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} - tr \left\{ \Sigma_1^{-1} \Sigma_1 \right\} + tr(\Sigma_2^{-1}\Sigma_1) + tr \left\{ \Sigma_2^{-1} \mu_1 \mu_1^T - \Sigma_2^{-1}\mu_2 \mu_1^T - \Sigma_2^{-1}\mu_1 \mu_2^T + \Sigma_2^{-1}\mu_2 \mu_2^T \right\} \right\} \\ & = \frac{1}{2} \left\{ \log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} - tr \left\{ \Sigma_1^{-1} \Sigma_1 \right\} + tr(\Sigma_2^{-1}\Sigma_1) + tr \left\{ \mu_1^T\Sigma_2^{-1}\mu_1 - 2 \mu_1^T\Sigma_2^{-1}\mu_2 + \mu_2^T\Sigma_2^{-1}\mu_2 \right\} \right\} \\ & = \frac{1}{2} \left\{ \log\frac{\lvert\Sigma_2 \rvert}{\lvert\Sigma_1 \rvert} - tr \left\{ \Sigma_1^{-1} \Sigma_1 \right\} + tr(\Sigma_2^{-1}\Sigma_1) + (\mu_2 - \mu_1)^T \Sigma_2^{-1}(\mu_2 - \mu_1) \right\} \\ & = \frac{1}{2} \left\{ \log\lvert\Sigma_2 \Sigma_1^{-1} \rvert + tr(\Sigma_2^{-1}\Sigma_1) + (\mu_2 - \mu_1)^T \Sigma_2^{-1}(\mu_2 - \mu_1) - tr(\Sigma_1^{-1} \Sigma_1) \right\}. \end{aligned}\]

JS (Jensen-Shannon) Divergence

JS散度是基于KL散度的变形,度量两个概率分布的相似度,解决了KL散度非对称的问题,一般是对称的,取值在0到1之间。

\[D_{JS}(P \Vert Q) = \frac{1}{2} D_{KL}\left( P \bigg\Vert \frac{P + Q}{2} \right) + \frac{1}{2} D_{KL}\left( Q \bigg\Vert \frac{P + Q}{2} \right).\]

JS散度性质:

  • 对称性。

JS散度局限性:

当两个分布距离很远,完全没有重叠时,JS散度为一个常数,在学习算法中也就意味着这一点的梯度为0,即梯度消失。

Wasserstein Distance

Wasserstein距离度量两个概率分布之间的距离。

\[W(P,Q) = \inf_{\gamma \sim \Pi(P,Q)} \mathbb{E}_{(x,y) \sim \gamma}[\Vert x-y \Vert].\]

\(\lambda\)阶Wasserstein距离为:

\[W_\lambda(P,Q) = \left( \inf_{\gamma \sim \Pi(P,Q)} \mathbb{E}_{(x,y) \sim \gamma} \left[\Vert x-y \Vert^\lambda \right] \right)^{1/\lambda}.\]

其中,\(\Pi(P,Q)\)是\(P\)和\(Q\)分布组合的所有可能的联合分布的集合,对于可能的联合分布\(\gamma\),从中采样\((x,y)\),得到一个样本,计算这对样本的距离\(\Vert x-y \Vert\),从而可求得该联合分布下样本对之间距离的期望值。从所有可能的联合分布中得到所有可能的样本距离期望值,并取下确界即为Wasserstein距离

直观理解:样本对之间距离的期望值可以理解为在\(\gamma\)路径规划下把土堆\(P\)推到土堆\(Q\)所需要的消耗,Wasserstein距离就是在最优路径下的最小消耗。因此,Wasserstein距离又称Earth-Mover距离。

Wasserstein距离的优势

即使两个分布之间重叠很少或没有重叠,仍能反映出两个分布的远近。例如:

二维空间中的两个分布\(P\)和\(Q\)分别在线段AB和CD上均匀分布,通过调节参数\(\theta\)来控制两个分布的距离远近,则可得到:

\[D_{KL}(P \Vert Q) = \begin{cases} +\infty, & \theta \neq 0 \\ 0, & \theta = 0. \end{cases}\] \[D_{JS}(P \Vert Q) = \begin{cases} \log 2, & \theta \neq 0 \\ 0, & \theta = 0. \end{cases}\] \[W(P,Q) = \vert \theta \vert.\]

其中,当两个分布没有重叠时,KL散度和JS散度出现突变,无法反映分布之间的距离信息,而Wasserstein距离则是平滑的,仍能表示两个分布之间的距离远近。如果使用梯度下降的方法来优化参数\(\theta\),则KL散度和JS散度无法提供梯度。因此,Wasserstein距离相对于KL散度和JS散度具有一定的优势和更广的适用范围。

TV (Total Variation) Distance

  • A distance measure for probability distributions.
  • This is the largest possible difference between the probability distributions.

Definition:

\[D_{TV}(P,Q) = \sup_{x\in X}\lvert P(x) - Q(x) \rvert.\]

Relation to KL-divergence:

\[D_{TV}(P,Q) \le \sqrt{\frac{1}{2}D_{KL}(P\Vert Q)}.\]

When the set is countable, the TV distance is related to the \(L_1\) norm:

\[D_{TV}(P,Q) = \frac{1}{2}\lVert P-Q \rVert_1 = \frac{1}{2}\sum_{x\in X}\lvert P(x) - Q(x)\rvert.\]