Deep Reinforcement Learning Hands-On
上QQ阅读APP看书,第一时间看更新

Theoretical background of the cross-entropy method

This section is optional and included for readers who are interested in why the method works. If you wish, you can refer to the original paper on cross-entropy, which will be given at the end of the section.

The basis of the cross-entropy method lies in the importance sampling theorem, which states this:

In our RL case, H(x) is a reward value obtained by some policy x and p(x) is a distribution of all possible policies. We don't want to maximize our reward by searching all possible policies, instead we want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) pergence which is as follows:

The first term in KL is called entropy and doesn't depend on that, so could be omitted during the minimization. The second term is called cross-entropy and is a very common optimization objective in DL.

Combining both formulas, we can get an iterative algorithm, which starts with Theoretical background of the cross-entropy method and on every step improves. This is an approximation of p(x)H(x) with an update:

This is a generic cross-entropy method, which can be significantly simplified in our RL case. Firstly, we replace our H(x) with an indicator function, which is 1 when the reward for the episode is above the threshold and 0 if the reward is below. Our policy update will look like this:

Strictly speaking, the preceding formula misses the normalization term, but it still works in practice without it. So, the method is quite clear: we sample episodes using our current policy (starting with some random initial policy) and minimize the negative log likelihood of the most successful samples and our policy.

There is a whole book dedicated to this method, written by Dirk P. Kroese. A shorter description can be found in the Cross-Entropy Method paper by Dirk P.Kroese (https://people.smp.uq.edu.au/DirkKroese/ps/eormsCE.pdf).