Deep Reinforcement Learning Hands-On
上QQ阅读APP看书,第一时间看更新

Practical cross-entropy

The cross-entropy method description is split into two unequal parts: practical and theoretical. The practical part is intuitive in its nature, while the theoretical explanation of why cross-entropy works, and what's happening is more sophisticated.

You may remember that the central, trickiest thing in RL is the agent, which is trying to accumulate as much total reward as possible by communicating with the environment. In practice, we follow a common ML approach and replace all of the complications of the agent with some kind of nonlinear trainable function, which maps the agent's input (observations from the environment) to some output. The details of the output that this function produces may depend on a particular method or a family of methods, as described in the previous section (such as value-based versus policy-based methods). As our cross-entropy method is policy-based, our nonlinear function (neural network) produces policy, which basically says for every observation which action the agent should take.

Figure 1: A high-level approach to RL

In practice, policy is usually represented as probability distribution over actions, which makes it very similar to a classification problem, with the amount of classes being equal to amount of actions we can carry out. This abstraction makes our agent very simple: it needs to pass an observation from the environment to the network, get probability distribution over actions, and perform random sampling using probability distribution to get an action to carry out. This random sampling adds randomness to our agent, which is a good thing, as at the beginning of the training when our weights are random, the agent behaves randomly. After the agent gets an action to issue, it fires the action to the environment and obtains the next observation and reward for the last action. Then the loop continues.

During the agent's lifetime, its experience is present as episodes. Every episode is a sequence of observations that the agent has got from the environment, actions it has issued, and rewards for these actions. Imagine that our agent has played several such episodes. For every episode, we can calculate the total reward that the agent has claimed. It can be discounted or not discounted, but for simplicity, let's assume a discount factor of gamma = 1, which means just a sum of all local rewards for every episode. This total reward shows how good this episode was for the agent. Let's illustrate this with a diagram, which contains four episodes (note that different episodes have different values for Practical cross-entropy, Practical cross-entropy, and Practical cross-entropy):

Figure 2: Sample episodes with their observations, actions, and rewards

Every cell represents the agent's step in the episode. Due to randomness in the environment and the way that the agent selects actions to take, some episodes will be better than others. The core of the cross-entropy method is to throw away bad episodes and train on better ones. So, the steps of the method are as follows:

  1. Play N number of episodes using our current model and environment.
  2. Calculate the total reward for every episode and decide on a reward boundary. Usually, we use some percentile of all rewards, such as 50th or 70th.
  3. Throw away all episodes with a reward below the boundary.
  4. Train on the remaining "elite" episodes using observations as the input and issued actions as the desired output.
  5. Repeat from step 1 until we become satisfied with the result.

So, that's all about the cross-entropy method description. With the preceding procedure, our neural network learns how to repeat actions, which leads to a larger reward, constantly moving the boundary higher and higher. Despite the simplicity of this method, it works well in simple environments, it's easy to implement, and it's quite robust to hyperparameters changing, which makes it an ideal baseline method to try. Let's now apply it to our CartPole environment.