
Cross-entropy on CartPole
The whole code for this example is in Chapter04/01_cartpole.py
, but the following are the most important parts. Our model's core is a one-hidden-layer neural network, with ReLU and 128 hidden neurons (which is absolutely arbitrary). Other hyperparameters are also set almost randomly and aren't tuned, as the method is robust and converges very quickly.
HIDDEN_SIZE = 128 BATCH_SIZE = 16 PERCENTILE = 70
We define constants at the top of the file and they include the count of neurons in the hidden layer, the count of episodes we play on every iteration (16), and the percentile of episodes' total rewards that we use for elite episode filtering. We'll take the 70th percentile, which means that we'll leave the top 30% of episodes sorted by reward:
class Net(nn.Module): def __init__(self, obs_size, hidden_size, n_actions): super(Net, self).__init__() self.net = nn.Sequential( nn.Linear(obs_size, hidden_size), nn.ReLU(), nn.Linear(hidden_size, n_actions) ) def forward(self, x): return self.net(x)
There is nothing special about our network; it takes a single observation from the environment as an input vector and outputs a number for every action we can perform. The output from the network is a probability distribution over actions, so a straightforward way to proceed would be to include softmax nonlinearity after the last layer. However, in the preceding network we don't apply softmax to increase the numerical stability of the training process. Rather than calculating softmax (which uses exponentiation) and then calculating cross-entropy loss (which uses logarithm of probabilities), we'll use the PyTorch class, nn.CrossEntropyLoss
, which combines both softmax and cross-entropy in a single, more numerically stable expression. CrossEntropyLoss
requires raw, unnormalized values from the network (also called logits), and the downside of this is that we need to remember to apply softmax every time we need to get probabilities from our network's output.
Episode = namedtuple('Episode', field_names=['reward', 'steps']) EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])
Here we will define two helper classes that are named tuples from the collections
package in the standard library:
EpisodeStep
: This will be used to represent one single step that our agent made in the episode, and it stores the observation from the environment and what action the agent completed. We'll use episode steps from elite episodes as training data.Episode
: This is a single episode stored as total undiscounted reward and a collection ofEpisodeStep
.
Let's look at a function that generates batches with episodes:
def iterate_batches(env, net, batch_size): batch = [] episode_reward = 0.0 episode_steps = [] obs = env.reset() sm = nn.Softmax(dim=1)
The preceding function accepts the environment (the Env
class instance from the Gym library), our neural network, and the count of episodes it should generate on every iteration. The batch
variable will be used to accumulate our batch (which is a list of the Episode
instances). We also declare a reward counter for the current episode and its list of steps (the EpisodeStep
objects). Then we reset our environment to obtain the first observation and create a softmax layer, which will be used to convert the network's output to a probability distribution of actions. That's all about our preparations; so we're ready to start the environment loop:
while True: obs_v = torch.FloatTensor([obs]) act_probs_v = sm(net(obs_v)) act_probs = act_probs_v.data.numpy()[0]
At every iteration, we convert our current observation to a PyTorch tensor and pass it to the network to obtain action probabilities. There are several things to note here:
- All
nn.Module
instances in PyTorch expect a batch of data items and the same is true for our network, so we convert our observation (which is a vector of four numbers in CartPole) into a tensor of size 1 × 4 (to achieve this we pass an observation in a single-element list). - As we haven't used nonlinearity at the output of our network, it outputs raw action scores, which we need to feed through the softmax function.
- Both our network and the softmax layer return tensors which track gradients, so we need to unpack this by accessing the
tensor.data
field and then converting the tensor into a NumPy array. This array will have the same two-dimensional structure as the input, with the batch dimension on axis 0, so we need to get the first batch element to obtain a one-dimensional vector of action probabilities:action = np.random.choice(len(act_probs), p=act_probs) next_obs, reward, is_done, _ = env.step(action)
Now that we have the probability distribution of actions, we can use this distribution to obtain the actual action for the current step by sampling this distribution using NumPy's function, random.choice()
. After this, we will pass this action to the environment to get our next observation, our reward, and the indication of the episode ending:
episode_reward += reward episode_steps.append(EpisodeStep(observation=obs, action=action))
Reward is added to the current episode's total reward, and our list of episode steps is also extended with an (observation, action) pair. Note that we save the observation that was used to choose the action, but not the observation returned by the environment as a result of the action. These are the tiny but important details that you need to keep in mind.
if is_done: batch.append(Episode(reward=episode_reward, steps=episode_steps)) episode_reward = 0.0 episode_steps = [] next_obs = env.reset() if len(batch) == batch_size: yield batch batch = []
This is how we handle the situation when the current episode is over (in the case of CartPole, the episode ends when the stick has fallen down despite our efforts). We append the finalized episode to the batch, saving the total reward (as the episode has been completed and we've accumulated all reward) and steps we've taken. Then we reset our total reward accumulator and clean the list of steps. After that, we reset our environment to start over.
In case our batch has reached the desired count of episodes, we return it to the caller for processing, using yield
. Our function is a generator, so every time the yield
operator is executed, the control is transferred to the outer iteration loop and then continues after the yield
line. If you're not familiar with Python's generator functions, refer to the Python documentation. After processing, we will clean up the batch:
obs = next_obs
The last, but very important, step in our loop is to assign an observation obtained from the environment to our current observation variable. After that, everything repeats infinitely: we pass the observation to the net, sample the action to perform, ask the environment to process the action, and remember the result of this processing.
One very important fact to understand in this function logic is that the training of our network and the generation of our episodes are performed at the same time. They are not completely in parallel, but every time our loop accumulates enough episodes (16), it passes control to this function caller, which is supposed to train the network using the gradient descent. So, when yield
is returned, the network will have different, slightly better (we hope) behavior.
We don't need to explore proper synchronization, as our training and data gathering activities are performed at the same thread of execution, but you need to understand those constant jumps from network training to its utilization.
Okay, now we need to define yet another function and we'll be ready to switch to the training loop:
def filter_batch(batch, percentile): rewards = list(map(lambda s: s.reward, batch)) reward_bound = np.percentile(rewards, percentile) reward_mean = float(np.mean(rewards))
This function is at the core of the cross-entropy method: from the given batch of episodes and percentile value, it calculates a boundary reward, which is used to filter elite episodes to train on. To obtain the boundary reward, we're using NumPy's percentile function, which from the list of values and the desired percentile, calculates the percentile's value. Then we will calculate mean reward, which is used only for monitoring.
train_obs = [] train_act = [] for example in batch: if example.reward < reward_bound: continue train_obs.extend(map(lambda step: step.observation, example.steps)) train_act.extend(map(lambda step: step.action, example.steps))
Next, we will filter off our episodes. For every episode in the batch, we will check that the episode has a higher total reward than our boundary and if it has, we will populate lists of observations and actions that we will train on.
train_obs_v = torch.FloatTensor(train_obs) train_act_v = torch.LongTensor(train_act) return train_obs_v, train_act_v, reward_bound, reward_mean
As the final step of the function, we will convert our observations and actions from elite episodes into tensors, and return a tuple of four: observations, actions, the boundary of reward, and the mean reward. The last two values will be used only to write them into TensorBoard to check the performance of our agent.
Now, the final chunk of code that glues everything together and mostly consists of the training loop is as follows:
if __name__ == "__main__": env = gym.make("CartPole-v0") # env = gym.wrappers.Monitor(env, directory="mon", force=True) obs_size = env.observation_space.shape[0] n_actions = env.action_space.n net = Net(obs_size, HIDDEN_SIZE, n_actions) objective = nn.CrossEntropyLoss() optimizer = optim.Adam(params=net.parameters(), lr=0.01) writer = SummaryWriter()
In the beginning, we will create all the required objects: the environment, our neural network, the objective function, the optimizer, and the summary writer for TensorBoard. The commented line creates a monitor to write videos of your agent's performance.
for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)): obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE) optimizer.zero_grad() action_scores_v = net(obs_v) loss_v = objective(action_scores_v, acts_v) loss_v.backward() optimizer.step()
In the training loop, we will iterate our batches (which are a list of Episode
objects), then we perform filtering of the elite episodes using the filter_batch
function. The result is variables of observations and taken actions, the reward boundary used for filtering and the mean reward. After that, we zero gradients of our network and pass observations to the network, obtaining its action scores. These scores are passed to the objective function, which calculates cross-entropy between the network output and the actions that the agent took. The idea of this is to reinforce our network to carry out those "elite" actions which have led to good rewards. Then, we will calculate gradients on the loss and ask the optimizer to adjust our network.
print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % ( iter_no, loss_v.item(), reward_m, reward_b)) writer.add_scalar("loss", loss_v.item(), iter_no) writer.add_scalar("reward_bound", reward_b, iter_no) writer.add_scalar("reward_mean", reward_m, iter_no)
The rest of the loop is mostly the monitoring of progress. On the console, we show iteration number, loss, the mean reward of the batch, and the reward boundary. We also write the same values to TensorBoard, to get a nice chart of the agent's learning performance.
if reward_m > 199: print("Solved!") break writer.close()
The last check in the loop is the comparison of the mean rewards of our batch episodes. When this becomes greater than 199
, we stop our training. Why 199
? In Gym, the CartPole environment is considered to be solved when the mean reward for last 100 episodes is greater than 195, but our method converges so quickly that 100 episodes are usually what we need. The properly trained agent can balance the stick infinitely long (obtaining any amount of score), but the length of the episode in CartPole is limited to 200 steps (if you look at the environment variable of CartPole, you may notice the TimeLimit
wrapper, which stops the episode after 200 steps). With all this in mind, we will stop training after the mean reward in the batch is greater than 199
, which is a good indication that our agent knows how to balance the stick as a pro.
That's it. So let's start our first RL training!
rl_book_samples/Chapter04$ ./01_cartpole.py [2017-10-04 12:44:39,319] Making new env: CartPole-v0 0: loss=0.701, reward_mean=18.0, reward_bound=21.0 1: loss=0.682, reward_mean=22.6, reward_bound=23.5 2: loss=0.688, reward_mean=23.6, reward_bound=25.5 3: loss=0.675, reward_mean=22.8, reward_bound=22.0 4: loss=0.658, reward_mean=31.9, reward_bound=34.0 ......... 36: loss=0.527, reward_mean=135.9, reward_bound=168.5 37: loss=0.527, reward_mean=147.4, reward_bound=160.5 38: loss=0.528, reward_mean=179.8, reward_bound=200.0 39: loss=0.530, reward_mean=178.7, reward_bound=200.0 40: loss=0.532, reward_mean=192.1, reward_bound=200.0 41: loss=0.523, reward_mean=196.8, reward_bound=200.0 42: loss=0.540, reward_mean=200.0, reward_bound=200.0 Solved!
It usually doesn't take the agent more than 50 batches to solve the environment. My experiments show something from 25 to 45 episodes, which is a really good learning performance (remember, we need to play only 16 episodes for every batch). TensorBoard shows our agent consistently making progress, pushing the upper boundary at almost every batch (there are some periods of rolling down, but most of the time it improves).

Figure 3: Loss, reward boundary, and reward during the training
To check our agent in action, you can enable Monitor
by uncommenting the next line after the environment creation. After restarting (possibly with xvfb-run
to provide a virtual X11 display), our program will create a mon
directory with videos recorded at different training steps:
rl_book_samples/Chapter04$ xvfb-run -s "-screen 0 640x480x24" ./01_cartpole.py [2017-10-04 13:52:23,806] Making new env: CartPole-v0 [2017-10-04 13:52:23,814] Creating monitor directory mon [2017-10-04 13:52:23,920] Starting new video recorder writing to mon/openaigym.video.0.4430.video000000.mp4 [2017-10-04 13:52:25,229] Starting new video recorder writing to mon/openaigym.video.0.4430.video000001.mp4 [2017-10-04 13:52:25,771] Starting new video recorder writing to mon/openaigym.video.0.4430.video000008.mp4 0: loss=0.682, reward_mean=18.9, reward_bound=20.5 [2017-10-04 13:52:26,297] Starting new video recorder writing to mon/openaigym.video.0.4430.video000027.mp4 1: loss=0.687, reward_mean=16.6, reward_bound=19.0 2: loss=0.677, reward_mean=21.1, reward_bound=21.0 [2017-10-04 13:52:26,964] Starting new video recorder writing to mon/openaigym.video.0.4430.video000064.mp4 3: loss=0.653, reward_mean=33.2, reward_bound=48.5 4: loss=0.642, reward_mean=37.4, reward_bound=42.5 ......... 29: loss=0.561, reward_mean=111.6, reward_bound=122.0 30: loss=0.540, reward_mean=135.1, reward_bound=166.0 [2017-10-04 13:52:40,176] Starting new video recorder writing to mon/openaigym.video.0.4430.video000512.mp4 31: loss=0.546, reward_mean=147.5, reward_bound=179.5 32: loss=0.559, reward_mean=140.0, reward_bound=171.5 33: loss=0.558, reward_mean=160.4, reward_bound=200.0 34: loss=0.547, reward_mean=167.6, reward_bound=195.5 35: loss=0.550, reward_mean=179.5, reward_bound=200.0 36: loss=0.563, reward_mean=173.9, reward_bound=200.0 37: loss=0.542, reward_mean=162.9, reward_bound=200.0 38: loss=0.552, reward_mean=159.1, reward_bound=200.0 39: loss=0.548, reward_mean=189.6, reward_bound=200.0 40: loss=0.546, reward_mean=191.1, reward_bound=200.0 41: loss=0.548, reward_mean=199.1, reward_bound=200.0 Solved!
As you can see from the output, it turns a periodical recording of the agent's activity into separate video files, which can give you an idea of what your agent's sessions look like.

Figure 4: Visualization of the CartPole state
Let's now pause a bit and think about what's just happened. Our neural network has learned how to play the environment purely from observations and rewards, without any one word interpretation of observed values. The environment could easily be not a cart with a stick but, say, a warehouse model with product quantities as an observation and money earned as a reward. Our implementation doesn't depend on environment details. This is the beauty of the RL model, and in the next section, we'll look at how exactly the same method can be applied to a different environment from the Gym collection.