Skip to content

Step 7: Add a player trained with Reinforcement Learning using DQN

This part of the tutorial follows step 5 and step 6, make sure you've gone through either one of those before starting this one. Alternatively the completed step 5 can be retrieved from the tutorial's repository.

In this step of the tutorial, we will go over yet another actor implementation and this implementation will be learning from its experience. We will implement an RPS player using Reinforcement Learning (RL) and more precisely a Deep Q Network, one of the foundational algorithms of modern RL.

While we will explain some aspects of RL and DQN along the way, we won't go into all the details. Interested readers can refer to "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto or to the original Deep Q Network article linked above.

Creating an actor service

Back in step 4, we created a new implementation of the player actor class in the same service as the previous one. It was a sound choice for this implementation because it was small and didn't require additional dependencies. In some cases it makes more sense to create a fully separated service for a new actor implementation. This is what we will do here.

Start by copy/pasting the random_agent folder and name the copy dqn_agent. Let's then clean up dqn_agent/ to keep only a single actor implentation and name it dqn_agent. You should end up with something like the following.

import cog_settings
from data_pb2 import PlayerAction

import cogment

import asyncio
import random

async def dqn_agent(actor_session):
    # ...

async def main():
    print("Deep Q Network agents service up and running.")

    context = cogment.Context(cog_settings=cog_settings, user_id="rps")

    await context.serve_all_registered(cogment.ServedEndpoint(port=9000))

if __name__ == "__main__":

Since we have created a new service we need to reference it at several places for everything to work properly. First, let's edit docker-compose.yaml to add the new service. To do that, simply add the following under the services key: it tells docker-compose about the new service.

        context: dqn_agent
        dockerfile: ../py_service.dockerfile

Then we will need to edit cogment.yaml to make cogment run copy copy files to the new service's directory and have cogment run build and cogment run start respectively trigger its build and its start. We will change the build and start keys under commands.

Note: the generate command is only needed if you are running things outside of docker, otherwise the code generation is done in the build step

    copy: cogment copy cogment.yaml *.proto client environment random_agent dqn_agent
    # ...
    build: docker-compose build client dashboard metrics orchestrator environment random-agent dqn-agent
    # ...
    start: docker-compose up dashboard metrics orchestrator environment random-agent dqn-agent

Finally, the metrics server needs to know about this new data source. In metrics/prometheus.yml, add a new item under the scrape_configs key.

- job_name: "dqn-agent"
      - names:
            - "dqn-agent"
        type: "A"
        port: 8000
        refresh_interval: 5s

Playing against the heuristic player

We will train our new player against the heuristic player we previously developed. We first need to update the trial config in cogment.yaml: player_1 will be our new actor implementation while player_2 will be the heuristic implementation. Trials will be 20 games long to generate enough meaningful data between each training step.

        endpoint: grpc://environment:9000
            target_game_score: 2
            target_games_count: 20
        - name: player_1
          actor_class: player
          implementation: dqn_agent
          endpoint: grpc://dqn-agent:9000
        - name: player_2
          actor_class: player
          implementation: heuristic_agent
          endpoint: grpc://random-agent:9000

We can also update client/ to run a bunch of trials sequentially.

async def main():
    print("Client starting...")

    context = cogment.Context(cog_settings=cog_settings, user_id="rps")

    # Create a controller
    controller = context.get_controller(endpoint=cogment.Endpoint("orchestrator:9000"))

    # Start a trial campaign
    for i in range(1000):
        trial_id = await controller.start_trial(trial_config=TrialConfig())
        print(f"Running trial #{i+1} with id '{trial_id}'")

        # Wait for the trial to end by itself
        async for trial_info in controller.watch_trials(
            if trial_info.trial_id == trial_id:

You can now build and run the application. It should take a few minutes to run as it goes through the trial campaign.

Implementing the Deep Q Network

We have set everything up, we can now focus on implementing our DQN agent.

A Deep Q Network is a neural network taking an observation as input, and outputing the Q value for each of the actions in the action space. The Q Value is an estimation of the expected value of all the rewards if a given action is taken. The DQN agent action policy is therefore to take the action having the largest predicted Q Value. Let's start by implementing this part and we will then deal with training this model.

In the rest of this tutorial we will use Tensorflow and its Keras API for the model itself, as well as numpy for datastructures. Let's add these to dqn_agent/requirements.txt and import them at the top of dqn_agent/

import numpy as np
import tensorflow as tf

Let's get into the meat of the matter by implementing a function to create our model. We are using Keras functional API to create the following layers:

  1. Two scalar inputs, the last moves of the player and the opponent.
  2. Each input is one-hot encoded to avoid assuming an unwanted ordering and quantitative relationship between the moves.
  3. The two encoded inputs are concatenated to a single vector.
  4. A dense non-linear hidden layer is added.
  5. The output layer estimates the Q value for each move.

Everything then gets wrapped up and returned.

This function is then used to create a global _model that we will use in the actor implementation.

actions_count = len(MOVES)

def create_model():
    # 1. Input layers
    in_me_last_move = tf.keras.Input(name="obs_me_last_move", shape=(1))
    in_them_last_move = tf.keras.Input(name="obs_them_last_move", shape=(1))
    # 2. One hot encoding of the layers
    one_hot_move = tf.keras.layers.experimental.preprocessing.CategoryEncoding(
        name="one_hot_move", max_tokens=len(MOVES), output_mode="binary"
    one_hot_me_last_move = one_hot_move(in_me_last_move)
    one_hot_them_last_move = one_hot_move(in_them_last_move)
    # 3. Concatenating the two inputs
    concat_ins = tf.keras.layers.concatenate(
        [one_hot_me_last_move, one_hot_them_last_move]
    # 4. Dense hidden layer
    hidden_layer = tf.keras.layers.Dense(24, activation="relu")(concat_ins)
    # 5. Output
    outs = tf.keras.layers.Dense(actions_count, activation="linear")(hidden_layer)
    return tf.keras.Model(
        inputs=[in_me_last_move, in_them_last_move], outputs=outs, name="rps_dqn_policy"

_model = create_model()

The other piece of the puzzle is implementing a small function that will convert our observations into inputs for the model we just created. As most of the encoding is handled by the model itself it's fairly straightforward.

def model_ins_from_observations(observations):
    return {
        "obs_me_last_move": np.array([[] for o in observations]),
        "obs_them_last_move": np.array(
            [[o.snapshot.them.last_move] for o in observations]

Finally we can make it work together by replacing the random choice of action by the use of the model. At the moment the model will just use the random initialization weights so don't expect much!

Here is how the event loop in the dqn_agent function will need to be updated:

  1. Use model_ins_from_observations to compute the model inputs,
  2. Use the model in inference mode to compute the q value of each of the possible actions,
  3. Finally, do the action having the largest q value.
if event.observation:
  model_ins = model_ins_from_observations([event.observation])
  if event.type == cogment.EventType.ACTIVE:
    model_outs = _model(model_ins, training=False)
    action = tf.math.argmax(model_outs[0]).numpy()

You can now build and run the application. It should take a few minutes to run as it goes through the trial campaign.

In this example we define _model (and other variables in the following sections) as global mutable variables. It works in our case because the dqn agents are neither distributed nor multithreaded.

Random exploration

With the previous code, you might have noticed that the agent will play exactly the same action given the same set of observations, this is because the weights of the model are fixed. However, especially at the beginning of the training process we want the agent to experience a variety of situations. We address this issue by introducing a decaying exploration rate epsilon.

First we will define the parameters for this epsilon value as global variables: its minimum value, its maximum and initial value and its decay per tick. We also define as a global variable the current value of epsilon. You can add the following after the imports in dqn_agent/

epsilon_min = 0.05
epsilon_max = 1.0
epsilon_decay_per_tick = (
  epsilon_max - epsilon_min
) / 1000.0  # Linearly reach the lowest exploration rate after 1000 ticks

_epsilon = epsilon_max

We then create a simple function we can use everytime an action needs to be taken to retrieve and update _epsilon.

def get_and_update_epsilon():
  global _epsilon
  current_epsilon = _epsilon
  _epsilon -= epsilon_decay_per_tick
  _epsilon = max(_epsilon, epsilon_min)
  return current_epsilon

This function can then be used to occasionally do random actions, to facilitate the exploration. To do that, we need to slightly modify how the actions are computed and submitted.

if event.type == cogment.EventType.ACTIVE:
  if np.random.rand(1)[0] < get_and_update_epsilon():
    # Take random action
    action = np.random.choice(actions_count)
    model_outs = _model(model_ins, training=False)
    action = tf.math.argmax(model_outs[0]).numpy()

You can now build and run the application. Nothing should appear different at this stage.

Replay buffer

In our journey to train a model, the next stage is to build an experience replay buffer to collect actions/observations/rewards triples over the course of the trials. Once done, it'll be usable to train the model using this data.

We will start by creating the datastructure. We are using a column-oriented structure relying on numpy arrays as they interoperate easily with tensorflow and support the needed manipulation primitives. Each row is a sample corresponding to one tick: the received observation and reward, the selected action as well as the next tick's received observation.

def create_replay_buffer():
  return {
    "obs_me_last_move": np.array([]),
    "obs_them_last_move": np.array([]),
    "action": np.array([]),
    "reward": np.array([]),
    "next_obs_me_last_move": np.array([]),
    "next_obs_them_last_move": np.array([]),

_rb = create_replay_buffer()

During each trial the agent will collect its data points in a trial replay buffer then append it to the global one. To achieve that we will first create the function in charge of the appending then collect data during the trial and call the "append" function.

The following function will take a trial replay buffer and append it to the global _rb. To avoid memory overflow the replay buffer size is capped.

_collected_samples_count = 0
max_replay_buffer_size = 100000

def append_trial_replay_buffer(trial_rb):
  global _rb
  global _collected_samples_count

  trial_rb_size = len(trial_rb["obs_me_last_move"])

  for key in _rb.keys():
    # Append the trial data to the current vector
    _rb[key] = np.append(_rb[key], trial_rb[key])
    # Enforce the size limit by discarding older data
    if len(_rb[key]) > max_replay_buffer_size:
        _rb[key] = _rb[key][-max_replay_buffer_size:]

  _collected_samples_count += trial_rb_size
  rb_size = len(_rb["obs_me_last_move"])

  # Sanity check, all vectors in the replay buffer should have the same size
  for key in _rb.keys():
    assert rb_size == len(_rb[key])

    f"{trial_rb_size} new samples stored after a trial, now having {rb_size} samples over a total of {_collected_samples_count} collected samples."

The dqn_agent function can then be updated to collect received observations, rewards and sent actions. By default every action gets a zero reward. When a reward for a specific tick is received, its value gets updated.

async def dqn_agent(actor_session):

  trial_rb = create_replay_buffer()

  async for event in actor_session.event_loop():
    if event.observation:
      model_ins = model_ins_from_observations([event.observation])
      if event.type == cogment.EventType.ACTIVE:
        # [...]
        trial_rb["obs_me_last_move"] = np.append(
            trial_rb["obs_me_last_move"], model_ins["obs_me_last_move"]
        trial_rb["obs_them_last_move"] = np.append(
            trial_rb["obs_them_last_move"], model_ins["obs_them_last_move"]
        trial_rb["action"] = np.append(trial_rb["action"], [action])
        trial_rb["reward"] = np.append(trial_rb["reward"], [0.0])
        trial_rb["obs_me_last_move"] = np.append(
            trial_rb["obs_me_last_move"], model_ins["obs_me_last_move"]
        trial_rb["obs_them_last_move"] = np.append(
            trial_rb["obs_them_last_move"], model_ins["obs_them_last_move"]
    for reward in event.rewards:
      trial_rb["reward"][reward.tick_id] = reward.value

  # Shifting the observations to get the next observations
  trial_rb["next_obs_me_last_move"] = trial_rb["obs_me_last_move"][1:]
  trial_rb["next_obs_them_last_move"] = trial_rb["obs_them_last_move"][1:]
  # Dropping the last row, as it only contains the last observations
  trial_rb["obs_me_last_move"] = trial_rb["obs_me_last_move"][:-1]
  trial_rb["obs_them_last_move"] = trial_rb["obs_them_last_move"][:-1]

You can now build and run the application. The behavior should be the same but the log should confirm that data gets accumulated.


Here we are, all the pieces are in place, we can implement the training proper. The function is a standard implementation of DQN and is decomposed in 4 steps:

  1. Select a random batch of samples from the replay buffer
  2. Compute the target Q value for each sample from the received reward and the next observation using a previous version of the model.
  3. (Re)compute the estimated Q value of each sample from the selected action and observation using the current version of the model.
  4. Perform an optimization step of the model parameters trying to reduce the loss between the samples estimated and target q values.
batch_size = 50  # Size of batch taken from replay buffer
gamma = 0.99  # Discount factor for future rewards
optimizer = tf.keras.optimizers.Adam(learning_rate=0.00025, clipnorm=1.0)
loss_function = tf.keras.losses.Huber()
target_model_update_interval = 1000

_target_model = create_model()

def train():
  global _model
  global _target_model

  rb_size = len(_rb["obs_me_last_move"])

  if rb_size >= batch_size:
    # Step 1 - Randomly select a batch
    batch_indices = np.random.choice(range(rb_size), size=batch_size)
    batch_rb = create_replay_buffer()
    for key in batch_rb.keys():
        batch_rb[key] = np.take(_rb[key], batch_indices)

    # Step 2 - Compute target q values
    ## Predict the expected reward for the next observation of each sample
    ## Use the target model for stability
    target_actions_q_values = _target_model(
        "obs_me_last_move": batch_rb["next_obs_me_last_move"],
        "obs_them_last_move": batch_rb["next_obs_them_last_move"],

    ## target Q value = reward + discount factor * expected future reward
    target_q_values = batch_rb["reward"] + gamma * tf.reduce_max(
      target_actions_q_values, axis=1

    # Step 3 - Compute estimated q values
    ## Create masks of the taken actions to later select relevant q values
    selected_actions_masks = tf.one_hot(batch_rb["action"], actions_count)

    with tf.GradientTape() as tape:
      ## Recompute q values for all the actions at each sample
      estimated_actions_q_values = _model(
          "obs_me_last_move": batch_rb["obs_me_last_move"],
          "obs_them_last_move": batch_rb["obs_them_last_move"],

      ## Apply the masks to get the Q value for taken actions
      estimated_q_values = tf.reduce_sum(
        tf.multiply(estimated_actions_q_values, selected_actions_masks), axis=1

      ## Compute loss between the target Q values and the estimated Q values
      loss = loss_function(target_q_values, estimated_q_values)

      ## Backpropagation!
      grads = tape.gradient(loss, _model.trainable_variables)
      optimizer.apply_gradients(zip(grads, _model.trainable_variables))

    # Update the target model
    if _collected_samples_count % target_model_update_interval == 0:

This function then needs to be called at the end of each trial after the call to append_trial_replay_buffer.

You can now build and run the application. The dqn agent will start to learn and quickly prevail against the heuristic implementation.

This can be observed by opening the dashboard at http://localhost:3003 and opening the reward page. You should be able to track the progression of the dqn implementation.

Cumulative reward by agent type diagram showing the dqn implementation prevailing against the heuristic agent

This concludes the step 7 of the tutorial: you implemented your first trained actor implementation!