Step 2: Implement a first actor and environment

This part of the tutorial follows step 1, make sure you've gone through it before starting this one. Alternatively, the completed step 1 can be retrieved from the tutorial's repository.

In this step of the tutorial, we will implement the (very simple) decison logic for the random player as well as the base mechanics for RPS, i.e. the rules of the game, in the environment services.

Random player agent

In the rps directory, the random_agent directory contains the python implementation for the eponymous service. You'll find two files here:

  • requirements.txt is a pip requirement file defining the dependencies of the service. For the moment it only lists cogment, Cogment's python SDK.
  • contains the implementation of the service.

Open and take a look at the generated content.

At the bottom you'll find the main function, it initializes Cogment's context, registers the random_agent actor's implementation, then starts the service itsef on tcp port 9000 and await for its termination.

Cogment's python sdk leverages Python's asynchronous features, you'll need a basic understanding of them.

async def main():
  print("Random-Agent actor service up and running.")

  context = cogment.Context(cog_settings=cog_settings, user_id="rps")

  await context.serve_all_registered(cogment.ServedEndpoint(port=9000))

At the beginning of the file, the function random_agent is the actor's implementation. This function is called once per actor and per trial and handles the full lifetime of the actor.

  • The actor's initialization, before the async for. This is where, for example, actor's internal data can be defined before calling actor_session.start() to notify that it is ready,
  • Its event loop, the content of the async for. This is where resides the implementation of the actor's response to various events,
  • Its termination, after the async for.

The generated implementation is very simple:

  • it handles the three main kind of events: observations, rewards and messages,
  • it does a default action whenever required, i.e. in response to an observation.

We will further learn about how to use observations in step 4 and rewards in step 3. Messages are out of the scope for this basics tutorial.

Please note the import and usage of PlayerAction which is the data structure from data.proto defining the actor's action space.

async def random_agent(actor_session):

    async for event in actor_session.event_loop():
        if event.observation:
            observation = event.observation
            print(f"'{}' received an observation: '{observation}'")
            if event.type == cogment.EventType.ACTIVE:
                action = PlayerAction()
        for reward in event.rewards:
            print(f"'{}' received a reward for tick #{reward.tick_id}: {reward.value}")
        for message in event.messages:
            print(f"'{}' received a message from '{message.sender_name}': - '{message.payload}'")

Our goal is to implement an actor playing at random. We first need to import the different Move, as defined in our data structures. We also need to import random, the python package generating random numbers.

from data_pb2 import ROCK, PAPER, SCISSORS

import random


Once this is available we can simply update the taking decision part of the actor's implementation to compute a random move whenever it is needed.

if event.observation:
    observation = event.observation
    print(f"'{}' received an observation: '{observation}'")
    if event.type == cogment.EventType.ACTIVE:
        action = PlayerAction(move=random.choice(MOVES))

Modify the random_agent/ file to include the above additions.

Implementing the rules of the game

In the rps directory, the environment directory contains the python implementation for the eponymous service. Similarly to the actor's service, you will find two files here, requirements.txt and

Open and take a look at the generated content.

The code is very similar to the random_agent's. In the main function, instead of using register_actor, register_environment is used. The implementation function, called environment here, is structured similarly to the actor's one but handles two kinds of events: actions (and the last actions of a trial final_actions) and message. Environments don't perform actions, they produce observations that are sent to the actors participating in the trial.

Please note the import and usage of Observation which is the datastructure defined in data.proto defining the actors observation space.

async def environment(environment_session):
    print("environment starting")
    # Create the initial observaton
    observation = Observation()

    # Start the trial and send that observation to all actors
    environment_session.start([("*", observation)])

    async for event in environment_session.event_loop():
        if event.actions:
            actions = event.actions
            print(f"environment received the actions")
            for actor, recv_action in zip(environment_session.get_active_actors(), actions):
                print(f" actor '{actor.actor_name}' did action '{recv_action.action}'")
            observation = Observation()
            if event.type == cogment.EventType.ACTIVE:
                # The trial is active
                environment_session.produce_observations([("*", observation)])
                # The trial termination has been requested
                environment_session.end([("*", observation)])
        for message in event.messages:
            print(f"environment received a message from '{message.sender_name}': - '{message.payload}'")

    print("environment end")

Our goal, in this section, is to implement how the environment computes the observations from the actions done by the actors at a given timestep.

We first import the needed datastructure and define a dictionary mapping each move to the move that defeats it.

from data_pb2 import PlayerState, ROCK, PAPER, SCISSORS


In the initialization phase of the environment implementation, i.e. before the async for, we create a simple state data structure that is keeping around the number of rounds played and won by each of the two players.

We then compute the initial observation for each of the two players. One instance of PlayerState per player is created, each is used as the me and them state of each player's observation.

state = {
    "rounds_count": 0,
    "p1": {
        "won_rounds_count": 0
    "p2": {
        "won_rounds_count": 0
print("environment starting")
[p1, p2] = environment_session.get_active_actors()
p1_state = PlayerState(won_last=False, last_move=None)
p2_state = PlayerState(won_last=False, last_move=None)
    (p1.actor_name, Observation(me=p1_state, them=p2_state)),
    (p2.actor_name, Observation(me=p2_state, them=p1_state)),

In the event loop we implement how the environment produces observations based on the actor's actions.

We start by retrieving each player's action and computing who won the round. Then, we update the internal state. Finally, we produce up-to-date observations for the players.

if event.actions:
    [p1_action, p2_action] = [recv_action.action for recv_action in event.actions]
    print(f"{p1.actor_name} played {MOVES_STR[p1_action.move]}")
    print(f"{p2.actor_name} played {MOVES_STR[p2_action.move]}")

    # Compute who wins, if the two players had the same move, nobody wins
    p1_state = PlayerState(
        won_last=p1_action.move == DEFEATS[p2_action.move],
    p2_state = PlayerState(
        won_last=p2_action.move == DEFEATS[p1_action.move],
    state["rounds_count"] += 1
    if p1_state.won_last:
        state["p1"]["won_rounds_count"] += 1
        print(f"{p1.actor_name} wins!")
    elif p2_state.won_last:
        state["p2"]["won_rounds_count"] += 1
        print(f"{p2.actor_name} wins!")

    # Generate and send observations
    observations = [
        (p1.actor_name, Observation(me=p1_state, them=p2_state)),
        (p2.actor_name, Observation(me=p2_state, them=p1_state)),
    if event.type == cogment.EventType.ACTIVE:
        # The trial is active
        # The trial termination has been requested

Finally, in the termination phase, we print some stats about the trial itself.

print("environment end")
print(f"\t * {state['rounds_count']} rounds played")
print(f"\t * {p1.actor_name} won {state['p1']['won_rounds_count']} rounds")
print(f"\t * {p1.actor_name} won {state['p2']['won_rounds_count']} rounds")
print(f"\t * {state['rounds_count'] - state['p1']['won_rounds_count'] - state['p2']['won_rounds_count']} draws")

Modify the environment/ file to include the above additions. Please note that this code makes assumptions on the number of actors and their classes. Production code should handle non-standard cases in a better way.

You can now build and run the application. Given the nature of the game and the fully random nature of the plays you should have around 1/3 of player 1 wins, 1/3 of player 2's and 1/3 of draws.

This concludes the step 2 of the tutorial: you implemented your first actor and your first environment.

Let’s move on to learning more about rewards in step 3.