Step 3 : Upgrade our components

Improve the Environment

In the client implementation of the previous step, we can display the previous decision of the agent by adding previous_p1_action and previous_p2_action to the observation. Modify the .proto file as follows:

data.proto

...
message Observation {
    int32 p1_score = 1;
    int32 p2_score = 2;
    PlayerAction previous_p1_action = 3;
    PlayerAction previous_p2_action = 4;
}
...

When the .proto files are modified, we always need to generate the config again:

cogment generate --python_dir=.

Modify the env file as follows in order to have these values updated.

env.py

    ...
    def update(self, actions):
        print("environment updating")

        self.observation.previous_p1_action.CopyFrom(actions.player[0])
        self.observation.previous_p2_action.CopyFrom(actions.player[1])

        p1_decision = actions.player[0].decision
    ...

Send feedback

Rewarding the agent is an important part of training a model. In this example, after each observation, the environment will send feedbacks to our single agent.

The orchestator calculates the average of all the feedback in order to create a single reward.

env.py

    ...
    def update(self, actions):
        ...
        print(f"p1 played {p1_decision} - p2 played {p2_decision}")

        p1_feedback = 0
        p2_feedback = 0
        if p1_decision == p2_decision:
            pass
        elif ((p1_decision == ROCK and p2_decision == SCISSOR) or
              (p1_decision == PAPER and p2_decision == ROCK) or
              (p1_decision == SCISSOR and p2_decision == PAPER)):
            self.observation.p1_score += 1
            p1_feedback = 1
            p2_feedback -= p1_feedback
        else:
            self.observation.p2_score += 1
            p1_feedback = -1
            p2_feedback -= p1_feedback

        print(f"p1 score {self.observation.p1_score} - p2 score {self.observation.p2_score}")

        self.trial.actors.player[0].add_feedback(value=p1_feedback, confidence=1)
        self.trial.actors.player[1].add_feedback(value=p2_feedback, confidence=1)

        return self.observation
    ...

Restart the environment by running docker-compose restart env

Improve the Agent

This section will outline how to include a supervised learning model for RPS which predicts actions for the player. Download the model to the root of the project (the “rps” folder) from gitlab.

New python dependencies are required, so we’ll update the Dockerfile:

Dockerfile

FROM python:3.7

RUN pip install cogment \
    keras==2.2.4 \
    tensorflow==1.14.0\
    numpy==1.17.2



WORKDIR /app

Rebuild the image -

docker-compose build

Update the agent -

player.py

import cog_settings
import numpy as np
import tensorflow as tf

from collections import deque
from keras.models import load_model
from data_pb2 import PlayerAction, NONE, ROCK, PAPER, SCISSOR
from cogment import Agent, GrpcServer

global graph, model
graph = tf.get_default_graph()


model = load_model('model_supervised.h5')


def GetPrediction(moves):
    if len(moves) < 18:
        moves = moves + ([0] * (18-len(moves)))

    moves_np = np.array([moves])

    with graph.as_default():
        prediction = model.predict(moves_np).argmax()

    return prediction


def GetAgentMove(moves):
    prediction = GetPrediction(moves)
    to_play = {
        ROCK: PAPER,
        PAPER: SCISSOR,
        SCISSOR: ROCK,
    }

    return to_play[prediction]


class Player(Agent):
    VERSIONS = {"player": "2.0.0"}
    actor_class = cog_settings.actor_classes.player

    def __init__(self, trial, actor):
        super().__init__(trial, actor)
        self.moves_history = deque(maxlen=18)

    def decide(self, observation):
        # p1 is the ai and p2 the humam
        if observation.previous_p2_action is not NONE:
            self.moves_history.append(observation.previous_p1_action.decision)
            self.moves_history.append(observation.previous_p2_action.decision)

        action = PlayerAction()
        action.decision = GetAgentMove(list(self.moves_history))

        print(f"Player decide {action.decision}")
        return action

    def reward(self, reward):
        print("Player reward")

    def end(self):
        print("Player end")


if __name__ == '__main__':
    server = GrpcServer(Player, cog_settings)
    server.serve()