How does self play trainer choose learning agent?

Hi!

I’ve set up a self play game, but for some reason I see ELO consistently going down rather than up. I think it’s tied to learning agent choice.

I have two agents, let’s call them Left and Right. Left is assigned to Team 0, Right is assigned to Team 1. They are listed in the order: Left, then Right in game object hierarchy. My camera follows player Left, and I assumed player Left would be the training player, and player Right would be the swapping snapshot from the past.

However, the mlagents-learn script shows that Helicopter?team=1 is the learning player. This is rather counter intuitive. When I see a win from the point of view of Left, the game records lowering of ELO for the player Right.

How does it determine the learning agent?

Why would it choose ?team=1 player over ?team=0?

Their weights are loaded in order from ?team=0 to ?team=1.

Furthermore, I think the Right agent gets into the downwards spiral when he’s always facing the agents from his past, who all have higher ELO than him, and are stacked to win again. I recorded the demonstration from the point of view of player Left, so this might work against player Right.

Ok so after some debugging here it is:

env_manager.external_brains is a dict,
env_manager.external_brains.keys () is a set,
new_behavior_ids is a set based on env_manager.external_brains.keys ().

And sets don’t guarantee the order of items. Items are stored in the correct order in the dict, but keys() messes it up.

To fix it, I altered the for loop in trainer_controller._create_trainers_and_managers:

for behavior_id in sorted(behavior_ids):
    self._create_trainer_and_manager(env_manager, behavior_id)

I added sorted clause. These behavior_ids are just strings, so they team=0 goes before team=1.