A Self-Play Environment with 2 Agents and 2 Policies

My understanding of the self-play environment with TWO AGENTS (Agent1, Agent2) and ONE POLICY (shared policy) is as below.

• Agent1 has a policy that is learned by several train steps. (by self-play parameter: save_steps)
• Agent2 is applied a fixed policy that is learned past. (by self-play parameter: window)
• The learning agent is switched by several train steps. (by self-play parameter: team_change) That means Agent2 becomes a learning agent.

In contrast, How is the environment with TWO AGENTS and TWO POLICIES (Agent1 has policy1 and Agent2 has policy2)?

• First, Which policy Agent1 will train with? (vs Agent1’s Policy1? of vs Agent2’s policy2)
• After changing the training team, Agent2 Which policy Agent1 will train with? (vs Agent1’s Policy1? of vs Agent2’s policy2)

Why I’m asking this because I’m wondering if I separate policies on a turn-based board game environment like chess.

Chess looks like the symmetry adversarial game, but the direction of moving tokens is different for each player. For example, moving forward for Agent1 is y-axis +1, but for Agent2 is y-axis -1.

My idea of learning this environment correctly is to separate the policies and learn each policy. But, if Agent1 learns by using past Agent1’s policy, this idea may not work because using the same policy as the opponent is not appropriate for learning as I said before.

[EDIT] This statement was incorrect, please disregard [/EDIT]

Chess is a symmetrical adversarial game because from the perspective of the agent each side has the exact same options/pieces/rules/strategies. Here’s a quick explanation. Symmetry doesn’t have anything to do with the board itself. Imagine you and I are playing chess across from each other in real life. At the start of the game if I reach out my left hand and move the queens knight, that’s a legal move right? Now imagine the same scenario but you go first, could you make the exact same move? That’s symmetry. You can make the exact same statement about any move in chess.

This symmetry allows ml-agents version of ‘self-play’ to work for chess without the need to introduce a second policy.

Just a quick extra note I noticed I didn’t specify. When I claim that chess is symmetrical I’m referring to a series of games (like an agent would play), not a single game. You could make the claim that a single game of chess is non-symmetrical because one player goes first and the other can never go first.

But the main claim is that board a-symmetry and symmetry can be abstracted out and don’t effect training in this case.

This is not correct, we support asymmetric self-play with agents that have different policies. See the StrikerVsGoalie config and example scene.

I agree that you should be able to use a single policy in this case, though. You should modify the observations so that the player’s starting side is in a consistent position.

Note that chess might not be solvable (easily, without a lot of computing power) using the current algorithms ML-Agents.

Oh my bad! Either it didn’t used to be supported or my memory is going . Thanks for the correction @celion_unity , I shouldn’t have blabbed without looking it up.

No worries. The initial version of self-play only supported symmetric agents, but asymmetric support has been available since the 0.16.0 release.

Totally agree with you. After I wrote this post, I noticed that I should create inverted observations for Agent2.

I’m new to ML. I’m curious about what algorithm the current version of Unity ML-Agents uses. And which algorithm the best for solving a complicated problem like chess.

Luke-Houlihan Thanks for your explanation. I’m trying to build single policy self-play again.

The two main algorithms that ML-Agents provides are:

The state-of-the-art for board games would be something like AlphaZero or MuZero, but we don’t have implementations for these (and I’m not sure we’re planning on adding them either).