TicTacToe with self-play?

HI all,

I’m trying to build my personal project to train TicTacToe with self-play.

  • Input (Features) : Player(1: Circle or Cross), Board(3x3: Blank, Circle or Cross) (total: 10 integers)
  • Stacked Vectors: 3
  • Output : Branch Size(1), Branch 0 Size(9) (between 0 and 8)
  • Reward : Win(1), Tie(0), Lose(-1)

I first started training this with my random AI and it easily learned to beat the random AI. I also trained this with my basic AI that does random moves except (1) it does defense when opponent would win next turn and (2) it does make a final move to win when it sees two things are already connected and there is an empty spot to finish. This indeed created much better model but was not smart enough.

I’m now trying to use mlagent’s self-play feature. It gives me a high hope since I don’t have to create clever AI first or tune the reward system to produce a better model. For previous single agent case, I’ve trained the model for Circle player only. Now I have two different agents, one is Circle (Team Id: 0) and the other one is Cross (Team Id: 1). After about 7M games, Elo kinda stop growing around 6000 so I took the model to play against it.

The result was very disappointing since it was much worse than the one trained with the basic AI. I suspected it might be which player (Circle or Cross) the model plays. But no matter which player it takes, it plays badly. Can anyone give some ideas or hints that I’m missing here?

  • Should I invert the board state for the Cross player agent so that both agents are trained as a Circle player?

  • Should I change any of the settings in the config?

  • Maybe TicTacToe is not a good target for self-play?

  • Here is the yaml config for your information. Thank you!

behaviors:
  TicTacToe:
    trainer_type: ppo
    hyperparameters:
      batch_size: 1024
      buffer_size: 10240
      learning_rate: 3.0e-4
      beta: 5.0e-3
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      learning_rate_schedule: linear
    network_settings:
      normalize: false
      hidden_units: 512
      num_layers: 2
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
    self_play:
      save_steps: 10000
      swap_steps: 10000
      play_against_current_self_ratio: 0.5
      window: 5
    max_steps: 10000000
    time_horizon: 64
    summary_freq: 10000

Okay I have just noticed that I’m actually getting the below warning. Anyone knows if this is an issue?

Here are a few things I would change:

  1. The agent should be agnostic to whether it is playing cirlce or cross. Observe the board state only (3x3), with 0 being an empty slot, 1 = agent occupies slot, -1 = opponent occupies slot. So yes, board states would be inverted for the two agents.
  2. I don’t think you need stacked observations, because all information needed for deciding on a move is in the current state.
  3. You’re using discrete actions, right? Set lower batch_size and buffer_size values (ml-agents/docs/Training-Configuration-File.md at release_5_docs · Unity-Technologies/ml-agents · GitHub)
  4. You probably won’t need 2x512 hidden_units for this, try with a smaller network first.

What are you doing if an agent makes an invalid move, like choosing a slot that’s already occupied? You could end the episode right away in this case, with that agent being the loser.

Thank you @mbaske ! Your suggestions all make sense.

I’m masking impossible moves by overriding CollectDiscreteActionMasks(). Thanks!

It is working great now. Here is my repo if anyone is interested. :slight_smile:
https://github.com/young2code/TicTacToeML

1 Like

Thanks for sharing the source!