HI all,
I’m trying to build my personal project to train TicTacToe with self-play.
- Input (Features) : Player(1: Circle or Cross), Board(3x3: Blank, Circle or Cross) (total: 10 integers)
- Stacked Vectors: 3
- Output : Branch Size(1), Branch 0 Size(9) (between 0 and 8)
- Reward : Win(1), Tie(0), Lose(-1)
I first started training this with my random AI and it easily learned to beat the random AI. I also trained this with my basic AI that does random moves except (1) it does defense when opponent would win next turn and (2) it does make a final move to win when it sees two things are already connected and there is an empty spot to finish. This indeed created much better model but was not smart enough.
I’m now trying to use mlagent’s self-play feature. It gives me a high hope since I don’t have to create clever AI first or tune the reward system to produce a better model. For previous single agent case, I’ve trained the model for Circle player only. Now I have two different agents, one is Circle (Team Id: 0) and the other one is Cross (Team Id: 1). After about 7M games, Elo kinda stop growing around 6000 so I took the model to play against it.
The result was very disappointing since it was much worse than the one trained with the basic AI. I suspected it might be which player (Circle or Cross) the model plays. But no matter which player it takes, it plays badly. Can anyone give some ideas or hints that I’m missing here?
-
Should I invert the board state for the Cross player agent so that both agents are trained as a Circle player?
-
Should I change any of the settings in the config?
-
Maybe TicTacToe is not a good target for self-play?
-
Here is the yaml config for your information. Thank you!
behaviors:
TicTacToe:
trainer_type: ppo
hyperparameters:
batch_size: 1024
buffer_size: 10240
learning_rate: 3.0e-4
beta: 5.0e-3
epsilon: 0.2
lambd: 0.95
num_epoch: 3
learning_rate_schedule: linear
network_settings:
normalize: false
hidden_units: 512
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
self_play:
save_steps: 10000
swap_steps: 10000
play_against_current_self_ratio: 0.5
window: 5
max_steps: 10000000
time_horizon: 64
summary_freq: 10000