Questions about self-play

I’m trying to train some agents using self play, and currently I am a bit confused by the output.

First a “My Behaviour” is detected that I cannot find anywhere in my scene, and should not be there.

2021-04-05 22:54:35 WARNING [trainer_factory.py:60] Behavior name My Behavior does not match any behaviors specifiedin the trainer configuration file: ['ShipAI']
2021-04-05 22:54:35 INFO [stats.py:186] Hyperparameters for behavior name My Behavior:

During training itself the training itself at first the ELO seems to update, but eventually it doesn’t get reported anymore. Why does it suddenly stop? Also, why is ELO decreasing while the mean group reward is positive?
The Group Rewards are assigned perfectly symmetrically, so anything above 0 should be bigger than their opponent.

2021-04-05 23:03:05 INFO [stats.py:180] ShipAI. Step: 20480. Time Elapsed: 517.991 s. Mean Reward: 0.000. Mean Group Reward: 0.699. Training. ELO: 1000.744.
2021-04-05 23:10:33 INFO [stats.py:180] ShipAI. Step: 25600. Time Elapsed: 965.617 s. Mean Reward: 0.000. Mean Group Reward: 0.389. Training. ELO: 998.989.
2021-04-05 23:17:54 INFO [stats.py:180] ShipAI. Step: 30720. Time Elapsed: 1406.548 s. Mean Reward: 0.000. Mean Group Reward: 0.453. Training. ELO: 998.492.
2021-04-05 23:25:15 INFO [stats.py:180] ShipAI. Step: 35840. Time Elapsed: 1848.152 s. Mean Reward: 0.000. Mean Group Reward: 0.335. Training.
2021-04-05 23:32:45 INFO [stats.py:180] ShipAI. Step: 40960. Time Elapsed: 2297.405 s. Mean Reward: 0.000. Mean Group Reward: 0.351. Training.
2021-04-05 23:39:56 INFO [stats.py:180] ShipAI. Step: 46080. Time Elapsed: 2729.359 s. Mean Reward: 0.000. Mean Group Reward: 0.434. Training.
2021-04-05 23:40:52 INFO [stats.py:180] ShipAI. Step: 51200. Time Elapsed: 2785.278 s. Mean Reward: 0.000. Mean Group Reward: 0.279. Training.
2021-04-05 23:50:49 INFO [stats.py:180] ShipAI. Step: 56320. Time Elapsed: 3381.634 s. Mean Reward: 0.000. Mean Group Reward: 0.312. Training.```

Lastly one of the training environments seems to time out, which I haven't been able to reproduce running it in the editor, causing all environments to shut down, with mlagents throwing an UnityTimeoutException.

2021-04-05 23:52:11 INFO [subprocess_env_manager.py:220] UnityEnvironment worker 5: environment stopping.
2021-04-05 23:53:11 INFO [environment.py:431] Environment timed out shutting down. Killing…
2021-04-05 23:58:04 INFO [model_serialization.py:183] Converting to results/TestAI_5/ShipAI/ShipAI-58027.onnx
2021-04-05 23:58:04 INFO [model_serialization.py:195] Exported results/TestAI_5/ShipAI/ShipAI-58027.onnx
2021-04-05 23:58:04 INFO [torch_model_saver.py:116] Copied results/TestAI_5/ShipAI/ShipAI-58027.onnx to results/TestAI_5/ShipAI.onnx.
2021-04-05 23:58:04 INFO [model_serialization.py:183] Converting to results/TestAI_5/My Behavior/My Behavior-0.onnx
2021-04-05 23:58:04 INFO [model_serialization.py:195] Exported results/TestAI_5/My Behavior/My Behavior-0.onnx
2021-04-05 23:58:04 INFO [torch_model_saver.py:116] Copied results/TestAI_5/My Behavior/My Behavior-0.onnx to results/TestAI_5/My Behavior.onnx.
2021-04-05 23:58:04 INFO [trainer_controller.py:81] Saved Model
2021-04-05 23:58:07 INFO [environment.py:429] Environment shut down with return code 0.
2021-04-05 23:58:07 INFO [environment.py:429] Environment shut down with return code 0.
2021-04-05 23:58:07 INFO [environment.py:429] Environment shut down with return code 0.
2021-04-05 23:58:07 INFO [environment.py:429] Environment shut down with return code 0.
2021-04-05 23:58:07 INFO [environment.py:429] Environment shut down with return code 0.
2021-04-05 23:58:07 INFO [environment.py:429] Environment shut down with return code 0.
2021-04-05 23:58:07 INFO [environment.py:429] Environment shut down with return code 0.


I would appreciate any help on what is going on here.

First, the “My Behavior” is the behavior name you assigned in the agent’s behavior parameters in the editor. You should assign the same name as the behavior name in your config file.

The reason why sometimes ELO doesn’t report is because it won’t report if the agents didn’t complete a game in that summary window. You can try to increase the summary period to get better stats.

To your question: Also, why is ELO decreasing while the mean group reward is positive?
ELO can increase or decrease when the reward is positive. It’s possible that you perform a bit worse in a certain period but still stronger than your opponent. Also it would be better to tell if ELO is really decreasing in the view of longer term. You summary period is pretty short so those might just fluctuations.

Lastly the time out problem. If it’s a one-time thing I’d say it’s some fluke on your machine. If it happens constantly and you have steps for reproducing it, please report a bug to us.

The problem is I don’t have any Agent named My Behavior, they are all named ShipAI, which gets detected correctly, and all agents are instantiated from that one prefab. That is why it is so puzzling me.

The decision making is turn based, a single game has at most ~700 decisions involved, so getting no report for 20k steps seems a bit odd, especially since before every 5k steps showed one?

For the crash, it seems to be repeating, but I am not sure what is causing it. Are there logs somewhere stored of the individual environments?

I just checked. “My Behavior” Is the default behavior name if you didn’t specify one, so it might be the case that somewhere in your scene there’s an agent which wasn’t specified the behavior name.

The total steps includes steps from all the agent so you might need to multiply 700 by total number of agents you have in the scene. It’s possible that in early stage of training the episode were shorter so you are seeing the logs, but then after the agent learned something it became longer.

You can find the player logs in your results folder. There should be a file named “Player_x.log”.

I run a much longer sequence, and ELO varies between 1010 and 995, not really seeming to increase even though the reward is always positive (around 0.4)
I don’t understand how that can happen, since the rewards are perfectly symmetrical. A draw should net you 0, a loss -1 and a win 1.
0.4 would mean the policy being trained is winning most of the time and ELO should increase?

Thanks, I could identify the bug that was causing the environment to freeze

I checked, there is no behaviour named “My Behavior” anywhere in my scene. All my Behaviours are named “ShipAI” and are instantiated from a single prefab, so I can’t really explain where that default name is coming from