Hi, I’d be interested in your thoughts on automating curricula with symmetric self-play. I’m not super familiar with the theory behind this, so I might be reinventing the wheel here or miss some crucial aspects.
I have two humanoid agents pitted against each other in a boxing match. Episodes are open-ended and can terminate in one of three ways:
- Whenever an agent strikes its opponent, I’m adding its fist velocity’s magnitude to a float field that was nulled on episode start. Once the accumulated velocities exceed a given max value, that agent is the winner and the opponent loses. Even if the difference between both agents’ accumulated velocities is minimal.
- When an agent’s head drops below some threshold height, it is considered down and a counter starts. If the agent doesn’t manage to get its head up again for a given number of steps, it loses and the opponent wins.
- While one agent is down, its opponent goes down as well. In this case, the episode ends immediately in a draw.
The challenge with this (besides fidgety physics) is that my values for max_accumulated_velocities and max_down_steps need to change as training progresses. In the beginning, agents lose their balance almost immediately and in order for them to understand that falling down is bad, max_down_steps has to be tiny. Otherwise, both agents go down and all episodes would end in a draw. Likewise, max_accumulated_velocities has to start low, so agents can learn that punching their opponent will result in a win. Later, agents need to learn striking harder and repeatedly, as well as how to get up after falling down, for which the respective max values have to be higher.
I’m trying to solve this with a seperate auto-curriculum class that receives OnEpisodeStop events from all agents in a scene, indicating why an episode has ended. This class has a couple of counter values: total_episode_count, win_by_strike_count, lose_by_down_count and draw_count. When an episode ends, it compares the counter values against each other and updates the max values for all agents accordingly. For instance, if more than some fraction of all episodes ended because of win_by_strike, then max_accumulated_velocities is incremented, otherwise it is decremented. Similarly for max_down_steps, its value is lowered if too many episodes ended in a draw.
Does this sound like a reasonable approach to the problem? Thanks!