Curriculum learning and self-play

Hi,

  1. Question about mean reward in self-play: maybe a silly question, but is is correct that the reward values that are written to tensor board are of the team that currently train?
  2. In self-play, when the reward is only +1 win, -1 lose, 0 draw, then the mean reward can’t really get to 1 as the “opponent” keeps getting better also, is it correct? the mean reward should jiggle around 0-ish…
  3. Curriculum learning is triggered by measuring reward or progress. I have a game with 2 tanks and progress is not a real measure of success. However, if my question 2 statement is correct, how can you use curriculum learning (using unity mechanism - not self written one) in self-play where reward hardly represent improvement
  4. @andrewcoh_unity in Can I add a reward in self-play which would make the game to be non-zero-sum? stated that “If the agents seem to ignore winning and losing, you can also try using a curriculum.”, can someone elaborate how curriculum will help?

thank you all

  1. The reward that is reported is from whichever team is learning.

  2. There are a number of hyperparameters that come into play here. Please see ml-agents/docs/Training-Configuration-File.md at release_8_branch · Unity-Technologies/ml-agents · GitHub for an overview. If the opponent pool contains a lot of old policies, the reward should be much closer to +1 than jiggle around 0.

3/4. You can use a curriculum to speed up learning by giving the agents rewards for intermediate decisions as opposed to just winning and losing. This is useful if its hard to win or lose by just following a random policy. For example, for a soccer agent with a really large field, it might be incredibly unlikely to actually score a goal. To speed this up, I might give the agents rewards for colliding with the ball. However, I ultimately want the agents to learn behavior that’s optimal for winning, and not necessarily optimal for colliding with the ball as man times as possible and then scoring! So, I might use a curriculum to teach the agents that the ball is important at first and then decay this reward when the agents start winning efficiently.

Let me know if this helps or if there are other questions.

1 Like

Thank you for your reply.
I understand the concept of curriculum learning, but my issue is with advancing the lessons.
I use a simple +1 -1 reward, and simplify the task using the curriculum system (bigger ball or bigger net).
Because play_against_latest_model_ratio is 0.5, the reward jiggle around 0 and then lessons are not advancing using the completion_criteria with measure:reward (and episode length is irrelevant)
I now understand I can change play_against_latest_model_ratio to a lower number, however, this will alter the learning process and is not my intention…

what are my options?
Should I write my own completion criteria?
Is there an official example of symmetric self-play and curriculum learning? (I feel that in the current implementation curriculum learning does not suit symmetric self-play)
thank you

Has there been any clarity on how to set up curriculum learning for collaborative Group rewards ?

I have not been able to find any clear and obvious documentation on what the correct configuration would be for the measure of Group Reward to advance Curriculum training forward. “measure:reward” does not seem to recognise the Group rewards signal.

Looking through the ML Agents code on GitHub, Group Reward is only implemented for the PPO trainer. And I cannot find any implementation of group reward as a measure for advancing curriculum learning. It might be there I just can’t see where.

Note: you can always do this manually though.
Lesson 1: Set the parameters and do a training run till happy with the performance
Lesson 2: Change the parameters and start a new run with --initialize-from= set to the previous run.

Yep Thanks - It looks as though the only option is manual curriculum learning.

Its a shame that the Group Reward is not accessible via the Academy API in some way, so that we could code our own curriculum threshold step ups, to help automate the process (overnight when we are sleeping)

Suggestion:
If you are not using parallel executable instances then and instead using multiple clones of the environment within the same Unity instance. Then:

Instead of using the default SImpleGroupReward class which comes with Unity ML Agents.

You could instead inherit from it, override the add reward method and the end episode methods

Then you could implement your own tracking of the group reward progress stats per episode, collate this into stats, and programmatically adjust Unity environment parameters.

1 Like