Hi, I’m debugging an issue with the SAC trainer where agents learn to play very well then regress to doing nothing.

In my setup, there are two competing teams and I use zero-sum rewards (i.e. if Team A gets +R Team B gets -R, for anything). The setup works very well up to a point where everything goes wrong and they all decide to just do nothing. (If I save the model before that point, they actually play pretty well.)

I have tried exiting and reloading at regular intervals to make sure it’s not a problem with running the simulation itself for a long period of time. It’s really not that.

I’ve attached a screenshot of an example tensorboard. A few notes:

Cumulative rewards are always zero due to zero-sum.

Episode length goes down if one team can fulfill the winning condition before time runs out, so it’s a good thing if the episode ends early

After a while, the episode length goes back to the max due to all agents doing nothing (and not due to better defense, for instance)

Now, I find the Extrinsic value estimate graph confusing, cause the value estimate seems to go down when one team is fulfilling the winning condition!
Is there a reason a state where one team gets +R and the other gets -R would have a lower value estimate than a state where they both get 0?

When one team reaches the winning condition, is the specific observation exactly the same for both teams? If the value function is being trained to give the same exact state both an estimate of both +R and -R, it would cause issues.

If the agents on each team are exactly the same (i.e. use the same policy), you might want to check out our new self play feature that will be part of the next release.

No the observations are team-dependent, for instance +1 if your team has the ball and -1 if the other team has it.

Yes I just saw the Self-Play feature (thanks for that!) I intend to try it out, but I’d like to figure out the current issue before adding more stuff.

I just find it very confusing that the value estimate goes down the more rewards are given. Theoretically, if 2 states are discovered: one with +R and one with -R, shouldn’t the mean value estimate remain at 0?

But it keeps going down the higher the magnitude of R is, and training goes well until it drops below 0, and then the model just prefers to do nothing and keep it at 0.

Well, ideally, trainers would generate a policy that assigns lower probability to the actions that lead to the -R and higher probability to those that lead to +R. The value estimate is the expected discounted return when following a particular policy (the current policy in PPO, optimal maximum entropy policy in SAC). In both cases, the policy should favor the actions that lead to +R and the value estimate should reflect that. However, when looking at your tensorboard, I see what you mean.

Can you share your observation space? it’s possible there is some weirdness there since symmetric observations can be tricky for agents on opposing teams. In our soccer and tennis environments, we don’t use an explicit -1/1 observation to specify teams, but rather invert the appropriate observation values/indices. For example, the agent’s x coordinate is inverted in tennis because the agents move in opposite directions in the env. In the soccer environment, we have specific ray cast slots for ‘teammate’ and ‘opponent’ which are different for purple/blue teams. Those might be helpful to look at too.