Last reward when the agent is destroyed

Hi! I am working on a competitive self-play project in which the agents get destroyed when the area is reset. Since the last reward (+1 if the team has win, -1 if it has lost) is relevant for both having appropiate results and the ELO measure, I wanted to know if that last reward is actually received by the system or it gets lost because after being added the agent get destroyed.

You should be able to see if the agents are receiving the rewards via the tensorboard curves. Additionally, if the ELO curve increases reasonably, this would also indicate the agent’s are correctly receiving rewards. To learn more about using tensorboard, please see our documentation here ml-agents/docs/Using-Tensorboard.md at main · Unity-Technologies/ml-agents · GitHub

The ELO curve is decreasing, that is what made me think in the first place that something is going wrong, on the other hand I am not sure if I can appreciate if the agent is receiving the reward because I don’t know if the Cumulative Reward graph in Tensorboard is giving me the mean reward of the agents of the learning team or both teams.

So, in case this is the problem how could I assure that the agents receive that last reward?

Only the reward of the learning agent is reported in the reward curve. Can you try resetting the agents instead of destroying them? Or do they need to be destroyed for some reason?

I am destroying the agents because the size of the teams changes on each episode, randomly between 2 and 5, so when the area is reset i may not longer need all the agents that I had on the last episode.

I'm having what seems to be the same issue: I'm training a 2v2 fighting scenario, and ELO keeps decreasing.

The same 1v1 situation works fine: when one of the two agents gets hit, I give a reward to each agent and destroy them. When time runs out, the duel ends in a draw and both agents are destroyed (I have a scenario manager maintaining several duel instances in parallel, so another duel will start again soon). In this case, all seems to work, ELO is increasing, behavior seems to improve.

In the 2v2 case, I can't just destroy an agent when it dies because the reward is tied to the end result. Instead, I disable its gameobject, wait for the fight to be decided or the time to run out, assign rewards accordingly (+1 for all agents in the winning team, -1 for the losing team, 0 for a draw), then destroy the parent scenario and all agents with it. In this case, ELO is steadily decreasing, and I'm not sure what I'm missing. Looking into the ml-agents code now to better understand, let me know if you have ideas.

My questions:
- Is disabling agent gameobjects a supported use-case? (expected: agents no longer observes or takes action, but I'd like to set the final reward)
- Is there an easy way to log or debug an agent's final reward? I'm running a lot of instances at the same time, so tensorboard curves are hard to read, and I also have small incentive rewards, so the cumulative reward is slightly different than [1, -1, 0].

Thanks!

FWIW, here is my reward assignment code:

  public void EndDuel(List<BaseAgent> winners, List<BaseAgent> losers) {
    if (ended) return;
    ended = true;
    foreach (var agent in winners) {
      if (agent == null) continue;
      agent.gameObject.SetActive(true);  // Agent might be disabled if it was killed earlier.
      agent.SetReward(1f);
      agent.EndEpisode();  // Trying to force the final reward to be taken into account.
      Destroy(agent.gameObject);  // Trying to make sure the agent doesn't take another step or restart an episode.
    }
    foreach (var agent in losers) {
      if (agent == null) continue;
      agent.gameObject.SetActive(true);
      agent.SetReward(-1f);
      agent.EndEpisode();
      Destroy(agent.gameObject);
    }
    scenarioStats.SetDecided();
    scenario.Expire(); // This has Destroy(scenario.gameObject, 0.01f);
  }

ELO (orange is 1v1 agent, blue is 2v2 agent)

Answering my own questions:

1 - Disabling agents temporarily is not supported (see OnDisable in Agent.cs:498 in 1.7.2-preview): when the agent is disabled, its episode ends, so I can't use this to set a "delayed" reward. It would be helpful to make this a supported use-case (or allow the agent to "sleep" and not observe/take action, just wait for the final reward). I'm looking for a way around this, will post here if I find one.

2 - AFAIK there is no logging built-in, but I managed to edit the package and add logging to Agent.NotifyAgentDone() by doing the following (hacky):

  • Copy ml-agents and barracuda from Library/PackageCache into Assets/Packages
  • Reinstall Burst + restart editor
  • Fix assembly definition dependencies to point to the right packages
  • Fix missing scripts on prefabs (IDs for ML Agent scripts changed, so some scripts went missing)

Hey kodobolt - glad you were able to figure it out. One way to delay the reward would be to disable the decision requester temporarily (you’ll notice the agent has a Decision Requester script attached to it) and just call RequestDecision manually. Then when you don’t want to take any actions, you can stop calling RequestDecision. Another way would be to simply ignore the effects of actions for the agent. You’ll want to feed in a new 0-1 observation (isAlive) so you can tell the Agent when it’s dead. This is something we’d like to support in the future so we’ll look into it.

For #2 - to make modifications to ML-Agents code, you can clone the Github repo and change your project’s manifest.json to point to the local package instead of the one in PackageCache.

Thanks for the reply!

I'd like to disable the agent GameObject because it also has a number of components attached (rigidbody + many custom behaviors - think animated 3D character). I'm trying to evaluate how to integrate RL into a Unity/game dev workflow, so maybe matching the expectation that objects are inert but keep their state when disabled would make sense here?

For now I modified the base agent code to introduce a paused state and disconnect it from the academy. It seems to be working (step count for paused agents matches their time active), but it does not fully solve my ELO leak, so maybe I'm missing something. I'll update this if I figure it out.

Thanks for the info for #2.