Self-play ignores old policy from pretrained weight

I have a small square area environment that has 2 agents shoot each other. If an agent touches the wall, it loses, the other win, end episode. If an agent dies, it loses, the other win, end episode.
At first, I trained the agent without self-play to shoot another agent with the same policy to speed up the training. It works very well.
After that, I add self-play to the config file and change learning_rate_schedule from “linear” to “constant” and --resume training on that run-ID.
Here is my current reward code:

    //when an agent die
    public void AgentDie(GameObject whatAgent){
        float rewardPenalty = penaltyRate * (float)resetTimer/(float)MaxEnvironmentSteps;
        if (whatAgent == GreenAgent){
            GreenAgent.GetComponent<Agent_Fighter_Demo>().AddReward(-1f);
            RedAgent.GetComponent<Agent_Fighter_Demo>().AddReward(+1f-rewardPenalty);
        }
        else{
            GreenAgent.GetComponent<Agent_Fighter_Demo>().AddReward(+1f-rewardPenalty);
            RedAgent.GetComponent<Agent_Fighter_Demo>().AddReward(-1f);
        }
        ResetScene();
    }
    //when an agent touch wall
    public void AgentTouchWall(GameObject whatAgent){
        if (whatAgent == GreenAgent){
            GreenAgent.GetComponent<Agent_Fighter_Demo>().AddReward(-1f);
            RedAgent.GetComponent<Agent_Fighter_Demo>().AddReward(+1f);
        }
        else{
            GreenAgent.GetComponent<Agent_Fighter_Demo>().AddReward(+1f);
            RedAgent.GetComponent<Agent_Fighter_Demo>().AddReward(-1f);
        }
        ResetScene();
    }
//when a bullet hit an enemy
Shooter.GetComponent<Agent_Fighter_Demo>().AddReward(damage/maxHealth);

After many attempts to reshape the reward and train, my agents always tend to not shoot each other anymore, they just shoot to the air. Even I tried in long term, trained overnight, the result is the same.
7925032--1011694--upload_2022-2-25_14-3-17.png


Does anyone know what is wrong with my problem?
Why did my agent not hitting each other anymore?

Does this cause because of my reward shape?

Absolutely no idea :slight_smile: Nope. No clue. Nada, Zip.

However if I may suggest:

  1. Instead of resuming the run try --initialize-from to a new run.
    From somewhere in the docs: “You may want to do this, for instance, if your environment changed and you want a new model, but the old behavior is still better than random. You can do this by specifying --initialize-from=, where is the old run ID.”

Or maybe:

  1. After training without Self Play. Record a bunch of demo files using the trained agents.
    2.1) Then turn on self play on a new run (not initialized from the previous run). A brand new run. But use Gail with the recorded demo files
    2.2) Then with self play turned on. Do another run initialized from the previous (2.1) run but without Gail.
1 Like