UnityEnvironment worker 0: environment stopping.

Hi everyone, i was training my agent yesterday, everything was going perfectly until i reached 10 millions steps. Then i noticed a sudden drop in performances. Reward and value loss plummeted, policy loss skyrocketed, and entropy started increasing.

Here are some tensorboard graphs: Imgur: The magic of the Internet

On top of that, this morning i discovered that the training stopped after 27 millions steps, with the message:
UnityEnvironment worker 0: environment stopping.

It’s the first time that i see something like that, what could be the reason of such a sudden change?
and what does “worker 0” means? no agent is responding? no agent is present in the scene?

How is that possible? Currently the only thing that reset the agents, is the maxStep parameter in the inspector. after 5k steps, they start a new episode, no other line of code to reset agents or environment.

I don’t know if the performance drop, and the sim stop are related, maybe i’m asking two separate questions.

My agent has 229 observation size, 20 continuous action space, no stacked observation.
Here’s the config file:

behaviors:
  Walker_4Legs:
    trainer_type: ppo
    hyperparameters:
      batch_size: 2024
      buffer_size: 20240
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      learning_rate_schedule: linear
    network_settings:
      normalize: true
      hidden_units: 512
      num_layers: 3
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.995
        strength: 1.0
    keep_checkpoints: 5
    max_steps: 50000000
    time_horizon: 1000
    summary_freq: 100000
    threaded: true

Any idea about the issue?

They might be. Did you check the console log for NaN observation error messages?

there were no error message in the console.
I’ve researched the topic and i’ve found out that, the performance drop, might be related to the ppo algorithm.
Noobish explanation, in my own word:
if the agent reaches an high score for too many iteration (eg. almost solved the environment perfectly, but still 75% of training to go) , it might start trying different things due to the high entropy, if it does too many weird things, the bad experiences will clog the next batch, corrupting the training performances beyond recover.

Saw suggestions about lowering the learning rate, and tweaking the batch/buffer size, to avoid taking step too big during policy upgrade.

i’ve implemented curriculum learning, that will avoid the problem of getting to a solution too early.

still no idea what caused the premature stop of the previous training session.

Interesting! Could you share the source where you found this?
I’ve seen sudden performance drops a couple of times, but was assuming there’s something wrong with my agent design.

i haven’t found specific paper on the subject, but i’ve read every reddit and article, talking about :
Performance drop, reward collapse, ppo learning instability.
I’ve played around with these keywords, and every now and then someone talks about this exact problem.

i’ve noticed the issue in other simulations i wrote, but more often than not, the agent recovered, so i just thought it was related to some exploration o the action space.

Tonight i’ve run another training session, this time i had curriculum enabled, and it went on for 50 million step with no problem or performance drop.
Last time it went crazy after 10M , being around max score for the last 4 (no curriculum).

Have you ever noticed the problem with curr enabled?

No, not specifically with a curriculum. I guess the problem being related to high entropy makes sense. Although, my naive thinking so far was that the algorithm incrementally tries variations on the current policy all the time. And that it would always plateau if no better one can be found, rather than completely degrade all of a sudden.
But yeah, I think I’ve seen these kind of drops more often with the learning rate set to constant instead of linear.

1 Like

there must be something that screws the policy somehow, i understand “jumping off the cliff” to see what happen, but after doing that 100 times with no result, it should go back to the previous working strategy, not trying different jump styles over and over.