Weird change in cumulative reward

I was working on a project and my cumulative reward changed so weirdly I thought I should post this.


So I read about curiosity can lead this kind of behavior however I am only using extrinsic reward.
My configuration file:

behaviors:
  PandemicAgent:
    trainer_type: ppo
    hyperparameters:
      batch_size: 2048
      buffer_size: 20480
      learning_rate: 0.0003
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 6
      learning_rate_schedule: linear
    network_settings:
      normalize: false
      hidden_units: 512 #256
      num_layers: 4 #2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
    keep_checkpoints: 5
    checkpoint_interval: 500000
    max_steps: 1.0e7
    time_horizon: 128
    summary_freq: 10000
    threaded: true

The task is simple blue agents tries to collect yellow cubes as fast as possible.

Any idea why this happened ?

do you have set a max_step for an episode or is it a continuous environment which does not reset? I guess the agent encountered a situation that he couldn’t get out off, like being stuck in a corner or a flaw in the neural network during training where the agents wants to go forward based on the pixel values in a corner and therefore consistently walking against the corner.

There was a maximum_step. It wasn’t a continuous environment. Therefore I don’t think it was that. When I look at the simulation I saw the cube keep spinning rather than going to the reward. @BotAcademy

Thats interesting. Hopefully someone from the dev team can help you out!

Would you mind sharing your policy/value loss and policy entropy curves? Also, you could try running with threaded: false which might help stability.