Cumulative reward decreases when episode length is also decreasing.

Hi,

The Cumulative reward is decreasing, when episode length is decreasing. When the episode ends it receives 20 points. Why does the Cumulative reward decrease, but the agent is getting better and better to solve the task?. (Attaching stat for agent.)

Rewards:

-0.2 for colliding with border
2 for taking a treasure from chamber
4 for taking this treasure to own chamber
20 for winning the game.

Conf file.

default:
    trainer: ppo
    batch_size: 1024
    beta: 5.0e-3
    buffer_size: 10240
    epsilon: 0.2
    hidden_units: 128
    lambd: 0.95
    learning_rate: 3.0e-4
    learning_rate_schedule: linear
    max_steps: 5.0e5
    memory_size: 128
    normalize: false
    num_epoch: 3
    num_layers: 2
    time_horizon: 64
    sequence_length: 64
    summary_freq: 20000
    use_recurrent: false
    vis_encode_type: simple
    reward_signals:
        extrinsic:
            strength: 1.0
            gamma: 0.99
        curiosity:
            strength: 0.02
            gamma: 0.99
            encoding_size: 64
            learning_rate: 3.0e-3

PlayerAgent:
    time_horizon: 256
    batch_size: 4096
    buffer_size: 40960
    hidden_units: 512
    max_steps: 5.0e6
    beta: 7.5e-3

Github Repository. You need to import ML-agent 1.0.8 Package manually in the project.

https://github.com/Badsalt/AI

/Melvin

8845120–1205575–AI_292_PlayerAgent.zip (4.6 KB)

Cumulative reward is the total reward earned in the episode.

Imagine you earn $1 per hour. You work 8 hours. Your cumulative reward is $8.

Now imagine that you are being paid to do a task. You learn to do it faster, in only 3 hours. Now your cumulative reward is only $3.

It’s ok for cumulative reward to go down by the way.

Oh, you don't have per-timestep rewards. Never mind then :P This is not the reason then :P

I'd recommend watching what the agent is actually doing. How is it earning reward? Why is the episode getting shorter? There's either a bug in your code, or the agent is somehow 'gaming' your rewards somehow. Looking at what it is actually doing will likely reveal some insight to you.

I solved the problem by setting the final reward based on the amounts of steps.