Agent performing worse during inference

Hi everyone, I’m having a really frustrating problem with an agent where it is performing much worse during inference. I’ll try and include as much detail as I can:

Environment:
The agent acts in a 3D environment, controlling a rocket booster that starts in the air a hundred metres or so above the landing pad. It’s goal is to land softly. The agent has individual control over the booster’s nine engines.

Actions:
The agent takes both discrete and continuous actions:

  • It has 9 discrete actions, one for each engine, with 5 choices each: no action, start up engine, shut down engine, throttle up, throttle down.
  • It has 18 continuous actions: the ability to gimbal each engine in the x and y direction.
  • Action masking is implemented to prevent the agent from, for example, starting up an engine that is already running.
  • A decision is requested every 5 steps (the ML-Agents default). Actions are taken between decisions.

Observations:
The agent receives 23 observations, including:

  • Position relative to pad
  • Rotation
  • Velocity
  • Angular velocity
  • Fuel remaining
  • Current engine throttles

Rewards:

  • The agent receives a penalty each timestep dependent on its angular velocity, to encourage stable flight.
  • It does not receive any rewards at the end of the episode, unless it hits the pad, in which case the it receives a positive reward which is higher if it has performed a good landing (determined by landing speed, distance to the centre of the pad, etc).

Training:

  • Training takes place on a remote Linux machine that is provided by my university. I use GitHub to transfer the built environment to the remote computer and the .onnx models back to mine for testing.
  • I run 8 instances of the environment, each of which contains 8 training areas.
  • Training occurs in real time (1x timescale)
  • My config file is below:
behaviors:
    BoosterLanding:
        trainer_type: ppo
        max_steps: 1.0e9
        time_horizon: 128
        summary_freq: 100000
        hyperparameters:
            batch_size: 2048
            beta: 2.5e-3
            buffer_size: 81920
            epsilon: 0.2
            num_epoch: 3
            lambd: 0.95
            learning_rate: 2.0e-4
            learning_rate_schedule: linear
        network_settings:
            memory:
                memory_size: 128
                sequence_length: 64
            hidden_units: 128
            num_layers: 3
            vis_encode_type: simple
            normalize: true
        # use_recurrent: false
        reward_signals:
            extrinsic:
                strength: 1.0
                gamma: 0.99

The results of training, according to the Tensorboard statistics, were nearly perfect. The agent was consistently receiving nearly the theoretical maximum reward. However, when I transferred the model back to my computer, it was clearly performing worse than the statistics suggested. The statistics showed that it was consistently landing below 1 m/s, but in inference it was landing at over 20 m/s.

If anyone has any suggestions as to what could be going wrong or tests I could carry out to determine the problem, I’d really appreciate it. Thanks.

so is it actually getting a lower reward than the training said it was or is it just finding another way to get the “max” reward?
if you’re not happy with the landing you might just need to increase the reward it gets for landing well, now that it has a good baseline to train from it shouldn’t take much to improve the landing (assuming the flight is ok, it’s not clear from your post).