Hi everyone, I’m having a really frustrating problem with an agent where it is performing much worse during inference. I’ll try and include as much detail as I can:
Environment:
The agent acts in a 3D environment, controlling a rocket booster that starts in the air a hundred metres or so above the landing pad. It’s goal is to land softly. The agent has individual control over the booster’s nine engines.
Actions:
The agent takes both discrete and continuous actions:
- It has 9 discrete actions, one for each engine, with 5 choices each: no action, start up engine, shut down engine, throttle up, throttle down.
- It has 18 continuous actions: the ability to gimbal each engine in the x and y direction.
- Action masking is implemented to prevent the agent from, for example, starting up an engine that is already running.
- A decision is requested every 5 steps (the ML-Agents default). Actions are taken between decisions.
Observations:
The agent receives 23 observations, including:
- Position relative to pad
- Rotation
- Velocity
- Angular velocity
- Fuel remaining
- Current engine throttles
Rewards:
- The agent receives a penalty each timestep dependent on its angular velocity, to encourage stable flight.
- It does not receive any rewards at the end of the episode, unless it hits the pad, in which case the it receives a positive reward which is higher if it has performed a good landing (determined by landing speed, distance to the centre of the pad, etc).
Training:
- Training takes place on a remote Linux machine that is provided by my university. I use GitHub to transfer the built environment to the remote computer and the .onnx models back to mine for testing.
- I run 8 instances of the environment, each of which contains 8 training areas.
- Training occurs in real time (1x timescale)
- My config file is below:
behaviors:
BoosterLanding:
trainer_type: ppo
max_steps: 1.0e9
time_horizon: 128
summary_freq: 100000
hyperparameters:
batch_size: 2048
beta: 2.5e-3
buffer_size: 81920
epsilon: 0.2
num_epoch: 3
lambd: 0.95
learning_rate: 2.0e-4
learning_rate_schedule: linear
network_settings:
memory:
memory_size: 128
sequence_length: 64
hidden_units: 128
num_layers: 3
vis_encode_type: simple
normalize: true
# use_recurrent: false
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
The results of training, according to the Tensorboard statistics, were nearly perfect. The agent was consistently receiving nearly the theoretical maximum reward. However, when I transferred the model back to my computer, it was clearly performing worse than the statistics suggested. The statistics showed that it was consistently landing below 1 m/s, but in inference it was landing at over 20 m/s.
If anyone has any suggestions as to what could be going wrong or tests I could carry out to determine the problem, I’d really appreciate it. Thanks.