SAC not learning while PPO works fine in the same environment

Unity 2023-2.13f1
ML-Agents 1.0.0

I have been struggling to get SAC to work inside unity ML-Agents using multiple different config parameters. Inside the same environment PPO manages to learn and converge within 1.5million steps while SAC struggles to find a solution after 6million steps and also experiences catastrophic forgetting. While training the SAC environment also freezes constantly for a long time after about 2000 steps.

The environment is not changed at all when training using both SAC and PPO.
Is there currently an issue with SAC inside ML-Agents?

Agent Action Space : Discrete

This is the learning graph for PPO , which shows steady learning
image

This is the learning graph for SAC, which is very unstable and struggles to learn.
image

Config For PPO:

FindExitAgent:
trainer_type: ppo
    hyperparameters:
      batch_size: 120
      buffer_size: 12000
      learning_rate: 0.0003
      beta: 0.001
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      learning_rate_schedule: linear
    network_settings:
      normalize: true
      hidden_units: 256
      num_layers: 2
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
    keep_checkpoints: 10
    max_steps: 500000000
    time_horizon: 1000
    summary_freq: 12000
    threaded: true

Config for SAC:

FindExitAgent:
    trainer_type: sac
    hyperparameters:
      learning_rate: 0.0003
      learning_rate_schedule: constant
      batch_size: 256
      buffer_size: 500000
      buffer_init_steps: 0
      tau: 0.005
      steps_per_update: 20.0
      save_replay_buffer: false
      init_entcoef: 1.0
      reward_signal_steps_per_update: 20.0
    network_settings:
      normalize: true
      hidden_units: 512
      num_layers: 3
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.995
        strength: 1.0
    keep_checkpoints: 5
    max_steps: 5000000
    time_horizon: 1000
    summary_freq: 30000

Additionally training PPO only took around 30 minutes to reach a million steps while SAC had to run for nearly 2 hours. While I expected SAC to be slower because of the buffer size and other factors, the difference is immense.
I was curious and I also tracked time in between each academic step (Orange PPO , Gray SAC)

What could be the issue here?

Edit: For SAC I have tried numerous different configs while changing network size,layers , batch size but nothing has been successful.

1 Like