SAC Training: Rewards Dropping Off

Hello. I have been attempting to transition from PPO to SAC as my trainer of choice. SAC seems very promising due to it’s potentially more general solutions as well as potentially higher sample efficiency if used right.

However, up to now, SAC training has to a large degree been a failure for me compared to PPO. The training is incredibly much slower, unstable and often seems to collapse and flatline as an end result. I was hoping someone could help me understand what I am doing so wrong?


Without red, which had extreme starting entropy coefficient:


Curiosity is unused here. Pink has only been training for very few steps and uses PPO, unlike the others that use SAC. Already at this minuscule amount of steps, in the grand scheme of things, it is outperforming the solutions that have trained for 50m - 100m steps.

Pink PPO finished training (big temporary drop is me messing around and stopping the training many times). As you can see, it greatly excels in an identical environment and even takes less time per step :confused:

More data, the green/turqoise one has fewer available resources to live on in-environment and as such performs a bit worse. Higher lifetime is better, 1200 is maximum lifetime.