Hello. I have been attempting to transition from PPO to SAC as my trainer of choice. SAC seems very promising due to it’s potentially more general solutions as well as potentially higher sample efficiency if used right.
However, up to now, SAC training has to a large degree been a failure for me compared to PPO. The training is incredibly much slower, unstable and often seems to collapse and flatline as an end result. I was hoping someone could help me understand what I am doing so wrong?
Without red, which had extreme starting entropy coefficient:
Curiosity is unused here. Pink has only been training for very few steps and uses PPO, unlike the others that use SAC. Already at this minuscule amount of steps, in the grand scheme of things, it is outperforming the solutions that have trained for 50m - 100m steps.