Sorry if this is too basic, I’m a begginer at ML agents.
I built a basic model with 1 discrete action with size of two. These two values (0,1) can get positive rewards depending on the environment values. The issue is the training uses more 0s and learns that 0 is the correct value to use only and doesn’t use 1 much anymore even though there are still some rewards when using 1. Basically the model learns but just from one action. It’s observing the current action (1), and the environment values (2). I’ve tried:
- playing with the hyperparameters but no success, especially beta, batch and buffer size, normalize and learning rate
- ppo and sac
- Addrewards +1 and -0.1 seem to work better but still nothing
Any help would be appreciate it. Thanks!