How to reduce the too-intensive use of actions by the agent during early training

Hello,

Let’s take a simple example with a simple cube, with movements + one attack.
Two discrete branches:

  • Movement size 5
  • Attack size 2 (Attack or not attack)

And let’s say the attack take 2seconds to complete. During this time, movement is not allowed. (Because it’s a simplified example, one discrete branch would be enough, but in our project it’s a bit more complicated)

How can we make the cube to not spam attacks all the time? I know that eventually in the end, it will reduce the use of them to optimise it’s reward. But because of RL works, with a high amount of random action on start, the agent will obviously attack all the time. And the agent never tries to just walk for the majority of the training. This does seems to be a very inneficient way of learning, that’s why i’m trying to find a better way.

  • Curriculum learning could be a solution, but teaching him first to reach the target is a bit cheating.

  • Adding a cooldown on the attack just for the training seems a bad idea, since the exercice would be different

  • Penalty on missed attacks seems to not be efficient enough when used with a big horizon

  • Gail seems not to be compatible with our project, due to the need of heavy generalisation.

Does someone have ideas on an approach to solve this problem?

Thank you

Can you predict what the likely outcome of the attack will be in advance? If so you could give it an immediate reward for a successful attack and a (small!) negative reward for a failed or useless attack.

Thanks for your answer
I can’t know in advance, in the game some attacks can hit 3seconds after being thrown. I already have some penalty on missed attacks, and bonus on successful (after 1-3 sec). However it seems that because of the super long horizon, thoses rewards are not really effective.