I’m currently trying to make an AI that can fly a drone (quadcopter) to a goal with a randomized location and hover at it until the time runs out. The drone receives a negative reward when it flies out of bounds and ends the episode early. It should get a reward after each decision based on the distance of the drone to the goal, encouraging it to stay as close to the goal as possible.
This is where I have a problem however. If the drone happens to spawn further away from the goal that in a different episode, it gets a lower reward for doing the same thing, making it inconsistent. My solution to this was to take the distance the drone starts out that, and give a reward based on the percentage of that distance the drone crossed. However this also has a problem. If the drone is headed towards the goal at a constant velocity, the drone will cross a higher percentage of the distance if it started out closer to the goal than if it spawned further away from the goal, making it inconsistent again. What would be a solution to this?
Assuming your goal is a single point in space, perhaps introducing a goal zone would satisfy how you are looking to reward your AI.
So as an example: A sphere collider or circle collider (depending on if it is 3d space or 2d) of fixed radius which has the goal at the centre. If the AI enters the zone it is rewarded and if you wish to increase its rewards based on the distance to the centre of the goal you can then continue to reward it based on how well it maintains it’s distance to the centre point. That would normalise the rewards the AI would be receiving would it not?
You don’t even really need to use physics, this could be done mathematically where you only reward the AI once it is inside a maximum distance and the reward is judged on how well it stays in the centre from the edges of the zone. Hopefully this idea leads you to a solution!
That could work, and I’ll try it out once I get home, but I’m worried the AI won’t be able to find the goal at first, meaning it won’t know how to get the reward.
If you don’t mind me asking then, how does your agent currently observe the world? That would help determine if it’s likely to be able figure out the way towards the goal when using a goal zone.
If it at a minimum knows the position of the zone and knows its own velocity and position, it ‘should’ be able to determine how to get rewards… eventually!(You may be right though, statistically it’s possible for it not to) You could use GAIL demonstrations to help guide it initially, or even with just curiosity rewards enabled it ‘should’ at least find the goal at some point.
It knows its own position, rotation and transformation, its distance to the goal on each axis, and its distance to the outer bounds of the training area.