I’ll start with an example: an agent has to go around a a map by following a path composed of N checkpoints. The state is the relative position of the next 4 checkpoints. The agent receives +1 every time it takes a checkpoint, and -0.01 at every time-step. In training, maps have different sizes and number of checkpoints, therefore the total achievable reward in each episode varies accordingly to the number of checkpoints in the episode. Does this negatively affect the training using PPO?
I’m confused between two answers:
A) Yes, it negatively affects the training because the value of the exact same state changes with the map.
B) No, because PPO considers the temporal difference error, not the total reward from a given state to the end of the episode.
However, if B is correct, how can the agent then learn to “go fast”?