I’m training a drone by MBaske (mbaske / ml-drone-collection ). It comes with showing you how to make the drone hover, in which it pretty much has to learn to fly while doing so. I’m training it to hover till it points to a target and then follow a path to a destination while it stays oriented at the target. In other words, I’m trying to make a follow me drone for bike riding.
So anyways, I have been asking ChatGPT a lot of questions. I was wondering if it’s better to send rewards individually or after you summed em all up. Couldn’t find the answer on Google or any other search engine.
Here is the entire conversation. Hope it helps someone. I asked cause I am giving rewards that are not summed up.
================================================================
Hi ChatGPT. Can you tell me if it’s better in Unity ML-Agents using PPO algorithm to add up all the rewards and send by AddReward() as one value, or use AddReward() multiple times during a episode?
ChatGPT
In Unity ML-Agents with the Proximal Policy Optimization (PPO) algorithm, the choice of how and when to use the AddReward() function can impact the learning process. Generally, it is more effective to use AddReward() multiple times during an episode, rather than adding up all the rewards and sending them as a single value.
The reason for this is that PPO relies on estimating the advantage function, which measures how much better or worse an action is compared to the average action taken in that state. By using AddReward() multiple times, you provide more fine-grained feedback to the agent about the consequences of its actions at different points within the episode.
When rewards are accumulated and sent as a single value using AddReward(), it can lead to delayed or sparse rewards. Delayed rewards refer to situations where the agent receives a reward only at the end of a long sequence of actions, making it challenging for the agent to associate its actions with specific outcomes. Sparse rewards occur when rewards are infrequent, making it difficult for the agent to learn optimal behavior.
By using AddReward() multiple times, you can provide more immediate feedback to the agent, guiding it towards desirable behavior. This approach helps in reducing the temporal credit assignment problem, where it becomes easier for the agent to understand the impact of its actions on the immediate environment.
In summary, using AddReward() multiple times during an episode in Unity ML-Agents with the PPO algorithm generally leads to better learning outcomes compared to accumulating rewards and sending them as a single value.