Hi all,
I’m trying to understand how gradient updates are exactly implemented to make sure I’m using the right horizon / batch-size settings:
From the documentation:
- Batch_size is the number of experiences used for one iteration of a gradient descent update.
- Time_horizon corresponds to how many steps of experience to collect per-agent before adding it to the experience buffer.
If e.g. batch size is set to 32, and horizon to 64:
Does that mean that each of the 32 samples in batch size includes one random horizon set of 64 experiences with one corresponding total (expected) reward to this set?
If so, since a bigger horizon contains more steps and therefore more rewards, wil a longer time horizon give greater variance at each gradient update?
In a reward-dense environment I’ll probably also have to decrease my learning rate / extrinsic-reward-strength when I increase time-horizon right?
The docs make it seem like the expected reward of horizon will be the expected reward until the end of an entire episode (agent reset?) but that seems strange to me, since then each horizon will have a very different expected reward depending on how early in the episode it starts.
Thanks!