Hey guys,

So obviously, every time-step (or after a batched amount of timesteps), the NN receives state+reward at each timestep t.
If I add reward 0.1 at timestep 0 and then addreward 0.1 at timestep 1, is it like: reward t_0 = 0.1, reward t_1 = 0.2? So the cumulative reward would be 0.3?
So if i SetReward 0.1 at t_0 and at t_1, would my cumulative reward be 0.2?

Does this mean, if I want to add a negative reward of -0.1 at every time-step, I should do this with SetReward(-0.1), instead of AddReward(-0.1), since AddReward would actually not result in a linear increase of reward, but actually exponential?

Also, should I set the reward in the `CollectObservations` or `OnActionReceived` function? Shouldnâ€™t there simply be an additional function for this?

AddReward and SetReward modify the reward for a single timestep. At the next timestep, the reward is reset to 0.0.

If r_t = 0.0 is your reward at timestep t, AddReward(value) accumulates reward as:

r_t += value

whereas SetReward(value) sets reward as:

r_t = value

Whether or not you should set the reward in CollectObservations or OnActionReceived depends on the environment/reward. For per timestep penalties, we usually use OnActionReceived since this usually corresponds to max step except when the Take Actions Between checkbox is not checked on the decision requester.

Is a â€śtimestepâ€ť defined as the DecisionPeriod set in the DecisionRequester?
Where one step in the set period would be a single EnvironmentStep call on the academy (i.e. fixedupdate per default)?

Also, I donâ€™t quite understand how taking actions between decisions works - isnâ€™t the definition of a decision deciding on the next action to do (i.e. one input->output run through the Python interface)? How is the action vector decided upon betwen decisions?

What do you mean with Corresponding to max step?

My current understanding of the academy flow is:

• Setup environment, connect with python, etc.
Loop:
• EnvironmentStep is called
• If there is no episode running, start a new one
• Once EnvironmentStep count is >= decision period, continue
• CollectObservations is called, â€śconsumingâ€ť the reward up till now, setting it back to 0
• The received observations and rewards are sent to the python interface to go through the NN
• Wait for the return of action vectors from python
• wait for the next environmentstep and loop

And somewhere inbetween batches, the Python interface does some backpropagation for training, this is also where freezes occur while training, since we have to wait for this to finish before requesting the next action.

Is something wrong in my assumptions or did I miss something?

Sorry, I should have been more careful. By timestep, I was referring to the number of fixed updates that occur between decisions since an agent can accumulate reward in any of these fixed updates. After the decision interval has elapsed, the reward is reset to 0.

The agentâ€™s policy will be queried for a new action given the current observation every decision interval. In the DecisionRequester, there is a checkbox for â€śTake Actions Betweenâ€ť. If this is checked, the agent will continue executing the same action in between decisions.

In the agent script, a max step field is available which corresponds to the maximum number of fixed updates before a new episode begins. OnActionReceived is called on every fixed update (if take actions between is checked).

That understanding of the flow looks good to me. Let me know if I can clarify anything else.

2 Likes