Are ML-agents on policy?

In the code,I see the code and note on rl_rainer.py on 133 row

with hierarchical_timer(“process_trajectory”):
for traj_queue in self.trajectory_queues:

We grab at most the maximum length of the queue.

This ensures that even if the queue is being filled faster than it is

being emptied, the trajectories in the queue are on-policy.

_queried = False
for _ in range(traj_queue.qsize()):
_queried = True
try:
t = traj_queue.get_nowait()
self._process_trajectory(t)
except AgentManagerQueue.Empty:
break

I can’t understand why the queue is on policy. The later trajectory when the policy hasn’t been update seems to use the old policy.

‘on-policy’ is a reinforcement learning technical term that means a policy update should only be computed with trajectories sampled from that policy.

Does this answer your question or are you concerned with the implementation?

I am concerned with the implementation. Because I don’t know how to keep the on policy with the Queue. GA3C and the IMPALA uses the Queue to comunicate between the trainer and the actor are all off-policy.
Does MLagent stop the step and use all trajecroy even the experience step nums dont reach the max to train the policy?