I want to know the details of the self-play algorithm implementation in mlagent. Which paper should I refer to?
Some self-play papers say that they use two memory pools. One is used for supervised learning and the other is used for RL training. Does this exist in mlagent’s self-play?
And how does the buffer in mlagent’s self-play work? For example, side 1 collects 2048 trajectories and saves them in buffer. Now the training side switches, whether the data collected by side 1 in the buffer needs to be cleared? Or I should use different buffer for two side agents?
Could anyone tell me about this? Thanks a lot.