The details of the self-play algorithm implementation

I want to know the details of the self-play algorithm implementation in mlagent. Which paper should I refer to?

Some self-play papers say that they use two memory pools. One is used for supervised learning and the other is used for RL training. Does this exist in mlagent’s self-play?

And how does the buffer in mlagent’s self-play work? For example, side 1 collects 2048 trajectories and saves them in buffer. Now the training side switches, whether the data collected by side 1 in the buffer needs to be cleared? Or I should use different buffer for two side agents?

Could anyone tell me about this? Thanks a lot.

MLAgent’s implementation of self-play largely follows this openai paper - https://arxiv.org/pdf/1710.03748.pdf

Here’s the blog post too if you want something more digestible - https://openai.com/research/competitive-self-play

Hi,if i wanna to decide which agent’behavior been trained frist. What can i do? Are there some instructions that make this happen?
Thanks!