OnActionReceived() called before SetMask()

I’m making a chess AI through self-playing.
There are two players A and B, and A plays first.
When A plays, everything works normally. The program calls SetMask(), and then OnActionReceived().
6070314--658038--Screenshot - 7_9_2020 , 12_55_47 AM.png
When B plays, it calls OnActionReceived() before SetMask(). (The 3rd and 5th line in the screenshot below)
6070314--658041--Screenshot - 7_9_2020 , 12_56_01 AM.png
And even weirder, when player B makes his first move, the agent uses player A’s mask. After his first move, player B still calls OnActionReceived() before SetMask(), but he uses the mask from the previous round.
Here is my agent setting:
Player A:
6070314--658035--Screenshot - 7_9_2020 , 1_07_03 AM.png
Player B:
6070314--658032--Screenshot - 7_9_2020 , 1_06_53 AM.png

Well I spent few hours and finally found where the problem is. I should WaitForFixedUpdate before I request Agent’s decision.

Glad, you were able to solve this. Out of curiosity, how come you are using stacked observations for a Chess AI?

In general, I’d be really interested to see how this turns out.

My observation is simply the whole chess board. It’s like an image and each pixel represents a single cell of the chess board. The reason the number of observations is 100 instead of 64, is that I’m using a 10*10 chess board :stuck_out_tongue:
The stacked observation idea comes from AlphaGo’s architecture. AlphaGo used 17 stacked observation (which is the current state + past few moves) for training. So here I also used stacked observations, but a relatively small number.
6077325--659253--20071016-02051.png
This is a screenshot I took from a YouTube video (AlphaGo Zero Tutorial Part 3 - Neural Network Architecture) And I believe this is true.

When it comes to the training performance, I don’t really know if I benefit from this stacked observation. Actually, I tried different settings, like a stack of 3, a stack of 4 etc. But there isn’t really that much difference between different stack numbers.
The real problem for me is that, at a certain point (say after 1 million steps) both players become stagnant, they stop exploring new policies and more often they just repeat a certain (crappy) strategy. So it’s hard to tell whether training with stacked observation gives me a better policy.