I look for some help to understand the ELO results in the training process of the following game I created:
The game is a symmetric 17+4 variant for 2 players (that is a Black Jack variant with 32 cards and card values which are different from Black Jack and no dealer).
In the first example the values of the 8 cards in every suit are 1,2,3,4,5,6,7,8. This seems to work well: PPO with self play delivers these graphs for the mean rewards and elo:


If I play against the produced brain model, the model plays very well. I coundn‘t detect any failures done by the model.
In a second example the values of the 8 cards in every suit are 2,3,4,7,8,9,10,11. The same PPO / self play training now delivers these graphs for the mean rewards and elo:

There is a small bias in the cumulative rewards and, after a roughly 1 million steps, a decrease in elo.
But: the code has been reviewed intensively, so that I am quite confident that the bias isn´t caused by my code. And: the trained model again plays very well versus a human player, so I wonder about the decreasing elo.
My questions: Do you have an idea what can cause such effects? Where can I find really detailed documentation of the training process and Elo calculation?
Thanks a lot!