Deterministic Inference runs of learned models - ML-Agents

Hello there,

I’m currently working on a project including ml-agents. This has lead me to a point where i ran a learned model in inference with 16 agents with the exact same starting conditions. While expecting that every agent should choose the exact same actions since all startconditions are equal and there should be no random influence in there, not all of them behaved exactly the same. In some instances some agents decide for other actions than the majority of agents.

This kind of made me wonder: Is this wanted behaviour? Am i doing something wrong here? Is this only a sign that the agent has not explored the action space to it’s fullest and is still trying to take a few peeks here and there? (Though this should not happen since i am not learning anyway) Do i have to expect random behaviour in case the agent encounters a state where there are mutliple equally good actions available?

I can’t seem to find any information on this topic in the documentation provided along the ml-agents repository. I also can’t seem to find any information if a learnt NN-Model is tweakable or modifiable after the learning process has been finished.
If someone can shed some light on this i’d be very grateful.


Some info on my Setup: The agent learns moving on a grid with discreticed action space. (6 actions - one for moving one step along each axis) no visual observations, only an observation vector. This contains the last action taken and 10 values gained from a Cubecast variation. Per cubecast the agents recieves the distance to the next surface in this direction as well as the surface area size.

For anyone intrested:
i opened an issue on github concerning this and got my answer there:
https://github.com/Unity-Technologies/ml-agents/issues/2643

TL;RD:
PPO is a stochastic implementation to be able to escape possible deadlocks. Unity has acknowledged the request of multiple people for a feature to deactivate random actions in learnt models. ETA unknown.

2 Likes