training stops with RuntimeError: dictionary changed size during iteration

Hi, Sometimes my training stops with this "dictionary changed size during iteration" error. Is anyone familiar with that? I’m using concurrent environments, and mlagents release 2.

This was from command prompt:
File “c:\users\hello\desktop\project\ml-agents-release_2\ml-agents\mlagents\trainers\stats.py”, line 344, in write_stats
for key in StatsReporter.stats_dict[self.category]:
RuntimeError: dictionary changed size during iteration

This was from my build’s debug log:
Unable to save timers to file C:/Users/hello/Desktop/project/builds/7_3_2/agents2_Data\ML-Agents\Timers\Clay3D_timers.json
(Filename: C:\buildslave\unity\build\Runtime/Export/Debug/Debug.bindings.h Line: 35)

Any idea what’s happening, or ways to stop this behavior?

The stats.py error is really weird - I understand that modifying the dictionary while iterating over it is bad, but don’t see how that could be happening here. Could you open a github issue with the full callstack (and maybe some more info about your python version)?

The “Unable to save timers to file” message should be harmless.

I had the same issue. Used Release 3 and the default 3DBall environment with SAC. Only changed the max_steps to 1 million. Occurred at around 700k steps or so. So it should (hopefully) be easy to reproduce.

I tried a few times but can reproduce the problem (3DBall, release 3, SAC, max_steps=1000000). Can you please post the full callstack of the error, command line args you’re using to run, and output from “python --version”?

Still can’t reproduce it, but I have a theory - I think StatsReporter is getting called from different threads simultaneously, so one thread causes a new key to be added (via add_stat or set_stat) while write_stats is being called.

We were able to reproduce the problem (decreasing the summary frequency to 1 and adding a sleep in the loop makes it happen almost immediately). PR to fix is here: [bug-fix] Make StatsReporter thread-safe by ervteng · Pull Request #4201 · Unity-Technologies/ml-agents · GitHub

Thanks for reporting this!

2 Likes

Great! Happy it got solved

If this is causing a problem for training, and you’re comfortable modifying the python code, a simpler workaround is to convert the loop in question to

The fix will be in the next release, tentatively scheduled for next week.

1 Like