Hey folks,
Getting into training with multiple simultaneous environments… Right now I have 8 environments going, and it’s giving me a 2x speedup. CPU is at 15%, Memory at 40%, disk/network/gpu are all negligible. Any thoughts as to what’s causing the bottleneck here? Also, below is a screenshot of the reward: Orange is 1 instance, blue is 8 instances. The problem is the same. Any clues as to why it’s gone all… saw-tooth-y?
2 Likes
i like to know as well. at my system it seems only two cores (of total 8 cores /16 threads) are working hard. Increasing enviroments does not help (much) going from 8 to 16 only eats up more Ram nothing more. CPU keeps bouncing around 25 to 33% SSD is idle. I tried to increase buffer sizes in config file but nothing seems to help. I noticed 2 instances of Python running wish would explain the 2 core load. no idea on how to increase that to eight.
1 Like
Thats because Pytorch is only configured to use 4 Threads at once. You can change that in venv\Lib\site-packages\mlagents\torch_utils\cpu_utils.py at “get_num_threads_to_use()”
In the “return max(min(num_cpus // 2, 4), 1) if num_cpus is not None else None”-Line change 4 to the Number of Threads you want
A note about PyTorch and CPU threads - for the small networks we’re using in ML-Agents, increasing the number of threads that PyTorch uses will increase your CPU usage but it won’t actually make it much faster
. This is because parallelizing small ops is less beneficial than with large ops (e.g. in the case of CNNs).
As for the sawtooth problem - you’re likely going to have to increase your summary frequency - it looks like many more short episodes are completing in between each summary write b/c of the increase in environments.