I am new to mlagents and pytorch, I’m using ml agent 0.28.0. Just got a new gpu too (well not new, but I picked up a cheap used nvidia one, the gt 730). And I built my binary from source since pytorch doesn’t support its compute compatibility. I wonder if there was a setting when building the binary to make pytorch utilize gpu more.
My issue is that the gpu is hardly being utilized at all while training. I am in the task manager, I can see only 6% of the gpu is being used and 100% cpu. Some 5.1 percent it by python from the command prompt.
There is hardly any change in training speed from when I didn’t have the gpu. There are spikes in gpu usage as it starts to get stuck. The cpu goes down to 70% and gpu rises to 20%. Some memory is cleared. BTW can anyone recommend a better tool than task manager? I saw some where the task manager showed cuda usage, but I don’t have the option. I don’t even know if the vis obs I set are even being used. Unity ml-agents package 2.0.0, mlagents 0.28.0. Because vis obs are gpu heavy, and my game is 95% vis obs. gpu is between idle at 2% but spikes 10% without training. and 6% mostly when training. spikes to 20% for a sec before returning. I want to just off load the utilization to the gpu I want gpu to consistently at least use 45% by default and spike if necessary. Can I set any pytorch settings through command line before running? Should I rebuild with specific config settings?
Watching a video on youtube raises gpu usage by 35%, yet Cuda training is hardly budging it.
What helped me was increasing the number of environments to generate more data. Still that only works as long as your cpu is strong enough to create more data.
I figured it out.
It has to do with the batch size that is written to and processed by the GPU. While I was playing around with settings, I had odd numbers set in for the batch and buffer sizes. Batch was either too large that it would take 4 minutes to fill up and then return a Cuda out of memory error, or too small to utilize the gpu at all. I read from an other answer that this had to do with the batch size. Based on that I was able to calculate out how many steps to use a Gb and set it to utilize all 3 gb of my gpu. However, in the end I settled for the recommended 1024 step batch size, which doesn’t utilize all of the gpu, but I learned that I don’t need to utilize all of the gpu for training, it serves a specific purpose for processing the appropriate amount of data. Buffer size seems irrelevant in terms of cpu, gpu, memory, and storage usage, but the recommended size is 10240 (fraction of the batch) (I could be wrong, I did not assess the buffer size relationships well.).
Additionally, I want to mention that I took the training over to an AWS instance. The training time didn’t improve much as I’m now training 17 environments vs 5 at just half the time it took to 50k steps. I gave my AWS instance enough storage to handle a larger batch size, and buffer. Though it depends on how fast and how much info I need to set in order to train my AI, so I did not max out the gpu with a super large batch size. I do not know how to speed up the training from there and I could still use more advice here. The specs of the AWS instance are good on paper, but even with 17 envs, & not crashing, it isn’t training as fast as I had hoped.
So I’m trying to use CUDA during training but I cannot find an up-to-date guide on how to get everything working. I did install CUDA from Nvidia and I have a RTX3080 but the CUDA usage stays at 0% in the GPU monitor in Task Manager during training. Any way you can explain on how to get it working?
When I increased my batch size, I finally saw the gpu tick in to a higher amount after a few mins while waiting for the batch to fill up. IIRC CUDA in task manager is only shown for the newer gpus, and on the other hand it will tick and show usage regardless of how little your batch size is as long as training is taking place. So if it isn’t being utilized, then there maybe a problem with how you have configured the training, and you are not training at all. Make sure you are training with gpu. If you are, then re-check your unity configuration and code.