In Tensorflow and Pytorch you can create tensors directly on the GPU.
Some people might find this useful. Using the ExecuteOperatorOnTensor sample I created a little helper function like so:
TensorFloat CreateTensorOnGPU(TensorShape shape, float[] data)
{
using TensorFloat input = new TensorFloat(shape, data);
return s_Ops.Copy(input) as TensorFloat;
}
It seems to work, as in calling this lots of times, the GPU memory is going up while the RAM is staying fairly constant. (Just a little spike as you create the float array).
This might be useful if you have a lot of weights or data on your HD and you want to transfer it to the GPU but you donât have much RAM.
Someone let me know if this function is totally wrong!
Edit OK, just seen there is already a function to do this UploadToDevice(ITensorData) so I could use:
Although the first method seems to spike lower in the RAM interestingly.
BTW, I tried my method of taking a really big ONNX file and turning all the weights into inputs with a python script. Then putting this small ONNX with no weights into Unity. Then using the top function to push the weights into the inputs one by one bypassing the RAM. Worked very well with virtually zero RAM used either in the editor or during runtime.
Hi @yoonitee, Do you mind sharing the python script that takes a large onnx model, turns all weights into inputs and produces the tiny onnx-file? It sounds quite practical for large models.
Sure I put them here: GitHub - pauldog/FastOnnxLoader: Loads in onnx files with less RAM I was using it first with Onnx Runtime instead of Unity so youâd have to write your own C# script to load the weights but thatâs quite simple as the weight files just store giant arrays. You also need to modify the names of the input and output files in the script. (Also you need to do a âpip install onnxâ to get the onnx library)
The downside is that doing it this way Unity canât optimise the model in this form. So thereâs pros and cons with this method.
BTW. Another thing with this method is that you can use whatever compression/quantization algorithm you like to store the weights on disk. e.g. you could store the weights on disk as 16bit floats and decompress them to 32bit floats provided you had a fast way of doing that, to save disk space.
Thank you! For small-size models it works OK, but for larger models (>2 GB) it blocks the UI indefinitely and ultimately can crash the Editor (e.g. on MacOS) due to insufficient app memory.
@gilescoope Thank you once again! Yes, ModelWriter.Save() is now public and works properly for small-size models. The issue with larger models still remains though, and can lead to Unity editor crash due to insufficient memory. I suppose either some temporary objects are not disposed properly, or GC needs to be called from time to time to clean up.
Itâs due to the Unity asset pipeline not cleaning loaded objects. We canât do much at the moment to fix it, but you can work around it by saving it in the StreamingAssets folder either by code or clicking the âSerialize To StreamingAssetsâ in the Inspector window.
@liutaurasvysniauskas_unity Thank you for the info and for the suggestion! When I use the âSerialize to StreamingAssetsâ-button, the result is the same. The system reports insufficient app memory.
Is this the same issue as this one: Memory Leak importing ONNX? - #3
Maybe someone on the âUnity Asset Pipeline Teamâ knows how to fix it. I donât know if such a team exists. But if they did theyâd probably know how to fix it. Otherwise maybe there is a clever way to bypass the asset pipeline altogether. IDK.
One possibility if the pipeline canât be patched might be a standalone app or command-line utility that converts the onnx into streaming asset. Or alternatively a button in Unity where you enter a filename for an onnx and it converts it to a streaming asset. This might fix one part of the RAM problem if not all. I donât know what the best solution would be.
So, weâve debugged quite a bit into this.
The issue is that Unity keeps allocated memory during asset import cached.
Weâve reduced that amount to a minimum, but the model is essentially loaded twice.
The only solution around this is too use the Serialize to StreamingAssets or ModelWrite.Save.
Ofc this uses memory and so if your computer has a limited memory then youâll run into insufficient app memory, or your computer will start paging.
The workaround is to make sure that your computer is using the less amount of RAM when you save your model.
ModelWriter.Save() Time taken: 10 minutes. (running in the editor).
If a 1GB model takes over 6GB of RAM to save it, presumably a 2GB model will need over 12GB RAM to save it. Thatâs more than most people have. For a 7GB model such as the smallest Llama, we would need 42GB RAM to save it and would take over 1 hour.
Latest. I think ModelWriter.Save is only available in the latest right?
BTW. On another topic: LoadModelDesc() and LoadModelWeights() donât seem to have versions that work with the streaming asset? as in where you put in a filename. Hopefully a version of LoadModelDesc() would be able to get the model desc from the streaming asset without loading the whole thing into RAM. That would be quite useful I think. And also if you could stream in individual weights from the streaming asset one at a time would be useful thing to have for developers to have full control. Perhaps something like: StreamModelWeight(path, âweight-nameâ , CompressionType, GPU). Something for the future maybe.
Experimental Faster Model Save Hack Up to 60x faster
@roumenf I came up with a hack that reduces the save time down from 10 minutes to 10 seconds! and uses about 80% less RAM. No idea if it will work for all models. Try it out if you like.
The gist of it is that it saves the big block of memory that is the weights into a separate file all at once in one go.
So now you have two files: model.sentis and model.weights. No python hacking needed.
It is a bit hacky (as in this is not how the API is designed to be used) so use at your own risk!
It assumes all the constants point to the same weight block of memory. Maybe this is not always true but it shouldnât be hard to alter it.
Save the Model
//load the model from the asset
model = ModelLoader.Load(onnx);
// the weights of all the constants seem to point to the same massive block of memory:
NativeTensorArray weights = model.constants[0].weights;
//let's delete the weight pointer from each constant
for (int i = 0; i < model.constants.Count; i++)
{
model.constants[i].weights = null;
}
//save the weights to a separate file:
using (BinaryWriter writer = new BinaryWriter(File.Open("model.weights", FileMode.Create)))
{
writer.Write(weights.AsReadOnlySpan<byte>(weights.Length * 4));
}
//save the model
ModelWriter.Save("model.sentis", model);
Load The Model
(Unfortunately this is inefficient in terms of RAM since it has to load first into a buffer⌠maybe someone can improve on this? I donât know if you can load from hard disk straight into a NativeTensorArray.)
//read the weights into a buffer
byte[] buffer = File.ReadAllBytes("model.weights");
//load the model with blank weights
Model model2 = ModelLoader.Load("model.sentis");
//get a reference to the blank weights
NativeTensorArray weights = model2.constants[0].weights;
//copy the buffer into the weights block
NativeTensorArray.BlockCopy(buffer, 0, weights, 0, buffer.Length);
buffer = null;
Itâs just a proof of concept.
Conclusion
I think the reason why the API save model method takes long and uses a lot of RAM may be due to manipulating lots of NativeTensorArrays. I think perhaps they may be ordered in memory in such a way that it is fragmenting the memory. So big blocks of memory can not longer fit and so the memory just expands. That is just a WILD guess as I have no idea how itâs implemented!
That code wonât work
Constants might have different weight buffers and you might bust the int32 length limitâŚ
Iâll test out the split, maybe FileStream gets slow with a large array