Possible memory leak in 0-dimensional tensors

While playing around with different models on CPU and GPU, I noticed that in tensors where one dimension is 0, the tensorOnDevice value is sometimes null.

I couldn’t reproduce the error reliably (so it looks like a memory leak to me), but it seems to occur more often in the GPU runtime than in the CPU runtime. I guess, a tensor of dimension 0 should contain an empty array and not null (If I create a 0-dimensional tensor via TensorInt.Zeros(new TensorShape(0)); create tensorOnDevice contains an empty array on both CPU and GPU.)

Some operations like “Concat” don’t seem to have any problems. However, I noticed it with “Gather” and maybe some other operations are affected as well.

Minimal code to reproduce the behaviour:

ITensorAllocator allocator = new TensorCachingAllocator();
Ops gpuOps = WorkerFactory.CreateOps(BackendType.GPUCompute, allocator);

TensorInt input = new TensorInt(new TensorShape(3,1),new int[] { 0,1,2});
TensorInt indices = TensorInt.Zeros(new TensorShape(0));

Debug.Log(indices.tensorOnDevice); //Works fine on CPU

ComputeTensorData.Pin(indices);

Debug.Log(indices.tensorOnDevice); //Works fine on GPU

Tensor result = gpuOps.Gather(input, indices, 0);

Debug.Log("Result: " + result.shape);

//This one sometimes evaluates to true -> branches of my models cannot be completed
Debug.Log("Result: " + (result.tensorOnDevice == null)); 

I observed a similar behavior some time ago with an onnx “Shape” operation. I couldn’t perform any direct “Add” operations on the output of the “Shape” node. However, when I used a Gather operation in between (“Shape” → “Gather (0)” → “Add”) it worked.

Another strange observation, which is not necessarily related to the previous mentions, is:
If I want to run a model on the GPU and then iterate over all layers and output the deviceType, most of the outputs are still on the CPU (maybe this is due to my device as I use a Mac).

(...)
//Add all layers to outputs
foreach(Unity.Sentis.Layers.Layer l in modelLoader.layers)
{
    modelLoader.AddOutput(l.name);
}
IWorker worker = WorkerFactory.CreateWorker(BackendType.GPUCompute, modelLoader);

//Execute the model
(...)

foreach (Unity.Sentis.Layers.Layer l in modelLoader.layers)
{
    Tensor testTensor = worker.PeekOutput(l.name);
    Debug.Log(l.name + ": " + testTensor.tensorOnDevice.deviceType);
}

I would expect that all tensors that are output of an (GPU supported according to Supported ONNX operators | Sentis | 1.2.0-exp.2 ) operation are also on the GPU after executing the model.
However, that is not the case.
Maybe others have already stumbled across these issues and have hints or explanations :sunglasses:.

1 Like

0-dim tensor by definition don’t have data :slight_smile:

  • ComputeBuffers cannot be empty, Native memory you’ll also run into some issues.

So now we have two choices.

  • allocate a TensorOnDevice of size 1
  • have a null TensorOnDeivce

I think internally behaviour might differ depending on cpu/gpu but I’d need to check.

For shape, it’s output is a 1D tensor of shape (shape.length)
You can perform add on it, but you’d need to make sure you respect the broadcast rule

For the CPU comment, not all layers are necessarily on the GPU.
We try to be smart and for layers that need a readback on the CPU we try to run them on the CPU and propagate this state to their inputs.
For example:
shape → gather → add → div → upsample.shape
all the shape, gather, add, div will be on the cpu since they require a readback and performing them on the gpu would be un-necessarily slow.
You can check the LayerCPUFallback, but it’s internal unfortunetly

Hi Alexandre, thanks for your explanations.

As I understand you, it is not techniacally possible to create empty ComputeBuffers.
However, when initializeing a new tensor (for example TensorInt a = TensorInt.Zeros(new TensorShape(0))), i guess, its tensorOnDevice attribute points to an empty array/computebuffer (CPU array [0] / GPU:(0) buffer: UnityEngine.ComputeBuffer).

From my perspective there is no general problem with 0 dimensional tensors. My specific problem is that the 0-dimensional result tensor of the gather operation however has null on its tensorOnDevice attribute (which seems like a bug to me). I revised the code again to better describe the exact problem and the corresponding log output:

ITensorAllocator allocator = new TensorCachingAllocator();
Ops opsCPU = WorkerFactory.CreateOps(BackendType.CPU, allocator);

TensorInt a = new TensorInt(new TensorShape(3), new int[] { 0, 1, 2 });
TensorInt b = TensorInt.Zeros(new TensorShape(0));
Tensor c = opsCPU.Gather(a, b, 0);

Debug.LogFormat("CPU b.shape: {0}; b.tensorOnDevice: {1}", b.shape, b.tensorOnDevice);
Debug.LogFormat("CPU c.shape: {0}; c.tensorOnDevice: {1}", c.shape, c.tensorOnDevice);

ComputeTensorData.Pin(a);
ComputeTensorData.Pin(b);

Ops opsGPU = WorkerFactory.CreateOps(BackendType.GPUCompute, allocator);
Tensor cGPU = opsGPU.Gather(a, b, 0);

Debug.LogFormat("GPU b.shape: {0}; b.tensorOnDevice: {1}", b.shape, b.tensorOnDevice);
Debug.LogFormat("GPU c.shape: {0}; c.tensorOnDevice: {1}", cGPU.shape, cGPU.tensorOnDevice);

The corresponding log is:

Ok thanks, that explains a lot.
I think the problem was a shape operation at the beginning of the graph that caused almost all of the rest of the graph to be executed on the CPU.

In the documentation (Supported ONNX operators | Sentis | 1.3.0-pre.1) only minus signs are entered for shape. I didn’t realize that this meant the operation was not supported. Other operations explicitly say “Not supported”.

Is there a chance that the shape operation can be executed on GPU in the future?
Or a method/operation with which you can explicitly push tensors back to the GPU.

Thanks for the code, I’ll check it out, seems straightforward to check.
I’m not sure I follow your comment on the shape and minus sign…
Also could you detail why you’d like shape to be on the gpu?
If nothing is reading back from the shape on the cpu it’ll be on the gpu…

It may be easier to understand using an example. This is the beginning of my model. I need the shape to pad the image accordingly.

Concerning the “minus sign” thing, I just meant that I misinterpreted this notation as for some other operations (e.g. NonMaximumSupression) the corresponding entry is “Not supported”:

I experimented a bit regarding the CPU / GPU issue and generated a few minimal (onnx) examples with which I was able to reproduce strange behavior.

However, I will start a new thread for this because it doesn’t really match the original topic.
New thread at: Strange executions of some layers on the CPU instead of GPU

As I continued debugging my models, I noticed that this memory problem not only affects 0 dimensional tensors but rather tensors that change their shape during runtime. I have observed that tensors that have a different shape when the model is executed again sometimes point to compute buffers that definitely do not belong to them (As their shape is identical to an other tensor appearing earlier in the graph). I tried to break down the problem and created an Onnx model with a few dummy operations:

Executing the model with the following code:

Model modelLoader = ModelLoader.Load(modelAsset);
modelLoader.AddOutput("reduced_other_vec");

IWorker worker = WorkerFactory.CreateWorker(BackendType.GPUCompute, modelLoader);

Tensor output;

output = TestMemory(new TensorInt(0), new TensorInt(2), worker, modelLoader);
Debug.LogFormat("1 -- shape: {0} content: {1}", output.shape, output.tensorOnDevice);

output = TestMemory(new TensorInt(42), new TensorInt(2), worker, modelLoader);
Debug.LogFormat("2 -- shape: {0} content: {1}", output.shape, output.tensorOnDevice);

output = TestMemory(new TensorInt(0), new TensorInt(2), worker, modelLoader);
Debug.LogFormat("3 -- shape: {0} content: {1}", output.shape, output.tensorOnDevice);

worker.Dispose();

Tensor TestMemory(TensorInt zeroInput, TensorInt otherInput, IWorker worker, Model modelLoader)
{
        Dictionary<string, Tensor> inputTensors = new()
        {
            { "zero_input", zeroInput },
            { "other_input", otherInput },
        };

        worker.Execute(inputTensors);

        zeroInput.Dispose();
        otherInput.Dispose();

        return worker.PeekOutput("output");
}

One would assume that the first and third log print would show the same thing. However, this is the output instead:

I know that the example is very artificial and here the issue is still relatively harmless, but I also observed cases where a tensor of rank 1 (containing a single value) suddenly pointed to a buffer of rank 4 (containing several hundred values).

For example, if models contain operations such as NonMaximumSupression and the shapes change with each execution, this behavior leads to errors that are difficult to debug. As far as I can tell, the issue only affects tensors whose shape has to be calculated at runtime (e.g. NMS, Tile, Range …). The issue does not affect shapes defined by dynamic_axes.

Possible solutions:

  1. Call worker.Dispose(); and WorkerFactory.CreateWorker(...) each time you execute the model. However, this seems inefficient.
  2. I tried worker.PrepareForInput(...) which did not work.
  3. The best option would be if you could allocate the layer outputs manually. However, I have no idea how to do this.

This is known internally as Issue 207