Strange execution of some layers on the CPU instead of GPU

In some models I have observed that many layers are executed on the CPU even though I use the GPUCompute runtime and all operations should be supported by the GPU.

I built models to reproduce the behavior:
The first model just takes an rendered image (on the GPU) as input and calculates its height and width using the Shape operation. This model is just to proove that Shape can be executed on the GPU. This is what the model looks like:


After execution, all tensors were on the GPU!

The next model is artificially blown up but it simply represents the creation of identity matrices (Depending on the batch size of the input). It is based on this pytorch code:

import torch

class IdentityModel(torch.nn.Module):
    def __init__(self):
      super().__init__()

    def forward(self, x):
      identity_mat = torch.zeros(x.shape[0], 3,3)
      identity_mat[:,0,0] = 1
      identity_mat[:,1,1] = 1
      identity_mat[:,2,2] = 1
      return identity_mat

model = IdentityModel()
model.eval()

test = torch.rand(5,3,112,224)
model(test)
torch.onnx.export(model, test, "identity_test.onnx", verbose=False, opset_version=11, input_names = ['input'], output_names = ['identity_matrices'], dynamic_axes={'input' : {0: 'batch', 2 : 'height',3:'width'}}, do_constant_folding=True)

This is how the beginning of the model looks (I added CPU and GPU tags depending on where the corresponding tensor ended up in my experiment):


So for some reason, “Shape” seems to work on the GPU for the first example but not for the second :thinking:

(The discussion started at Possible memory leak in 0-dimensional tensors - #5 by josh-o)

I tried to replace a shape operation with indexing and sum operations. However the operations are still on the CPU.
The modified model looks like this (Untagged operations are still on the same device [see above]):

Oh, I accidentally assumed that the input is always ready on the GPU as I use a rendered texture.
However, the input texture seems to be on the CPU for some reason (at the graph execution stage):

Model modelLoader = ModelLoader.Load(modelAsset);
modelLoader.AddOutput("input");

IWorker worker = WorkerFactory.CreateWorker(BackendType.GPUCompute, modelLoader);
Tensor image = TextureConverter.ToTensor(textureInput);

Debug.Log("Initial input: " + image.tensorOnDevice.deviceType); //prints: GPU

worker.Execute(image);

Tensor inputInGraph = worker.PeekOutput("input");
Debug.Log("Final input: " + inputInGraph.tensorOnDevice.deviceType); //prints: CPU

This is known internally as Issue 208

hey @josh-o


this behaviour is on purpose.
Since ConstantOfShape will need to do a readback on the value of the input shape, we purposely flag this input as needing to be on the cpu. This gets propagated back to all the connected inputs (more or less).
This avoids having to do a GPU/CPU sync which is extremely costly

I know it’s private, but with reflection you can get a hold of m_LayerCPUFallback in model, it shows which layer needs to run where.
We also have (another private class) CPUFallbackPass that performs the code I mentioned to select which layer needs to run where

If you check

    [Serializable]
    [Optimization.CPUFallback.CPUReadInputs(0)]
    public class ConstantOfShape : Layer

ConstantOfShape is flagged as having input0 needing a readback. We use this info to propagate that a layer needs to be read and thus computed on the cpu

Hey Alexandre, thank you again for your explanation. I took a look at a few files. I think know which operations I should avoid or which combination of operations. I guess the CPU readback is always needed so that the shape inference can be kept up to date on the cpu, right? In some cases it is possible to revise the model so that the unknown parameter in the shape can be defined as a dynamic axis (this allows operations like Tile or ConstantOfShape to be bypassed).

Unfortunately, this is not possible with other models because the shape depends on the input.
A common example:

  • Your ONNX model takes the input {1,3,height,width}.
  • However, in order for the following convolution layers, MaxPool and skip-connections to be executed correctly, height and width must be multiples of 32.
  • If you insert an ONNX Pad operation at the beginning of the model with unknown padding input (derived from the image dimensions), I guess the entire image is loaded from the GPU to the CPU (and the following Convolution Layers are executed on the CPU) although actually only the dimensions of the image are needed on the CPU.

That’s my understanding at the moment. Maybe I’m thinking wrong.

I’m wondering now:

  • Is it planned for future releases that all operations will work without readbacks or is that unrealistic?
  • Would it be possible for the padding example to copy the image tensor to the CPU (asynchronously [for shape inference]), but at the same time continue to execute the graph with the tensor on the GPU?
  • Developers might also be able to set the values manually for unknown shapes. In the padding example the developer knows the exact output shape of the pad operation which is:
    (height + (32 - height % 32)) and (width + (32 - width % 32)) respectively.

For the padding example:
GPU [------CPU----------------------------------]
image → shape → (shape + 32 - shape) %32) → pad.padArray
--------------------------------------------> pad.input
GPU
Pad has two inputs: the image and the padding dimension.
You can see in the sketch that

  • only the second input, ie the padding dimension will be on the cpu.
  • the first input, ie the image will be on the gpu.

So we are not downloading the image on the cpu to do the shape calculation, that is what we are trying to avoid

I tested it again with a reduced model and it worked as you explained it:


So traced the strange behavior in my model to a TopK layer that was further down in my model and I attached a TopK operation to the model shown. This then leads to layers being executed on the CPU that are separated from the actual “readback path” (for the “padding values” of Pad [Shape → (…) Concat] and the “k” of TopK [ReduceMin → (…) → Cast]). I marked the unclarified layers with a question mark:

In my original model the distance between the input and the TopK operation is more than 20 nodes. Nevertheless, the input is copied to the CPU.

Ok share me the model I’ll confirm what is happening

This is how i built the example model in python:

import onnx

nodes = []

node = onnx.helper.make_node('Constant', inputs=[], outputs=['shape_indices'], value=onnx.helper.make_tensor(name='const_tensor', data_type=7, dims=[2], vals=[2,3]))
nodes.append(node)

node = onnx.helper.make_node('Constant', inputs=[], outputs=['divisor'], value=onnx.helper.make_tensor(name='const_tensor', data_type=7, dims=[], vals=[32]))
nodes.append(node)

node = onnx.helper.make_node('Shape', inputs=['input'], outputs=['input_shape'])
nodes.append(node)

node = onnx.helper.make_node('Gather', inputs=['input_shape', 'shape_indices'], outputs=['image_shape'])
node.attribute.append(onnx.helper.make_attribute("axis", 0))
nodes.append(node)

node = onnx.helper.make_node('Mod', inputs=['image_shape','divisor'], outputs=['remnant'])
nodes.append(node)

node = onnx.helper.make_node('Sub', inputs=['divisor', 'remnant'], outputs=['padding'])
nodes.append(node)

node = onnx.helper.make_node('Constant', inputs=[], outputs=['zeros'], value=onnx.helper.make_tensor(name='const_tensor', data_type=7, dims=[6], vals=[0,0,0,0,0,0]))
nodes.append(node)

node = onnx.helper.make_node('Concat', inputs=['zeros', 'padding'], outputs=['full_padding'])
node.attribute.append(onnx.helper.make_attribute("axis", 0))
nodes.append(node)

node = onnx.helper.make_node('Pad', inputs=['input', 'full_padding'], outputs=['padded_input'])
node.attribute.append(onnx.helper.make_attribute("mode", "constant"))
nodes.append(node)

node = onnx.helper.make_node('Constant', inputs=[], outputs=['weights'], value=onnx.helper.make_tensor(name='const_tensor', data_type=1, dims=[1, 3,3,3], vals=[1]*3*3*3))
nodes.append(node)

node = onnx.helper.make_node('Constant', inputs=[], outputs=['weights_flat'], value=onnx.helper.make_tensor(name='const_tensor', data_type=1, dims=[1, 1, 3,3], vals=[1]*3*3))
nodes.append(node)

node = onnx.helper.make_node('Conv', inputs=['padded_input', 'weights'], outputs=['conv_1'], name="Conv1")
node.attribute.append(onnx.helper.make_attribute("dilations", [1,1]))
node.attribute.append(onnx.helper.make_attribute("group", 1))
node.attribute.append(onnx.helper.make_attribute("kernel_shape", [3,3]))
node.attribute.append(onnx.helper.make_attribute("pads", [0,0,0,0]))
node.attribute.append(onnx.helper.make_attribute("strides", [1,1]))
nodes.append(node)

node = onnx.helper.make_node('Conv', inputs=['conv_1', 'weights_flat'], outputs=['conv_2'], name="Conv2")
node.attribute.append(onnx.helper.make_attribute("dilations", [1,1]))
node.attribute.append(onnx.helper.make_attribute("group", 1))
node.attribute.append(onnx.helper.make_attribute("pads", [0,0,0,0]))
node.attribute.append(onnx.helper.make_attribute("strides", [1,1]))
nodes.append(node)


node = onnx.helper.make_node('Conv', inputs=['conv_2', 'weights_flat'], outputs=['output_1'], name="Conv3")
node.attribute.append(onnx.helper.make_attribute("dilations", [1,1]))
node.attribute.append(onnx.helper.make_attribute("group", 1))
node.attribute.append(onnx.helper.make_attribute("pads", [0,0,0,0]))
node.attribute.append(onnx.helper.make_attribute("strides", [1,1]))
nodes.append(node)

node = onnx.helper.make_node('Conv', inputs=['conv_2', 'weights_flat'], outputs=['conv_3'], name="Conv4")
node.attribute.append(onnx.helper.make_attribute("dilations", [1,1]))
node.attribute.append(onnx.helper.make_attribute("group", 1))
node.attribute.append(onnx.helper.make_attribute("pads", [0,0,0,0]))
node.attribute.append(onnx.helper.make_attribute("strides", [1,1]))
nodes.append(node)

node = onnx.helper.make_node('ReduceMin', inputs=['conv_3'], outputs=['min'], name="K")
node.attribute.append(onnx.helper.make_attribute("axes", [1, 2,3]))
node.attribute.append(onnx.helper.make_attribute("keepdims", 0))
nodes.append(node)

node = onnx.helper.make_node('Constant', inputs=[], outputs=['one'], value=onnx.helper.make_tensor(name='const_tensor', data_type=1, dims=[], vals=[1]))
nodes.append(node)

node = onnx.helper.make_node('Add', inputs=['min','one'], outputs=['k'])
nodes.append(node)

node = onnx.helper.make_node('Cast', inputs=['k'], outputs=['k_int'])
node.attribute.append(onnx.helper.make_attribute("to", onnx.TensorProto.INT64))
nodes.append(node)

node = onnx.helper.make_node('TopK', inputs=['conv_3', 'k_int'], outputs=['output_2', '_'])
node.attribute.append(onnx.helper.make_attribute("axis", 2))
node.attribute.append(onnx.helper.make_attribute("largest", 1))
nodes.append(node)

inputs = []
inputs.append(onnx.helper.make_tensor_value_info("input", onnx.TensorProto.FLOAT, [1,3,"width","height"]))

outputs = []
outputs.append(onnx.helper.make_tensor_value_info("output_1", onnx.TensorProto.FLOAT, [1,1,"width_padded","height_padded"]))
outputs.append(onnx.helper.make_tensor_value_info("output_2", onnx.TensorProto.FLOAT, [1,1,"unknown", "height_padded"]))

graph = onnx.helper.make_graph(nodes, "graph", inputs, outputs)
model = onnx.helper.make_model(graph, producer_name="", opset_imports=[onnx.helper.make_opsetid("", 11)])

onnx.checker.check_model(model)
onnx.save(model, "test_padding_and_topk.onnx")

I noticed another problem with the automatic “cpu readback tracing”.
If you have an input that is initially on the CPU (k from the first TopK) but is then loaded onto the GPU for another operation (Sub), a third operation (k from the second TopK) can no longer read the value. In this example, I have marked the corresponding position with exclamation marks:

Running the model on the GPU I get the following log message:

TensorFloat arrayTensor = new TensorFloat(new TensorShape(5), new float[] { 1,2,3,4,5 });
TensorInt arrayLength = new TensorInt(new TensorShape(1), new int[] { 5 });

Dictionary<string, Tensor> inputTensors = new()
{
      { "array", arrayTensor },
      { "array_length", arrayLength },
};

worker.Execute(inputTensors);

Hi Josh, this seems like expected behaviour. For example the shape of a tensor is only a few numbers so it would not be efficient to do operations on this on the GPU.
For example doing a sum on two numbers would be very inefficient sending that to the GPU and then sending back the result. Whereas doing a sum on a thousand numbers it is worth it to send it to the GPU.

Whereas doing operations on tensors themselves which are large arrays of numbers is most efficient doing it on the GPU.
The CPU and GPU are constantly communicating.

In general you will have to call MakeReadable() before reading a values.

For your above example, if you got that error without trying to read anything from the output that looks like a bug. If you can supply the model and source code we’ll take a look at it. (Make sure you are running unity 2023 and have updated to the latest package.)