Sentis inference is extremely slow

Posting here as I’ve been unsuccessfully trying to optimize my NN inference runtime in several ways and I still get horrible performance.

I am running inference on some data through a first function (Run) and collecting it through a second one (GetOutput) to allow the operation to run in a thread during the execution of my remaining Update loops.

I think this implementation should be rather equivalent to the one in the Read output asynchronously example.

This is a snippet of my code:

private void Run()
        {
            FloatsFromVec3();

            DisposeTensors(_inputTensor, _outputTensor);

            _inputTensor = new TensorFloat(new TensorShape(1, 63), _inputs);
            _engine.Execute(_inputTensor);

            // Peek the value from Sentis, without taking ownership of the Tensor (see PeekOutput docs for details).
            _outputTensor = _engine.PeekOutput() as TensorFloat;
            _outputTensor.AsyncReadbackRequest(ReadbackCallback);
        }
        
void ReadbackCallback(bool completed)
        {
            if (!completed)
            {
                DebugUtils.LogAvatarInput("ReadbackCallback failed: not completed", DebugLevelEnum.Debug);
                return;
            }
            
            // Put the downloaded tensor data into a readable tensor before indexing.
            _outputTensor.MakeReadable();
            
            DebugUtils.LogAvatarInput($"Output tensor processed", DebugLevelEnum.Debug);
        }

public int GetOutput()
        {
            if (_skip)
                return _outPose;
            
            while (!_outputTensor.IsAsyncReadbackRequestDone())
            {
                DebugUtils.LogAvatarInput("Waiting for async readback to complete  ....", DebugLevelEnum.Debug);
            }

            float[] tensorVals = new float[N];
            float outVal;
            if (_outputTensor != null)
            {
                tensorVals = _outputTensor.ToReadOnlyArray();
                
                // get argmax
                outVal = tensorVals.ToList().IndexOf(tensorVals.Max());
            }
...            
            }

now there are some things that don’t make sense to me.

  1. I am importing here an extremely simple NN (2 layers x 16 neurons, with 64 inputs, for a total of ~1000 flops). Processing it takes ~0.5ms to 1ms, compared to ~3us running the same model on tflite.
  2. I am trying to simulate the runtime cost of my main app using a single delayer class (DataDrivenDelayer.cs) that just counts to N. My expectation would be that as I wait for a longer time my NN should be run in a thread and the total cost of the class running inference on Sentis (DataDriven.cs) should not contribute to PlayerLoop. Apparently that’s not the case.
  3. If I try to run a set of models in parallel their runtimes don’t some up linearly but it seems like running 4-5 simple models is much less expensive than 5 times the cost of a single one.

Maybe this?

from the manual

You must call Dispose on outputs if you obtain them via worker.FinishExecutionAndDownloadOutput or if you take ownership of them by calling tensor.TakeOwnership

I think the output disposal is a product of different iterations. I’m pretty sure I had the same problem before adding it and it didn’t help.
I will double-check though and let you know. Thank you @BackgroundMover.

confirming that commenting out the first

DisposeTensors(_inputTensor, _outputTensor);
does not solve the problem.

1 Like

@mmacchini your code looks correct, even though you might want to do the argmax in the model instead

model.layers.AddLayer(new ArgMax("ouptut_argmax", model.ouputs[0], ..);
model.outputs[0] = "ouptut_argmax";

For the 0.5ms scheduling cost. We are aware that our scheduling cost are too high, and are working to fix it. The issue is linked to our allocator, we try to be smart and re-use memory but it’s currently having too big of a cost on the main thread. Look out for the fix in a upcoming version

A few things you can do in the meantime.

  • Make sure you are running the editor or a runtime as a release build and not debug build
  • Since your model is extremely light can always use our GPUComputeBackend/CPUBackend to manually call Dense
    This will have the least amount of scheduling cost. You are in total control of all allocations

Hi @alexandreribard_unity, thank you for your answer, this makes a lot of sense.

I think this scheduling cost should be documented somehow, especially as it seems to weight on the main thread which is not what I got from your docs.

Anyways, are you saying that I can run manually run the NN to prevent/limit these scheduling costs? Can you provide sample code for that? can it be threaded?

What I’m saying is:

  • we know scheduling cost are too high and we are fixing it
  • in the meantime since your NN is tiny, dispatching it manually would solve your scheduling cost issues.
 GPUComputeBackend backend = new GPUComputeBackend();
...
X, Y, O = new TensorFloat ...
W0 = model.constants[0].DataSetToTensorView();
B0 = model.constants[1].DataSetToTensorView();
...
backend.Dense(X, W0, B0, Y, Layers.FusableActivation.None);
backend.Dense(Y, W1, B1, O, Layers.FusableActivation.None);

You still get dispatch cost, and that’s only threadable in 2023 LTS with awaitable
https://github.com/keijiro/AsyncCaptureTest/blob/master/Assets/AsyncCapture.cs

Thank you @alexandreribard_unity!
I tried this in this code:

                    // X, Y, O = new TensorFloat ...
                    TensorFloat mlp0w = (TensorFloat)_runtimeModel.constants[6].DataSetToTensor(); //.DataSetToTensorView();
                    TensorFloat mlp0b = (TensorFloat)_runtimeModel.constants[0].DataSetToTensor();
                
                    TensorFloat mlp2w = (TensorFloat)_runtimeModel.constants[7].DataSetToTensor();
                    TensorFloat mlp2b = (TensorFloat)_runtimeModel.constants[1].DataSetToTensor();
                
                    TensorFloat scale11 = (TensorFloat)_runtimeModel.constants[4].DataSetToTensor();
                    TensorFloat bias11 = (TensorFloat)_runtimeModel.constants[5].DataSetToTensor();
                
                    TensorFloat mlp6w = (TensorFloat)_runtimeModel.constants[2].DataSetToTensor();
                    TensorFloat mlp6b = (TensorFloat)_runtimeModel.constants[3].DataSetToTensor();
          
                    TensorFloat X = _inputTensor;
                    TensorFloat X1 = new TensorFloat(new TensorShape(1, 16), new float[16]);
                    TensorFloat X2 = new TensorFloat(new TensorShape(1, 16), new float[16]);
                    TensorFloat B = new TensorFloat(new TensorShape(1, 16), new float[16]); // batchNorm
                    TensorFloat Y_ = new TensorFloat(new TensorShape(1, 9), new float[9]);
                    TensorFloat Y = new TensorFloat(new TensorShape(1, 9), new float[9]);
                
                
                    backend.Dense(X, mlp0w, mlp0b, X1, Unity.Sentis.Layers.FusableActivation.Relu);
                    backend.Dense(X1, mlp2w, mlp2b, X2, Unity.Sentis.Layers.FusableActivation.Relu);

                    backend.ScaleBias(X2, scale11, bias11, B);

                    backend.MatMul2D(B, mlp6w, Y_, false, true);
                    backend.Add(Y_, mlp6b, Y);

                    _outputTensor = (TensorFloat)Y.DeepCopy();
                
                
                    DebugUtils.LogAvatarInput("Sentis inference done manually", DebugLevelEnum.Debug);

I guess I can get the constants once in the constructor, no big deal.

The problem here is that the CPU backend is still extremely slow at performing these actions! you can see the sizes of the tensors, the largest matmul is 63x16. However, it still takes something in the order of 2-3ms on my machine.

is it reasonable for the backend to be THIS slow? a 9-element ADD 0.16ms?

Double check that you are in release mode and not debug.
Else share the model and we’ll spawn a bug and fix it.
16ms is not normal, especially that this is measuring scheduling cost and not the burst job underneath.

I confirm the same behavior in a release run.

here is my .onnx:

Thanks for your help, it’s appreciated!

Btw, I also tried implementing MatMul, Add, BatchNorm and Relu myself and the code runs ~20x faster