What are the best practices for running inference on webcams/videos with Sentis?

Hi everyone,

I’m currently working on a project that involves running inference (e.g., segmentation) on webcam input using Unity Sentis, and I’m targeting edge devices such as mobiles. I’m seeking advice on best practices to achieve the highest FPS and lowest latency.

Here are a few specific questions and points I’m looking for guidance on:

  1. Inference Backend:
    Which backend should I choose for optimal performance on mobile devices? My experience with GPUCompute has shown it to be slower in some cases. Are there alternatives or specific configurations that I should consider, or it depends on the type of operations I am performing?

  2. Mask Application and Resizing:
    What are the best practices for applying masks on the input frames? Any tips on efficient ways to overlay segmentation masks on the webcam feed without introducing significant latency? My current approach involves using for loops to check for each pixel score and use a threshold to filter, using Graphics.Blit to resize the texture and the GetPixels(), SetPixels() to manipulate the webcam texture.
    How should I handle resizing masks to match the input frame size? Are there efficient methods or built-in functions in Unity Sentis that can help with this?

  3. General Optimization:
    Any general tips or tricks for optimizing inference performance on edge devices? Specific settings, code optimizations, or hardware considerations that can help improve both FPS and latency?

  4. Using Sentis Backend Ops:
    When should I be using the Backend class operations such as ArgMax? Would they work well on mobile devices or am I better off using some custom functions?

I appreciate any insights or experiences you can share. Looking forward to your responses!

Best, Namas

Great questions:

  1. when dealing with images the data is coming in from the gpu. so it’s best to stay on the gpu and do download the result asynchronously (because downloading from the gpu causes a frame hitch if you do it on the same frame).
    => use the GPU backend and follow the asynchronous readback tutorial to not interupt the system too much
  2. Use compute shader. it the most flexible. You could also add a * mask operator to your model that would work too. So you define your mask a second input that gets multiplied to your input image.
  3. GPU, keep everything on the same pipe. don’t use the cpu for processing images. if you need a result do it asynchronously.
  4. Try to use the functional api to add a op to the end of your model it’s much more intuitive and you’ll benefit from our model optimizations

If GPU is slower than CPU for image base models let us know it’s typically not normal

  1. nice one! this is good and fast. To resize back to the final output on the screen, follow com.unity.sentis\Samples\Copy a texture tensor to the screen\RunModelOnFullScreen.cs
void OnRenderImage(RenderTexture source, RenderTexture destination)
{
       ...
   TextureConverter.RenderToScreen(output);
}
  1. nope all good, everything will run on the gpu.

Now back to 1.

  1. that depends on your model and the temporal frequency you can afford. You can play around with input dimension. it all depends on how much ms budget you have. if your model is fast enough you don’t need it. if it is slow you might want to spread inference over 1 or 2 frame to smooth out the frame rate.
    You can also play around with lowering the input resolution and upscaling after. You might want to look into temporal up-scaling and temporal re-projection to combine two frames depth into one higher resolution image

Cheers @alexandreribard_unity! You have helped me a great deed. After carefully understanding your suggestions, I have finally come up with a solution that works, however I am still confused about a few things. Here’s the code that I am using:

using UnityEngine;
using UnityEngine.UI;
using Unity.Sentis;
using System.Collections;

public class Main : MonoBehaviour
{
    // Displaying stuff
    public RawImage cameraFeed;
    public Text FPSText;
    private WebCamTexture webCamTexture;
    
    // Constants
    private static int modelInputWidth;
    private static int modelInputHeight;
    private float deltaTime = 0.0f;

    // Sentis stuff
    public ModelAsset modelAsset;
    IWorker m_Engine;
    Tensor m_Input;
    TensorFloat m_OutputTensor;
    private TextureTransform transform;

    // Parameters for model execution
    const int k_LayersPerFrame = 20;
    IEnumerator m_Schedule;
    bool m_Started = false;
    

    void ReadbackCallback(bool completed)
    {
        // The call to `CompleteOperationsAndDownload` will no longer block with a readback as the data is already on the CPU
        m_OutputTensor.CompleteOperationsAndDownload();
        // The output tensor is now in a readable state on the CPU
    }


    void Start()
    {
        // Webcam stuff

        // Load the ONNX model via Sentis.
        var model = ModelLoader.Load(modelAsset);
        var output = model.outputs[0];

         // Model processing stuff
        
        // Add the operators/operations to the model and compile a new one.
        var finalModel = Functional.Compile(ModelWithPrePostProcessing, InputDef.FromModel(model));
    
        // Initialize the transform to convert a Texture input to Tensor for model inference.
        transform = new TextureTransform()
            .SetDimensions(modelInputWidth, modelInputHeight, 3)
            .SetTensorLayout(TensorLayout.NCHW);
        
        // Initialize model with Backend.
        m_Engine = WorkerFactory.CreateWorker(BackendType.GPUCompute, finalModel);
    }

    bool executionStarted = false;
    IEnumerator executionSchedule;

    void Update()
    {   
        deltaTime += (Time.deltaTime - deltaTime) * 0.1f;
        FPSText.text = string.Format("FPS: {0:0.}", 1.0f / deltaTime);

        // // ###### SPLIT FRAME EXECUTION #############
        if (!m_Started)
        {
            m_Input = TextureConverter.ToTensor(webCamTexture, transform);
            m_Schedule = m_Engine.ExecuteLayerByLayer(m_Input);
            m_Started = true;
        }

        int it = 0;
        while (m_Schedule.MoveNext())
        {
            if (++it % k_LayersPerFrame == 0)
                return;
        }
        // // ##########################################

        // ##### UNCOMMENT FOR NORMAL EXECUTION #####
        // m_Input = TextureConverter.ToTensor(webCamTexture, transform);
        // m_Engine.Execute(m_Input);
        // ##########################################

        // Get the output and perform ASYNC operation to move tensor data from GPU to CPU.
        m_OutputTensor = m_Engine.PeekOutput() as TensorFloat;
        //m_OutputTensor.ReadbackRequest(ReadbackCallback); // Comment for SYNC
        //m_OutputTensor.CompleteOperationsAndDownload(); // Comment for ASYNC

        // Convert the tensor to a RenderTexture
        //TextureConverter.RenderToScreen(m_OutputTensor);
        RenderTexture outputRenderTexture = TextureConverter.ToTexture(m_OutputTensor, transform);

        // Assign the RenderTexture to the RawImage for display
        cameraFeed.texture = outputRenderTexture;

        // Dispose stuff
        m_Started = false;
        m_Input.Dispose();
        m_OutputTensor.Dispose();
    }

    void OnDisable()
    {
        m_Engine.Dispose();
        m_Input.Dispose();
        m_OutputTensor.Dispose();
    }

    void OnDestroy()
    {
        m_Engine.Dispose();
        m_Input.Dispose();
        m_OutputTensor.Dispose();
    }
}

My questions are:

  1. While using the split frame inference on my android phone (Pixel 7 Pro), with layers=2, the FPS prints ~60 and the latency is about 16ms which seems correct, but the video that is rendered on the screen is extremely choppy and laggy (looks like about 5 fps max). However, when I use layers=20, the FPS still prints ~60 and the latency is also 16ms but the displayed frames with the masks are very smooth. I am trying to understand why this is the case? Could you point me out if I am missing something in my code, perhaps how I am displaying stuff?

  2. When using the GPUCompute backend here, do I need to use the CompleteOperationsAndDownload() on the output tensor before rendering it to the screen RenderTexture outputRenderTexture = TextureConverter.ToTexture(m_OutputTensor, transform);, if not, why so? Does TextureConverter make use of GPU by default? I have tried both Async and Sync approaches, and the FPS and latency is almost similar, however I have noticed that when using GPUCompute and not calling m_OutputTensor.ReadbackRequest(ReadbackCallback); or m_OutputTensor.CompleteOperationsAndDownload();, the latency remains constant with fewer drops.

  3. Just wanted to point out I tried using Functional.ArgMax() on an input of size (1,21,257,257) and it would throw an error stating the the size is too big. Note: This only happened while using GPUCompute Backend. It worked with the CPU Backend.

  4. What would be an ideal choice for BackEnd when trying to run stuff on mobile phones? What’s the difference between GPUPixel and GPUCompute? Is either better than the other?

Please let me know if you have any other questions regarding this implementation.

I really appreciate your help @alexandreribard_unity!

Best, Namas

Your code looks good. Just be aware that
RenderTexture outputRenderTexture = TextureConverter.ToTexture(m_OutputTensor, transform); will allocate a RT every frame so not ideal memory wise :slight_smile:
I’d suggest to allocate it once and then use TextureConverter.ToTexture(m_OutputTensor, outputRenderTexture transform);instead

  1. yeah that’s normal.
    Assume your model has 100 layers
    2 layers a frame, there is no strain on the GPU, FPS is smooth at 60fps. however it takes 50 frames so a bit under a second to finish. So the visual frequency of the output depth is now 1 fps. so things are very choppy
    20 layers a frame, a bit more strain on the GPU, FPS is still smooth at 60fps. now it takes 5 frames to finish. So roughly 0.083 ms or 12 fps. which is visually ok so less choppy.

  2. No you don’t want to download the result from the GPU. your code is correct, TextureConverter keeps things on the GPU so everything is smooth.

  3. What error do you get? What’s the final tensor output size?

  4. Use GPUCompute, Pixel works but is very slow

1 Like