High performance inferencing

Hey, I’m trying to get high performance inferencing working. It looks like there is some effort in the Tensor interface to allow for asynchronous readback from the GPU, which I have working and seems fine.

The problem I am running into is the IWorker.Execute() call is taking 12-13 milliseconds!!! I expected it to start a thread and schedule the work so I could fetch the results at some future point in time, but it appears to block the main thread entirely while it runs. Is this true?

The alternative of me scheduling and running layers manually seems… not ideal either. If I have a really deep model, it will take me many frames for it to complete. I’m trying to do realtime voice work, and it does not appear I can do anything in realtime like this.

How do I kick off a model and have it not touch the main thread until I get notified it’s done?

Thanks,
JH

Further questions… I have a number of different inferences that I need to do on the same model, in sequence, and it’s likely a previous one is not complete before the next one begins.

It appears that if I were to reuse an IWorker that is not quite done with a previous inference, it trashes the active context. I guess that makes sense, but wanted to verify that is expected behavior?

Given this, I must create a separate IWorker for each active inference I want to perform on the same model. It appears if I create an IWorker with a model, it compiles shaders at that moment. If I create 5 of the same effective IWorker, I get 5x the compilation time (on first inference, not on creation). There’s no apparent caching of that work, or reuse of those shaders. Depending on the number of simultaneous inferences I have in-flight this can be prohibitive.

Assuming the above are true, I build a simple recycling cache and I reuse an IWorker if the model matches (would be convenient if I could ask the IWorker what model it’s associated with, I have to store that myself). Is this the correct approach?

Last big concern is how to deal with tensor data that has been used as inputs or outputs of a model? The memory model seems pretty loose. Maybe reference counting would help.

If I am calling SetInput and passing that tensor in, I can’t call Dispose() at that time or it would destroy the underlying buffers. I’m guessing that the API expectes me to call Dispose() when inferencing completes. I’m not currently, rather I am just calling SetInput again and expecting it to dispose of tensors that are not already disposed. Does not appear to do that. I’m seeing tons of leaks after reusing the same IWorker for multiple inferences.

It would be incredibly handy if there was a way to reset an IWorker’s tensors without needing to know all the inputs and outputs that might be referenced inside it. Or better yet, if SetInput would clean up and dispose any tensors that nobody called TakeOwnership on.

Calling Dispose on an IWorker destroys the shaders (which cost me 500ms to build), so that’s not really an option.

I’ll try to manually Dispose all the tensors myself after inferencing and see if that patches up the leaks.

Lots of interesting questions!
I’ll do my best to answer them all, but maybe it’s best to break them up into different threads

1 Like

Last question is the easiest.

  • any tensors you allocate via new Tensor... you are responsible for them
  • anything else you don’t need to Dispose, unless you call TakeOwnership
    So outputs of a model, no need to dispose of them

I’ll refer you to, we detail this out in the docs
https://docs.unity3d.com/Packages/com.unity.sentis@1.2/manual/get-the-output.html
https://docs.unity3d.com/Packages/com.unity.sentis@1.2/manual/manage-memory.html?q=dispose

If that is hard, you can use a Allocator to keep track of your tensor allocations and dispose of that.
C.f. ~\Samples\Do an operation on a tensor

I don’t get why disposing a worker would dump all the shaders and force a recompile, that is a strange behaviour
If you have a nice repro step please submit a bug report

Second question:
Model chaining.

First do remember that model.Execute only schedules the model kernels on the gpu/burst.
We handle dependencies correctly.
If you do

model.Execute(input);
output = model.PeekOutput();
model.Execute(output);

This will work perfectly as we are chaining dependencies between jobs.
Only when you do tensor.CompletePendingTransactions() do we wait for the jobs to be done.
Else everything is happening asynchronously.

So do not create 5 workers if you are chaining one model many time, do as you would expect :slight_smile:
We’ll add that to the doc I realize it might not have been explained enough

1 Like

First point, this thread should help

model.Execute schedules the work to GPU/Burst. There might have some layers that needs to run on the CPU (shape calculations…) too.
Could very well be that the cumulative of those two cost 12ms…
To aleviate this, I would try two things:

If you think the 12ms is excessive, do file a bug report

Thank you for the helpful responses! I did make some effort toward cleanup of tensors. It’s very unfortunate IWorker does not provide a method for iterating over input and output tensors without knowing the model. More unfortunate that IWorker does not allow you to retrieve the Model it is associated with, which does indeed have such capability. I would highly recommend adding some linkages back to those classes, for developer convenience. I had to create several dictionaries to allow me to look that up when all I have is the IWorker at the point of inference completion several frames later after the async callback fires for the output tensor.

With every inference, I add an entry to this dictionary:

		private Dictionary<int, (string, IWorker, Tensor, List<Tensor>)> _idToOutputsAndInputs;  // id -> (model, engine, outputTensor, inputTensors)

I poll to see if the outputTensor IsAsyncReadbackRequestDone is true in the update of my monobehaviour. When it’s done, I grab the output tensor contents, then dispose of all the inputs, and for good measure TakeOwnership of the output and dispose of it as well. After only maybe 10 or so inferences, I shut the program down and I get a slew of these warnings:

Found unreferenced, but undisposed ComputeTensorData which might lead to GPU resource leak
UnityEngine.Debug:LogWarning (object)
Unity.Sentis.D:LogWarning (object) (at Library/PackageCache/com.unity.sentis@1.2.0-exp.2/Runtime/Core/Internals/Debug.cs:72)
Unity.Sentis.ComputeTensorData:Finalize () (at Library/PackageCache/com.unity.sentis@1.2.0-exp.2/Runtime/Core/Backends/GPUCompute/ComputeTensorData.cs:115)

And a similar number of these:

GarbageCollector disposing of ComputeBuffer allocated in D:\UnityPiecemealPOC\Library\PackageCache\com.unity.sentis@1.2.0-exp.2\Runtime\Core\Backends\GPUCompute\ComputeTensorData.cs at line 96. Please use ComputeBuffer.Release() or .Dispose() to manually release the buffer.
UnityEngine.ComputeBuffer:Finalize ()

When I say slew I mean over 1000 of each. There’s something wrong.


What I am seeing in the Execute call is approximately 12 ms of time. This is unreasonably slow for realtime work. From what I see in the GenericWorker code, it’s queueing up all the layers in one go, and with the wav2vec2 model this takes a really long time. Is it possible for this work to be done in a different thread, somehow? Or prepare all these dispatches on a separate thread then kick them off in the main thread as soon as possible? If generating dispatch instructions can be this expensive, it is worth the effort to move it off the main thread. In Unity, that’s the most precious commodity, and gamedevs spend most of our time trying to find ways to get that back.

You mentioned perhaps we can optimize our model somehow to flow through Sentis more efficiently. What guidance can you give for that, based on what I’m seeing?

I still fail to see why you’d want to query the worker model if you are allready providing it on the execute…

Dictonary<string, Tensor> firstInputs;

worker.Execute(firstInputs);
var output worker.PeekOuptut();
worker.Execute(output);


foreach (var input in firstInputs)
   input.Dispose()

you don’t need to TakeOwnership of the output if you are chaining that to another model execution or just reading from it

Scheduling needs to happen on the main thread because we are scheduling graphics or unity jobs.
You can use the CommandBuffer backend to build graphics job queue that once.
https://docs.unity3d.com/Packages/com.unity.sentis@1.2/manual/use-command-buffer.html?q=command

From looking at your model, you just seem to have a lot of layers and job/compute scheduling takes a few ms every time.
So you seemed to only be able to afford x amount of job scheduling every frame.
I would try to reduce the # of layers in your model or split execution over N frames

Isolating problems with one IWorker is easier than trying to fix them when they are being reused. I just switched my code to reuse a single IWorker and it seems to work fine now.

I also stopped taking ownership of the output tensor and stopped disposing it. Seems happy.

Finally, I discovered I was not disposing the IWorkers when shutting down, so there are tons of internal leaks that were cleared up from that. As far as I can tell, it’s working, just slow to schedule now.

Thanks for the guidance. I’ll review the command buffer backend and see how hard that is to get working. The performance for large models on main thread CPU is untenable through the simpler interface.

Actually, running the same IWorker multiple times in a row doesn’t work properly. Seems like it does, but the output is different than when I run the same input tensors on separate workers. This is why I split it up to a pool of them in the first place… to isolate the issues I was seeing. Now that I folded it back together, it’s broken again.

So… let me just paste my code here and see if you see anything wrong with my approach:

using System.Collections.Generic;
using System.Diagnostics;
using UnityEngine;
using Unity.Sentis;
using System;

namespace Inferencer
{
	public class SentisInferencer : IInferencer
	{	
		private int                                                      _inferenceCounter;
		private Dictionary<string, Model>                                _models               = new Dictionary<string, Model>();
		private Dictionary<string, IWorker>                              _workers              = new Dictionary<string, IWorker>();  // it's apparently legal to reuse the same worker for multiple executions.  I assume at .Execute() it generates the output tensors.
		private Dictionary<int, (string, IWorker, Tensor, List<Tensor>)> _idToOutputsAndInputs = new Dictionary<int, (string, IWorker, Tensor, List<Tensor>)>();  // id -> (model, engine, outputTensor, inputTensors)

		public void LoadModel(string name, string modelPath, bool useCPU=false)
		{
			try
			{
				ModelAsset modelAsset = Resources.Load<ModelAsset>($"Models/{modelPath}");
				Model model = ModelLoader.Load(modelAsset);
				_models.Add(name, model);

				IWorker engine = ProduceWorker(name);
				_workers.Add(name, engine);
			}
			catch (Exception e)
			{
				UnityEngine.Debug.LogException(e);
			}
		}

		public int Inference(string name, string outputName, Dictionary<string, (float[], int[], DataType)> nameAndData)
		{
			if (_workers.TryGetValue(name, out IWorker engine))
			{
				// put placeholder in the inference outputs dictionary
				_inferenceCounter++;

				List<Tensor> inputTensors = new List<Tensor>();
				foreach (KeyValuePair<string, (float[], int[], DataType)> kvp in nameAndData)
				{
					float[] srcData = kvp.Value.Item1;
					int[] srcDims = kvp.Value.Item2;
					DataType dtype = kvp.Value.Item3;

					switch (dtype)
					{
						case Inferencer.DataType.Float:
						{
							// Sentis considers a single value not a dimensioned tensor.  Weird.
							if (srcDims==null)
							{
								TensorFloat tensorFloat = new TensorFloat(srcData[0]);
								inputTensors.Add(tensorFloat);
								engine.SetInput(kvp.Key, tensorFloat);
							}
							else
							{
								TensorFloat tensorFloat = new TensorFloat(new TensorShape(srcDims), srcData);
								inputTensors.Add(tensorFloat);
								engine.SetInput(kvp.Key, tensorFloat);
							}
							break;
						}
						case Inferencer.DataType.Integer:
						{
							if (srcDims==null)
							{
								TensorInt tensorInt = new TensorInt((int)srcData[0]); 
								inputTensors.Add(tensorInt);
								engine.SetInput(kvp.Key, tensorInt);
							}
							else
							{
								int[] data = new int[srcData.Length];
								for (int i=0; i<data.Length; i++)
								{
									data[i] = (int)srcData[i];
								}
								TensorShape inputDataShape = new TensorShape(srcDims);
								TensorInt tensorInt = new TensorInt(inputDataShape, data); 
								inputTensors.Add(tensorInt);
								engine.SetInput(kvp.Key, tensorInt);
							}
							break;
						}
						default:
							UnityEngine.Debug.Log($"Unrecognized input type for {name} = {kvp.Value.Item2}");
							break;
					}
				}

				Stopwatch stopWatchInf = Stopwatch.StartNew();

				// Start the inferencer
				engine.Execute();

				UnityEngine.Debug.Log($"{name} Inference execute {stopWatchInf.ElapsedMilliseconds}");

				Tensor outputTensor = engine.PeekOutput(outputName);  // Grab a reference to the output so we can stash it
				outputTensor.AsyncReadbackRequest(null);

				// Store off the unfinished inference for later
				_idToOutputsAndInputs.Add(_inferenceCounter, (name, engine, outputTensor, inputTensors));
				return _inferenceCounter;
			}
			return -1;
		}

		// Returns null if the data isn't present yet.
		public (float[], int[]) GetData(int outputID)
		{
			if (_idToOutputsAndInputs.TryGetValue(outputID, out (string modelName, IWorker engine, Tensor output, List<Tensor> inputs) x))
			{
				(float[] data, int[] dims) results = (null, null);

				if (x.output.IsAsyncReadbackRequestDone())
				{
					_idToOutputsAndInputs.Remove(outputID);

					x.output.MakeReadable(); 

					// Read the outputs here and fill the output dictionary for this inference, then put it into the outputDict
					Dictionary<string, (float[], int[], DataType)> outputs = new Dictionary<string, (float[], int[], DataType)>();
					results.dims = x.output.shape.ToArray();
					results.data = ((TensorFloat)x.output).ToReadOnlyArray();  // only support a single TensorFloat as the outputs for now.

					// Dispose the input tensors
					foreach (Tensor t in x.inputs)
					{
						t.Dispose();
					}
				}
				return results;
			}
			throw new Exception($"Caller requested inference results from an invalid ID {outputID}");
		}

		public void Shutdown()
		{
			// Make sure we dispose of all the InferenceSession objects, as they may be hanging onto GPU resources.
			foreach (KeyValuePair<int, (string modelName, IWorker engine, Tensor output, List<Tensor> inputs)> kvp in _idToOutputsAndInputs)
			{
				kvp.Value.output.Dispose();
				foreach (Tensor t in kvp.Value.inputs)
				{
					t.Dispose();
				}
			}
			_idToOutputsAndInputs.Clear();

			// Release all the workers
			foreach (KeyValuePair<string, IWorker> kvp in _workers)
			{
				kvp.Value.Dispose();
			}
			_models.Clear();
		}

		private IWorker ProduceWorker(string modelName)
		{
			IWorker engine = null;

			// Find the model first, then spin up a worker for it.
			if (_models.TryGetValue(modelName, out Model model))
			{
				try
				{
					engine = WorkerFactory.CreateWorker(BackendType.GPUCompute, model);
					UnityEngine.Debug.Log($"{modelName} is connected to the GPU Compute");
				}
				catch
				{
					try
					{
						engine = WorkerFactory.CreateWorker(BackendType.GPUCommandBuffer, model);
						UnityEngine.Debug.Log($"{modelName} is connected to the GPU CommandBuffer");
					}
					catch
					{
						try
						{
							engine = WorkerFactory.CreateWorker(BackendType.CPU, model);
							UnityEngine.Debug.Log($"{modelName} is connected to the CPU");
						}
						catch
						{
							try
							{
								engine = WorkerFactory.CreateWorker(BackendType.GPUPixel, model);
								UnityEngine.Debug.Log($"{modelName} is connected to the GPU via GPUPixel");
							}
							catch
							{
								UnityEngine.Debug.Log($"SentisInferencer Inference failed to create worker for {modelName}");
							}
						}
					}
				}
			}
			else
			{
				UnityEngine.Debug.Log($"SentisInferencer model {modelName} not loaded");
			}
			return engine;
		}
	}
}

Generally what happens is I call LoadModel and pass in the name, my code loads the model and produces a worker. When I want to inference, I call Inference and pass in the model name, the name of the output tensor, and a dictionary of the inputs. That gets translated to what Sentis wants, and I get back an int that lets the caller poll in the Update every frame until the data is ready.

This same interface has been implemented with DirectML and directly with OnnxRuntime, among others. It helps isolate the issues with a specific provider and keep the caller code agnostic.

And I should point out, when I had a list of IWorker objects instead of just one, I would take one off the list, use it, then return it when the GetData() call succeeds. That provided correct output. This does not.

Can you send me a PM? I can follow up with you offline and investigate your model