AsyncReadbackRequest

I am currently working on a Voice Conversion feature that runs inference on Wav2Vec 2.0, Fastspeech2, and HifiGAN. I am attempting to use AsyncReadbackRequest to have inference run on a different thread to ensure that MakeReadable() does not block the main thread. However, when I am testing which thread the callback is running on, it always returns 1 which is the id for the main thread.

Questions:

  • Is there a way to access the output data without using MakeReadable()? I have checked the samples for Compute Buffer and Async Process, but it looks MakeReadable() is required to interact with the data even if you pin the data.
  • Are there other steps that I need to take into consideration to allow for Inference() to run on a separate thread?
  • Is there a better method (e.g. profiling in Unity) to determine which thread Inference() is running on?

Code sample:

using System;
using System.Collections;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using System.Threading;
using UnityEngine;
using Unity.Sentis;

using System.Linq;

namespace Inferencer
{
	public class SentisInferencer : IInferencer
	{	
		private Model model;
		private Dictionary<string, IWorker> _modelSessions  = new Dictionary<string, IWorker>();
		private TensorInt tensorInt;
		private TensorFloat tensorFloat;
		private TensorFloat outputTensor;
		private TensorShape inputDataShape;
		private Dictionary<string, Tensor> inputs;
		private Dictionary<string, (float[], int[], DataType)> outputs;
		private string outputName;
		private string modelName;
		private Stopwatch stopWatchInf;

		public SentisInferencer(){}


		public void LoadModel(string name, string modelPath, bool useCPU=false)
		{
			ModelAsset modelAsset = Resources.Load<ModelAsset>($"Models/{modelPath}");
			IWorker m_Engine = null;
			try
			{
				try
				{
					model = ModelLoader.Load(modelAsset);
					m_Engine = WorkerFactory.CreateWorker(BackendType.GPUCompute, model);
					UnityEngine.Debug.Log($"{name} is connected to the GPU");
				}
				catch
				{
					model = ModelLoader.Load(modelAsset);
					m_Engine = WorkerFactory.CreateWorker(BackendType.CPU, model);
					UnityEngine.Debug.Log($"{name} is connected to the CPU");
				}
			}
			catch (Exception e)
			{
				UnityEngine.Debug.Log(e);
			}
			
			
			if (m_Engine != null)
			{
				// Remember the session, inputDimensions, outputDimensions so we know how to form the inputs and receive the outputs
				_modelSessions.Add(name, m_Engine);
			}
			else
			{
				UnityEngine.Debug.Log($"SentisInferencer LoadModel failed {name} to load {modelPath}");
			}
		}

		private void ReadbackCallback(bool completed)
		{
			UnityEngine.Debug.Log($"ReadbackCallback on main thread: {isMainThread()}");
			int[] outputDims = outputTensor.shape.ToArray();
			outputTensor.MakeReadable();
			float[] output = outputTensor.ToReadOnlyArray();

			// There is now only one output per refactored model vs multiple
			outputs[outputName] = (output, outputDims, DataType.Float);
		}

		public async Task<Dictionary<string, (float[], int[], DataType)>> Inference(string name, string outputNameParam, Dictionary<string, (float[], int[], DataType)> nameAndData)
		{
			outputs = new Dictionary<string, (float[], int[], DataType)>();
			outputName = outputNameParam;
			modelName = name;

			if (_modelSessions.TryGetValue(name, out IWorker m_Engine))
			{
				inputs = new Dictionary<string, Tensor>();

				foreach (KeyValuePair<string, (float[], int[], DataType)> kvp in nameAndData)
				{
					float[] srcData = kvp.Value.Item1;
					int[] srcDims = kvp.Value.Item2;
					DataType dtype = kvp.Value.Item3;

					switch (dtype)
					{
						case Inferencer.DataType.Float:
						{
							inputDataShape = new TensorShape(srcDims);
							tensorFloat = new TensorFloat(inputDataShape, srcData); 
							inputs.Add(kvp.Key, tensorFloat);
							break;
						}
						case Inferencer.DataType.Integer:
						{
							int[] data = new int[srcData.Length];
							for (int i=0; i<data.Length; i++)
							{
								data[i] = (int)srcData[i];
							}
							inputDataShape = new TensorShape(srcDims);
							tensorInt = new TensorInt(inputDataShape, data); 
							inputs.Add(kvp.Key, tensorInt);
							break;
						}
						default:
							UnityEngine.Debug.Log($"Unrecognized input type for {name} = {kvp.Value.Item2}");
							break;
					}
				}

				try
				{
					stopWatchInf = Stopwatch.StartNew();
					m_Engine.Execute(inputs);

					// Return type is currently only float data
					outputTensor = m_Engine.PeekOutput(outputName) as TensorFloat;
					UnityEngine.Debug.Log($"Inference - On MainThread?: {isMainThread()}");
					outputTensor.AsyncReadbackRequest(ReadbackCallback);
					bool shouldPrint = true;
					bool prevThread = isMainThread();
					while (!outputTensor.IsAsyncReadbackRequestDone())
						if (shouldPrint) { UnityEngine.Debug.Log("Waiting on async process to finish..."); shouldPrint = false; };

						if (!isMainThread())
							UnityEngine.Debug.Log($"Is on MainThread: {isMainThread()}");

					stopWatchInf.Stop();
					long wallClockTime = stopWatchInf.ElapsedMilliseconds;
					UnityEngine.Debug.Log($"{modelName} Inference() time: {wallClockTime}");
				}
				catch (Exception e)
				{
					UnityEngine.Debug.Log(e);
				}
			}
			else
			{
				UnityEngine.Debug.Log($"SentisInferencer model {name} not loaded");
			}

			return outputs;
		}

		private bool isMainThread(){
			return Thread.CurrentThread.ManagedThreadId.Equals(1);
		}

		public void Shutdown()
		{
			// Make sure we dispose of all the InferenceSession objects, as they may be hanging onto GPU resources.
			foreach (KeyValuePair<string, IWorker> kvp in _modelSessions) kvp.Value.Dispose();
			_modelSessions.Clear();

			foreach (KeyValuePair<string, Tensor> kvp in inputs) kvp.Value.Dispose();
			inputs.Clear();

			if (tensorFloat != null) tensorFloat.Dispose();
			if (tensorInt != null) tensorInt.Dispose();
			if (outputTensor != null) outputTensor.Dispose();
		}
	}
}

Ok so first things first I’ll refer you to the corresponding docs:
https://docs.unity3d.com/Packages/com.unity.sentis@1.2/manual/read-output-async.html
https://docs.unity3d.com/Packages/com.unity.sentis@1.2/manual/access-tensor-data-directly.html

Then answering your questions more directly:

  • Execute happens on the main thread
    This is because we are scheduling GPU/job workload. This needs to happen on the main thread.
    But this also means that the workload (the math for each layer) is happening on a separate thread or on the GPU, leaving the main thread free (minus the scheduling cost ofc)
  • Blocking main thread.
    If you call MakeReadable on a tensor that is a blocking call and will synchronize the threads and main thread.
    If you do not want this behavior you need to call tensor.AsyncReadbackRequest
    Once the callback is called, or tensor.IsAsyncReadbackRequestDone == true then you can do a MakeReadable/ToReadOnlyNativeArray/… without any blocking issue
2 Likes

You can also want to operate on the tensor’s data natively without dealing with this callback.
~Samples/Use a compute buffer/Use Burst to write data.

For this you do
var dataBurst = BurstTensorData.Pin(tensor);
var dataCompute = ComputeData.Pin(tensor);
accordingly if your tensor is on cpu or gpu.

You then have access to:

  • data pointer + read/write job fence for BurstTensorData
  • computebuffer for ComputeData
    And now you are back in regular job/compute land and can do what you want with it

Maybe I’m missing something, but isn’t this how I currently have the code structured? Aside from

while (!outputTensor.IsAsyncReadbackRequestDone())
	if (shouldPrint) { UnityEngine.Debug.Log("Waiting on async process to finish..."); shouldPrint = false; };

	if (!isMainThread())
		UnityEngine.Debug.Log($"Is on MainThread: {isMainThread()}");

the code should for the most part match the code from the resource you shared for read-output-async.

Answer is in my post.
But here is the breakdown

  • You cannot do m_Engine.Execute(inputs); in a thread :slight_smile:
  • Inference scheduling need to happen on the main thread
  • To ensure that MakeReadable doesn’t block the main thread you need todo
tensor.AsyncReadbackRequest()

On main thread. Then you can do

while (!outputTensor.IsAsyncReadbackRequestDone())

on main thread.

  • In order to interact with the data without using MakeReadable, you pin to which ever backend you want and manage the read/write fences or computebuffer yourself
  • run model inference and use the unity profiler you’ll see that all layers are scheduled on the main thread, but work (heavy math) will happen on different threads