After struggling with yolo-pose for months, I finally got some idea of how the Sentis works on the CPU, and got a solution of how to make it run asynchronously in 1.3.0-pre3(It might have a better way when they finally fix NMS)
Before that, I want to explain the “Async” I need: A way to run models in the background without affecting framerate. Just like this post: [Feature Request] Asynchronous execution (or doing graphics updates automatically)
1.The first thing I found is: Execute is just a Allocator.
TextureConverter.ToTensor(source, _inputTensor, new TextureTransform()); _worker.Execute(_inputTensor);
It must run in the Main Thread and start to assign work for GPU or Burst(CPU). You can never make Execute run in a Thread or Coroutine.
2.Execute is Already “Async”, it will not block the CPU.
After the first Layer Conv was allocated, it started to run on Burst. Before the end of allocating, several layers were running.If we Run a model a layer at a time | Sentis | 1.3.0-pre.3, it will allocate several layers in different frames, but not force it to run in different frames.
For example, if I make the a Model “run” in two frame, the first half part of Model will be allocated in first frame, and start running. When it comes to the second frame, the first part of the Model could still be running. The second part(have been allocated) will run after it is done. The key point is the Allocate(Execute or m_Schedule.MoveNext()) takes little time, but the real running cost a lot.
3.NMS only works on the Main Thread. It will wait for its inputs, which is why you can see a long idle in NMS.
NMS seems to take 32.07ms. But the real NMS process only starts working at the last 0.4ms(the little red line after the JobHandle.Complete 31.72ms).
Execute NOT blocking, NMS itself NOT blocking. The Burst-CPU/GPU----Main Thread cause blocking. (well, it is caused by NMS)
If NMS can be run in Burst or GPU, it will be much better. But in 1.3.0 pre-3, we can only separate it by slicing these layers.
After trying the Run a model a layer at a time | Sentis | 1.3.0-pre.3, Read output from a model asynchronously | Sentis | 1.3.0-pre.3, and read all the NMS/asynchronously topics in the forum, I finally got the way of asynchronously. It is quite simple but it takes a lot of time to understand what happened.
- Understand the Model has two parts – 1. the part before NMS – 2. the NMS.
- “Run a layer in a frame” for the first part.
- PeekOutput(“boxCoords”) for the input of NMS.
- Use IsReadbackRequestDone()(A hiding API?) to check if it is ready for NMS(BurstJobs Completed). Finish this Update if it is false.
- run the NMS part in a single frame for the end.
The result is I got 400fps for yolov8n-pose(640*640) in i7-12800h+3080Ti+32G and 45fps in i3-10100+8G. Before that, I can only get 65fps in i7 and 25fps in i3. This is a CPU backend result so it might work on Android devices.
Here is some helpful code:
private void SetupModel()
{
//Load model
var model = ModelLoader.Load(Application.streamingAssetsPath + "/" + modelName);
_inputTensor = TensorFloat.Zeros(new TensorShape(1, 3, ModelSize.x, ModelSize.y));
//Set constants
model.AddConstant(new Lays.Constant("0", new int[] { 0 }));
model.AddConstant(new Lays.Constant("1", new int[] { 1 }));
model.AddConstant(new Lays.Constant("4", new int[] { 4 }));
model.AddConstant(new Lays.Constant("5", new int[] { 5 }));
model.AddConstant(new Lays.Constant("56", new int[] { 56 }));
// 1+4 The number of known classification categories is 1, only humans + box coordinates are 4 digits
model.AddConstant(new Lays.Constant("classes_plus_4", new int[] { 5 }));
model.AddConstant(new Lays.Constant("maxOutputBoxes", new int[] { MaxOutputBoxes }));
model.AddConstant(new Lays.Constant("iouThreshold", new float[] { IouThreshold }));
model.AddConstant(new Lays.Constant("scoreThreshold", new float[] { ScoreThreshold }));
//Add layers
// ---to get boxCoords
model.AddLayer(new Lays.Slice("boxCoords0", "output0", "0", "4", "1"));
model.AddLayer(new Lays.Transpose("boxCoords", "boxCoords0", new int[] { 0, 2, 1 }));
// ---to get NMS
model.AddLayer(new Lays.Slice("scores0", "output0", "4", "classes_plus_4", "1"));
model.AddLayer(new Lays.ReduceMax("scores", new[] { "scores0", "1" }));
model.AddLayer(new Lays.NonMaxSuppression("NMS", "boxCoords", "scores",
"maxOutputBoxes", "iouThreshold", "scoreThreshold",
centerPointBox: Lays.CenterPointBox.Center
));
_modelLayerCount = model.layers.Count;
model.outputs.Clear();
model.AddOutput("boxCoords");
model.AddOutput("NMS");
_ops = WorkerFactory.CreateOps(Backend, null);
_engine = WorkerFactory.CreateWorker(Backend, model);
}
private void ExecuteMl(Texture source)
{
// If distribution has not been started, start distributing the first part of the model.
if (_firstPart)
{
TextureConverter.ToTensor(source, _inputTensor, new TextureTransform());
_executionSchedule = _engine.StartManualSchedule(_inputTensor);
for (var i = 0; i < _modelLayerCount - 3; i++)
{
_executionSchedule.MoveNext();
}
_firstPart = false;
// After the allocation is completed, exit this update immediately.
return;
}
_boxCoords = _engine.PeekOutput("boxCoords") as TensorFloat;
// Check whether allocation and execution of the second part of the model are allowed. The flag is whether the box Coords are ready.
_secondPart = _boxCoords.IsReadbackRequestDone();
// If the second part is not allowed yet, exit this update.
if (!_secondPart)
{
return;
}
// Allocating The Last Segment Of TheModel(NMS have 3 layers, so we need to move 3 times to get the output of NMS
for (var i = 0; i < 3; i++)
{
_executionSchedule.MoveNext();
}
_nms = _engine.PeekOutput("NMS") as TensorInt;
// Using asynchronous callback, when nms result returns, execute Readback Callback
_nms.ReadbackRequest(ReadbackCallback);
// Reset
_firstPart = true;
_secondPart = false;
}
void ReadbackCallback(bool completed)
{
using var boxIDs = _ops.Slice(_nms, new int[] { 2 }, new int[] { 3 }, new int[] { 1 }, new int[] { 1 });
using var boxIDsFlat = boxIDs.ShallowReshape(new TensorShape(boxIDs.shape.length)) as TensorInt;
using var boxOutput = _ops.Gather(_boxCoords, boxIDsFlat, 1);
boxOutput.MakeReadable();
ClearAnnotations();
float displayWidth = _rawImageSize.x;
float displayHeight = _rawImageSize.y;
var scaleX = displayWidth / ModelSize.x;
var scaleY = displayHeight / ModelSize.y;
var boxNum = boxOutput.shape[1];
for (var n = 0; n < boxNum; n++)
{
var box = new BoundingBox
{
CenterX = boxOutput[0, n, 0] * scaleX - displayWidth / 2,
CenterY = boxOutput[0, n, 1] * scaleY - displayHeight / 2,
Width = boxOutput[0, n, 2] * scaleX,
Height = boxOutput[0, n, 3] * scaleY,
};
DrawBox(box, n);
}
}
Please correct me if I have any incorrect information. I will be very happy if this post helps you.