I got real time 3D pose estimation somewhat working using YOLOv8-pose and motionBERT models that I have converted to ONNX from PyTorch. Currently, the implementation is very basic since I only got into Unity couple months ago and still pretty much a novice.
All that aside, the current implementation does a frame-by-frame inference for both YOLOv8 and motionBERT worker process. Hence, motionBERT makes inference on a single frame prediction output from YOLOv8.
To improve the prediction accuracy of motionBERT, I want to store N most recent predictions from YOLOv8 and feed them to motionBERT. I’ve looked at some example codes on GitHub and Hugging Face, but I’m having a hard time getting a reasonable solution. Maybe store the last N predictions in an array and use backend to concatenate them?
MotionBERT input shape is [Batch, # of frames, # of joints, xyAndScore]
The inference part of my code is down below, which pretty much uses the same implementation of YOLOv8 uploaded on Hugging Face.
public bool RunML(WebCamTexture webcamTexture) {
bool hasPredicted = false;
inputTensor?.Dispose();
inputTensor = TextureConverter.ToTensor(webcamTexture, textureTransform);
twoDPoseWorker.Execute(inputTensor);
var twoDJointsTensor = twoDPoseWorker.PeekOutput() as TensorFloat;
twoDJointsTensor.CompleteOperationsAndDownload();
if(twoDJointsTensor.shape[2] == numJoints && twoDJointsTensor.shape[3] == 3) {
threeDPoseWorker.Execute(twoDJointsTensor);
var threeDJointsTensor = threeDPoseWorker.PeekOutput() as TensorFloat;
threeDJointsTensor.CompleteOperationsAndDownload();
for (int idx = 0; idx < numJoints; idx++) {
threeDJointsVector[idx].x = threeDJointsTensor[0,0,idx,0];
threeDJointsVector[idx].y = threeDJointsTensor[0,0,idx,1];
threeDJointsVector[idx].z = threeDJointsTensor[0,0,idx,2];
}
hasPredicted = true;
}
return hasPredicted;
}