3D Pose Estimation with YOLOv8 and MotionBERT

I got real time 3D pose estimation somewhat working using YOLOv8-pose and motionBERT models that I have converted to ONNX from PyTorch. Currently, the implementation is very basic since I only got into Unity couple months ago and still pretty much a novice.

All that aside, the current implementation does a frame-by-frame inference for both YOLOv8 and motionBERT worker process. Hence, motionBERT makes inference on a single frame prediction output from YOLOv8.

To improve the prediction accuracy of motionBERT, I want to store N most recent predictions from YOLOv8 and feed them to motionBERT. I’ve looked at some example codes on GitHub and Hugging Face, but I’m having a hard time getting a reasonable solution. Maybe store the last N predictions in an array and use backend to concatenate them? :thinking:

MotionBERT input shape is [Batch, # of frames, # of joints, xyAndScore]

The inference part of my code is down below, which pretty much uses the same implementation of YOLOv8 uploaded on Hugging Face.

public bool RunML(WebCamTexture webcamTexture) {

    bool hasPredicted = false;

    inputTensor?.Dispose();

    inputTensor = TextureConverter.ToTensor(webcamTexture, textureTransform);

    twoDPoseWorker.Execute(inputTensor);
    
    var twoDJointsTensor = twoDPoseWorker.PeekOutput() as TensorFloat;
    
    twoDJointsTensor.CompleteOperationsAndDownload();

    if(twoDJointsTensor.shape[2] == numJoints && twoDJointsTensor.shape[3] == 3) {

        threeDPoseWorker.Execute(twoDJointsTensor);

        var threeDJointsTensor = threeDPoseWorker.PeekOutput() as TensorFloat;

        threeDJointsTensor.CompleteOperationsAndDownload();

        for (int idx = 0; idx < numJoints; idx++) {

            threeDJointsVector[idx].x = threeDJointsTensor[0,0,idx,0];
            threeDJointsVector[idx].y = threeDJointsTensor[0,0,idx,1];
            threeDJointsVector[idx].z = threeDJointsTensor[0,0,idx,2];

        }

        hasPredicted = true;

    }

    return hasPredicted;

}

Can I try using the YOLOv8-pose model from here - Xenova/yolov8x-pose-p6 · Hugging Face?

You can use them, but to get faster inference speed I exported from ultralytics’ pose estimation model with smaller input image size.

I found a solution to my problem using IBackend. Starting to get the hang of when and what to dispose of now for memory management :smile:

Just fyi if you want to debug inference between sentis and torch, you can serialize tensors as json and check midway if the intermediate results matches

1 Like