Extracting Keypoints from YOLOv8-pose Model

Hi,
I want to extract keypoints from the YOLOv8-pose model: Pose - Ultralytics YOLO Docs
I can run the model just fine, but I do not know how to extract keypoints from the model output.
I can do this in Python, so this is more of a Sentis-specific issue. The onnx model just returns a big tensor that is more difficult to understand. That is, I do not know which parts of the output tensor correspond to which keypoint coordinates.
Any pointers are much appreciated!

Hi, yes it’s quite hard to find good documentation on the YOLO models. A good place to start is having a look at some python implementations such as this one to help you interpret the outputs.

It will be similar to the YOLO8n implementation we have on Hugging Face. But as well as the box coordinates will have extra information for the pose.

Why not post what you have so far, and we’ll try and talk it through together.

In fact, the most challenging bit will probably be deciding how best to draw the lines for the pose model.

1 Like

Hi, thanks for the fast response.

I do not use the YOLOv8 model form Hugging Face, but converted the YOLOv8-pose model to onnx and run that. Pose - Ultralytics YOLOv8 Docs

So, this is my code:

[SerializeField]
RawImage display;
[SerializeField]
Texture2D image;
[SerializeField]
ModelAsset modelAsset;
IWorker m_Engine;
TensorFloat m_Input;

void Start()
{
    var model = ModelLoader.Load(modelAsset);
    m_Engine = WorkerFactory.CreateWorker(BackendType.GPUCompute, model);

    m_Input = TextureConverter.ToTensor(image, 640, 640, 3);
    display.texture = TextureConverter.ToTexture(m_Input);
    m_Input.CompleteOperationsAndDownload();
}


void Update()
{
    m_Engine.Execute(m_Input);

    TensorFloat outputTensor = m_Engine.PeekOutput() as TensorFloat;
    outputTensor.CompleteOperationsAndDownload();
    Debug.Log(outputTensor.shape);
}

void OnDisable()
{
    // Clean up Sentis resources.
    m_Engine.Dispose();
    m_Input.Dispose();
}

It is pretty simple for now and mostly inspired by your examples for the sentis package. For an image with just one person I get (1, 56, 8400) as output.

Briefly, the output tensor format is (batch, coordinate_number, box_number)

The output value is a score for each of 8400 possible boxes.

The 56 is a set of coordinates. The first 5 give coordinates of a box. (class, x,y, width, height)
(These will be the same each time as they are the same set of boxes whatever the input).

The next 51 give coordinates of a 17 body parts (x,y,visibility).

So you would probably want to take the box coordinates and feed them through an NMS to pick out the best boxes (as in our example) which gives a list of ids which you can then pick out the best poses. (See also the blaze face example which does something similar).

I recommend printing out some of the output numbers to the console to get a hang of them. And try drawing some to the screen.

1 Like

Wow, what you just explained was a HUGE help.

I will see what I can do with the information and maybe come back when I have more questions.

Another question: how can I compute the NMS to get the best box without the model returning something like a confidence score?

I have tried drawing all keypoints using the following methods, but got rather strange results.

  void getKeypoints(TensorFloat tensor)
  {
      for (int i = 0; i < 8400; i++)
      {
          for (int j = 0; j < 17; j++)
          {
              CreateCircle(new Vector2(tensor[0, j * 3 + 5, i], tensor[0, j * 3 + 6, i]));
          }
      }
  }

  private void CreateCircle(Vector2 center)
  {
      GameObject gameObject = new GameObject("circle", typeof(Image));
      gameObject.transform.SetParent(graphContainer, false);
      gameObject.GetComponent<Image>().sprite = circleSprite;
      RectTransform rectTransform = gameObject.GetComponent<RectTransform>();
      rectTransform.anchoredPosition = center;
      rectTransform.sizeDelta = new Vector2(11, 11);
      rectTransform.anchorMin = new Vector2(0 ,0);
      rectTransform.anchorMax = new Vector2(0 ,0);
  }

Do you have any ideas why this happens?
Also, notice that there are no points drawn on the correct locations e.g. the right wrist.

That looks about right. It is just using a coordinate system where (0,0) is the centre of the image.

Was there a second output with the score for each box perhaps?

Setting a particular i value such as i=100. Might give something recognisable.

Yeah, I messed up the coordinate system in the screenshot.
Nevertheless, I have not found a way to properly extract information from the CNN. I also tried setting i to specific values.

Was there a second output with the score for each box perhaps?

No, the model has only one output.

Since ML is a hard constraint for this project, I decided to move on from Unity to a Python game engine.
Thanks for the help!

Hey I understand I’m a bit late to the party, but I write so that somebody else could benefit.

the Yolov8 pose estimation output is an object composed by some fields but in specific it has 2 arrays:

  • xy not normalized (pixel size)
  • xyn: xy coords normalized (0-1)

this code (in py) produces the points I’m not sure how to convert it inside unity (if you do plz let me know :smiley: ):

from ultralytics import YOLO

model = YOLO("yolov8m-pose.pt")  # load a pretrained model (recommended for training)


results = model('samples/running.jpg')


# Process results
for r in results:
    keypoints = r.keypoints.xyn.cpu().numpy()  # Normalized keypoints (x, y, conf)
    for kp in keypoints[0]:
            x, y = int(kp[0] * width), int(kp[1] * height) # denormalize , if needed