World position from depth image

Hello!

This is my first post and I am very new to unity and computer graphics in general.
So please do not be too harsh. :smile:

I am trying to reconstruct the world position from a depth image.

8429705--1116074--upload_2022-9-10_12-49-25.png8429705--1116080--upload_2022-9-10_12-50-6.png

My shader:

// Upgrade NOTE: replaced 'mul(UNITY_MATRIX_MVP,*)' with 'UnityObjectToClipPos(*)'

Shader "Hidden/Depth"{

    Properties
     {
         _MainTex ("Base (RGB)", 2D) = "white" {}
         _DepthLevel ("Depth Level", Range(1, 3)) = 1
     }
     SubShader
     {
         Pass
         {
             CGPROGRAM
             #pragma vertex vert
             #pragma fragment frag
             #include "UnityCG.cginc"

             uniform sampler2D _MainTex;
             uniform sampler2D _CameraDepthTexture;
             uniform fixed _DepthLevel;
             uniform half4 _MainTex_TexelSize;
             struct input
             {
                 float4 pos : POSITION;
                 half2 uv : TEXCOORD0;
             };
             struct output
             {
                 float4 pos : SV_POSITION;
                 half2 uv : TEXCOORD0;
             };
             output vert(input i)
             {
                 output o;
                 o.pos = UnityObjectToClipPos(i.pos);
                 o.uv = MultiplyUV(UNITY_MATRIX_TEXTURE0, i.uv);
                 // Flip the image
                 #if UNITY_UV_STARTS_AT_TOP
                 if (_MainTex_TexelSize.y < 0)
                         o.uv.y = 1 - o.uv.y;
                 #endif
                 return o;
             }

             fixed4 frag(output o) : COLOR
             {
                 float depth = UNITY_SAMPLE_DEPTH(tex2D(_CameraDepthTexture, o.uv));
                 depth = pow(Linear01Depth(depth), _DepthLevel);
                 return depth;
             }
             ENDCG
         }
     }
}

I am not trying to shoot rays through the pixels, because I want to learn more
about camera parameters and matrices (or at least, I do not think I want to).

What I actually want to do is use the camera parameters and the depth I get from
the image. An idea, to easily get the parameters I need, would be to try to use
the view and inverse projection matrix. I do not even know, if I need the view matrix.

Can I calculate an imaginary line with these parameters (start point, end point from the depth
value and angle from one of the matrices) to inverse calculate the world position from
screen/image space (my depth image)?
That would probably be my ultimate goal.

Is it bad, that my camera has a world position with x = 5 and not 0?
I just realized that while writing this.

What i also tried, was multiplying the Vector from the pixel coordinate
from the image/view space and the z from my depth values with the inverse
projection matrix.

I tried to reverse the perspective divide process, which, from what I am understanding,
takes part after the projection from world to image/world space. Then I multiplied that
vector with the inverse projection matrix of the camera and added my depth value.

I know, that we use the w component of the Vec4 to keep the original z value
of the vertex for further operations with textures and stuff.

The "zValue" I am using here tho is the ??linear depth?? from my image
(0 to 1, from the RGB value(s) of the pixel(s)).
So i multiplied that with my far clip plane as shown below to get back
the actual distance in world space (w).
Do i even need to do this, or can I also just use the
projection matrix somehow for this?

However, the x and y coordinates are way off after that. Only the z value is legit
(since I am reading it from my depth image).

Vector3 ScreenToWorld(Vector3 screenPos, float zValue)
    {
        Matrix4x4 matInv = _camera.projectionMatrix.inverse;
        screenPos.x *= zValue * _camera.farClipPlane; // Trying to reverse the perspective divide here
        screenPos.y *= zValue * _camera.farClipPlane;
        Vector4 WorldPos = matInv * screenPos;
        WorldPos.z = zValue * _camera.farClipPlane;

        return WorldPos;
    }

This is all done with a camera in unity and a depth image generated from scene with a shader,
because I can not get my hand on a let's say Kinect for example, nor do I want to yet.
But this would be the project after this one to then calculate from actual camera parameters.

Could I also use the "real" camera parameters from the camera in unity? From what I know,
it would be more handy to use matrices for now.

Just to make it clear, I do not want to use anything in the scene itself.
Only code to do the image processing to get the world position.
I do not want to use camera.ScreenToWorldPoint to shoot a ray through
the pixel and get the point at the depth value to get the world position.
I already did this and it works.
Maybe I want to write something similar to it tho with the help of the
camera parameters/matrices.

Help with both solutions would be very appreciated!

It would also be nice, if someone could also help me with understanding the theory behind
all of this a little better (maybe even for both solutions).

Thank you so much! <3

Do you want to do this in a shader or in C# code?

This is how it works in a shader (writing it from memory, could be buggy):

sampler2D _CameraDepthTexture;
float4x4 _ScreenToWorldSpaceMatrix;

float4 frag(float4 pos: SV_Position) : SV_Target
{
    float depth = _CameraDepthTexture[int3(pos.x, pos.y, 0)].r;

    float4 screenPosition = float4(pos.x, pos.y, depth, 1.0);
    float4 worldPosition = mul(_ScreenToWorldSpaceMatrix, screenPosition);
    worldPosition /= worldPosition.w;

    return worldPosition;
}

where

Matrix4x4 ScreenToWorldSpaceMatrix =
    camera.cameraToWorldMatrix *
    GL.GetGPUProjectionMatrix(camera.projectionMatrix, camera.targetTexture != null).inverse *
    Matrix4x4.Translate(new Vector3(-1.0f, -1.0f, 0.0f)) *
    Matrix4x4.Scale(new Vector3(2.0f / camera.pixelWidth, -2.0f / camera.pixelHeight, 1.0f));  // not sure about the minus

@c0d3_m0nk3y

Thank you so much for your reply!

I think, I actually want it as a C# script.
But seeing it in a shader will probably also help me to understand
how this stuff works.

Would you be so kind and show me, how this would look like in a Script?
I think I need the values from a script because I want to let a virtual robot
with inverse cinematics point at the coordinates later on.
I also do not want to have to write the coordinates into the fragments as RGB.

I want to try to get all the coordinates of say a plane object representing a person
and then find the average of them to point at the middle of it, also when it is moving.

The last part I think I can do on my own. I just need the world positions.

I also have the depth values already in Script from the depth image if
that makes it easier and I think, I want to use these values for z so there
would be no need to get them again. But if possible, it would be also interesting
to see. I just want to take them from my image, cause that is pretty much what
I am trying to to achieve with it in the future.

Thank you! :)

Well, theoretically in C# it's the same

Vector3 ScreenToWorld(Vector2 screenPos, float zValue)
{
    Vector4 worldPosition = ScreenToWorldSpaceMatrix * new Vector4(screenPos.x, screenPos.y, zValue, 1.0);
    return worldPosition / worldPosition.w;
}

using the ScreenToWorldSpaceMatrix from above.

However, you'd have to read the depth buffer pixels to get them on the CPU which is slow. The CPU can be up to 3 frames ahead of the GPU so reading the pixels back would stall the CPU and you'd lose all parallelism between CPU and GPU. Also, you'd have the depth value of the last frame but you probably need the value for the current frame - which doesn't exist yet because the frame hasn't been rendered yet.

On the CPU, using raycasts is actually the way to go.

@c0d3_m0nk3y

Thank you!

Would there be a way for me to get the world position from the shader into a script tho?

Sorry, don't understand what you mean. Can you elaborate?

Do you mean calculating the world position in a shader, storing it in a render target and then reading the render target on the CPU?

It's the same problem, you'd either stall the CPU (to get 1 frame old data) or get 3 frames old data with async readback.

  • float4 frag(float4 pos: SV_Position) : SV_Target
  • {
  • float depth = _CameraDepthTexture[int3(pos.x, pos.y, 0)].r;
  • - float4 screenPosition = float4(pos.x, pos.y, depth, 1.0);
  • float4 worldPosition = mul(_ScreenToWorldSpaceMatrix, screenPosition);
  • worldPosition /= worldPosition.w;
  • - return worldPosition;
  • }

I can't get this return value from the shader into a script, right?

I also tried the script version.
Once written by me from the stuff you wrote in the shader and the one you just posted.

But it gives me very weird numbers.

I could probably use camera.ScreenToWorldPoint(). I tried it and it seems to work.
But then I would understand pretty much nothing of what is actually going on.
Too bad unity is not open source.

Basically I would need a self written ScreenToWorldPoint() and add my depth afterwards.
Or go with the shader, if possible. But if the result of the multiplication of the matrices is already weird,
then i don't know where to start with.

Now ScreenToWorldPoint also gives me some weird numbers. I really don't know, where to start anymore. :smile:

The thing is, I can not shoot rays into the scene, 'cause I technically don't have a scene.
I could not check for collision.
I could however probably calculate a ray and get its point at the depth value.
The thing is tho, I can not just draw a ray through the pixel, because it would calculate
the angle by itself.

I need that angle tho. any idea how I would get than from the matrices and/or the
camera intrinsic and extrinsic parameters?

I will have to do it with these parameters and matrices of a physical camera later on
anyway.

Totally forgot about Camera.ScreenToWorldPoint.

The difference between Camera.ScreenToWorldPoint and my ScreenToWorld is that Camera.ScreenToWorld takes a world space depth and my version takes NDC space depth.

As I said, the problem is not the calculation itself, it's where to get the depth value from.

Okay, let me show you, what I am doing.

Vector3 ScreenToWorld(Vector2 screenPos, float zValue)
    {
        Matrix4x4 ScreenToWorldSpaceMatrix =
            _camera.cameraToWorldMatrix *
            GL.GetGPUProjectionMatrix(_camera.projectionMatrix, _camera.targetTexture != null).inverse *
            Matrix4x4.Translate(new Vector3(-1.0f, -1.0f, 0.0f)) *
            Matrix4x4.Scale(new Vector3(2.0f / _camera.pixelWidth, -2.0f / _camera.pixelHeight, 1.0f));  // not sure about the minus

        Vector4 worldPosition = ScreenToWorldSpaceMatrix * new Vector4(screenPos.x, screenPos.y, zValue, 1.0f);
        Debug.Log("World position of pixel: " + "x: " + worldPosition.x + "y: " +  worldPosition.y + "z: " + worldPosition.z);

        return worldPosition / worldPosition.w;
    }

And then:

ScreenToWorld(new Vector2(x, y), _screenShot.GetPixel(x,y).r);

_screenShot.GetPixel(x,y).r is my NDC depth.

The result is:
8431457--1116437--upload_2022-9-11_16-48-5.png

8431457--1116440--upload_2022-9-11_16-48-17.png

The weirdest thing is, if i move the cube, it still stays around 5 for x.

Camera has 84x84 resolution at the moment.

8431457--1116458--upload_2022-9-11_16-52-17.png

How do you take the screenshot? Are your sure .r is NDC depth? Can you log it out?

Debug.Log("NDC depth: " + _screenShot.GetPixel(x,y).r);
Debug.Log("World depth: " + _screenShot.GetPixel(x,y).r * _camera.farClipPlane);

8431562--1116476--upload_2022-9-11_17-42-12.png
8431562--1116479--upload_2022-9-11_17-42-21.png

Since 14.5 is pretty much off center by 0.5, this should be correct.
Or do i need the non linearized depth?

I pretty much just get the depth in the shader and assign it to RGB (shader code from original post).
Then i put a screenshot on a Texture2D and get the pixel.

Thank you for helping me out again!
I am going to get dinner now but will be right back.

Unity uses reverse depth by default. Also NDC depth is non-linear so you can't just multiply it with the far plane distance, as you already assumed.

Call LinearEyeDepth(ndcDepth) in the shader and store it in a floating point texture. (you can also do the calculation on the CPU, but this is easier).

That should give you view-space depth that you can pass to Camera.ScreenToWorldPoint().

This will only be correct if neither the camera nor objects are moving because of the lag.

Also, just to be sure where do you get the NDC depth from in the shader?

  • fixed4 frag(output o) : COLOR
  • {
  • float depth = UNITY_SAMPLE_DEPTH(tex2D(_CameraDepthTexture, o.uv));
  • depth = pow(Linear01Depth(depth), _DepthLevel);
  • return depth;
  • }

From here. :)

Ok, so you've already linearlized the depth. Make sure that _DepthLevel is 1, otherwise this won't work.

What's the texture format of the RenderTexture that you are using?

In which render pass do you render this?

I actually do this in a script.

In Start():

    void Start()
    {
        _camera = GetComponent<Camera>();
        _rect = new Rect(0, 0, _camera.pixelWidth, _camera.pixelHeight);
        _renderTexture = new RenderTexture(_camera.pixelWidth, _camera.pixelHeight, 24);
        _screenShot = new Texture2D(_camera.pixelWidth, _camera.pixelHeight, TextureFormat.RGBA32, false);

        _camera.targetTexture = _renderTexture;
    }

Then:

 private void GetValues()
    {
        _camera.Render();
        RenderTexture.active = _renderTexture;
        _screenShot.ReadPixels(_rect, 0, 0);
        _camera.targetTexture = null;
        RenderTexture.active = null;

        Boolean pixelsFound = false;
        _pixelVals = new List<Vector3>();

        for (int y = _camera.pixelHeight; y >= 0; y--)
        {
            for (int x = _camera.pixelWidth; x >= 0; x--)
            {
                if (_screenShot.GetPixel(x, y) != Color.white)
                {
                    Debug.Log("pixel found: " + x  + "   " + y);
                    Vector3 worldPoint = ScreenToWorld(new Vector3(x, y, _screenShot.GetPixel(x,y).r));
...

        Destroy(_renderTexture);
    }
Vector3 worldPoint = _camera.ScreenToWorldPoint(new Vector3(x, y, _screenShot.GetPixel(x,y).r * _camera.farClipPlane));

This seems to work.
8431796--1116530--upload_2022-9-11_19-57-52.png
This would be a fine outcome.

Well... or it sometimes works and sometimes:
8431796--1116533--upload_2022-9-11_19-58-23.png
:smile:

The problem still is tho, i have no idea what it is doing. :smile:

Would be nice,
if someone could write me the ScreenToWorldPoint function so i could see what it is doing. :smile:

Also what is happening here? :smile: :
8431808--1116542--upload_2022-9-11_20-3-45.png
8431808--1116545--upload_2022-9-11_20-3-53.png

Ohhhh....:
8431808--1116548--upload_2022-9-11_20-4-35.png

I strongly recommend using a floating point texture format for both the render texture and the screenshot. 8 bit is not enough.
RenderTextureFormat.RFloat
TextureFormat.RFloat