UnityObjectToClipPos transforms the local vertex position into clipspace. In clipspace you still haven’t done the “perspective divide”. So the “w” component will contain the normalizing value you have to divide by. So to answer the question directly: The x and y coordinates are given in the range of “-w” to “w”.
So you usually would divide your position by “w” to get the NDC coordinates (Normalized Device Coordinates) which are of course in the range -1 to 1. After that you usually enter viewport space which is 0 to 1 simply by adding 1 and dividing by 2 (or dividing by 2 and adding 0.5). After viewport space it’s usually transformed into screenspace by multiplying by the width / height of the screen.
Local space
World space (model matrix / Transform of object)
Camera space (view matrix / inverse of Transform of camera)
Clip space (projection matrix / also given by the Camera)
NDC (normalizing by w)
Viewport space (p*0.5 + 0.5)
Screen space (p * (width, height))
Step 2 to 4 are usually done at once through the combined MVP matrix (model, view, projection). This is what the vertex shader outputs and this is what UnityObjectToClipPos calculates. Everything that follows is usually done by the hardware internally. The clipping, normalizing and transforming into screenspace is done by the GPU.
Some sources claim that NDC and clip space is the same thing, but it isn’t. NDC is what you get after the normalization / perspective divide / homogeneous divide.