Sorry, but what you are trying to do is impossible. However, it is difficult to explain why.
Let me try anyway:
- as you can already see in your image, your y-Axis is not truly orthographic. If it were, the back vertical edges of your cube would have to be the same height as the front vertical edges. They are not, they are perspectively distorted (=fewer pixels).
- orthographic projections don’t have vanishing points. Parallel lines are always parallel, no matter which direction they’re facing
- perspective projections always have exactly one single vanishing point for every set (=direction) of parallel lines
- what you would need is one vanishing point for every “row” of objects (=horizontally grouped cubes in your example), which would be something like a vertical “vanishing line”. That is not possible, since any specific x/y coordinate in the front would map to an infinite number of y coordinates in the back (see image)
- or somewhat more mathematical: perspective projection does not distort “left” or “right”, nor “top” or “bottom”, it distorts depth (=front and back). for this, it uses homogeneous coordinates. But these influence all axes. It either scales by (the inverse of) depth, or it doesn’t. You simply can’t have both.
Look at the image below: Assuming “standard” coordinate axes +X=right, +Y=up, +Z=depth, the top left corner of the bottom cube has the same coordinates as the bottom left corner of the cube above it, in this case (-6,-1,0). If the cubes are 2x2x2, the bottom left corner is therefore (-6,-3,0), bottom right (-4,-3,0), and top right (-4,-1,0). Now, if you go into the depth for the back face of that cube, look at its top right corner: it it is located at Z+2, and becomes (-4,-1,2). But the bottom right corner of the center cube also has the coordinates (-4,-1,2). But they are at different pixel positions! Furthermore, the area between these two “identical” points cannot be reached by any valid cartesion coordinates.
What kind of workarounds are available?
Well, this highly depends on what you want to do.
In theory, you would have to render each object using its own off-axis projection matrix. AFAIK, this is not possible in Unity. I tried some experiments once using OnRenderObject, Graphics.DrawMeshNow, etc., but it seems you can’t modify the projection matrix during rendering, it would create undefined areas that can never be filled (again, see image). Also, you couldn’t “share” the depth buffer between these sections, for the same reason.
If you “only” have a grid of vertical rows and don’t intend to move objects smoothly between them, or even interact, then you could split your screen into several individual cameras, each one rendering one particular row. You wouldn’t even need adapted projection matrices in this case.
A more flexible method would be to first render each object into a rendered texture (using an off-axis projection matrix with an x-shear depending on where you want to place that rendered texture on your screen), which you can then arbitrarily shift along y, without changing its perspective, hence simulating an “orthographic view” along that axis. But you would lose 3D (=depth) information of your objects, and therefore occlusion, and so on.
But there is no “clean” solution, or even a single 4x4 matrix which can accomplish that.
