Mixing orthographic and perspective rendering for 2.5D and generating 2D colliders from mesh silhouettes

So I’m trying to prototype a side scroller game where all the gameplay happens on a 2D plane, with Box2D and 2D colliders.

However, all the objects are 3D meshes and everything is rendered through a perspective camera. I think this is usually called 2.5D

In this case Z=0 is the gameplay plane with all the colliders, and anything else is background or foreground, with no physics interaction.

My initial approach was to tag objects as obstacles if they intersect the Z=0 plane, and try to generate a 2D collider wrapping around the intersection with an editor script. This seemed to work OK, but results in collider ambiguity for the player, unless the meshes are oriented in a very specific way. I tried to remedy this by projecting a thin line effect around the player on the gameplay plane, but ultimately wasn’t satisfied with it.

So I want to try another approach, which is going to look weird, but should have zero ambiguity: Any mesh that intersects with the gameplay plane is rendered orthogonally and the collider should match the resulting silhouette.

This would effectively make the obstacles look like sprites, but to maintain a little sense of depth in them, I’d still want 3D shadows to be casted on the mesh itself. So baking them to 2D sprites isn’t going to be an option either.

I feel like the fixed perspective flattening is kind of the same process for rendering and the collider generation, so I feel like the two issues might have a bunch of overlap.

I also realize using one large terrain is a bit of a problem in a setup like that, and I might need to generate two terrains: one perspective terrain that covers all of the background and foreground, and one thin sliver of terrain on top of it for the fixed perspective and colliders.

The objects on the obstacle plane aren’t going to rotate around the X/Y axis, so the colliders don’t need to be (re)generated during runtime. Still ideally I’d want something that is as automated as possible, so level design can be streamlined.

So far I have gotten somewhat close with a three camera and three layer setup:

  • Orthographic overlay camera with culling mask for only the midground (obstacles layer)
  • Perspective overlay camera with culling mask for only foreground
  • Perspective base camera rendering all except those two layers, with the midground and foreground overlays stacked.

But this kind of setup doesn’t get me the shadows on the obstacles, and doesn’t solve the collider generation.

I’m struggling to implement this in a way that feels robust, so I’d appreciate tips or any thoughts on how you’d go about building a similar setup.

For reference, I think the fantastic game Getting Over It with Bennett Foddy achieves something like this with Unity.

Here are two screenshots from slightly offset positions overlayed on top of each other, which shows how there’s dynamic perspective in the background, and fixed perspective on the foreground.

And this demonstrates the dynamic shadow I mentioned

So basically I’m looking to build a very similar rendering setup, but I haven’t figured out how he pulled it off.