How to handle reference frames with XR Interaction Toolkit?

Hi, I am a bit confused on how to handle reference frames when using the XR Interaction Toolkit (bonus question with VisionOS PolySpatial at the end).

Reference frames have always been especially troublesome with hand tracking when the user has to interact with both a virtual world where they can navigate freely, and virtual controls that let them perform said navigation. My current use case is a virtual two-dimensional joystick/slider that allows the user to freely relocate inside the virtual scene, but a maybe clearer example would be a race simulator. In this driving example, all of the virtual world should “move” around the user, except for the virtual car and affordances needed to drive it, which should stay in a fixed position relative to the “real world” frame (you could call it the user’s frame, even though it wouldn’t be completely correct to say).

In the past, I found that the easiest way to achieve this is to have 2 huge hierarchies in the scene, one that represents the real world reference frame (so the root would be the XROrigin), and one that represents the virtual world reference frame. Since the XR Interaction Toolkit is heavily based on physics (correct me if I’m wrong), I am forced to offset my virtual world frame to make the user “move” inside the world. This is because if I moved my XR Origin around (which would make more sense, both conceptually and performance-wise), all of the interactions (XR Grab Interactable, etc.) would become super jittery, with the physics engine struggling to keep up with both the affordance and the tracked hand moving inside Unity’s world coordinates.

Now, I find myself in a situation where I cannot use the trick of moving the whole scene around anymore, so I need to move the user’s reference frame. What would be the way to go here? The only ideas that come to my mind would be to use discrete moving/rotating mechanisms, i.e., teleport instead of continuous movement, which I don’t really like for my use case, or to handle everything on my own by ditching Unity’s physics and creating interactions between the hands’ skeleton and the affordances on my own using transforms…

Am I missing something here? I feel like it would be so much easier to have 2 separate physics engines (or at least physics reference frames) available when dealing with XR, one for real world and one for virtual world…

BONUS QUESTION: Any particular suggestions to achieve all of this with VisionOS and PolySpatial? It is so confusing that in Mixed Reality Unbounded mode there needs to be an XR Origin with its own (unused) camera AND a Volume Camera which seems to be the one that determines the user’s reference frame in the virtual scene even though it does not really represent what the user is actually seeing. I am using PolySpatial instead of fully immersive mode in order to have passthrough (and also because gaze-pinch interactions where not working otherwise, but I’m not sure if that’s on me). I cannot find good Unity documentation about these things (Volume Camera docs are like 2 paragraphs in total), so feel free to link any useful sources that you happen to know (: