So far we have been focusing on using all cores for the simulation rather than making it asynchronous - which means you almost always complete the job within 1 frame. The 4 frame limit comes from the fact that we are using a specialized allocator for this case, not that we actively try to limit anything.
Long running asynchronous jobs are slightly different and require some tweaking, but allowing you to choose a different allocator for such jobs which does not have to complete within 4 frames seems like a good start.