Calculate Accurate Time for Blit Loops in a Coroutine

I’m performing a long, drawn out process using a Coroutine, where I track the elapsed time and, if it exceeds a threshold (e.g. 1/60 second), I yield return null and allow a frame to render before continuing the cycle.

While this works fine for a CPU-bound process, however, it runs into unusual dilemmas for a GPU-driven one. My loop, in brief, looks like this:

Stopwatch sw = new Stopwatch();
sw.Start();
bool working = true;
Texture2D inputTexture;
RenderTexture[] outputTexture = new Texture2D[2];
// skipping texture preparations
Material mat; // includes the shader used to process the output texture(s)
int texIndex;
while(working)
{
	// Incorporates the previous resulting output in the new pass,
	// hence the XOR cycled index
	mat.SetTexture("_CurrentSampler", outputTexture[texIndex]
	texIndex ^= 1;
	Graphics.Blit(inputTexture, outputTexture[texIndex], mat, 0);
	if(sw.Elapsed.TotalSeconds >= (1.0 / 60.0))
	{
		yield return null;
		sw.Restart();
	}
	if([process completed])
	{
		working = false;
	}
}
sw.Stop();

At a fundamental level, this works. When the process takes a while, it renders a new frame and resumes.

However, this has actually been a double-edged sword:

  1. If the loop is given too little time (e.g. 0.0 seconds, rather than 1/60), the process takes a very long time, running only a single pass per frame.
  2. If the loop is given too much time (e.g. 1.0 second), the process also takes a long time - I’m presuming this is due to the CPU and GPU processes falling out of sync, but I can’t claim that with 100% certainty.

The truly confounding factor, however, is that it seems like the Blit() calls are falling behind, so if I actually *USE* the 1/60 second delay in my example, it still winds up running slowly and, in turn, taking longer than it otherwise could.

To give an example, I’m currently processing a 256x256 texture, where it takes ~20 seconds with no rendered frames in between, ~13 seconds with a 1/60-second (specified) delay (running poorly), and ~8 seconds with a 1/480-second (specified) delay (running at near-60 fps). (To note, it takes 500+ seconds when yielding on every pass, since it’s very heavy-duty texture processing overall, since it’s getting through ~4000+ passes per frame otherwise)

All of this suggests that a degree of processor and video desynchronization is resulting in a delay that is proving difficult to accurately time, based on the results of 1/60 vs. 1/480-second cycle timers.

Furthermore, because processors, video cards, and texture sizes will vary, there’s no magic number that will inherently prove satisfactory here. While a [256x256] texture works well for my hardware with 1/480-second (specified) timing, a [512x512] texture will choke and take 200+ seconds to process (and yes, it’s still getting through plenty of cycles per rendered frame).


So, my question(s) is(are), essentially:

Is there any reasonable way to accurately track and react to time spent in a Blit()-heavy loop?

Why *does* the process take longer when there are *fewer* rendered frames?

Edit: clarity, cleanup, formatting

Blit is not a synchronous operation. It’s merely a command that’s queued “somewhere” abstracted away from your control.

Blit call itself returns quite fast, so basically what you are doing is you are scheduling a LOT of Blits in that 1/60th of a second, which will process sometimes later in the render thread, possibly taking a lot more time.

Perhaps what you’d need to do is somehow measure how long does the blit, or a series of blits, actually take. Unfortunately I am not aware of any tool in Unity that would help you do that in the running game. You could measure it with tools like RenderDoc, but that would give you numbers on your computer only.

One possible solution would be to simply benchmark it beforehand on the user’s device to come up with a number of Blits per frame that meets your timing criteria.

Maybe this would help

Instead of looping in a coroutine, prepare a commandBuffer that does all the blits you want (using compute shaders / async permitted commands), and run it asynchronously. Never tried it though.

That’s not a bad idea of something I can look into. It would definitely involve a thorough rewrite of that section of code, so it may or may not wind up being high priority, depending on how things pan out.

Anyway, all of that’s good food for thought. The part that surprises me most, really, is that the fact that I’m alternating back and forth between two RenderTexture buffers and winding up with a stuttering backlog in the way I am. I would have assumed that I would have forced myself to be more CPU-bound, and not caught in a GPU-driven queue, under the circumstances.

Actually, that’s a lie. The part that surprises me most is still that dedicating *MORE* time to rendering new frames and, by extension, taking the current state of the overall texture and CPU-reading it into a separate texture for rendering takes LESS time than a basic, single-long-frame loop.

(I know I hadn’t mentioned re-writing the texture data in a different format before, but that was neither relevant nor actually has any impact on the performance related to the question itself)

Ahh, right. Command Buffers. Handy if you like things with minimal documentation and seemingly/virtually zero reference points for… eccentric use.

Anyway, I think I found a way that I can effectively “benchmark” performance.

First, what doesn’t work: I can’t rely on timestamps. These loops will report time as if the framerate isn’t bad/choppy, even if it’s clearly/visibly running drastically worse (e.g. ~5-10 fps will still report a worst frame time of ~17ms (60fps))

So, what can work here?

First, fundamentally, there are three (kind’ve four) things to look for:

  1. When too few loop cycles occur per frame, too much time is wasted waiting for the next frame to render.
  2. When too many loop cycles occur per frame, it *could* be getting held up by Time.maximumDeltaTime, though I haven’t actually seen that *explicitly* stated in my testing.
  3. Between those two points, it ranges from ~maximum framerate (e.g. 60) to lousy (and stuttering), but maintains a high relative throughput, in terms of cycles-per-second

This suggests that a sliding scale of benchmarking should be able to swiftly determine per-computer and per-texture resolution where the optimal level range would (approximately) be (with some margin of error for arbitrary, errant hiccups). Then, by aiming for the “better framerate” end of that scale, it would effectively wind up with the same result while (somehow) minimizing the processing time for the loop.


Edit - So, here’s the follow-up with regard to what I went with.

In the end, I settled on just counting up in terms of performance. Start at 1 cycle through the loop, noting how long it takes while factoring in the Coroutine’s yield return null to wait for a new frame. Divide the number of passes by the time spent and you have a simple estimate of time taken per cycle.

Then, double the number of cycles per frame over and over until there isn’t a consistent ~80+% performance gain, and assume it’s reached peak efficiency.

Unfortunately, it’s not super reliable yet, even with multi-stage benchmark averaging (as it were), but it’s at least made it to a state where the typical top-three-performing counts are chosen for the rest of the texture processing, albeit with no particular consistency.

So for now, it’s at least a viable start.

1 Like