Programmer Ramblings – Screwing Around With std::thread

Step 2 – Increasing the size of a work unit

Luckily, this step is pretty easy. Pump more shit into a thread, let it run, see what happens. It only took a few minutes to make sure I was barking up the correct tree here. My first-pass on increasing the work size was simply to make each thread run a full row on the output image, rather than a single pixel. Now each thread would run about 80,000 rays instead of 100. Simple change, large result. The code ends up being like this:

uint32_t MaxThreadCount = std::thread::hardware_concurrency();
{
	for (uint32_t ThreadCount = 4; ThreadCount <= MaxThreadCount; ThreadCount += 4)
	{
		for (uint32_t j = 0; j < YSize; j += ThreadCount)
		{
			std::vector<std::thread> Threads;
			for (uint32_t t = 0; t < ThreadCount && (j + t) < YSize; ++t)
			{
				Threads.push_back(std::thread(&SphereTest_ThreadIter1::BatchThread, this, j + t, SampleCount, XSize, XSizef, YSizef, Shapes, OutputImage));
			}
			std::for_each(Threads.begin(), Threads.end(), std::mem_fn(&std::thread::join));
		}
	}
}
void SphereTest_ThreadIter1::BatchThread(uint32_t j, uint32_t SampleCount, uint32_t XSize, float XSizef, float YSizef, std::vector<Shape*> Shapes, class Image* OutputImage)
{
	std::default_random_engine generator;
	std::uniform_real_distribution<float> distribution(-0.5f, 0.5f);

	for (uint32_t i = 0; i < XSize; ++i)
	{
		Color PixelColor = Color();
		for (uint32_t s = 0; s < SampleCount; ++s)
		{
			float U = float(i + distribution(generator)) / XSizef;
			float V = float(j + distribution(generator)) / YSizef;

			Ray TraceRay(Origin, BottomLeft + U * HorizSize + V * VertSize);
			PixelColor += GetColorForRay(Shapes, TraceRay);
		}
		PixelColor /= float(SampleCount);
	}
}

This gives me the following timings:

Sphere Threading Iteration 1 - Total Time Taken (ms):829 for thread count:4
Sphere Threading Iteration 1 - Total Time Taken (ms):429 for thread count:8
Sphere Threading Iteration 1 - Total Time Taken (ms):323 for thread count:12
Sphere Threading Iteration 1 - Total Time Taken (ms):256 for thread count:16
Sphere Threading Iteration 1 - Total Time Taken (ms):211 for thread count:20
Sphere Threading Iteration 1 - Total Time Taken (ms):186 for thread count:24
Sphere Threading Iteration 1 - Total Time Taken (ms):174 for thread count:28
Sphere Threading Iteration 1 - Total Time Taken (ms):239 for thread count:32

Simple change, huge results, and a confirmation that I’m barking up the right tree. In this case, the worker threads were now ~80% of the total execution time of the program. It shows the improvements as thread count increases, but starts to peter out a bit as thread count increases. The jump in cost at the highest thread count is probably presenting a point where the cost of the join and variations in thread completion is causing bottlenecks.

Out of curiosity, I did another quick test where I split each worker thread into some percentage of the image, rather than per-row. This gave me the following timings:

Sphere Threading Iteration 2 - Total Time Taken (ms):822 for thread count:4
Sphere Threading Iteration 2 - Total Time Taken (ms):432 for thread count:8
Sphere Threading Iteration 2 - Total Time Taken (ms):321 for thread count:12
Sphere Threading Iteration 2 - Total Time Taken (ms):232 for thread count:16
Sphere Threading Iteration 2 - Total Time Taken (ms):196 for thread count:20
Sphere Threading Iteration 2 - Total Time Taken (ms):168 for thread count:24
Sphere Threading Iteration 2 - Total Time Taken (ms):158 for thread count:28
Sphere Threading Iteration 2 - Total Time Taken (ms):146 for thread count:32

This one is a little more expected. Since I’m only running N number of total threads for the life of execution, the timing reduction is linear. This has none of the thread concurrency problems waiting on the worker thread batch join, so the cost of creating threads is eliminated. This is especially apparent on the high thread count end. If I were to adapt a message pump to the smaller full-row batch, I suspect I’d see a similarly linear drop in the total execution time of the program with some additional overhead for spawning the threads.

One Reply to “Programmer Ramblings – Screwing Around With std::thread”

Comments are closed.