CP3 vs CP4

I wanted to see how much faster the CP4 is versus the CP3, so I wrote a small benchmark program to count prime numbers between 2 and some number. I’ve added a FINDPRIMES command to the console so I can test a few ways. Here is an example running on a CP3:

findprimes 20
2 3 5 7 11 13 17 19
Found 8 primes in 00:00:00.0005093

After testing each number between 2 and 20, the program found 8 that are prime, and it took 509.3 microseconds from start to finish.

Single-Threaded

The algorithm I’m using for this first run of tests is pretty awful. It runs in quadratic time, so the more numbers it tests, the total number of loops grows exponentially. There are better ways of finding all the primes below a certain number, but I just need something that will tax the processor for a bit. It also runs everything in a single thread.

I’m using the Stopwatch class to time how long it takes to find all prime numbers between 2 and 20. Here’s another example where I tell the command to be quiet, and the time improves significantly. Printing each number to the console takes a good chunk of time:

findprimes 20 quiet
Found 8 primes in 00:00:00.0000283

If we up the limit to 50,000, it takes a little over 20 seconds to finish:

findprimes 50000 quiet
Found 5133 primes in 00:00:20.8147098

There is a problem if we run our test for too long: a watchdog timer notices that we haven’t been treating the system very fairly and notifies our thread to wrap things up. On the CP3, I can only check below 80,000 before we exceed this timer. Here’s a graph of how long it takes for each set of numbers:

Vertical axis is seconds, horizontal is iterations

Based on this, the CP4 performs a little better, but not as much as I was expecting. I think this chart shows that single-threaded programming will receive only a slight boost in performance. Let’s see how multi-threaded programming compares!

Multi-Threaded

We can modify our program a bit to hand off the task of determining if something is prime. This test can be run in a different thread and our main thread can wait until those finish before moving onto the next batch. Running the test this way will incur more overhead (we need to start other threads and coordinate communication between them) but it should improve performance in the end.

One side-effect of running multiple threads is that our results can be returned out of order:

findprimes 20
2 3 5 7 17 13 11 19
Found 8 primes in 00:00:00.0026434

If we cared about the order, we could wait for threads to complete in order using a queue or some other data structure.

Here’s how the multi-threaded program compares on both architectures:

Vertical axis is seconds, horizontal is iterations

You can see the single-core CP3 performs roughly the same (actual times are a little worse compared to single-threaded), but the multi-core CP4 really shines once we start giving each thread a sufficiently large chunk of work. I’m sure if I divided up the work between threads a bit more intelligently, I could get the slope of the CP4 line to look even better. The way I wrote the program, the second half of the worker threads end up doing 75% of the work. Here’s how lopsided it is when running FINDPRIMES 1000:

Each thread counting how many loops it makes

Summary

This wasn’t an exhaustive benchmark, I only ran each test once to get a vague idea what performance looked like. I did notice that running the same test again immediately had slightly better results. If you have a task that can be parallelized well, the 4-series should definitely give you better performance. And the good thing about UI logic and automation is it lends itself well to multiprocessing because events can all fire at the same time.

And just because I was curious how the CP4 performs against my desktop PC (AMD Ryzen 7 2700 Eight-Core 3.40GHz with 32GB RAM), here’s what I got:

CP4 (10 worker threads): 9592 prime numbers in 00:00:15.7732802 (limit=100,000)
PC (10 worker threads):  9592 prime numbers in 00:00:00.4565772 (limit=100,000)
PC (20 worker threads):  9592 prime numbers in 00:00:00.3910194 (limit=100,000)
PC (20 worker threads): 78498 prime numbers in 00:00:24.2096742 (limit=1,000,000)

Under similar test parameters, my PC is almost 166 times faster. If I double the number of worker threads, it ends up being almost 194 times faster. I’d also like to see how a Raspberry Pi 4 compares. Maybe I can try that in the future.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s