What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.
Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:
Is there any way I can do this? Perhaps a method of simulating multiple cores?
preguntado el 09 de marzo de 12 a las 22:03
I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.
Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.
You could try to predict what will happen, but there are a lot of factors that need to be taken into account:
- caches - size, number of layers, shared / non-shared
- ancho de banda de memoria
- number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
- interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
- memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.
I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.
bamboon and and doron are correct that many variables are at play, but if you have a tunable input size
n, you can figure out the escalamiento fuerte y escamas débiles de su código.
Strong scaling refers to fixing the problem size (e.g.
n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size por hilo (
n = 10k/thread) and varying the number of threads available for computation.
It's true there's a lot of variables at work in any program -- however if you have some basic input size
n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.
Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.
You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.
Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case
I don't believe this is possible since there are too many variables to be able to accurately extrapolate performace. Even assuming you are 100% parallel. There are other factors like bus speed and cache misses that might limit your performance, not to mention periferal performace. How all of these factors affect your code can only be done though measuring on your specific hardware platform.
I take it you are asking about measurement, so I won't address the issue of predicting the effect on higher numbers of cores.
This question can be viewed another way: how busy can you keep each thread, and what do they total up to? So for six threads, running at say 50% utilization each, means you have 3 equivalent processors running. Dividing that by say four processors, means that your methods are achieving 75% utilization. Comparing that utilization, against the clock-time of actual speedup, tells you how much of your utilization is new overhead, and how much is real speed up. Isn't that what you are really interested in?
The processor utilization can be computed in real-time a couple different ways. Threads can independently ask the system for their thread times, compute ratios and maintain global totals. If you have total control over your blocking states, you don't even need the system calls, because you can just keep track of the ratio of blocking to nonblocking machine cycles, for computing utilization. A real-time multithreading instrumentation package I developed uses such methods and they work well. The cpu clock counter in newer cpus reads on the inside of 20 machine cycles.