¿Cómo puedo medir cómo se escala (aceleración) mi código multiproceso?

What would be the best way to measure the speedup of my program assuming I only have 4 cores? Obviously I could measure it up to 4, however it would be nice to know for 8, 16, and so on.

Ideally I'd like to know the amount of speedup per number of thread, similar to this graph:

Amdahl's law diagram

Is there any way I can do this? Perhaps a method of simulating multiple cores?

preguntado el 09 de marzo de 12 a las 22:03

+1 for visuals. Short answer, you can't aside from making educated guesses. -

@Mysticial but shouldn't you be able to measure with a tool like Intel's VTune? -

@ConradFrix Not when you're trying to guess the performance on 16 cores that you don't have. You can, on the other hand, use VTune to profile the performance on 4 cores, and based on those numbers to attempt to extrapolate to 16 cores. That would be, more or less, an "educated guess". -

5 Respuestas

I'm sorry, but in my opinion, the only reliable measurement is to actually get an 8, 16 or more cores machine and test on that.

Memory bandwidth saturation, number of CPU functional units and other hardware bottlenecks can have a huge impact on scalability. I know from personal experience that if a program scales on 2 cores and on 4 cores, it might dramatically slow down when run on 8 cores, simply because it's not enough to have 8 cores to be able to scale 8x.

You could try to predict what will happen, but there are a lot of factors that need to be taken into account:

  1. caches - size, number of layers, shared / non-shared
  2. ancho de banda de memoria
  3. number of cores vs. number of processors i.e. is it an 8-core machine or a dual-quad-core machine
  4. interconnection between cores - a lower number of cores (2, 4) can still work reasonably well with a bus, but for 8 or more cores a more sophisticated interconnection is needed.
  5. memory access - again, a lower number of cores work well with the SMP (symmetrical multiprocessing) model, while a higher number of core need a NUMA (non-uniform memory access) model.

respondido 10 mar '12, 11:03

I do neither think that there is a real way to do this, but one thing which comes to my mind is that you could use a virtual machine to simulate more cores. In VirtualBox for example you can select up to 16 cores out of the standard menu, but I am very confident that there are some hacks, which can make more of that and other VirtualMachines like VMware might even support more out of the Box.

enter image description here

respondido 09 mar '12, 23:03

How can virtualbox simulate more cores? - CMCDragonkai

@CMCDragonkai Well, it's virtualization. It can tell the guest operating system whatever it wants. - Stephan Dollberg

Does it then thread those simulated cores into the real physical core? So if I have 4 cores, I can then create 100 simulated cores using VirtualBox? I didn't about such a capability! - CMCDragonkai

@CMCDragonkai yeah, it somehow has to schedule them. Whether this idea works actually depends on how this scheduling works. Maybe just try it out and see. - Stephan Dollberg

bamboon and and doron are correct that many variables are at play, but if you have a tunable input size n, you can figure out the escalamiento fuerte y escamas débiles de su código.

Strong scaling refers to fixing the problem size (e.g. n = 1M) and varying the number of threads available for computation. Weak scaling refers to fixing the problem size por hilo (n = 10k/thread) and varying the number of threads available for computation.

It's true there's a lot of variables at work in any program -- however if you have some basic input size n, it's possible to get some semblance of scaling. On a n-body simulator I developed a few years back, I varied the threads for fixed size and the input size per thread and was able to reasonably calculate a rough measure of how well the multithreaded code scaled.

Since you only have 4 cores, you can only feasibly compute the scaling up to 4 threads. This severely limits your ability to see how well it scales to largely threaded loads. But this may not be an issue if your application is only used on machines where there are small core counts.

You really need to ask yourself the question: Is this going to be used on 10, 20, 40+ threads? If it is, the only way to accurately determine scaling to those regimes is to actually benchmark it on a platform where you have that hardware available.


Side note: Depending on your application, it may not matter that you only have 4 cores. Some workloads scale with increasing threads regardless of the real number of cores available, if many of those threads spend time "waiting" for something to happen (e.g. web servers). If you're doing pure computation though, this won't be the case

respondido 10 mar '12, 02:03

Creo que el Ley de Amdahl only makes sense for tasks consuming CPU-time. - André Carón

I don't believe this is possible since there are too many variables to be able to accurately extrapolate performace. Even assuming you are 100% parallel. There are other factors like bus speed and cache misses that might limit your performance, not to mention periferal performace. How all of these factors affect your code can only be done though measuring on your specific hardware platform.

respondido 10 mar '12, 01:03

I take it you are asking about measurement, so I won't address the issue of predicting the effect on higher numbers of cores.

This question can be viewed another way: how busy can you keep each thread, and what do they total up to? So for six threads, running at say 50% utilization each, means you have 3 equivalent processors running. Dividing that by say four processors, means that your methods are achieving 75% utilization. Comparing that utilization, against the clock-time of actual speedup, tells you how much of your utilization is new overhead, and how much is real speed up. Isn't that what you are really interested in?

The processor utilization can be computed in real-time a couple different ways. Threads can independently ask the system for their thread times, compute ratios and maintain global totals. If you have total control over your blocking states, you don't even need the system calls, because you can just keep track of the ratio of blocking to nonblocking machine cycles, for computing utilization. A real-time multithreading instrumentation package I developed uses such methods and they work well. The cpu clock counter in newer cpus reads on the inside of 20 machine cycles.

respondido 11 mar '12, 09:03

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.