Rendimiento de dividir un bucle en dos bucles

Buenos días,

Suppose that you have a simple for loop like below...

for(int i=0;i<10;i++)
{
    //statement 1
    //statement 2
}

Assume that statement 1 and statement 2 were O(1). Besides the small overhead of "starting" another loop, would breaking down that for loop into two (not nested, but sequential) loops be as equally fast? For example...

for(int i=0;i<10;i++)
{
    //statement 1
}
for(int i=0;i<10;i++)
{
    //statement 2
}

Why I ask such a silly question is that I have a Collision Detection System(CDS) that has to loop through all the objects. I want to "compartmentalize" the functionality of my CDS system so I can simply call

cds.update(objectlist);

instead of having to break my cds system up. (Don't worry too much about my CDS implementation... I think I know what I am doing, I just don't know how to explain it, what I really need to know is if I take a huge performance hit for looping through all my objects de nuevo.

preguntado el 09 de marzo de 12 a las 13:03

6 Respuestas

In terms of algorithmic complexity splitting the loops makes no difference.

In terms of real world performance splitting the loops could improve performance, worsen performance or make no difference - it depends on the OS, hardware and - of course - what statement 1 y statement 2 son.

respondido 09 mar '12, 13:03

Depende de su aplicación.

Posibles inconvenientes (of splitting):

  • your data does not fit into the L1 data cache, therefore you load it once for the first loop and then reload it for the second loop

Possible Gains (of splitting):

  • your loop contains many variables, splitting helps reducing register/stack pressure and the optimizer turns it into better machine code
  • the functions you use trash the L1 instruction cache so the cache is loaded on each iteration, while by splitting you manage into loading it once (only) at the first iteration of each loop

These lists are certainly not comprehensive, but already you can sense that there is a tension between código y datos . So it is difficult for us to take an educated/a wild guess when we know neither.

In doubt: profile. Use callgrind, check the cache misses in each case, check the number of instructions executed. Measure the time spent.

respondido 09 mar '12, 14:03

With two loops you will be paying for:

  • increased generated code size
  • 2x as many branch predicts
  • depending what the data layout of statement 1 and 2 are you could be reloading data into cache.

The last point could have a huge impact in either direction. You should measure as with any perf optimization.

respondido 09 mar '12, 13:03

Your third point is likely the most important. It'll come down to whether you fit within the first-level CPU cache or not. If both combined all data fits in the cache splitting won't likely help, but if too big for the cache and splitting is small enough, the gains could be substantial. - edA-qa mort-ora-y

As far as the big-o complexity is concerned, this doesn't make a difference if 1 loop is O(n), then so is the 2 loop solution.
As far as micro-optimisation, it is hard to say. The cost of a loop is rather small, we don't know what the cost of accessing your objects is (if they are in a vector, then it should be rather small too), but there is a lot to consider to give a useful answer.

respondido 09 mar '12, 13:03

As noted, the complexity remains.

But in the real world, it is impossible for us to predict which version runs faster. The following are factors that play roles, huge ones:

  • Almacenamiento en caché de datos
  • Instruction caching
  • Speculative execution
  • Predicción de rama
  • Branch target buffers
  • Number of available registers on the CPU
  • Cache sizes

(note: over all of them, there's the Damocles sword of misprediction; all are wikipedizable and googlable)

Especially the last factor makes it sometimes impossible to compile the one true code for code whose performance relies on specific cache sizes. Some applications will run faster on CPU with huge caches, while running slower on small caches, and for some other applications it will be the opposite.

Soluciones:

  • Let your compiler do the job of loop transformation. Modern g++'s are quite good in that discipline. Another discipline that g++ is good at is automatic vectorization. Be aware that compilers know more about computer architecture than almost all people.
  • Ship different binaries and a dispatcher.
  • Utilizan cache-oblivious data structures/layouts and algorithms that adapt to the target cache.

It is always a good idea to endeavor for software that adapts to the target, ideally without sacrificing code quality. And before doing manual optimization, either microscopic or macroscopic, measure real world runs, then and only then optimize.

Literature: * Agner Fog's Guides * Intel's Guides

respondido 16 mar '12, 16:03

You're correct in noting that there will be some performance overhead by creating a second loop. Therefore, it cannot be "equally fast"; as this overhead, while small, is still overhead.

I won't try to speak intelligently about how collision systems should be built, but if you're trying to optimize performance it's better to avoid building unnecessary control structures if you can manage it without pulling your hair out.

Remember that premature optimization is one of the worst things you can do. Worry about optimization when you have a performance problem, in my opinion.

respondido 09 mar '12, 13:03

As stefaanv noted, the cost of looping through all your objects a second time is indeterminate with the information you've given. - patricio

I would also note that the two control structures you posted solve different problems, and thus are not easily compared in the context of performance. - patricio

Without knowing more details and without actual measurement, it is impossible to say which version is faster. Caching, both data and instruction, as well as branch prediction (and -tables) and speculative execution add a lot of complexity to nowadays' optimization. Good point though on premature optimization. Measure first in the real world, then optimize. - Sebastián Mach

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.