Reducciones parciales eficientes dadas matrices de elementos, compensaciones y longitudes de sublistas

For my application I have to handle a bunch of objects (let's say ints) that gets subsequently divided and sorted into smaller buckets. To this end, I store the elements in a single continuous array

arr = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14...}

and the information about the buckets (sublists) is given by offsets to the first element in the respective bucket and the lengths of the sublist.

So, for instance, given

offsets = {0,3,8,..}
sublist_lengths = {3,5,2,...}

would result in the following splits:

0 1 2 || 3 4 5 6 7 || 8 9 || ...

What I am looking for is a somewhat general and efficient way to run algorithms, like reductions, on the buckets only using either custom kernels or the thrust library. Summing the buckets should give:

3 || 25 || 17 || ...

Lo que se me ocurrió:

  • opción 1: custom kernels require a quite a bit of tinkering, copies into shared memory, proper choice of block and grid sizes and an own implementation of the algorithms, like scan, reduce, etc. Also, every single operation would require an own custom kernel. In general it is clear to me how to do this, but after having used thrust for the last couple of days I have the impression that there might be a smarter way

  • opción 2: generate an array of keys from the offsets ({0,0,0,1,1,1,1,1,2,2,3,...} in the above example) and use thrust::reduce_by_key. I don't like the extra list generation, though.

  • opción 3: Utilizar thrust::transform_iterator Junto con thrust::counting_iterator to generate the above given key list on the fly. Unfortunately, I can't come up with an implementation that doesn't require increments of indices to the offset list on the device and defeats parallelism.

What would be the most sane way to implement this?

preguntado el 01 de febrero de 12 a las 04:02

2 Respuestas

Within Thrust, I can't think of a better solution than Option 2. The performance will not be terrible, but it's certainly not optimal.

Your data structure bears similarity to the Compressed Sparse Row (CSR) format for storing sparse matrices, so you could use techniques developed for computing sparse matrix-vector multiplies (SpMV) for such matrices if you want better performance. Note that the "offsets" array of the CSR format has length (N+1) for a matrix with N rows (i.e. buckets in your case) where the last offset value is the length of arr. La CSR SpMV code in Cúspide is a bit convoluted, but it serves as a good starting point for your kernel. Simply remove any reference to Aj or x from the code and pass offsets y la arr en el Ap y la Av arguments respectively.

Respondido 01 Feb 12, 10:02

The similarity to compressed sparse row matrices struck me too. - garras

You didn't mention how big the buckets are. If the buckets are big enough, maybe you can get away with copying the offsets and sublist_lengths to the host, iterating over them and doing a separate Thrust call for each bucket. Fermi can have 16 kernels in flight at the same time, so on that architecture you might be able to handle smaller buckets and still get good utilization.

Respondido 03 Feb 12, 04:02

Thank you for your answer. I am going to settle with a relatively small fixed bucket size such that every bucket is processed within a single block utilizing shared memory. Could you point me to literature about the limitations of spawning multiple kernels? Thanks! - bbtrb

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.