Algoritmos para representar un conjunto de enteros con un solo entero

This may not be a programming question but it's a problem that arised recently at work. Some background: big C development with special interest in performance.

I've a set of integers and want to test the membership of another given integer. I would love to implement an algorithm that can check it with a minimal set of algebraic functions, using only a integer to represent the whole space of integers contained in the first set.

I've tried a composite Cantor pairing function for instance, but with a 30 element set it seems too complicated, and focusing in performance it makes no sense. I played with some operations, like XORing and negating, but it gives me low estimations on membership. Then I tried with successions of additions and finally got lost.

¿Alguna idea?

preguntado el 28 de agosto de 12 a las 10:08

What is the range of the integers ? -

Note that an integer with 32 bits, can represent a set that can contain up to 32 elements (the range contains up to 32 elements). The reason is if you have k>32 possible elements, there are 2^32 possible sets, and from pigeonhole principle - two sets will be mapped to the same integer. -

Sorry for not clearing the domain up: they're 30 bit long unsigned integers. -

3 Respuestas

For sets of unsigned long of size 30, the following is one fairly obvious way to do it:

  • store each set as a sorted array, 30 * sizeof(unsigned long) bytes per set.
  • to look up an integer, do a few steps of a binary search, followed by a linear search (profile in order to figure out how many steps of binary search is best - my wild guess is 2 steps, but you might find out different, and of course if you test bsearch and it's fast enough, you can just use it).

So the next question is why you want a big-maths solution, which will tell me what's wrong with this solution other than "it is insufficiently pleasing".

I suspect that any big-math solution will be slower than this. A single arithmetic operation on an N-digit number takes at least linear time in N. A single number to represent a set can't be very much smaller than the elements of the set laid end to end with a separator in between. So even a linear search in the set is about as fast as a single arithmetic operation on a big number. With the possible exception of a Goedel representation, which could do it in one division once you've found the nth prime number, any clever mathematical representation of sets is going to take multiple arithmetic operations to establish membership.

Note also that there are two different reasons you might care about the performance of "look up an integer in a set":

  • You are looking up lots of different integers in a single set, in which case you podría be able to go faster by constructing a custom lookup function for that data. Of course in C that means you need either (a) a simple virtual machine to execute that "function", or (b) runtime code generation, or (c) to know the set at compile time. None of which is necessarily easy.
  • You are looking up the same integer in lots of different sets (to get a sequence of all the sets it belongs to), in which case you might benefit from a combined representation of all the sets you care about, rather than considering each set separately.

I suppose that very occasionally, you might be looking up lots of different integers, each in a different set, and so neither of the reasons applies. If this is one of them, you can ignore that stuff.

Respondido 28 ago 12, 11:08

Maybe you're right and I'm in a lost quest for a "pleasant" math solution, but I'll sleep like a baby if I find one ;) - manuel abeledo

It took me a few hours but you finally convinced me. - manuel abeledo

One good start is to try Filtros de floración. Basically, it's a probabilistic data structure that gives you no false negative, but some false positive. So when an integer matches a bloom filter, you then have to check if it really matches the set, but it's a big speedup by reducing a lot the number of sets to check.

Respondido 28 ago 12, 10:08

Bloom filters use multiple bitsets, I am not sure it fits here. - amit

That's a method to test membership of an element in a set. I'm not sure I understood the whole problem, but it seems that's a membership problem :-) - Scharron

Strictly speaking, a Bloom filter is a way to acelerar membership testing, you still need to implement the fallback test :-) - steve jesop

@SteveJessop Right :-) (that's in my answer ;-) ) - Scharron

From personal experience - bloom filters are usually good for huge/infinite range of elements, and require much more then 32 bits to be effective. If someone have ever encountered something different - I'll be glad to hear about it. - amit

if i'd understood your correctly, python example:

>>> a=[1,2,3,4,5,6,7,8,9,0]
>>>
>>>
>>> len_a = len(a)
>>> b = [1]
>>> if len(set(a) - set(b)) < len_a:
...     print 'this integer exists in set'
...
this integer exists in set
>>>

math base: http://en.wikipedia.org/wiki/Euler_diagram

Respondido 28 ago 12, 10:08

This would only work with constant successions like [Xi | Xi = Xi-1 + 1]. I'm looking for something that looks in sets like [30, 40, 120] - manuel abeledo

It does work on all sets, len(set(a) - set([1]) < len(a) es solo una forma divertida de decir 1 in set(a), porque si 1 no está en a, entonces set(a) - set(b) == set(a) and hence has the same length as a. Introducing the set difference appears to me to be a way to satisfy the requirement in the question, "must have spurious mathematical basis" ;-) - steve jesop

@SteveJessop i'd write this statments (len(....<) en lugar de in because SO may be not understood python. - Dmitri Zagorulkin

Ok, I see it now, sorry. Yes, but simplifying the code does not make it suitable for the task I've in mind :) - manuel abeledo

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.