Buena función hash para eliminar duplicados de la matriz

Can any one suggest good has-function to remove duplicates from the array in order to use moderate memory consumption? Remember I am using hash map solution for that but want good hash function. Otherwise the memory consumption depends on the biggest element of the array.

Its an array of integers....

preguntado el 10 de marzo de 12 a las 14:03

C or C++? It surely makes a difference. What implementation of a hash table do you use? -

It's hard to answer this without any information about the keys of the hash. Strings, integers? -

what types do you have in the array and what is the range of values? -

What kind of integers are they? What is their range? Could you tell anything else about them? -

@AbdulSamad: you are mistaken about something. A hash function goal is to map an input to an integer within a specified range. However the hash itself is not use crudo, instead it is used as a hint by the hash table, which will be sized only depending on its number of elements. -

3 Respuestas

Your question lack details, so I'll just make them up.

Hashing an integer is usually useless. An integer is its own hash.

What matters most is the size of the integer (how many bits), the number of different elements (so that we know how much the side table will grow) and the number of elements in the array (to estimate how much operations it will take).

The simplest solution to eliminating duplicates is usually sort + unification. Or in Unix:

cat list | sort -u

In C++, this can be achieved through the <algorithm>:

std::sort(vector.begin(), vector.end());
vector.erase(std::unique(vector.begin(), vector.end()), vector.end());

However this will obviously sort the array so may not be desirable. In this case, you can always use a side table.

  • If the range of the integers is small (say all in [0, 65536) for example), then just use a regular table with the integers as indexes. Using a bitset you can easily get them.
  • If the range grows, things depend more on how sparse the range is.
    • For a sparse range, indeed a hash table can be a good approach
    • However for a full range (eg, very few duplicates and large number of elements) then the hash table will grow enormously and might become too big, in this case perhaps than a Bloom Filter (ie probabilistic approach) would work better.

respondido 10 mar '12, 14:03

There is very small point in hashing an integer since it is already small enough to do comparison. You can sort the array and remove subsequent elements that are equal easily. If you really want to hash them just take, for example, first two bytes into a short and it is your hash.

respondido 10 mar '12, 14:03

You could use the MAD (multiply add and divide) method, which helps eliminate repeated patterns in a set of integer keys.

h(k) = |ak + b| mod N,

where N is a prime number, and a and b are non-negative integers randomly chosen so that a mod N != 0. But you still need to deal with collisions.

respondido 23 mar '12, 14:03

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.