Número esperado de colisiones hash

I feel like I'm way overthinking this problem, but here goes anyway...

I have a hash table with M slots in its internal array. I need to insert N elements into the hash table. Assuming that I have a hash function that randomly inserts am element into a slot with equal probability for each slot, what's the expected value of the total number of hash collisions?

(Sorry that this is more of a math question than a programming question).

Edit: Here's some code I have to simulate it using Python. I'm getting numerical answers, but having trouble generalizing it to a formula and explaining it.

import random
import pdb

N = 5
M = 8

NUM_ITER = 100000

def get_collisions(table):
    col = 0
    for item in table:
        if item > 1:
            col += (item-1)
    return col

def run():
    table = [0 for x in range(M)]

    for i in range(N):
        table[int(random.random() * M)] += 1

    #print table
    return get_collisions(table)

# Main

total = 0
for i in range(NUM_ITER):
    total += run()

print float(total)/NUM_ITER

preguntado el 01 de febrero de 12 a las 22:02

how do you want "triplet" collisions measured ? -

Whatever makes the most sense I guess. So I'll go with counting it as two collisions (one per new element added after the first) -

The best measure appears to be the amount of work to retrieve all items, which is SUM(x * (x+1) /2) with X is the number of items in a bucket, and the sum is over all buckets. -

2 Respuestas

Encontrarás la respuesta aquí: Quora.com. The expected number of collisions for m cubos y n inserts is

n - m * (1 - ((m-1)/m)^n).

Respondido 06 Jul 12, 16:07

There is also a proof of this on the Math StackExchange. - ShreevatsaR

Answer should include the proof. - MVTC

Is there a table available for generic m values (such as 2^32) ? - Cian

IMHO, number of collision is not same as number of elements sharing same bucket/slot. In the context of B'day paradox, if 4 persons share same B'day, then answer to the latter question (# of person sharing same B'day) would be 4. However, for the former question, the # of B'day collision is often considered to be 4-1=3. The rationale behind this is that, without any three of the four persons, there is no collision. The difference is minor and yet worth noting so as to not get confused. - KGhatak

Is there a way to show the variance of the number of collisions? - zyxue

La fórmula para el SUM(x*(x+1)/2) metric can be found aquí, y valor esperado parece ser (n/2m)* (n+2m -1).

Don't know about the variance, IANAM.

Respondido el 28 de Septiembre de 15 a las 13:09

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.