Calcule eficientemente la distancia euclidiana al cuadrado por pares en Matlab

Given two sets of d-dimensional points. How can I most efficiently compute the pairwise squared euclidean distance matrix en Matlab?

Notación: Set one is given by a (numA,d)-matriz A and set two is given by a (numB,d)-matriz B. The resulting distance matrix shall be of the format (numA,numB).

Example points:

d = 4;            % dimension
numA = 100;       % number of set 1 points
numB = 200;       % number of set 2 points
A = rand(numA,d); % set 1 given as matrix A
B = rand(numB,d); % set 2 given as matrix B

preguntado el 28 de mayo de 14 a las 13:05

¿Has mirado el pdist2 ¿función? mathworks.com/help/stats/pdist2.html -

@rayryeng yes, take a look at my evaluation part in my answer please :) -

2 Respuestas

The usually given answer here is based on bsxfun (cf. p. ej. [ 1 ]). My proposed approach is based on matrix multiplication and turns out to be much faster than any comparable algorithm I could find:

helpA = zeros(numA,3*d);
helpB = zeros(numB,3*d);
for idx = 1:d
    helpA(:,3*idx-2:3*idx) = [ones(numA,1), -2*A(:,idx), A(:,idx).^2 ];
    helpB(:,3*idx-2:3*idx) = [B(:,idx).^2 ,    B(:,idx), ones(numB,1)];
end
distMat = helpA * helpB';

Ten en cuenta que: para constante d one can replace the for-loop by hardcoded implementations, e.g.

helpA(:,3*idx-2:3*idx) = [ones(numA,1), -2*A(:,1), A(:,1).^2, ... % d == 2
                          ones(numA,1), -2*A(:,2), A(:,2).^2 ];   % etc.

Evaluación:

%% create some points
d = 2; % dimension
numA = 20000;
numB = 20000;
A = rand(numA,d);
B = rand(numB,d);

%% pairwise distance matrix
% proposed method:
tic;
helpA = zeros(numA,3*d);
helpB = zeros(numB,3*d);
for idx = 1:d
    helpA(:,3*idx-2:3*idx) = [ones(numA,1), -2*A(:,idx), A(:,idx).^2 ];
    helpB(:,3*idx-2:3*idx) = [B(:,idx).^2 ,    B(:,idx), ones(numB,1)];
end
distMat = helpA * helpB';
toc;

% compare to pdist2:
tic;
pdist2(A,B).^2;
toc;

% compare to [1]:
tic;
bsxfun(@plus,dot(A,A,2),dot(B,B,2)')-2*(A*B');
toc;

% Another method: added 07/2014
% compare to ndgrid method (cf. Dan's comment)
tic;
[idxA,idxB] = ndgrid(1:numA,1:numB);
distMat = zeros(numA,numB);
distMat(:) = sum((A(idxA,:) - B(idxB,:)).^2,2);
toc;

Resultado:

Elapsed time is 1.796201 seconds.
Elapsed time is 5.653246 seconds.
Elapsed time is 3.551636 seconds.
Elapsed time is 22.461185 seconds.

For a more detailed evaluation w.r.t. dimension and number of data points follow the discussion below (@comments). It turns out that different algos should be preferred in different settings. In non time critical situations just use the pdist2 versión.

Desarrollo adicional: One can think of replacing the squared euclidean by any other metric based on the same principle:

help = zeros(numA,numB,d);
for idx = 1:d
    help(:,:,idx) = [ones(numA,1), A(:,idx)     ] * ...
                    [B(:,idx)'   ; -ones(1,numB)];
end
distMat = sum(ANYFUNCTION(help),3);

Nevertheless, this is quite time consuming. It could be useful to replace for smaller d the 3-dimensional matrix help by d 2-dimensional matrices. Especially for d = 1 it provides a method to compute the pairwise difference by a simple matrix multiplication:

pairDiffs = [ones(numA,1), A ] * [B'; -ones(1,numB)];

Do you have any further ideas?

Respondido 25 Feb 15, 07:02

Really interesting!+1 In an other story: On my machine starting at about d>30, bsxfun will win again due to lower memory overhead. - knedlsepp

@knedlsepp Thanks for taking time to put all those together! Well I did benchmark those two vectorized versions again the loop-based version as proposed here and I didn't see a lot of difference, at least not for small to decent sized dims. - Divakar

@Divakar: As on my machine: If we want squared distances, your Vec1 version is the fastest for lower dimensions until it gets beat by bsxfun. If we want the actual sqrt-distances pdist2 is faster until it also gets beat by bsxfun eventually. After doing all this comparing: I guess, even though it's nice to know that we can squeeze the last bit of speed from all of this, I somehow get the feeling that simply going with pdist2 is a no-brainer, if you have the statistics toolbox installed, as it is flexible yet still very very fast. - knedlsepp

@knedlsepp Thanks a lot - this is a very interesting evaluation! I would just like to add that the time scale in log10 is a little bit misleading since the relevance of computation time does not live on a log-scale (e.g. a factor 2 is really interesting to save time, but looks like nothing on log10-scale). My facit: It pays off to test different algos for a time critical implementation (which is the case especially for large numbers of points). E.g. for large numbers of 2d data points it turns out to be useful to use my implementation. I really like our collection of algos! :) - Matheburg

This is a very interesting suggestion and comparison. It seems that the pdist2 version lacks in efficiency due mainly to the element-wise squares, while Matlab now provides the 'squaredeuclidean' option to get this directly. With this, the proposed method and pdist2 appear to be very close (and perhaps pdist2 is faster in some regimes). The option may be more recent than the posted answer. - akkapi

For squared Euclidean distance one can also use the following formula

||a-b||^2 = ||a||^2 + ||b||^2 - 2<a,b>

Dónde <a,b> is the dot product between a e b

nA = sum( A.^2, 2 ); %// norm of A's elements
nB = sum( B.^2, 2 ); %// norm of B's elements
distMat = bsxfun( @plus, nA, nB' ) - 2 * A * B' ;

Recently, I've been les dijo a that as of R2016b this method for computing square Euclidean distance is faster than accepted method.

respondido 11 mar '20, 17:03

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.