# MATLAB: palabras que coinciden entre matrices de celdas de cadenas

I’m trying to solve the following problem, and I need to do it as efficiently as possible (i.e. trying to avoid loops as much as I can).

I have two cell arrays, namely A and B. Each cell of A and B contains a string of characters. The length of these strings of characters is variable. Let’s say:

``````A={‘life is wonderful’, ‘matlab makes your dreams come true’};

B={‘life would be meaningless without wonderful matlab’, ‘what a wonderful world’, ‘the shoemaker makes shoes’, ‘rock and roll baby’};
``````

Moreover, the number of elements of cell array B is around three orders of magnitude larger than that of cell array A.

My goal is to find how many words of every char string in A also appear in every char string of B.

For the previous example, a suitable result could be something like:

``````match = [2 1 0 0
1 0 1 0]
``````

The first row indicates how many words in the 1st char string of A appear in the four char strings of B. And the second row, the same for the 2nd char string of A.

The double loop implementation is straightforward, but very time consuming especially because of the length of cell array B (over 3 million cells).

¿Algunas ideas? Muchas gracias.

Javier

preguntado el 08 de noviembre de 11 a las 11:11

## 3 Respuestas

Let me get this started by posting the "straightforward" solution (at least others have a baseline to compare against):

``````A = {'life is wonderful', 'matlab makes your dreams come true'};
B = {'life would be meaningless without wonderful matlab', 'what a wonderful world', 'the shoemaker makes shoes', 'rock and roll baby'};

count = zeros(numel(A),numel(B));

%# for each string
for i=1:numel(A)
%# split into words
str = textscan(A{i}, '%s', 'Delimiter',' '); str = str{1};

%# for each word
for j=1:numel(str)
%# count occurences
count(i,:) = count(i,:) + cellfun(@numel, strfind(B,str{j}));
end
end
``````

``````>> count
count =
2     1     0     0
1     0     1     0
``````

A better algorithm could be building some kind of index or hash-table...

respondido 08 nov., 11:17

The solution is not complex.

split the sentences apart with:

``````a_words = regexp(A,'(\w+)','match')
b_words = regexp(B,'(\w+)','match')
``````

then compare in a loop:

``````match = nan(numel(a_words),numel(b_words));
for i = 1:numel(a_words)
for j = 1:numel(b_words)
match(i,j) = sum(ismember(a_words{i},b_words{j}));
end
end
``````

but to make it quicker - I'm not really sure. You definetly can put the inner loop in a parfor that should parallelize that. If it's really a lot of words maybe put them in a database. That will do the indexing for you.

respondido 08 nov., 11:17

Puedes explotar Mapa, which offers you an efficient dictionary based structure:

For each word save the vector showing the occurrences in each string:

``````A = {'life is wonderful', 'matlab makes your dreams come true'};
B = {'life would be meaningless without wonderful matlab', 'what a wonderful world', 'the shoemaker makes shoes', 'rock and roll baby'};

mapA = containers.Map();
sizeA = size(A,2);
for i = 1:size(A,2)         % for each string
a = regexpi(A(i),'\w+','match');
for w = a{:}                % for each word extracted
str = cell2mat(w);
if(mapA.isKey(str))     % if word already indexed
occ = mapA(str);
else                    % new key
occ = zeros(1,sizeA);
end
occ(i) = occ(i)+1;
mapA(str) = occ;
end
end

% same for B
mapB = containers.Map();
sizeB = size(B,2);
for i = 1:size(B,2)
a = regexpi(B(i),'\w+','match');
for w = a{:}
str = cell2mat(w);
if(mapB.isKey(str))
occ = mapB(str);
else
occ = zeros(1,sizeB);
end
occ(i) = occ(i)+1;
mapB(str) = occ;
end
end
``````

then, for each unique word found in A, compute the matches with B

``````match = zeros(size(A,2),size(B,2));
for w = mapA.keys
str = cell2mat(w);
if (mapB.isKey(str))
match = match + diag(mapA(str))*ones(size(match))*diag(mapB(str));
end
end
``````

``````match =

2     1     0     0
1     0     1     0
``````

this way you have a complexity of #wordsA + #wordsB + #singleWordsA instead of #wordsA*#wordsB

EDITAR: or, if you don't like `Map`, you can save the word-occurrence-vectors in alphabetically ordered vector. Then you can look for matches simultaneously checking both vectors:

(suppose we are using a struct where the `w` attribute is the word string and `occ` is the occurrence vector)

``````i = 1; j = 1;
while(i<=size(wordsA,2) && i<=size(wordsB,2))
if(strcmp(wordsA(i).w, wordsB(j).w))
% update match
else
if(before(wordsA(i).w, wordsA(i).w)) % before: fancy function returning 1 if the first argument comes (alphabetically) before the second one (no builtin function comes to my mind)
i = i+1;
else
j = j+1;
end
end
``````

if you are looking for 'matlab' and you know in the 10th position is stored 'life' is useless to check the positions before, since the vector is alphabetically ordered. So we have #wordsA+#wordsB iteration vs. #wordsA*#wordsB of the nested loops solution.

respondido 08 nov., 11:23

Did you time it against @Amro's solution? - Jonas

seems it clocks about the same, even with a larger dataset - Cavaz

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.