MATLAB: palabras que coinciden entre matrices de celdas de cadenas

I’m trying to solve the following problem, and I need to do it as efficiently as possible (i.e. trying to avoid loops as much as I can).

I have two cell arrays, namely A and B. Each cell of A and B contains a string of characters. The length of these strings of characters is variable. Let’s say:

A={‘life is wonderful’, ‘matlab makes your dreams come true’};

B={‘life would be meaningless without wonderful matlab’, ‘what a wonderful world’, ‘the shoemaker makes shoes’, ‘rock and roll baby’};

Moreover, the number of elements of cell array B is around three orders of magnitude larger than that of cell array A.

My goal is to find how many words of every char string in A also appear in every char string of B.

For the previous example, a suitable result could be something like:

match = [2 1 0 0
1 0 1 0]

The first row indicates how many words in the 1st char string of A appear in the four char strings of B. And the second row, the same for the 2nd char string of A.

The double loop implementation is straightforward, but very time consuming especially because of the length of cell array B (over 3 million cells).

¿Algunas ideas? Muchas gracias.


preguntado el 08 de noviembre de 11 a las 11:11

3 Respuestas

Let me get this started by posting the "straightforward" solution (at least others have a baseline to compare against):

A = {'life is wonderful', 'matlab makes your dreams come true'};
B = {'life would be meaningless without wonderful matlab', 'what a wonderful world', 'the shoemaker makes shoes', 'rock and roll baby'};

count = zeros(numel(A),numel(B));

%# for each string
for i=1:numel(A)
    %# split into words
    str = textscan(A{i}, '%s', 'Delimiter',' '); str = str{1};

    %# for each word
    for j=1:numel(str)
        %# count occurences
        count(i,:) = count(i,:) + cellfun(@numel, strfind(B,str{j}));

El resultado:

>> count
count =
     2     1     0     0
     1     0     1     0

A better algorithm could be building some kind of index or hash-table...

respondido 08 nov., 11:17

The solution is not complex.

split the sentences apart with:

a_words = regexp(A,'(\w+)','match')
b_words = regexp(B,'(\w+)','match')

then compare in a loop:

match = nan(numel(a_words),numel(b_words));
for i = 1:numel(a_words)
    for j = 1:numel(b_words)
        match(i,j) = sum(ismember(a_words{i},b_words{j}));

but to make it quicker - I'm not really sure. You definetly can put the inner loop in a parfor that should parallelize that. If it's really a lot of words maybe put them in a database. That will do the indexing for you.

respondido 08 nov., 11:17

Puedes explotar Mapa, which offers you an efficient dictionary based structure:

For each word save the vector showing the occurrences in each string:

A = {'life is wonderful', 'matlab makes your dreams come true'};
B = {'life would be meaningless without wonderful matlab', 'what a wonderful world', 'the shoemaker makes shoes', 'rock and roll baby'};

mapA = containers.Map();
sizeA = size(A,2);
for i = 1:size(A,2)         % for each string
    a = regexpi(A(i),'\w+','match');
    for w = a{:}                % for each word extracted
        str = cell2mat(w);
        if(mapA.isKey(str))     % if word already indexed
            occ = mapA(str);
        else                    % new key
            occ = zeros(1,sizeA);
        occ(i) = occ(i)+1;
        mapA(str) = occ;

% same for B
mapB = containers.Map();
sizeB = size(B,2);
for i = 1:size(B,2) 
    a = regexpi(B(i),'\w+','match');
    for w = a{:}
        str = cell2mat(w);
            occ = mapB(str);
            occ = zeros(1,sizeB);
        occ(i) = occ(i)+1;
        mapB(str) = occ;

then, for each unique word found in A, compute the matches with B

match = zeros(size(A,2),size(B,2));
for w = mapA.keys
    str = cell2mat(w);
    if (mapB.isKey(str))
        match = match + diag(mapA(str))*ones(size(match))*diag(mapB(str));


match =

     2     1     0     0
     1     0     1     0

this way you have a complexity of #wordsA + #wordsB + #singleWordsA instead of #wordsA*#wordsB

EDITAR: or, if you don't like Map, you can save the word-occurrence-vectors in alphabetically ordered vector. Then you can look for matches simultaneously checking both vectors:

(suppose we are using a struct where the w attribute is the word string and occ is the occurrence vector)

i = 1; j = 1;
while(i<=size(wordsA,2) && i<=size(wordsB,2))
if(strcmp(wordsA(i).w, wordsB(j).w))
    % update match
    if(before(wordsA(i).w, wordsA(i).w)) % before: fancy function returning 1 if the first argument comes (alphabetically) before the second one (no builtin function comes to my mind)
        i = i+1;
        j = j+1;

if you are looking for 'matlab' and you know in the 10th position is stored 'life' is useless to check the positions before, since the vector is alphabetically ordered. So we have #wordsA+#wordsB iteration vs. #wordsA*#wordsB of the nested loops solution.

respondido 08 nov., 11:23

Did you time it against @Amro's solution? - Jonas

seems it clocks about the same, even with a larger dataset - Cavaz

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.