¿Estructura de datos eficiente para la búsqueda de subcadenas?

Assume I have a set of strings S and a query string q. I want to know if any member of S is a substring of q. (For the purpose of this question substring includes equality, e.g. "foo" is a substring of "foo".) For example assume the function that does what I want is called anySubstring:

S = ["foo", "baz"]
q = "foobar"
assert anySubstring(S, q)  # "foo" is a substring of "foobar"

S = ["waldo", "baz"]
assert not anySubstring(S, q)

Is there any easy-to-implement algorithm to do this with time complexity sublinear in len(S)? It's ok if S has to be processed into some clever data structure first because I will be querying each S with a lot of q strings, so the amortized cost of this preprocessing might be reasonable.

EDIT: To clarify, I don't care el cual member of S is a substring of q, only whether at least one is. In other words, I , solamente care about a boolean answer.

preguntado el 09 de marzo de 12 a las 15:03

Are the strings in S short? How does the length of the longest string in S compare with the length of S? -

3 Respuestas

Creo que el Algoritmo Aho-Corasick does what you want. I think there is another solution which is very simple to implement, it's Algoritmo de Karp-Rabin.

respondido 09 mar '12, 16:03

I cannot believe I am the only person to upvote the right answer. - Nemo

And there's a nice #Python library for Aho-Corasick pypi.org/project/pyahocorasick ! :-D very useful answer! - Alisa

So if the length of S is way less then the sum of the lengths of the potential substrings your best option would be to build a árbol de sufijo from S and then do a search in it. This is linear with respect to the length of S plus the summar length of the candidate substrings. Of course there can not be an algorithm with better complexity as you have to pass through all the input at least. If the case is opposite i.e. the length of s is more then the summar length of the substrings your best option would be aho-corasick.

Espero que esto ayude.

respondido 09 mar '12, 15:03

It fits for the other way around - querying many times with a single q, but in here - the S is constant, and q is changing... - amit

@amit I think Aho-Corasick does what OP wants.. It operates much like your solution except that the trie has failure transitions that take you to the right position in the automaton. - aelguindy

@aelguindy: I was speaking only about the suffix tree approach, I am afraid I am not familiar with the mentioned algorithm, but I will read how it works later today. - amit

@amit Yes, you are right regarding your comment about the suffix tree, I though you're commenting on the entire answer, sorry :-). - aelguindy

Create a regular expression .*(S1|S2|...|Sn).* and construct its minimal DFA.

Run your query string through the DFA.

respondido 09 mar '12, 22:03

@SaeedAmiri If OP does not care at all about the efficiency of the preprocessing, then once the DFA is created, all queries run in time linear in the length of q, independent of any aspect of S. - aelguindy

@aelguindy, suffix tree works in similar Order with less constant factor and less memory, in fact you should convert your DFA finally to something like trie. - saeed amiri

@SaeedAmiri How do you check that your query string has the suffix tree string as a substring? Suffix trees should be built when the text is fixed not the patterns.. - aelguindy

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.