Búsqueda rápida de expresiones regulares

What would be a way to somehow index 50-100GB of text lines and then be able to perform fast regex searches? At least faster than going line by line. The regex pattern is not always the same so can't take it into account when building the index.

Is it possible to achieve something like this with Lucene? I know it might be possible with suffix trees but the index takes too much memory (much more than those 100GB).

preguntado el 08 de noviembre de 11 a las 17:11

1 Respuestas

The main thing you have to do is identify the common search terms in advance, and then index based on that.

For instance, maybe you anticipate that there will be a lot of searches for lines starting with "Foo". Then you can run that search in advance and store a list of lines starting with "Foo". Then, if someone searches for lines starting with "Foobar", you've already got a narrowed-down subset of lines to search.

If you want to get really clever, you can programmatically analyze common searches to find recurring search components, and then index based on those common components.

respondido 08 nov., 11:21

I've also found an article that describes a similar approach. It says to index k-grams (every k consecutive characters from those lines). The problem is that the search can include any number of characters and indexing every k-gram would take too much memory. In any case, this is worth testing. - user16367

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.