Herramienta para crear reglas propias para la lematización de palabras y tareas similares

I'm doing a lot of natural language processing with a bit unsusual requirements. Often I get tasks similar to lemmatization - given a word (or just piece of text) I need to find some patterns and transform the word somehow. For example, I may need to correct misspellings, e.g. given word "eatin" I need to transform it to "eating". Or I may need to transform words "ahahaha", "ahahahaha", etc. to just "ahaha" and so on.

So I'm looking for some generic tool that allows to define transormation rules for such cases. Rules may look something like this:

 {w}in   ->  {w}ing
 aha(ha)+  ->  ahaha

That is I need to be able to use captured patterns from the left side on the right side.

I work with linguists who don't know programming at all, so idealmente this tool should use archivos externos y simple language for rules.

I'm doing this project in Clojure, so idealmente this tool should be a library for one of JVM languages (Java, Scala, Clojure), but other languages or command line tools are ok too.

There are several very cool NLP projects, including GATE, Stanford Core NLP, NLTK and others, and I'm not expert in all of them, so I could miss the tool I need there. If so, please let me know.

Note, that I'm working with several languages and perform very different tasks, so concrete lemmatizers, stemmers, misspelling correctors and so on for concrete languages do not fit my needs - I really need more generic tool.

UPD. It seems like I need to give some more details/examples of what I need.

Basically, I need a function for replacing text by some kind of regex (similar to Java's String.replaceAll()) but with possibility to use caught text in replacement string. For example, in real world text people often repeat characters to make emphasis on particular word, e.g. someoone may write "This film is soooo boooring...". I need to be able to replace these repetitive "oooo" with only single character. So there may be a rule like this (in syntax similar to what I used earlier in this post):

{chars1}<char>+{chars2}?  ->  {chars1}<char>{chars2}

that is, replace word starting with some chars (chars1), at least 3 chars and possibly ending with some other chars (chars2) with similar string, but with only a single . Key point here is that we catch on a left side of a rule and use it on a right side.

preguntado el 10 de marzo de 12 a las 02:03

Do not use vulgarity here please. -

@AndrewMarshall: the word mentioned is just one of the most frequently misspelled words in user generated texts like tweets and thus it is a good example. But thanks to Gangadhar - he found another nice example, so there's no more need in vulgarity. -

Doesn't matter, it has no place here, and obviously there are many other examples that could have been used. Please refrain from it in the future. -

@AndrewMarshall: I disagree with you. It is really not a place for swearing entre usuarios - I agree with that. But programmers work with real world and all its unpleasant things, and sometimes these things do matter. In particular, people tend to hide vulgar words in their messages and thus misspell them purposely. Moreover, since these words are considered as bad, they are not included in many language resources and thus could not be caught by dictionary lookup - such details have direct relevance to subject matter. But surely, it's worth to keep number of such cases as small as possible. -

2 Respuestas

I am not an expert in NLP, but I believe Bola de nieve might be of interest to you. Its a language to represent stemming algorithms. Its stemmer is used in the Lucene search engine.

respondido 10 mar '12, 02:03

Thanks for suggestion - Snowball is a really good tool for endings (like "-in" -> "ing") manipulation. However, it is not flexible enough for other tasks like root manipulation ("ahahahaha" -> "ahaha"). - amigo

He encontrado http://userguide.icu-project.org/transforms/general to be useful as well for some general pattern/transform tasks like this, ignore the stuff about transliteration, its nice for doing a lot of things.

You can just load up rules from a file into a String and register them, etc.


respondido 10 mar '12, 14:03

Interesting framework, thanks. However, it is also too weak for my tasks. I need to be able at least catch repetitive patterns like "aha(ha)+" or "l(o)+l" (in regex syntax). - amigo

Well, this still doesn't fit todos my needs (see my update), but is the most close tool, so I accept this answer. - amigo

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.