algoritmo de diferencia ajustable

I'm interested in finding a more-sophisticated-than-typical algorithm for finding differences between strings, that can be "tuned" via some parameters, to balance between such things as "maximize count of identical characters" vs. "maximize the length of spans" vs. "try to keep whole words intact".

Ultimately, I want to be able to make the results as human readable as possible. For instance, if a long sentence has been replaced with an entirely new sentence, where the only things it has in common with the original are the words "the" "and" and "a" in that order, I might want it treated as if the whole sentence is changed, rather than just that 4 particular spans are changed --- just like how a reasonable person would see it.

Does such a thing exist? Although I'm working in javascript/node.js, an algorithm in any language would be helpful.

I'm actually ok with something that uses Monte Carlo methods or the like, if its results are better. Computation time is not an issue (within reason), nor is determinism.

Note: although this is beyond the scope of what I'm asking, I'll throw one more thing out there just in case: It would also be great if it could recognize changes that are out of order....for instance if someone changes the order of two paragraphs while leaving them otherwise identical, it would be awesome if it recognized it as a simple move, rather than as one subtraction and and one unrelated addition.

preguntado el 27 de agosto de 11 a las 19:08

Are you comparing specific input, say (programming) source code, or is it just free/plain text? If it's some well defined (programming) language you're comparing, perhaps you could compare their AST's instead of a "diff-like" approach. -

I want to work on the text. In some cases it may be source code, but other cases no. What you describe is an interesting approach, but not what I'm after for this project. -

Too bad. The idea is not mine I must confess. I got it from a Charla técnica de Google of one of the users from SO: Ira Baxter. -

+1 I'm interested in something like this as well. -

2 Respuestas

He tenido buena suerte con diff_match_patch. There are some good options for tuning it for readability.

Respondido 28 ago 11, 21:08

thank you that does exactly what I need. (actually I found it right before you posted, by looking at the answers to "related" over on the right :) ) Google is awesome sometimes....they've got it in 7 freakin' languages! - robar

Trata Its code is already formatted for compatibility with CommonJS, which is the framework Node uses.

respondido 25 nov., 11:15

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.