Reemplazo de cadenas múltiples en C#

I am dynamically editing a regex for matching text in a pdf, which can contain hyphenation at the end of some lines.

Ejemplo:

Cadena de origen:

"consecuti?vely"

Replace rules:

 .Replace("cuti?",@"cuti?(-\s+)?")
 .Replace("con",@"con(-\s+)?")
 .Replace("consecu",@"consecu(-\s+)?")

Salida deseada:

"con(-\s+)?secu(-\s+)?ti?(-\s+)?vely"

The replace rules are built dynamically, this is just an example which causes problems.

Whats the best solution to perform such a multiple replace, which will produce the desired output?

So far I thought about using Regex.Replace and zipping the word to replace with optional (-\s+)? between each character, but that would not work, because the word to replace already contains special-meaning characters in regex context.

EDIT: My current code, doesnt work when replace rules overlap like in example above

private string ModifyRegexToAcceptHyphensOfCurrentPage(string regex, int searchedPage)
    {
        var originalTextOfThePage = mPagesNotModified[searchedPage];
        var hyphenatedParts = Regex.Matches(originalTextOfThePage, @"\w+\-\s");
        for (int i = 0; i < hyphenatedParts.Count; i++)
        {
            var partBeforeHyphen = String.Concat(hyphenatedParts[i].Value.TakeWhile(c => c != '-'));

            regex = regex.Replace(partBeforeHyphen, partBeforeHyphen + @"(-\s+)?");
        }
        return regex;
    }

preguntado el 31 de julio de 12 a las 09:07

I am struggling to see a definitivo set of rules from your example. I assume this hyphenation can happen on any string, therefore, the example you have shown is only specific to that string only. You need to provide more general rules about how the regex should be built. -

This is the definitive set of rules for this example. I scan the pdf page for pattern word-hyphen-newline and get all strings, which are hyphenated. In this case, it produced only the 3 rules above. -

Sorry but what your saying does not match your example 1) ? is not a hyphen, 2) there are no new lines. So are you saying that for each individual string there are specific rules? -

I will put my current code in the question. This works for all the cases in which the replace rules do not overlap. -

4 Respuestas

the output of this program is "con(-\s+)?secu(-\s+)?ti?(-\s+)?vely"; and as I understand your problem, my code can completely solve your problem.

class Program
    {
        class somefields
        {
            public string first;
            public string secound;
            public string Add;
            public int index;
            public somefields(string F, string S)
            {
                first = F;
                secound = S;
            }

        }
    static void Main(string[] args)
    {
        //declaring output
        string input = "consecuti?vely";
        List<somefields> rules=new List<somefields>();
        //declaring rules
        rules.Add(new somefields("cuti?",@"cuti?(-\s+)?"));
        rules.Add(new somefields("con",@"con(-\s+)?"));
        rules.Add(new somefields("consecu",@"consecu(-\s+)?"));
        // finding the string which must be added to output string and index of that
        foreach (var rul in rules)
        {
            var index=input.IndexOf(rul.first);
            if (index != -1)
            {
                var add = rul.secound.Remove(0,rul.first.Count());
                rul.Add = add;
                rul.index = index+rul.first.Count();
            }

        }
        // sort rules by index
        for (int i = 0; i < rules.Count(); i++)
        {
            for (int j = i + 1; j < rules.Count(); j++)
            {
                if (rules[i].index > rules[j].index)
                {
                    somefields temp;
                    temp = rules[i];
                    rules[i] = rules[j];
                    rules[j] = temp;
                }
            }
        }

        string output = input.ToString();
        int k=0;
        foreach(var rul in rules)
        {
            if (rul.index != -1)
            {
                output = output.Insert(k + rul.index, rul.Add);
                k += rul.Add.Length;
            }
        }
        System.Console.WriteLine(output);
        System.Console.ReadLine();
    }
} 

Respondido 31 Jul 12, 12:07

The idea is good, however it would not work when a single replace rule would have multiple occurences. - Tomás Grosup

And a +1 for the bubble sort, made me smile :)) - Tomás Grosup

it is possible to add rules in your loop and after your loop execute them all. - Ali_D

You should probably write your own parser, it's probably easier to maintain :).

Maybe you could add "special characters" around pattern in order to protect them like "##" if the strings not contains it.

Respondido 31 Jul 12, 09:07

Thats the plan if no one comes with an easy solution :( - Tomás Grosup

I would basically do an Aho-Corasick search machine, which would skip speacial-meaning characters and instead of reporting a found string I would insert my replace sequence @(-\s+)? - Tomás Grosup

Prueba este:

var final = Regex.Replace(originalTextOfThePage, @"(\w+)(?:\-[\s\r\n]*)?", "$1");

Respondido 31 Jul 12, 11:07

I had to give up an easy solution and did the editing of the regex myself. As a side effect, the new approach goes only twice trough the string.

private string ModifyRegexToAcceptHyphensOfCurrentPage(string regex, int searchedPage)
    {
        var indexesToInsertPossibleHyphenation = GetPossibleHyphenPositions(regex, searchedPage);
        var hyphenationToken = @"(-\s+)?";
        return InsertStringTokenInAllPositions(regex, indexesToInsertPossibleHyphenation, hyphenationToken);
    }

    private static string InsertStringTokenInAllPositions(string sourceString, List<int> insertionIndexes, string insertionToken)
    {
        if (insertionIndexes == null || string.IsNullOrEmpty(insertionToken)) return sourceString;

        var sb = new StringBuilder(sourceString.Length + insertionIndexes.Count * insertionToken.Length);
        var linkedInsertionPositions = new LinkedList<int>(insertionIndexes.Distinct().OrderBy(x => x));
        for (int i = 0; i < sourceString.Length; i++)
        {
            if (!linkedInsertionPositions.Any())
            {
                sb.Append(sourceString.Substring(i));
                break;
            }
            if (i == linkedInsertionPositions.First.Value)
            {
                sb.Append(insertionToken);
            }
            if (i >= linkedInsertionPositions.First.Value)
            {
                linkedInsertionPositions.RemoveFirst();
            }
            sb.Append(sourceString[i]);
        }
        return sb.ToString();
    }

    private List<int> GetPossibleHyphenPositions(string regex, int searchedPage)
    {
        var originalTextOfThePage = mPagesNotModified[searchedPage];
        var hyphenatedParts = Regex.Matches(originalTextOfThePage, @"\w+\-\s");
        var indexesToInsertPossibleHyphenation = new List<int>();
        //....
        // Aho-Corasick to find all occurences of all 
        //strings in "hyphenatedParts" in the "regex" string
        // ....
        return indexesToInsertPossibleHyphenation;
    }

Respondido 31 Jul 12, 12:07

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.