¿Cómo devuelvo solo grupos de captura no nulos para cada coincidencia?

I'm using this regex to parse lines of a CSV in APEX:

Pattern csvPattern = Pattern.compile('(?:^|,)(?:\"([^\"]+|\"\")*\"|([^,]+)*)');

It works great, but returns two groups for each match (one for the quoted values, and one for non-quoted values). See below:

Matcher csvMatcher = csvPattern.matcher('"hello",world');
Integer m = 1;
while (csvMatcher.find()) {
    System.debug('Match ' + m);
    for (Integer i = 1; i <= csvMatcher.groupCount(); i++) {
        System.debug('Capture group ' + i + ': ' + csvMatcher.group(i));
    }
    m++;
}

Running this code will return the following:

[5]|DEBUG|Match 1
[7]|DEBUG|Capture group 1: hello
[7]|DEBUG|Capture group 2: null
[5]|DEBUG|Match 2
[7]|DEBUG|Capture group 1: null
[7]|DEBUG|Capture group 2: world

I'd like for each match to only return the non-null capture. Is that possible?

preguntado el 10 de marzo de 12 a las 16:03

Your regex is a bit broken; note that something like (a)* will capture only one of the matched a's. So you need to change ([^\"]+|\"\")* a ((?:[^\"]+|\"\")*) if you want to capture the whole contents of a double-quoted string that contains "". (There are a few other issues as well; your regex is not written very defensively IMHO.) Also -- it seems like you'd kind of quieres the two variants to show up in separate capture-groups, wouldn't you? Because the quoted variant will require post-processing to convert "" a ", whereas the unquoted variant will not. -

I asked because I wanted your NSHO! Thanks for the ideas. I just started this, so they're very helpful. -

2 Respuestas

This is actually a difficult thing to do.
It could be done with lookahead/behind assertions.
Not very intuitive though.

Se ve algo como esto:
(?:^|,)(\s*"(?=(?:[^"]+|"")*"\s*(?:,|$)))?((?<=")(?:[^"]+|"")*(?="\s*(?:,|$))|[^,]*)

How it works is to line up the text body after the first quote " on a valid quoted field. If its not a valid quoted field, it lines up on the quote itself. At that point the text body can be captured as either an un-quoted field, or as a quoted field minus the quotes, in a single capture buffer.

This is probably a power regex that instruments a precise solution without the need for residual code. I could be missing something, but I see no way to do this without lookaround assertions. So, your engine must support that. If not, you'll have to pick it out like your solution above.

Here is a prototype in Perl, with a commented expanded regex below it.
¡Buena suerte!

$samp = '  "hello " , world",,me,and,th""is, or , "tha""t"  ';

$regex = '
  (?: ^ | , )
  (\s*" (?= (?:[^"]+|"")* " \s*(?:,|$) ) )?
  (
     (?<=") (?:[^"]+|"")* (?="\s*(?:,|$) )
   |
     [^,]*
  )
';
while ($samp =~ /$regex/xg)
{
   print "'$2'\n";
}

Salida

'hello '
' world"'
''
'me'
'and'
'th""is'
' or '
'tha""t'

comentado

(?: ^ | , )          # Consume comma (or BOL is fine)

(                    # Capture group 1, capture '"' only if a complete quoted field
   \s*                  # Optional many spaces
   "
   (?=                  # Lookahead, check for a valid quoted field, determines if a '"' will be consumed
      (?:[^"]+|"")*
      "
      \s*
      (?:,|$)
   )
)?                   # End capt grp 1. 0 or 1 quote

(                    # Capture group 2, the body of text
   (?<=")                 # If there is a '"' behind us, we have consumed a '"' in capture grp 1, so this is valid
   (?:[^"]+|"")*
   (?="\s*(?:,|$) )
 |                      # OR,
   [^,]*                  # Just get up to the next ',' This could be incomplete quoted fields
)                    # End capt grp 2

Extensión

If in fact you might use this, it can be sped up to consume a backreferenced quoted field
instead of matching a quoted field twice. Backreferences usually resolve to a single string
comparison api such as strncmp() in C language, making it much faster.
As a side note, whitespace before/after the field body of non-quoted fields, can be trimmed
within the regex with a little extra notation.
¡Buena suerte!

Comprimido

(?:^|,)(?:\s*"(?=((?:[^"]+|"")*)"\s*(?:,|$)))?((?<=")\1|[^,]*)

Expandido

(?: ^|, )
(?: \s* " (?=  ( (?:[^"]+|"")* )  " \s*  (?: ,|$ )  ))?
( (?<=") \1 | [^,]* )

Ampliado con comentarios

(?: ^ | , )          # Consume comma (or BOL is fine)

(?:                  # Start grouping
   \s*                  # Spaces, then double quote '"' (consumed if valid quoted field)
   "                    #
   (?=                  # Lookahead, nothing consumed (check for valid quoted field)
      (                     # Capture grp 1
         (?:[^"]+|"")*          # Body of quoted field  (stored for later consumption)
      )                     # End capt grp 1
      "                     # Double quote '"'
      \s*                   # Optional spaces
      (?: , | $ )           # Comma or EOL
   )                    # End lookahead
)?                   # End grouping, optionaly matches and consumes '\s*"'

(                    # Capture group 2, consume FIELD BODY
   (?<=")                 # Lookbehind, if there is a '"' behind us the field is quoted
   \1                     # Consume capt grp 1
 |                      # OR,
   [^,]*                  # Invalid-quoted or Non-quoted field, get up to the next ','
)                    # End capt grp 2

respondido 12 mar '12, 16:03

Wow. I learned a lot from that answer and it worked well. I agree that it's not too intuitive, but I'll link to this answer in my code for when I get stumped in the future. - apenas conocido

With some inspiration from ruakh, I updated the regex to return only one capture group per match (and handle quotes within the field and white spaces).

(?:^|[\s]*?,[\s]*)(\"(?:(?:[^\"]+|\"\")*)[^,]*|(?:[^,])*)

respondido 10 mar '12, 20:03

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.