Para expresiones regulares, r '[! - \. &]' ¿Qué significa?

Is it !-\ (characters from 33=ord('!') to 92=ord('\') and '.' and '&' in a set?

I think my interpretation is incorrect based on my test.

But python reference doesn't say anything wrong with my interpretation. http://docs.python.org/library/re.html

preguntado el 08 de noviembre de 11 a las 16:11

4 Respuestas

En breve, r'[!-\.&]' is just a complicated form of writing r'[!-.]'.

It matches all characters with ord entre 33 = ord('!') y 46 = ord('.'), i.e. any of the following:

!"#$%&\'()*+,-.

The escaping backslash before . is ignored in character classes; it is unnecessary (. matching all characters in a character class wouldn't make any sense). Since the ampersand & is already in the character class, it is superfluous as well.

respondido 08 nov., 11:21

@FelixKling Precisely. Copied your comment into the answer. - Phihag

@Eugene Yup. Character class is the formal name of a set of characters or the stuff in brackets. - Phihag

An escaping backslash is not ignored in a bracketed character class. For example, it must be used to prefix a literal backslash if a backslash should be part of the class. It would be correct to say the backslash is optional before '.' because '.' loses its special meaning in a bracketed character class. - MetaEd

Tests may show that the pattern matches chr(33) a través de chr(46), but the pattern is not guaranteed to work that way on all systems. Here's why. Character sets vary from system to system.

This is why the Perl regex documentation specifically recommends “to use only ranges that begin from and end at either alphabetics of equal case ([a-e], [A-E]), or digits ([0-9]). Anything else is unsafe.” (Perl regex is relevant because that's the regex used by Python.)

So, if this pattern is ever run on an EBCDIC based platform, it will match a different set of characters. It is only correct to say that the pattern matches chr(33) a través de chr(46) on ASCII based platforms.

respondido 08 nov., 11:21

Does your warning apply to UTF-8/16? - eugene

@Eugene: It is very perilous to use cualquier regexes on bytestrings encoded in cualquier multibyte encoding. Decode them to unicode first. - John Machin

It seems that the intention of this regex is to match any character between "!" and "." (notice that the slash is escaping the "." character), which are ! " # $ % & ' ( ) * + , - . (from the Unicode table at http://www.tamasoft.co.jp/en/general-info/unicode.html).

Two comments about the expression:

  1. Usually, you don't need to escape characters within brackets [] (except, maybe, by the \ sí mismo).
  2. The ampersand symbol "&" is already contained in the range defined by "!-.", so it is redundant.

respondido 09 nov., 11:03

1) There are other characters that need to be escaped, eg: -[]^, depending on where you put them 2) I think it's possible that the character class was meant to match !-.& only, ie: - is not meant to use to denote a range, but the hyphen itself. - NullUserException

The backslash escapes the dot and the range will thus be from ! a .. The regex will match:

!"#$%&'()*+,-.

La última & is not necessary since it's included in the range, and escaping a dot is not needed either since it's inside a character class.

respondido 08 nov., 11:20

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.