Lista de cadenas unicode

If I have a list of unicode strings

lst = [ u"aaa", u"bbb", u"foo", u"bar", ... u"baz", u"zzz" ]

is it necessary to write a prefix u before every string? Can I make a construction that says that every element of lst will be unicode string and then write it without u ¿prefijo?

preguntado el 01 de febrero de 12 a las 14:02

This depends on you using Python 2 or Python 3. -

Estoy usando Python 2.7.2+, but if you know the answer for both it could be useful for future. -

In Python 3.x all strigns are unicode by default, and any channel dealing with text I/O (files, database, printing) either require an explcit encoding or use the system wide encoding by default. -

2 Respuestas

In Python 2.7 (also Python 2.6) you can make unicode literals the default for a module:

from __future__ import unicode_literals

You must include the import at the top of the file, and it then applies to all string literals in the file. Use a b prefix to force byte strings:

>>> from __future__ import unicode_literals
>>> "sss"
u'sss'
>>> b"x"
'x'

Respondido 01 Feb 12, 18:02

If your intention is to convert a set of standard strings to unicode, you could map that function onto your list:

lst = ["aaa", "bbb", "ccc"]
map(unicode, lst)

Lo que da

[u"aaa", u"bbb", u"ccc"]

Si acaso lst contains a non ASCII character string, you'll have to prefix that particular string with the u. If you don't, you'll get this error on the conversion:

lst = ["\xe4"]
map(unicode,lst)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

As noted in the comments, this answer is different for Python 2.x or 3.x. In Python 3, everything changes:

Everything you thought you knew about binary data and Unicode has changed. Python 3.0 uses the concepts of text and (binary) data instead of Unicode strings and 8-bit strings. All text is Unicode; however encoded Unicode is represented as binary data. The type used to hold text is str, the type used to hold data is bytes. The biggest difference with the 2.x situation is that any attempt to mix text and data in Python 3.0 raises TypeError, whereas if you were to mix Unicode and 8-bit strings in Python 2.x, it would work if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but you would get UnicodeDecodeError if it contained non-ASCII values. This value-specific behavior has caused numerous sad faces over the years.

Respondido 01 Feb 12, 19:02

I'd like to use only some better declaration that saves me typing. something like lst = u["aaa", "bbb", "ccc"] which would tell that every string in lst is unicode. - xralf

-1 for not knowing about unicode encodings ,and thinking that "ASCII is OK"- please read joelonsoftware.com/articles/Unicode.html - jsbueno

@jsbueno - Nowhere did I say that the default Python 2.x encoding (ASCII) is "OK". I simply stated a quick and dirty method for converting the OP's list of what looked to be only ASCII encodings into the unicode representation with the explicit ASCII encoding. Is it what he wanted? I'm not entirely sure as he didn't specify the encoding he wanted, so I took a guess. To add value to this site, if you feel that a detailed explanation about the different encodings is needed, please provide it in another answer! - Enganchado

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.