Problema con Unicode y Python (acceso a tablas de códigos Unicode)
Frecuentes
Visto 429 veces
1
Yesterday i wrote the following function
para convertir integer
a Persian
:
def integerToPersian(number):
listedPersian = ['۰','۱','۲','۳','۴','۵','۶','۷','۸','۹']
listedEnglish = ['0','1','2','3','4','5','6','7','8','9']
returnList = list()
listedTmpString = list(str(number))
for i in listedTmpString:
returnList.append(listedPersian[listedEnglish.index(i)])
return ''.join(returnList)
When you call it such as : integerToPersian(3455)
, eso return ۳۴۵۵
,
۳۴۵۵
es equivalente a 3455
in Persian
y Arabic language
.When you read
a number such as reading from databae
, and want to show in widget
, esta
function
es muy útil.
He descargado codes charts
of unicode
de http://unicode.org ,Because i need to wrote PersianToInteger('unicodeString')
According to it should get utf-8
as parameter and utf-8
tienda 2 bytes
,Also i'm newbie in pytho.
Mis preguntas are, how can store 2bytes
? , how can utf8
store , how can split an unicode string
to another format ? how can use unicode code charts
?
Notas encontré para usar int() built-in fuinction
, but i couldn't use it.may be you can
1 Respuestas
5
You need to read the Python Unicode HOWTO for either Python 2.x or 3.x, as appropriate. But I can give you brief answers to your questions.
My questions are, how can store 2bytes? how can utf8 store , how can split an unicode string to another format ?
A unicode
object holds characters; a bytes
object holds bytes.
Tenga en cuenta que en Python 2.x, str
es lo mismo que bytes
; in 3.x, it's the same thing as unicode
. And in both languages, a literal with neither a u
ni un b
prefix is a str
. Since you didn't tell us whether you're using Python 2 or 3, I'll use explicit unicode
y bytes
y u
y b
prefixes, everywhere.
You convert between them by picking an encoding (in this case, UTF-8) and using the encode
y decode
métodos. Por ejemplo:
>>> my_str = u'۰۱'
>>> my_bytes = b'\xdb\xb0\xdb\xb1'
>>> my_str.encode('utf-8') == my_bytes
True
>>> my_bytes.decode('utf-8') == my_str
True
If you have a UTF-8 bytes
object, you should decode
a unicode
as early as possible, and do all your work with it in Unicode. Then you don't have to worry about how many bytes something takes, just treat each character as a character. If you need UTF-8 output, encode
back as late as possible.
(Very occasionally, the performance cost of decoding and encoding is too high, and you need to deal with UTF-8 directly. But unless that really is a bottleneck in your code, don't do it.)
So, let's say you wanted to adapt your integerToPersian
to take a UTF-8 English digit string instead of an integer, and to return a UTF-8 Persian digit string instead of a Unicode one. (I'm assuming Python 3 for the purposes of this example.) All you need to do is change str(number)
a number.decode('utf-8')
, y cambio return ''.join(returnList)
a return ''.join(returnList).encode('utf-8')
, y eso es.
how can use unicode code charts?
Python already comes with the Unicode code charts (and the right ones to match your version of Python) compiled into the unicodedata
module, so usually it's a lot easier to just use those than to try to use the charts yourself. For example:
>>> import unicodedata
>>> unicodedata.digit(u'۱')
1
… i need to wrote PersianToInteger('unicodeString')
You really shouldn't need to. Unless you're using a very old Python, int
should do it for you. For example, in 2.6:
>>> int(u'۱۱')
11
If it's not working for you, unicodedata
es la solución más sencilla:
>>> numeral = u'۱۱'
>>> [unicodedata.digit(ch) for ch in numeral]
[1, 1]
However, either of these will convert digits in cualquier script to a number, not just Persian. And there's nothing in the Unicode charts that will directly tell you that a digit is Persian; the best you can do is parse the name:
>>> all('ARABIC-INDIC DIGIT' in unicodedata.name(ch) for ch in numeral)
True
>>> all('ARABIC-INDIC DIGIT' in unicodedata.name(ch) for ch in '123')
False
If you really want to do things in either direction by mapping digits from one script to another, here's a better solution:
listedPersian = ['۰','۱','۲','۳','۴','۵','۶','۷','۸','۹']
listedEnglish = ['0','1','2','3','4','5','6','7','8','9']
persianToEnglishMap = dict(zip(listedPersian, listedEnglish))
englishToPersianMap = dict(zip(listedEnglish, listedPersian))
def persianToNumber(persian_numeral):
english_numeral = ''.join(persianToEnglishMap[digit] for digit in persial_numeral)
return int(english_numeral)
Respondido el 10 de Septiembre de 13 a las 19:09
If you read agian, my integerToPersian
work fine, i need to help for persianToInteger
. - Golfo pérsico
@MohsenPahlevanzadeh: If you would actually answer people's questions instead of writing the same comment over and over, it would be much easier to help you. - abarnert
No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas python unicode utf-8 unicode-string python-unicode or haz tu propia pregunta.
Are you using python2 or python3? - Robᵩ
Note that Python comes with all the information from the Unicode charts built-in (and they're guaranteed to match the version of Unicode your Python version works with) in the
unicodedata
módulo. - abarnertComo nota al margen,
listedEnglish.index(i)
is intended to be justint(i)
, right? Which is a lot simpler, and means you can get rid oflistedEnglish
entirely… - abarnert