Problema con Unicode y Python (acceso a tablas de códigos Unicode)

Yesterday i wrote the following function para convertir integer a Persian :

def integerToPersian(number):
    listedPersian = ['۰','۱','۲','۳','۴','۵','۶','۷','۸','۹']
    listedEnglish = ['0','1','2','3','4','5','6','7','8','9']    
    returnList = list()

    listedTmpString = list(str(number))

    for i in listedTmpString:
        returnList.append(listedPersian[listedEnglish.index(i)])

    return ''.join(returnList)

When you call it such as : integerToPersian(3455) , eso return ۳۴۵۵, ۳۴۵۵ es equivalente a 3455 in Persian y Arabic language.When you read a number such as reading from databae, and want to show in widget, esta function es muy útil.

He descargado codes charts of unicode de http://unicode.org ,Because i need to wrote PersianToInteger('unicodeString') According to it should get utf-8 as parameter and utf-8 tienda 2 bytes,Also i'm newbie in pytho.

Mis preguntas are, how can store 2bytes? , how can utf8 store , how can split an unicode string to another format ? how can use unicode code charts?

Notas encontré para usar int() built-in fuinction , but i couldn't use it.may be you can

preguntado el 09 de septiembre de 13 a las 22:09

Are you using python2 or python3? -

Note that Python comes with all the information from the Unicode charts built-in (and they're guaranteed to match the version of Unicode your Python version works with) in the unicodedata módulo. -

Como nota al margen, listedEnglish.index(i) is intended to be just int(i), right? Which is a lot simpler, and means you can get rid of listedEnglish entirely… -

1 Respuestas

You need to read the Python Unicode HOWTO for either Python 2.x or 3.x, as appropriate. But I can give you brief answers to your questions.

My questions are, how can store 2bytes? how can utf8 store , how can split an unicode string to another format ?

A unicode object holds characters; a bytes object holds bytes.

Tenga en cuenta que en Python 2.x, str es lo mismo que bytes; in 3.x, it's the same thing as unicode. And in both languages, a literal with neither a u ni un b prefix is a str. Since you didn't tell us whether you're using Python 2 or 3, I'll use explicit unicode y bytes y u y b prefixes, everywhere.

You convert between them by picking an encoding (in this case, UTF-8) and using the encode y decode métodos. Por ejemplo:

>>> my_str = u'۰۱'
>>> my_bytes = b'\xdb\xb0\xdb\xb1'
>>> my_str.encode('utf-8') == my_bytes
True
>>> my_bytes.decode('utf-8') == my_str
True

If you have a UTF-8 bytes object, you should decode a unicode as early as possible, and do all your work with it in Unicode. Then you don't have to worry about how many bytes something takes, just treat each character as a character. If you need UTF-8 output, encode back as late as possible.

(Very occasionally, the performance cost of decoding and encoding is too high, and you need to deal with UTF-8 directly. But unless that really is a bottleneck in your code, don't do it.)

So, let's say you wanted to adapt your integerToPersian to take a UTF-8 English digit string instead of an integer, and to return a UTF-8 Persian digit string instead of a Unicode one. (I'm assuming Python 3 for the purposes of this example.) All you need to do is change str(number) a number.decode('utf-8'), y cambio return ''.join(returnList) a return ''.join(returnList).encode('utf-8'), y eso es.

how can use unicode code charts?

Python already comes with the Unicode code charts (and the right ones to match your version of Python) compiled into the unicodedata module, so usually it's a lot easier to just use those than to try to use the charts yourself. For example:

>>> import unicodedata
>>> unicodedata.digit(u'۱')
1

… i need to wrote PersianToInteger('unicodeString')

You really shouldn't need to. Unless you're using a very old Python, int should do it for you. For example, in 2.6:

>>> int(u'۱۱')
11

If it's not working for you, unicodedata es la solución más sencilla:

>>> numeral = u'۱۱'
>>> [unicodedata.digit(ch) for ch in numeral]
[1, 1]

However, either of these will convert digits in cualquier script to a number, not just Persian. And there's nothing in the Unicode charts that will directly tell you that a digit is Persian; the best you can do is parse the name:

>>> all('ARABIC-INDIC DIGIT' in unicodedata.name(ch) for ch in numeral)
True
>>> all('ARABIC-INDIC DIGIT' in unicodedata.name(ch) for ch in '123')
False

If you really want to do things in either direction by mapping digits from one script to another, here's a better solution:

listedPersian = ['۰','۱','۲','۳','۴','۵','۶','۷','۸','۹']
listedEnglish = ['0','1','2','3','4','5','6','7','8','9']    
persianToEnglishMap = dict(zip(listedPersian, listedEnglish))
englishToPersianMap = dict(zip(listedEnglish, listedPersian))

def persianToNumber(persian_numeral):
    english_numeral = ''.join(persianToEnglishMap[digit] for digit in persial_numeral)
    return int(english_numeral)

Respondido el 10 de Septiembre de 13 a las 19:09

If you read agian, my integerToPersian work fine, i need to help for persianToInteger. - Golfo pérsico

@MohsenPahlevanzadeh: If you would actually answer people's questions instead of writing the same comment over and over, it would be much easier to help you. - abarnert

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.