Cómo leer archivos grandes con unicode en Python 3

Hello i have a large file that contain unicode characters, and when i try to open it in Python 3 this is the mistake i have.

File "addRNC.py", line 47, in add_rnc()

File "addRNC.py", line 13, in init for value in rawDoc.readline():

File "/usr/local/lib/python3.1/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final)

UnicodeDecodeError: el códec 'utf8' no puede decodificar el byte 0xd3 en la posición 158: byte de continuación no válido

And i try everything and didn't work, here is the code:

rawDoc = io.open("/root/potential/rnc_lst.txt", 'r', encoding='utf8')
    result = []
    for value in rawDoc.readline():

        if len(value.split('|')[9]) > 0 and len(value.split('|')[10]) > 0: 
            if value.split('|')[9] == 'ACTIVO' and value.split('|')[10] == 'NORMAL':
                address = ''
                for piece in value.split('|')[4:7]:
                    address += piece
                if value.split('|')[8] != '':
                    rawdate = value.split('|')[8].split('/')
                    _date = rawdate[2]+"-"+rawdate[1]+"-"+rawdate[0]
                else:
                    _date = 'NULL'

                id = db.prepare("SELECT id FROM potentials_reg WHERE(rnc = '%s')"%(value.split('|')[0]))()

                if len(id) == 0:
                    if _date == 'NULL':
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', NULL, '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7], 'true'))()
                    else:
                        db.prepare("INSERT INTO potentials_reg (rnc, _name, _owner, work_type, address, telephone, constitution, active)"+ 
                                "VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')"%(value.split('|')[0], value.split('|')[1], 
                                                                        value.split('|')[2],value.split('|')[3],address, 
                                                                        value.split('|')[7],_date, 'true'))()
                else:
                    pass

    db.close()

preguntado el 01 de febrero de 12 a las 04:02

What makes you think that the file is a Unicode file that’s encoded in UTF-8? Byte 0xD3 is a U+201D ʀɪɢʜᴛ ᴅᴏᴜʙʟᴇ Qᴜᴏᴛᴀᴛɪᴏɴ ᴍᴀʀᴋ in the MacRoman encoding, for example. Does the file validate as UTF-8? -

1 Respuestas

Your file actually contains invalid UTF-8.

When you say "contains unicode characters", you should be aware that Unicode doesn't specify how the characters are represented. So even if the file represents Unicode data, it could be in UTF-8, UTF-16 (UTF-16BE or UTF-16LE, each with or without a BOM), the deprecated UCS-2, or perhaps even one of the more esoteric forms...

Double check that the file is valid; I'd bet that you indeed have a byte 0xD3 (11010011), which must in UTF-8 be the leading byte of a two-byte character, in a follower position (in other words, 0xD3 immediately follows a byte whose binary representation begins with 11 [is greater than 0xC0]).

The most likely reason for this is that your file contains no ASCII characters, but isn't in UTF-8.

Respondido 01 Feb 12, 08:02

I think is a Unicode character because on the 158 position there is a 'Ó'. - hidura

@hidura: Unicode and UTF-8 are no es lo mismo. Yes your file contains Unicode characters. That does no mean it is encoded in UTF-8. HTTP://regebro.wordpress.com/2011/03/23/… There is no character at all at position 158, there is a NUMBER. That number is 201. In UTF-8, that's an Ó, correct. In MacRoman, it's a quotation mark. Does the Ó make sense? What is position 157 and 159? - Lennart Regebro

@hidura Not all non-English characters are Unicode. Many legacy documents use what are called páginas de códigos ( en.wikipedia.org/wiki/Code_page ). - Borealido

@Borealid: Which contains characters that are all part of Unicode, and hence are Unicode characters. :-) - Lennart Regebro

@LennartRegebro Not exactly true; Unicode consolidated multiple previously-distinct glyphs into single characters in some cases, and the reverse in some cases. See the notes below the table on en.wikipedia.org/wiki/Code_page_437 ; the characters in a code page are tied to an intended visual representation, not a semantic meaning! - Borealido

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.