Problema de codificación al intentar raspar una página

I'm using beautifulSoup to scrape a page that has a ISO-8859-1 encoding however I've run into my little hiccup.

I have a line that reads:

logging.info("Processing [%s]" % (link))

La variable link is one of the values scraped from beautifulsoup. It is a Unicode string and I can print it by typing print link. It shows up on the console exactly the way it was scraped but the line above throws this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)

I've read up on Unicode right now but I can't figure out why it is able to print it but it can't log it.

The string in question is this:

booba-concert-à-bercy

Any ideas on where I'm mucking this up?

Gracias por su atención.

preguntado el 02 de febrero de 12 a las 10:02

2 Respuestas

logging no le gusta unicode; pass it bytes.

logging.info("Processing [%s]" % (link.encode('utf-8')))

Respondido 02 Feb 12, 14:02

Hi Ignacio. That didn't work for me. I still get the same error. Is there some other settings about my environment that could cause this. Thanks. - Mridang Agarwalla

If I encode the string to cp850 which seems to be Windows terminal encoding, it works fine but I'm wondering why that even though I specified the encoding from your example to be utf-8 it still tried to encode it to ASCII. - Mridang Agarwalla

I managed to solve this by addin a file called sitecustomize.py en mi Python/Lib/site-packages directory. This file contained two lines: import sys y la sys.setdefaultencoding('utf-8'). The default encoding prior to that was ascii and therefore the issues. Now I don't need to speicify an explicit encoding for the link variable as it uses the default encoding i.e. utf-8 and converts it to that. Is this a good solution to the issue? Of course, I'll never see the characters properly until my terminal in the same encoding but that won't break my code. - Mridang Agarwalla

I managed to solve this by adding a file called sitecustomize.py en mi Python/Lib/site-packages directory. This file contained two lines: import sys y la sys.setdefaultencoding('utf-8').

The default encoding prior to that was ascii and therefore the issues. Now I don't need to specify an explicit encoding for the link variable as it uses the default encoding i.e. utf-8 and converts it to that.

Of course, I'll never see the characters properly until my terminal in the same encoding but that won't break my code.

Respondido 07 Feb 12, 18:02

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.