Extracción de etiquetas de BeautifulSoup

I am trying to get the Body-Tag from http://feeds.reuters.com/~r/reuters/technologyNews/~3/ZyAuZq5Cbz0/story01.htm

but BeautifulSoup doesn't find it. Is this because of invalid HTML? If so, how can I prevent this?

I also tried to prefix HTML-Errors using PyTidyLib (http://countergram.com/open-source/pytidylib/docs/index.html)

A continuación, se muestra parte del código:

def getContent(url, parser="lxml"):
    request = urllib2.Request(url)  
    try:    
        response = opener.open(request).read()
    except:
        print 'EMPTY CONTENT',url
        return None
    doc, errors = tidy_document(response)
    return parse(url, doc)

def parse(url, response, parser="lxml"):
    try:
        soup = bs(response,parser)
    except UnicodeDecodeError as e:
        if parser=="lxml":
            return parse(url, response, "html5lib")
        else:
            print e,url
            print 'EMPTY CONTENT',url
            return None  

    body = soup.body
    ...

When I print out Soup, I can see the opening and closing body-Tag, but after body = soup.body, I get None.

I am using Python 2.7.3 and BeautifulSoup4 It seems to work with BeautifulSoup3, but I need to stick to BS4 due to performance issues.

preguntado el 05 de mayo de 13 a las 12:05

This may help you, I had similar problem: stackoverflow.com/questions/15290991/… -

1 Respuestas

I finally got it running. Here is the code:

import urllib2
from lxml import html

url = "http://www.reuters.com/article/2013/04/17/us-usa-immigration-tech-idUSBRE93F1DL20130417?feedType=RSS&feedName=technologyNews"
response = urllib2.urlopen(url).read().decode("utf-8")
test = html.fromstring(response)

for p in test.body.iter('p'):
    print p.text_content()

contestado el 06 de mayo de 13 a las 19:05

for p in test.body.iter('p'): .. what does the ('p') stands for. is it the <p> tag? - Vincent

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.