Confusión del analizador de BeautifulSoup - HTML

I'm trying to scrape some content off another site and I'm not sure why BeautifulSoup is producing this output. It is only finding a blank space inside the match, but the real HTML contains a large amount of markup. I apologize if this is something stupid on my part. I'm new to python.

Aquí está mi código:

import sys
import os
import mechanize
import re
from BeautifulSoup import BeautifulSoup

def scrape_trails(BASE_URL, data):
    #Get the trail names
    soup = BeautifulSoup(data)
    sitesDiv = soup.findAll("div", attrs={"id" : "sitesDiv"})
    print sitesDiv


def main():
    BASE_URL = "http://www.dnr.state.mn.us/skiing/skipass/list.html"
    br = mechanize.Browser()
    data = br.open(BASE_URL).get_data()
    links = scrape_trails(BASE_URL, data)


if __name__ == '__main__':
    main()

If you follow that URL you can see the sitesDiv contains a lot of markup. I'm not sure if I'm doing something wrong or if this is just malformed markup that the script can't handle. Thanks!

preguntado el 08 de enero de 11 a las 20:01

1 Respuestas

The problem is that the HTML served from that URL has an empty div.sitesDiv:

<div id="sitesDiv">&nbsp;</div>

There's a script on the page that fills in the div after the page is loaded. Your Python code doesn't execute the Javascript, so the div is never modified, so it's still empty when your code parses it.

The good news is that the data you're looking for is served to the HTML as JSON from this URL: http://maps.dnr.state.mn.us/cgi-bin/mapserv54?map=/usr/local/mapserver/apps/prk/ski_pass/sites.map&mode=nquery&qformat=geojson . So you can skip BeautifulSoup altogether, and just read and parse the JSON directly to get the info you want.

Respondido el 09 de enero de 11 a las 00:01

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.