Problemas con la sintaxis xpath para recuperar el atributo img src usando Python

I've been trying to figure out the xpath syntax to parse this html, but I haven't been getting the same results as others. I've been modeling my work after, http://docs.python-guide.org/en/latest/scenarios/scrape/#web-scraping, but I can't get it to work for my html.

<div id="sku-8103">
    <!-- B:649 -->
    <input type="hidden" id="productIdPDP" value="1218866963585"/>
    <input type="hidden" id="skuIdPDP" value="8240103" />
    <input type="hidden" id="enableLightbox" value="" />
    <!-- B:780 -->
    <img src="http://images.bestbuy.com/BestBuy_US/en_US/images/global/buttons/btn_notorderable_pdp.gif" alt="Not Orderable" border="0" id="notorderable" />
    <input name="8240103" type="hidden" value="1">
    <!-- E:780 -->
    <!-- E:649 -->
    </div>

Mi código:

import pycurl
import sys
import cStringIO
from lxml import etree
from lxml import html

buf = cStringIO.StringIO()

c = pycurl.Curl()
c.setopt(c.URL, 'http://www.bestbuy.com/site/sony-playstation-4-500gb/8240103.p?id=1218866963585&skuId=8240103')
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()

data = buf.getvalue()
buf.close()

tree = html.fromstring(data)


product = tree.xpath('//div[@id="sku-8240103"]/img[@src]')
print product

El resultado es: [] en vez de src value of the image. I also tried:

product = tree.xpath('//div[@id="sku-8240103"]/img[@src]/text()')

pero eso tampoco pareció funcionar.

preguntado el 27 de noviembre de 13 a las 01:11

1 Respuestas

Your HTML has this:

<div id="sku-8103">

You're searching with this:

product = tree.xpath('//div[@id="sku-8240103"]/img[@src]')

Notice the different SKU number? There are no matching nodes, and therefore you get back the empty list, [].

If you change it like this:

product = tree.xpath('//div[@id="sku-8103"]/img[@src]')

You now get a single-element list, like this:

[<Element img at 0x10c85b890>]

Y si haces esto:

print product[0].attrib['src']

… you get this:

http://images.bestbuy.com/BestBuy_US/en_US/images/global/buttons/btn_notorderable_pdp.gif

Really, you don't need the [@src] part there; if you're attempting to restrict it to imgs que tienen un src attribute… what other imgs do you expect to see?

respondido 27 nov., 13:01

Good catch! ...but that was just a typo for my post. However, it looks like print(product[0].attrib['src']) does work, instead of just print product. Do you know why print product doesnt work? Isn't it just a list? - user1152532

@ user1152532: Es work. You see that [<Element img at 0x10c85b890>]? That's the output that I copied and pasted from running the last three lines of your code, with the SKU fixed, against your data. It's a list with one img Element objeto en él. - abarnert

@user1152532: If you're getting back an empty list, you must have some otra typo. Either that, or you somehow screwed up the HTML when copying it from the real page. - abarnert

Right... I guess I'm curious as to why the print products didn't result in the full contents of the Element object. I'm used to working with JSON, and when I print the entire object, the child objects are printed as well. - user1152532

@user1152532: Meanwhile, when you parse a JSON string, you don't get a "JSON object tree" or anything; you just get a Python dict full of Python dicts y lists y strs and so on. That's because JSON is specifically designed to only hold simple types that are common to all scripting languages. That's not true for HTML or XML. - abarnert

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.