Problemas con la sintaxis xpath para recuperar el atributo img src usando Python
Frecuentes
Visto 4,789 veces
0
I've been trying to figure out the xpath syntax to parse this html, but I haven't been getting the same results as others. I've been modeling my work after, http://docs.python-guide.org/en/latest/scenarios/scrape/#web-scraping, but I can't get it to work for my html.
<div id="sku-8103">
<!-- B:649 -->
<input type="hidden" id="productIdPDP" value="1218866963585"/>
<input type="hidden" id="skuIdPDP" value="8240103" />
<input type="hidden" id="enableLightbox" value="" />
<!-- B:780 -->
<img src="http://images.bestbuy.com/BestBuy_US/en_US/images/global/buttons/btn_notorderable_pdp.gif" alt="Not Orderable" border="0" id="notorderable" />
<input name="8240103" type="hidden" value="1">
<!-- E:780 -->
<!-- E:649 -->
</div>
Mi código:
import pycurl
import sys
import cStringIO
from lxml import etree
from lxml import html
buf = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, 'http://www.bestbuy.com/site/sony-playstation-4-500gb/8240103.p?id=1218866963585&skuId=8240103')
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()
data = buf.getvalue()
buf.close()
tree = html.fromstring(data)
product = tree.xpath('//div[@id="sku-8240103"]/img[@src]')
print product
El resultado es: []
en vez de src
value of the image. I also tried:
product = tree.xpath('//div[@id="sku-8240103"]/img[@src]/text()')
pero eso tampoco pareció funcionar.
1 Respuestas
3
Your HTML has this:
<div id="sku-8103">
You're searching with this:
product = tree.xpath('//div[@id="sku-8240103"]/img[@src]')
Notice the different SKU number? There are no matching nodes, and therefore you get back the empty list, []
.
If you change it like this:
product = tree.xpath('//div[@id="sku-8103"]/img[@src]')
You now get a single-element list, like this:
[<Element img at 0x10c85b890>]
Y si haces esto:
print product[0].attrib['src']
… you get this:
http://images.bestbuy.com/BestBuy_US/en_US/images/global/buttons/btn_notorderable_pdp.gif
Really, you don't need the [@src]
part there; if you're attempting to restrict it to img
s que tienen un src
attribute… what other img
s do you expect to see?
respondido 27 nov., 13:01
No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas python xpath web-scraping lxml elementtree or haz tu propia pregunta.
Good catch! ...but that was just a typo for my post. However, it looks like print(product[0].attrib['src']) does work, instead of just print product. Do you know why print product doesnt work? Isn't it just a list? - user1152532
@ user1152532: Es sí work. You see that
[<Element img at 0x10c85b890>]
? That's the output that I copied and pasted from running the last three lines of your code, with the SKU fixed, against your data. It's a list with one imgElement
objeto en él. - abarnert@user1152532: If you're getting back an empty list, you must have some otra typo. Either that, or you somehow screwed up the HTML when copying it from the real page. - abarnert
Right... I guess I'm curious as to why the
print products
didn't result in the full contents of the Element object. I'm used to working with JSON, and when I print the entire object, the child objects are printed as well. - user1152532@user1152532: Meanwhile, when you parse a JSON string, you don't get a "JSON object tree" or anything; you just get a Python
dict
full of Pythondict
s ylist
s ystr
s and so on. That's because JSON is specifically designed to only hold simple types that are common to all scripting languages. That's not true for HTML or XML. - abarnert