Usando BeautifulSoup para extraer un elemento li basado en una cadena contenida dentro
Frecuentes
Visto 2,154 veces
2
I have been attempting to use BeautifulSoup to retrieve any <li>
element that contains any format of the following word: Ottawa
. El problema es ese ottawa
is never within a tag of it's own such as <p>
. So I want to only print li
elementos que contienen Ottawa
.
The HTML formatting is like this:
<html>
<body>
<blockquote>
<ul><li><a href="http://link.com"><b>name</b></a>
(National: Ottawa, ON)
<blockquote> some description </blockquote></li>
<li><a href="http://link2.com"><b>name</b></a>
(National: Vancouver, BC)
<blockquote> some description </blockquote></li>
<li><a href="http://link3.com"><b>name</b></a>
(Local: Ottawa, ON)
<blockquote> some description </blockquote></li>
</ul>
</blockquote>
</body>
</html>
Mi código es el siguiente:
from bs4 import BeautifulSoup
import re
import urllib2,sys
url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
re1='.*?'
re2='(Ottawa)'
ottawa = soup.findAll(text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL))
search = soup.findAll('li')
The results of the above code finds Ottawa correctly, and when using it to find the li
elements, it does find the li
elements but it gives me every single one on the page.
I understand that they are currently not in conjunction as trying to do search = soup.findAll('li', text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL))
resultados en []
My end goal is basically to get every <li>
element that contains any mention of Ottawa
and give me the entire <li>
element with the name, description, link, etc.
2 Respuestas
3
Utilice el texto attribute to filter the results of the encuentra todos:
elems = [elem for elem in soup.findAll('li') if 'Ottawa' in str(elem.text)]
contestado el 04 de mayo de 12 a las 01:05
2
from bs4 import BeautifulSoup
import re
import urllib2,sys
url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
for item in soup.find_all(text=re.compile('\(.+: Ottawa', re.IGNORECASE)):
link = item.find_previous_sibling(lambda tag: tag.has_key('href'))
if link is None:
continue
print(u'{} [{}]: {}'.format(link.text,
item.strip(),
link['href']).encode('utf8'))
contestado el 04 de mayo de 12 a las 13:05
@thiago-m I'm not sure exactly which pattern you wanna match, tell me if you need help with that. Maybe '(Regional|Local|National): Ottawa'
en lugar de simplemente 'Local: Ottawa'
? o tal vez '\(.*: Ottawa\)'
- kurzedmetal
@thiago-m I tried to improve the regex and found that some nodes don't follow the same structure, and some have anchors, so i added some checks to match only real links and skip entries that doesn't follow the structure, GL. - kurzedmetal
Quité el local
portion and it retrieved most of the ones I needed. The lists are huge as I was scraping about 20 different pages so I retrieved enough. - paradd0x
No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas python beautifulsoup or haz tu propia pregunta.
Unfortunately I got this:
AttributeError: 'list' object has no attribute 'text'
after using this:elems = [elem for elem in search if ottawa in str(search.text)]
- paradd0x