Usando BeautifulSoup para extraer un elemento li basado en una cadena contenida dentro

I have been attempting to use BeautifulSoup to retrieve any <li> element that contains any format of the following word: Ottawa. El problema es ese ottawa is never within a tag of it's own such as <p>. So I want to only print li elementos que contienen Ottawa.

The HTML formatting is like this:

<html>
<body>
<blockquote>
<ul><li><a href="http://link.com"><b>name</b></a>
(National: Ottawa, ON)
<blockquote> some description </blockquote></li>
<li><a href="http://link2.com"><b>name</b></a>
(National: Vancouver, BC)
<blockquote> some description </blockquote></li>
<li><a href="http://link3.com"><b>name</b></a>
(Local: Ottawa, ON)
<blockquote> some description </blockquote></li>
</ul>
</blockquote>
</body>
</html>

Mi código es el siguiente:

from bs4 import BeautifulSoup
import re
import urllib2,sys

url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

re1='.*?'
re2='(Ottawa)'
ottawa = soup.findAll(text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL))
search = soup.findAll('li')

The results of the above code finds Ottawa correctly, and when using it to find the li elements, it does find the li elements but it gives me every single one on the page.

I understand that they are currently not in conjunction as trying to do search = soup.findAll('li', text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL)) resultados en []

My end goal is basically to get every <li> element that contains any mention of Ottawa and give me the entire <li> element with the name, description, link, etc.

preguntado el 03 de mayo de 12 a las 21:05

2 Respuestas

Utilice el texto attribute to filter the results of the encuentra todos:

elems = [elem for elem in soup.findAll('li') if 'Ottawa' in str(elem.text)]

contestado el 04 de mayo de 12 a las 01:05

Unfortunately I got this: AttributeError: 'list' object has no attribute 'text' after using this: elems = [elem for elem in search if ottawa in str(search.text)] - paradd0x

from bs4 import BeautifulSoup
import re
import urllib2,sys

url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

for item in soup.find_all(text=re.compile('\(.+: Ottawa', re.IGNORECASE)):
    link = item.find_previous_sibling(lambda tag: tag.has_key('href'))
    if link is None:
        continue
    print(u'{} [{}]: {}'.format(link.text,
                               item.strip(),
                               link['href']).encode('utf8'))

contestado el 04 de mayo de 12 a las 13:05

@thiago-m I'm not sure exactly which pattern you wanna match, tell me if you need help with that. Maybe '(Regional|Local|National): Ottawa' en lugar de simplemente 'Local: Ottawa'? o tal vez '\(.*: Ottawa\)' - kurzedmetal

@thiago-m I tried to improve the regex and found that some nodes don't follow the same structure, and some have anchors, so i added some checks to match only real links and skip entries that doesn't follow the structure, GL. - kurzedmetal

Quité el local portion and it retrieved most of the ones I needed. The lists are huge as I was scraping about 20 different pages so I retrieved enough. - paradd0x

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.