Cómo usar Beautiful Soup para extraer el enlace debajo de la estructura del árbol

Suppose the html web page like this:

<html>
    <div id="a">
        <div class="aa">
            <p>
                <a id="ff" href="#">ff</a>
                <a id="gg" href="#">gg</a>
            </p>
        </div>
        <div class="bb">
            <p>
                <a id="ff" href="#">ff</a>
            </p>
        </div>
    </div>
    <div id="b">
    </div>
</html>

Después de usar

soup = BeautifulSoup(webpage.read())

I have the html web page, and I would like to get the the link that is under the tree structure: <html> -> <div id="a"> -> <div class="aa">.

How can I write the following Python code using Beautiful Soup?

preguntado el 28 de agosto de 12 a las 09:08

It would be useful if you mentioned what you've tried so far. -

3 Respuestas

Without more info about your data it is difficult to give you a concise solution that will cover all possible inputs. To help you on your way, here's a walkthrough which will hopefully lead you to a solution that suits your needs.

The following will give us <div id="a"> (there should only be one element with a specific id):

top_div = soup.find('div', {'id':'a'})

We can then proceed to retrieve all inner divs with class='aa' (possible to have more than one):

aa_div = top_div.findAll('div', {'class':'aa'})

From there, we can return all links for each div found:

links = [div.findAll('a') for div in aa_div]

Tenga en cuenta que links contains a nested list since div.findAll('a') devolverá una lista de a nodes found. There are various ways to flatten such a list.

Here's an example which iterates through the list and prints out the individual links:

>>> from itertools import chain
>>> for a in chain.from_iterable(links):
...   print a
... 
<a id="ff" href="#">ff</a>
<a id="gg" href="#">gg</a>

The solution presented above is rather long winded. However, with more understanding of the input data a much more compact solution is possible. For example, if the data is exactly as you've show and there will always be that one div con class='aa' then the solution could simply be:

>>> soup.find('div', {'class':'aa'}).findAll('a')
[<a id="ff" href="#">ff</a>, <a id="gg" href="#">gg</a>]

Using CSS selectors with BeautifulSoup4

If you're using a newer version of BeatifulSoup (version 4), you could also use the .select() Método que proporciona Selector de CSS support. The elaborate solution I provided at the beginning of this answer could be re-written as:

soup.select("div#a div.aa a")

For BeautifulSoup v3, you can add on this functionality using sopa selecta.

However, do note the following statement from the docs (emphasis mine):

This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly, because it’s faster. But this lets you combine simple CSS selectors with the Beautiful Soup API.

contestado el 23 de mayo de 17 a las 12:05

Why not the css-like selectors? soup("div#a div.aa a") - Kos

@Kos I'm under the impression that you'd need something like sopa selecta. Has css-style selectors been added to BeautifulSoup? Would be awesome if it has. - barbilla shawn

Yup, at least partially. crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors La sintaxis es soup.select(...) - Kos

Thanks. I've updated the question with a quick mention of .select() (BS4 only I believe, introduced in esta revisión). - barbilla shawn

Lo haría de esta manera:

from BeautifulSoup import BeautifulSoup
import urllib

url = 'http://www.website.com'
file_pointer = urllib.urlopen(url)
html_object = BeautifulSoup(file_pointer)

link_list = []
links = html_object('div',{'class':'aa'})[0]('a')
for href in links:
    link_list.append(href['href'])

This returns a list of 'links' that can be called by offset:

link_1 = link_list[0]
link_2 = link_list[1]

Alternatively, if you want the text associated with the links (ie 'Click Here' vs '/Product/Store/Whatever.html'), you could change this same code very slightly and produce the desire results:

link_list = []
links = html_object('div',{'class':'aa'})[0]('a')
for text in links:
    link_list.append(text.contents[0])

Again, this will return a list so you will have to call the offsets:

link_1_text = link_list[0]
link_2_text = link_list[1]

Respondido 30 ago 12, 16:08

I have found this info on the official beautiful soup documentation:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie

You can see more about beautiful soup here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

saludos

Respondido el 26 de diciembre de 14 a las 14:12

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.