Subconjunto coincidente de nodos DOM en función del elemento HTML desconectado anterior

A page I am trying to scrape into a CSV database/Ruby array lists 470 total records of uneven sized groups, each group preceded by a date (22 unique dates total).

I am not sure how to do it since groups aren't organized into any HTML tables, nor any hierarchy in the DOM where a "parent" could lead to each group's date, only a dry list of <div class="line"> visible record divs, occasionally preceded by only a <span class="date">Thursday, May 24, 2012</span> holding the date that applies only to the next X records until a new date is printed.

In irb it correctly shows:

$page = $agent.get(pageurl) # gets page with Mechanize
doc = $page.parser # returns Nokogiri::HTML 

(records = doc.search('html body div#wrapper div#innerwrapper div#content div.line')).size 
=> 470
(dates = doc.search('html body div#wrapper div#innerwrapper div#content span.date')).size 
=> 22

Show the first date for example:

doc.search('html body div#wrapper div#innerwrapper div#content span.date')[0].text
=> "Wednesday, May 23, 2012"

My goal is to append the correct date as a field to each of the 470 archivos doc.search found above, before saving into a CSV file.

Can Nokogiri (or Mechanize) help me retrieve these records in groups based on their position in the DOM, i.e. immediately following dates[N].text pero antes de la siguiente <span class="date">?

I could iterate N from 0 to 21 appending to a master array/CSV object for all 470 records, but for each group, adding the appropriate date campo.

preguntado el 22 de mayo de 12 a las 11:05

2 Respuestas

First, you can simplify your search a bit. Since content is an id, and it by definition uniquely identifies that particular div, you don't need any of the preceding path information.

records = doc.search('div#content div.line')

From each record, you can pull the date using xpath's preceding-sibling axis. Altogether:

doc.search('div#content div.line').each do |record|
  date = record.xpath('preceding-sibling::span[@class="date"][1]').text
  #append to CSV
end

The XPath says: find the preceding spans at the same level (preceding-sibling::span) that have a class of "date" ([@class="date"]), and take the first such one ([1]) to ensure you get the nearest date span).

contestado el 22 de mayo de 12 a las 13:05

Yep thanks! Currently I am futzing with the exact xpath since it is coming up blank for date. Looking for possible uncle/parent hops etc. Maybe I'll post a snip of the HTML if I can't figure it out. - Marcos

Interestingly, in an earlier revision I had the date as an "uncle" node in my sample html. I used ../preceding-sibling::span to get to it. (.. means parent) - Mark Thomas

The version that gave me trouble was prior to your edit. This works for me: records[0].xpath('preceding-sibling::span[@class="date"][1]').text ¡Gracias de nuevo! - Marcos

This is another good time to use traverse:

doc.traverse do |node|
  @date = node.text if 'span' == node.name && 'date' == node[:class]
  puts [@date, node.text].join(', ') if 'div' == node.name && 'line' == node[:class]
end

contestado el 22 de mayo de 12 a las 15:05

Also great! I see what's going on there. This traverse will be useful--reminds me what I was madly getting ready to do with just raw text processors like sed y awk, outside of any xml/xsltproc or nokogiri aids. - Marcos

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.