¿Probar si la tabla HTML se usa para el diseño frente a los datos?

This is more of a web scraping question. What are the recognized approaches to automatically determining if a <table> is used for layout vs. is used for datos in some HTML document you've never seen before?

I'd like to be able to pass in any HTML file as a string into some function that spits out all of the tablas de datos in an HTML page, but ignores tables used purely for layout. But sites like http://news.ycombinator.com/newcomments use HTML tables for layout, which makes it tricky.

This function shouldn't be tailored to any specific websites' DOM structure, so it should work with any HTML string (or have as high a success rate as possible).

Are there any algorithms/checks people have figured out over the years that can distinguish between layout and data tables? It should be possible, it's just a matter of writing down all the variables and trial/error - which I imagine many people have already mapped out somewhere.

I don't necessarily need the function (that would be awesome though, but I imagine it would require a lot of fine-tuning). Just looking for some tried strategies.

Noticias

Here's a good start (thanks @JaredFarrish):

preguntado el 02 de julio de 12 a las 18:07

"Sites like" is probably going to be a bit more extensive than a few (unfortunately). This sounds like a research paper topic; maybe someone has done one already, recently? -

In fact, Jakob Nielsen might have something on it at his website; he seems the sort to develop these types of identity heuristics. -

Yeah, I guess I'm looking for a research paper on the topic. I haven't been able to find any b/c I don't really know what the field/topic is exactly. If anyone knows of a good paper to start with that's all I am asking for - not a generic paper on web scraping though, found a lot of those :). -

I would have thought that tables containing useful data would generally be nested within other elements, and layout tables would generally be stuck straight in the body tab. Interesting question! -

Y aqui tienes: A Machine Learning Based Approach for Table Detection on The Web. I'd still probably vacuum it all up and develops hooks and push with it as a contingent and not an altered group. Keep in mind you don't have to review it all, do a sample test of a statistically valid number and derive from there; maybe, 300 pages/sites, review 50 tables, and apply until you're satisfied? Then go fishing with the certainty you've nailed it well enough to be wrong 3-5% of the time at best. ;) -

1 Respuestas

Tables used for layout will generally

  • have few rows and few cells per row.
  • have content in cells that is wildly inconsistent in length
  • have much HTML within cells
  • may use colspan / rowspan
  • exist near the top of the DOM
  • no hacer uso de <th> or <thead>
  • contain other tables

Tables used for data will generally

  • have more rows and more cells per row
  • have content in cells that is reasonably consistent in length
  • lack structuring HTML within cells (like <div>, <p>; seeing <b>, <strong>, etc does not preclude data)
  • probably not use colspan and very probably not use rowspan
  • not contain other tables

When you scrape a table, assess and score it for these criteria, apply scores and weights to them and use the final score to decide whether it is layout or data.

Respondido 02 Jul 12, 18:07

Is this personal/anecdotal, or is it derived from something observed and published? Just wondering; it seems like it could go both ways, but the specificity is a touch too granular. I'm sure a source would be valuable for the OP. (And this is probably what I would guess, so I'm not being critical, just teasing details out of their hiding place.) - Jared Farrish

This table, for example, is a data table but uses colspan/rowspan, does not make use of th/thead, etc. en.wikipedia.org/wiki/Timeline_of_Chinese_history#PRC.2FROC, but what you're saying is a good start. - Lance

This answer is entirely anecdotal. @LancePollard That table does complicate some of the rules, but others, such as content consistency and row/cell count still land it squarely in data-table. - Paraguas

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.