This is more of a web scraping question. What are the recognized approaches to automatically determining if a
<table> is used for layout vs. is used for datos in some HTML document you've never seen before?
I'd like to be able to pass in any HTML file as a string into some function that spits out all of the tablas de datos in an HTML page, but ignores tables used purely for layout. But sites like http://news.ycombinator.com/newcomments use HTML tables for layout, which makes it tricky.
This function shouldn't be tailored to any specific websites' DOM structure, so it should work with any HTML string (or have as high a success rate as possible).
Are there any algorithms/checks people have figured out over the years that can distinguish between layout and data tables? It should be possible, it's just a matter of writing down all the variables and trial/error - which I imagine many people have already mapped out somewhere.
I don't necessarily need the function (that would be awesome though, but I imagine it would require a lot of fine-tuning). Just looking for some tried strategies.
Here's a good start (thanks @JaredFarrish):
- A Machine Learning Based Approach for Table Detection on The Web
- Keywords: Table Detection, Layout Analysis, Machine Learning, Decision tree, Support Vector Machine, Information Retrieval
preguntado el 02 de julio de 12 a las 18:07
Tables used for layout will generally
- have few rows and few cells per row.
- have content in cells that is wildly inconsistent in length
- have much HTML within cells
- may use colspan / rowspan
- exist near the top of the DOM
- no hacer uso de
- contain other tables
Tables used for data will generally
- have more rows and more cells per row
- have content in cells that is reasonably consistent in length
- lack structuring HTML within cells (like
<strong>, etc does not preclude data)
- probably not use colspan and very probably not use rowspan
- not contain other tables
When you scrape a table, assess and score it for these criteria, apply scores and weights to them and use the final score to decide whether it is layout or data.