Enfoque scrapy para raspar múltiples URL

I have a project which requires a great deal of data scraping to be done.

I've been looking at Scrapy which so far I am very impressed with but I am looking for the best approach to do the following:

1) I want to scrape multiple URL's and pass in the same variable for each URL to be scraped, for example, lets assume I am wanting to return the top result for the keyword "python" from Bing, Google and Yahoo.

I would want to scrape http://www.google.co.uk/q=python, http://www.yahoo.com?q=python y http://www.bing.com/?q=python (not the actual URLs but you get the idea)

I can't find a way to specify dynamic URLs using the keyword, the only option I can think of is to generate a file in PHP or other which builds the URL and specify scrapy to crawl the links in the URL.

2) Obviously each search engine would have its own mark-up so I would need to differentiate between each result to find the corresponding XPath to extract the relevant data from

3) Lastly, I would like to write the results of the scraped Item to a database (probably redis), but only when all 3 URLs have finished scraping, essentially I am wanting to build up a "profile" from the 3 search engines and save the outputted result in one transaction.

If anyone has any thoughts on any of these points I would be very grateful.


preguntado el 28 de agosto de 12 a las 14:08

you might want to look into scrapyd doc.scrapy.org/en/latest/topics/scrapyd.html -

Well besides my thoughts on #1 I was looking to handle #2 and #3 in the 'parse' method using a combination of case statements for the URL and building up a generic Item to grab XPath values. I'm sure there is a easier way however -

3 Respuestas

1) In the BaseSpider, there is an __init__ method that can be overridden in subclasses. This is where the declaration of the start_urls and allowed_domains variables are set. If you have a list of urls in mind, prior to running the spider, than you can insert them dynamically here.

For example, in a few of the spiders I have built, I pull in preformatted groups of URL's from MongoDB, and insert them into the start_urls list in once bulk insert.

2)This might be a little bit more tricky, but you could easily see the crawled URL by looking in the response object (response.url). You should be able to check to see if the url contains 'google', 'bing', or 'yahoo', and then use the prespecified selectors for a url of that type.

3) I am not so sure that #3 is possible, or at least not without some difficulty. As far as I know, the url's in the start_urls list are not crawled orderly, and they each arrive in the pipeline independently. I am not sure that without some serious core hacking, you will be able to collect a group of response objects and pass them into a pipeline together.

However, you might consider serializing the data to disk temporarily, and then bulk-saving the data later on to your database. One of the crawlers I built receives groups of URLs that are around 10000 in number. Rather than making 10000 single item database insertions, I store the urls (and collected data) in BSON, and than insert it into MongoDB later.

Respondido 28 ago 12, 15:08

How exactly did you manage to pull the urls from the mongodb and insert them to the bulk list? An example implementation code would be really helpful. - peter

I would use mechanize for this.

import mechanize
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open('https://www.google.ca/search?q=python')
links = list(br.links())

which gives you all of the links. or you can filter them out by class:

links = [aLink for aLink in br.links()if ('class','l') in aLink.attrs]

Respondido 28 ago 12, 15:08

you could use '-a' switch to specify a key-value pair to the spider, which could indicate a particular search words

scrapy crawl <spider_name> -a search_word=python

Respondido 06 Jul 13, 01:07

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.