I have a project which requires a great deal of data scraping to be done.
I've been looking at Scrapy which so far I am very impressed with but I am looking for the best approach to do the following:
1) I want to scrape multiple URL's and pass in the same variable for each URL to be scraped, for example, lets assume I am wanting to return the top result for the keyword "python" from Bing, Google and Yahoo.
I would want to scrape
http://www.bing.com/?q=python (not the actual URLs but you get the idea)
I can't find a way to specify dynamic URLs using the keyword, the only option I can think of is to generate a file in PHP or other which builds the URL and specify scrapy to crawl the links in the URL.
2) Obviously each search engine would have its own mark-up so I would need to differentiate between each result to find the corresponding XPath to extract the relevant data from
3) Lastly, I would like to write the results of the scraped Item to a database (probably redis), but only when all 3 URLs have finished scraping, essentially I am wanting to build up a "profile" from the 3 search engines and save the outputted result in one transaction.
If anyone has any thoughts on any of these points I would be very grateful.
preguntado el 28 de agosto de 12 a las 14:08
1) In the BaseSpider, there is an
__init__ method that can be overridden in subclasses. This is where the declaration of the start_urls and allowed_domains variables are set. If you have a list of urls in mind, prior to running the spider, than you can insert them dynamically here.
For example, in a few of the spiders I have built, I pull in preformatted groups of URL's from MongoDB, and insert them into the start_urls list in once bulk insert.
2)This might be a little bit more tricky, but you could easily see the crawled URL by looking in the response object (
response.url). You should be able to check to see if the url contains 'google', 'bing', or 'yahoo', and then use the prespecified selectors for a url of that type.
3) I am not so sure that #3 is possible, or at least not without some difficulty. As far as I know, the url's in the start_urls list are not crawled orderly, and they each arrive in the pipeline independently. I am not sure that without some serious core hacking, you will be able to collect a group of response objects and pass them into a pipeline together.
However, you might consider serializing the data to disk temporarily, and then bulk-saving the data later on to your database. One of the crawlers I built receives groups of URLs that are around 10000 in number. Rather than making 10000 single item database insertions, I store the urls (and collected data) in BSON, and than insert it into MongoDB later.
I would use mechanize for this.
import mechanize br = mechanize.Browser() br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:184.108.40.206) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] br.set_handle_robots(False) response = br.open('https://www.google.ca/search?q=python') links = list(br.links())
which gives you all of the links. or you can filter them out by class:
links = [aLink for aLink in br.links()if ('class','l') in aLink.attrs]
you could use '-a' switch to specify a key-value pair to the spider, which could indicate a particular search words
scrapy crawl <spider_name> -a search_word=python