Mi araña no sigue enlaces

class KextraSpider(CrawlSpider):
    name = "kextra"
    allowed_domains = ["k-extra.fi"]
    start_urls = ["http://www.k-extra.fi/Tarjoukset/"]
    rules = (
            Rule(SgmlLinkExtractor(allow = ('\?id=5&epslanguage=fi&sivu=\d')) , callback="parse_items" , follow=True),
    )


    def parse_items(self, response):
        sel = Selector(response)
        kxitems = []
        sites = sel.xpath('//div[@class="offerListItem"]')
        for site in sites:
            item = KextraItem()
            item["image"] = site.xpath('div[@class="offerListLeftColumn"]/img/@ src').extract()
            item["product_primary"] = site.xpath('div[@class="offerListRightColumn"]/h4/text()').extract()
            item["product_secondary"]= site.xpath('div[@class="offerListRightColumn"]/h3/text()').extract()
            item["discount"] = site.xpath('div[@class="offerListRightColumn"]/div[@class="plussaDiscount"]/div[@class="plussaAmount"]/text()').extract()
            item["priceEuros"] = site.xpath('div[@class="offerListPriceContainer"]/div[@class="price"]/p[@class="euros"]/text()').extract()
            item["priceCents"] = site.xpath('div[@class="offerListPriceContainer"]/div[@class="price"]/p[@class="euros"]/span[@class="cents"]/text()').extract()
            kxitems.append(item)
        return kxitems;

El problema es que el permiso especificado no se sigue. Si dejo Permitir en blanco, se seguirán los enlaces completos. ¿Cuál podría ser el problema con la expresión regular en allow?

preguntado el 08 de febrero de 14 a las 12:02

Intenta cambiar "&" por "&" -

1 Respuestas

Prueba con este extractor de enlaces:

SgmlLinkExtractor(allow = ('\?id=5&epslanguage=fi&sivu=\d'))

Ejemplo de sesión de shell scrapy:

paul@wheezy:~/$ scrapy shell http://www.k-extra.fi/Tarjoukset/
2014-02-08 17:13:07+0100 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrunchscraper)
2014-02-08 17:13:07+0100 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-02-08 17:13:07+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrunchscraper.spiders', 'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 'SPIDER_MODULES': ['scrunchscraper.spiders'], 'TEMPLATES_DIR': 'templates/', 'BOT_NAME': 'scrunchscraper', 'WEBSERVICE_ENABLED': False, 'LOGSTATS_INTERVAL': 0, 'TELNETCONSOLE_ENABLED': False}
2014-02-08 17:13:07+0100 [scrapy] INFO: Enabled extensions: CloseSpider, CoreStats, SpiderState, StatsNotifier
2014-02-08 17:13:07+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-08 17:13:07+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-08 17:13:07+0100 [scrapy] INFO: Enabled item pipelines: ConsistencyPipeline
2014-02-08 17:13:07+0100 [default] INFO: Spider opened
2014-02-08 17:13:08+0100 [default] DEBUG: Crawled (200) <GET http://www.k-extra.fi/Tarjoukset/> (referer: None)
[s] Available Scrapy objects:
[s]   item       {}
[s]   request    <GET http://www.k-extra.fi/Tarjoukset/>
[s]   response   <200 http://www.k-extra.fi/Tarjoukset/>
[s]   sel        <Selector xpath=None data=u'<html xmlns="http://www.w3.org/1999/xhtm'>
[s]   settings   <CrawlerSettings module=<module 'scrunchscraper.settings' from '/home/paul/scrapinghub/scrunch/retailspiders/scrunchscraper/settings.pyc'>>
[s]   spider     <Spider 'default' at 0x387f950>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor                         

In [2]: SgmlLinkExtractor(allow = ('\?id=5&amp;epslanguage=fi&amp;sivu=\d')).extract_links(response)
Out[2]: []

In [3]: SgmlLinkExtractor(allow = ('\?id=5&epslanguage=fi&sivu=\d')).extract_links(response)
Out[3]: 
[Link(url='http://www.k-extra.fi/Tarjoukset/?epslanguage=fi&id=5&sivu=1', text=u'2', fragment='', nofollow=False),
 Link(url='http://www.k-extra.fi/Tarjoukset/?epslanguage=fi&id=5&sivu=2', text=u'3', fragment='', nofollow=False),
 Link(url='http://www.k-extra.fi/Tarjoukset/?epslanguage=fi&id=5&sivu=6', text=u'Viimeinen', fragment='', nofollow=False)]

In [4]: 

Respondido 08 Feb 14, 16:02

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.