PhantomJS y pjscrape: fallan en algunas URL múltiples

Visión general

I am trying to create a very basic scraper with PhantomJS and pjscrape framework.

Mi código

pjs.config({
timeoutInterval: 6000,
timeoutLimit: 10000,
format: 'csv',
csvFields: ['productTitle','price'],
writer: 'file',
outFile: 'D:\\prod_details.csv'
});

pjs.addSuite({
title: 'ChainReactionCycles Scraper',
url: productURLs, //This is an array of URLs, two example are defined below
scrapers: [
    function() {
        var results [];
        var linkTitle = _pjs.getText('#ModelsDisplayStyle4_LblTitle');
        var linkPrice = _pjs.getText('#ModelsDisplayStyle4_LblMinPrice');
        results.push([linkTitle[0],linkPrice[0]]); 
        return results;
    }
]
});

URL Array's Used

This first array NO FUNCIONA and fails after the 3rd or 4th URL.

var productURLs = ["8649","17374","7327","7325","14892","8650","8651","14893","18090","51318"];
for(var i=0;i<productURLs.length;++i){
  productURLs[i] = 'http://www.chainreactioncycles.com/Models.aspx?ModelID=' + productURLs[i];
}

This second array OBRAS and does not fail, even though it is from the same site.

var categoriesURLs = ["304","2420","965","518","514","1667","521","1302","1138","510"];
for(var i=0;i<categoriesURLs.length;++i){
  categoriesURLs[i] = 'http://www.chainreactioncycles.com/Categories.aspx?CategoryID=' + categoriesURLs[i];
}

Problema

When iterating through productURLs the PhantomJS page.open optional callback automatically assumes el fracaso. Even when the page hasn't finished loading.

I know this as I started the script up while running an HTTP debugger and the HTTP request were still running even after PhantomJS had reported a a page load el fracaso.

However, the code works fine when running with categoriesURLs.

Supuestos

  1. All the URL's listed above are VALID
  2. I have the latest versions of both PhantomJS and pjscrape

Soluciones posibles

These are solutions I have tried thus far.

  1. Disabling image loading page.options.loadImages = false
  2. Settings a larger timeoutInterval in pjs.config this was not useful apparently as the error generated was of a page.open failure and NOT a timeout failure.

¿Alguna idea?

preguntado el 10 de marzo de 12 a las 14:03

As I just noted on GitHub, I can't reproduce the issue - I was able to retrieve the productUrls list without a problem. I don't think it's a Pjscrape problem - it sounds like a PhantomJS issue. -

1 Respuestas

The problem was caused by PhantomJS. This has now been resolved.

I now use PhantomJS v2.0.

Respondido 15 Oct 15, 19:10

Fixed how? Which version of PhantomJs you used? (Having the same problem with 1.6 and the latest version compiled from git sources) - Mehdi Lahmam B.

@Hzmy It would be more worthy to called an answer if you also said what solution you used instead of PhantomJS. - rinez

@rineez I just upgraded the PhantomJS binary to v2.0 and my code worked. - Hzmy

Oh! For me PhantomJS 1.9 is working well for pjscrape. 2.0 was showing too many compatibility problems with pjscrape current version. - rinez

In the code you posted with question: this line var results []; looks wrong to me. I never seen such a syntax in js. Is that a valid syntax? - rinez

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.