Java HttpClient parece estar almacenando contenido en caché

I'm building a simple web-scraper and i need to fetch the same page a few hundred times, and there's an attribute in the page that is dynamic and should change at each request. I've built a multithreaded HttpClient based class to process the requests and i'm using an ExecutorService to make a thread pool and run the threads. The problem is that dynamic attribute sometimes doesn't change on each request and i end up getting the same value on like 3 or 4 subsequent threads. I've read alot about HttpClient and i really can't find where this problem comes from. Could it be something about caching, or something like it!?

Update: here is the code executed in each thread:

HttpContext localContext = new BasicHttpContext();

HttpParams params = new BasicHttpParams();
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params,
        HTTP.DEFAULT_CONTENT_CHARSET);
HttpProtocolParams.setUseExpectContinue(params, true);

ClientConnectionManager connman = new ThreadSafeClientConnManager();

DefaultHttpClient httpclient = new DefaultHttpClient(connman, params);

HttpHost proxy = new HttpHost(inc_proxy, Integer.valueOf(inc_port));
httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY,
        proxy);

HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");

String iden = null;
int timeoutConnection = 10000;
HttpConnectionParams.setConnectionTimeout(httpGet.getParams(),
        timeoutConnection);

try {

    HttpResponse response = httpclient.execute(httpGet, localContext);

    HttpEntity entity = response.getEntity();

    if (entity != null) {

        InputStream instream = entity.getContent();
        String result = convertStreamToString(instream);
        // System.out.printf("Resultado\n %s",result +"\n");
        instream.close();

        iden = StringUtils
                .substringBetween(result,
                        "<input name=\"iden\" value=\"",
                        "\" type=\"hidden\"/>");
        System.out.printf("IDEN:%s\n", iden);
        EntityUtils.consume(entity);
    }

}

catch (ClientProtocolException e) {
    // TODO Auto-generated catch block
    System.out.println("Excepção CP");

} catch (IOException e) {
    // TODO Auto-generated catch block
    System.out.println("Excepção IO");
}

preguntado el 09 de marzo de 12 a las 22:03

Could be cached on the server side. -

You could be writing thread-unsafe code, and whenever you download the data the old results get overwritten by new results. It's hard to tell without code. -

i've updated the question with code -

3 Respuestas

HTTPClient does not use cache by default (when you use DefaultHttpClient class only). It does so, if you use CachingHttpClient cual es HttpClient interface decorator enabling caching:

HttpClient client = new CachingHttpClient(new DefaultHttpClient(), cacheConfiguration);

Then, it analyzes If-Modified-Since y If-None-Match headers in order to decide if request to the remote server is performed, or if its result is returned from cache.

I suspect, that your issue is caused by proxy server standing between your application and remote server.

You can test it easily with curl application; execute some number of requests omitting proxy:

#!/bin/bash

for i in {1..50}
do
  echo "*** Performing request number $i"
  curl -D - http://yourserveraddress.com -o $i -s
done

And then, execute diff between all downloaded files. All of them should have differences you mentioned. Then, add -x/--proxy <host[:port]> option to curl, execute this script and compare files again. If some responses are the same as others, then you can be sure that this is proxy server issue.

respondido 10 mar '12, 14:03

I guess this is about apache httpcomponents-client hc.apache.org/httpcomponents-client-4.5.x/current/… Note: It now says, Deprecated. (4.3) use CachingHttpClientBuilder or CachingHttpClients. - Benjamín Peter

Generally speaking, in order to test whether or not HTTP requests are being made over the wire, you can use a "sniffing" tool that analyzes network traffic, for example:

I highly doubt HttpClient is performing caching of any sort (this would imply it needs to store pages in memory or on disk - not one of its capabilities).

While this is not an answer, its a point to ponder: Is it possible that the server (or some proxy in between) is returning you cached content? If you are performing many requests (simultaneously or near simultaneously) for the same content, the server may be returning you cached content because it has decided that the information has not "expired" yet. In fact the HTTP protocol provides caching directives for such functionality. Here is a site that provides a high level overview of the different HTTP caching mechanisms:

http://betterexplained.com/articles/how-to-optimize-your-site-with-http-caching/

I hope this gives you a starting point. If you have already considered these avenues then that's great.

respondido 09 mar '12, 22:03

You could try appending some unique dummy parameter to the URL on every request to try to defeat any URL-based caching (in the server, or somewhere along the way). It won't work if caching isn't the problem, or if the server is smart enough to reject requests with unknown parameters, or if the server is caching but only based on parameters it cares about, or if your chosen parameter name collides with a parameter the site actually uses.

If this is the URL you're using http://www.example.org/index.html intenta usar http://www.example.org/index.html?dummy=1

Set dummy to a different value for each request.

respondido 09 mar '12, 23:03

I'm also using an FixedThreadPool to execute the threads: ExecutorService pool = Executors.newFixedThreadPool(10); for(int i=0;i<count;i++) pool.submit(new GetThread(i)); pool.submit; - Trucha

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.