I have a file with approximately 200,000 documents urls. I want to sum up the sizes of these urls. I've written something i java, using HttpURLConnection but it takes a very long time to run, and that is of course understandable - it opens an http connection for each one.
Is there a faster way to do this? Maybe the same thing in other language would take less time (if processing a single http connection in java takes a bit longer than in another language, then for my amount of connections its noteable)? Or another approach?
preguntado el 31 de enero de 12 a las 08:01
Changing the language won't make a difference here, it's because opening 200,000 HTTP connections, however you look at it, takes a long time!
Puedes usar un grupo de hilos and execute the tasks concurrently which might speed things up quite a bit, but something like this is never going to run in a second or two.
You should also use HEAD HTTP requests to only retrieve the Content-Length but not the content to speed up your process. Also the use of threads can speed up the process, especially when your line is not loaded very much by one request, which is probably not the case. The last and probably most efficient option you have is to execute the process physically near by the server, e.g. in the same subnet or so.
Seems like you are approaching the problem in the wrong way. Your bottleneck isn't in counting the size of the URL, but in efficiently accessing them to determine the size of each file. Luckily there are web services that can help you overcome this bottleneck, maybe try a service like 80 legs to run a cheap web crawler and then run analysis on the result set...
Also, just a point of clarification - you are hoping to understand the size of the files described by the URL... not the actual URL itself, right?