Mapear / Reducir en un solo servidor

Does it make sense to do map/reduce on a non sharded architecture ?

Or, in other words, is it effective to do it on a single server.

preguntado el 08 de noviembre de 11 a las 19:11

2 Respuestas

Overall I disagree with Praveen.

Yes, I agree that when running on a single system you lose the fault-tolerance properties of the platform. However there are many situations where the platform has useful properties for specific purposes.

There are many situations where using the Hadoop toolkit has advantages over doing it without Hadoop.

  1. You do not need to worry about the size of the input file. If your input data is many GiB then you can still run it on a system where you only have 512MiB of system RAM available.
  2. With the platform you can make your data processing application run multithreaded without the need to dive into creating threads. You simply deploy your application on a different instance of the platform.
  3. You keep the door open to scaling out over multiple systems. When your application reaches that level then the step towards real horizontal scalability is a very simple one.

When you have written your processing application using Hadoop you have several options for running it:

  1. Single threaded on a single box using the local filesystem. This way it is simply a commandline Java application that transforms input into output.
  2. With just a jobtracker/tasktracker setup on a single box using the local file system. See this stackoverflow question for more info: ¿Es posible ejecutar Hadoop en una operación pseudodistribuida sin HDFS?
  3. Full blown on a single system (the pseudo-distributed modo).
  4. Full blown multi system setup.

contestado el 23 de mayo de 17 a las 15:05

Which point do you disagree on? The original query was Does it make sense to do map/reduce on a non sharded architecture ? .... on a single server and there was no point of scaling it further. Straight to the point, Hadoop is an overkill on a single box because of the overheads. Hadoop is like a power drill for a single box with non-sharded architecture. On my laptop a simple grep is 5-10 times faster than Hadoop in pseudo-distributed mode for the same functionality. I don't think tuning it will make it much better. Hadoop has got a steep learning curve also. - Praveen Sripati

Hadoop has overhead when running on a single system. Yet it provides ease of handling larger datasets, even when only running on a single system. So i disagree with your conclusion to not use Hadoop when running on a single server. I've posted several arguments to support this opinion. - Niels Basjes

With large data sets, disk read speed will be the bottle-necks for the job (multiple spindles on the server will help to some extent). That's one of the reason in Hadoop the data is spread across 1+ nodes. Also, there is a lot of IPC (JobTracker <-> TaskTracker and NameNode <-> DataNode). It's a topic for a lengthy discussion, but I still disagree that Hadoop on a Single server is worth the effort. - Praveen Sripati

With MapReduce, I think you mean Hadoop. There are other languages and frameworks which support the MapReduce paradigm. Here is my opinion on Hadoop.

Hadoop on a single server suits for testing purpose (independiente y pseudo-distributed modes).

When Hadoop is run on a single server the inherent features like fault-tolerance are lost, because if the server goes down then all the data associated with the server is lost. Also, when the data is small and the computing is less, then Hadoop has a lot overheads compared to the actual processing.

When going for a single server, it's better not to go for Hadoop (which is designed for distributed computing).

respondido 09 nov., 11:06

A documento from Microsoft in the same context. - Praveen Sripati

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.