Does it make sense to do map/reduce on a non sharded architecture ?
Or, in other words, is it effective to do it on a single server.
preguntado el 08 de noviembre de 11 a las 19:11
Overall I disagree with Praveen.
Yes, I agree that when running on a single system you lose the fault-tolerance properties of the platform. However there are many situations where the platform has useful properties for specific purposes.
There are many situations where using the Hadoop toolkit has advantages over doing it without Hadoop.
- You do not need to worry about the size of the input file. If your input data is many GiB then you can still run it on a system where you only have 512MiB of system RAM available.
- With the platform you can make your data processing application run multithreaded without the need to dive into creating threads. You simply deploy your application on a different instance of the platform.
- You keep the door open to scaling out over multiple systems. When your application reaches that level then the step towards real horizontal scalability is a very simple one.
When you have written your processing application using Hadoop you have several options for running it:
- Single threaded on a single box using the local filesystem. This way it is simply a commandline Java application that transforms input into output.
- With just a jobtracker/tasktracker setup on a single box using the local file system. See this stackoverflow question for more info: ¿Es posible ejecutar Hadoop en una operación pseudodistribuida sin HDFS?
- Full blown on a single system (the pseudo-distributed modo).
- Full blown multi system setup.
With MapReduce, I think you mean Hadoop. There are other languages and frameworks which support the MapReduce paradigm. Here is my opinion on Hadoop.
When Hadoop is run on a single server the inherent features like fault-tolerance are lost, because if the server goes down then all the data associated with the server is lost. Also, when the data is small and the computing is less, then Hadoop has a lot overheads compared to the actual processing.
When going for a single server, it's better not to go for Hadoop (which is designed for distributed computing).