iniciar un trabajo de reducción de mapas desde mi aplicación web java / mysql

I need a bit of archecture advice. I have a java based webapp, with a JPA based ORM backed onto a mysql relational database. Now, as part of the application I have a batch job that compares thousands of database records with each other. This job has become too time consuming and needs to be parallelized. I'm looking at using mapreduce and hadoop in order to do this. However, I'm not too sure about how to integrate this into my current architecture. I think the easiest initial solution is to find a way to push data from mysql into hadoop jobs. I have done some initial research on this and found the following relevant information and possibilities:

1) https://issues.apache.org/jira/browse/HADOOP-2536 this gives an interesting overview of some inbuilt JDBC support 2) This article http://architects.dzone.com/articles/tools-moving-sql-database describes some third party tools to move data from mysql to hadoop.

To be honest I'm just starting out with learning about hbase and hadoop but I really don't know how to integrate this into my webapp.

Any advice is greatly appreciated. cheers, Brian

preguntado el 08 de enero de 11 a las 22:01

threads and a couple of stored procs should do you :P -

2 Respuestas

DataNucleus supports JPA persistence to HBase. Obviously JPA is designed for RDBMS so support for full JPA will never be possible, but you can do basic persistence/querying

Respondido el 09 de enero de 11 a las 08:01

Thanks for that! DataNucleus looks cool! I'll investigate more! - Brian

Brian, In this case, you can either use HBase or Hive or just raw map-reduce jobs. 1. HBase is a column-oriented database. HBase best suits for a column based computations. For example, average employee salary(assuming salary is a column). And with it's powerful scalability feature, we can add nodes on the fly. 2. Hive is like traditional databases which supports SQL like queries. Internally queries will be converted into map-reduce problems. We can use this in case of row based computations. 3. Final option, where we can write our own map-reduce functionality. Using "sqoop", we can migrate data from relational databases to HDFS(Hadoop File System). Then we can write map-reduce problems that directly deal with underlying flat files. Mentioned some of the possible options. Let me know if you need additional details about above mentioned options.

Respondido el 11 de enero de 11 a las 20:01

Hi Krishna, thanks for the response. My calculations are row based so looks like option 2 (Hive) or option 3 Sqoop to HDFS are the better ways to go. I understand that Sqoop can get the data from the DB into HDFS, but can you perhaps tell me if sqoop would be the right way to get the data into Hive from a DB also? Finally, what are the pros and cons of Hive vs raw map reduce and HDFS. Thanks so much, Brian - Brian

seems like I found the answer to the first part... archive.cloudera.com/cdh/3/sqoop/…. So perhaps you could just give me your experience/opinion on the advantages of Hive vs HDFS? - Brian

Both Hbase and Hive boils down to map-reduce problems. The only advantage of Raw Map-reduce approach is we can customize the break-down based on our requirement. Even with Hbase and Hive too we can customize. It purely depends on your requirement. If you could explain your problem in more detail will try to help you. - Krishna

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.