¿Cómo diseñar el esquema HBase para la red de sensores?

I'm new to this Big data world. As a course project, I'm working on sensor networks and want to store sensor data on HBase. Currently data are stored in MySQL database. I'm trying to load this data into HBase. But data are growing so fast and querying on that is getting very slow. Here is the MYSQL table schema: SensorLog(sensorID, userID,time,date). So this tables saves a sensor firing logs. For each user (45 users in total) there are 25 motion sensors in his apartment. Every time a user moves in his apartment a sensor will fire, and this event will be logged into this table. The main question is what sensors fired for a specific user in a specific time interval and day.
I came up with threeHBase schema, and I'm just wondering to ask your opinion about them. In these schema, I present time as seconds in a day, i.e. an integer number in the range of 0-86400.

Schema1: Rowkey: Date; Column-Family: Time { cq:(t0-t86400); cv:(userID,sensorID)}

Schema2: Rowkey: (Date,userID); Column-Family: Time { cq:(t0-t86400); cv:(sensorID)}

Schema3: Rowkey: (Date,userID); Column-Family: Time { cq:(s1-s25); cv:(time)}

Would you please let me know which schema is better and more efficient? I appreciate any help in advance.

preguntado el 23 de septiembre de 13 a las 02:09

What kind of queries are you doing? gets? mapreduce? -

how are you querying? do you know what specific event you want? -

we are basically looking for motion patterns. so currently MySQL queries are finding sensors fired for a specific user in a specific time interval and day. I'm actually not sure that in HBase which strategy for querying is more efficient (map reduced or gets). Any recommendation? -

1 Respuestas

45 people and 25 sensors hardly seems like something you'd want to store in HBase.

If you're keen on using HBase anyway, than a key design should be driven by your read and write patterns. for instance assuming each user only gets a few measurements a second and the number of users affects the load a composite key of row key userId, timestamp and sensor Id seems to make sense where the value would be the reading

Lastly , you may want to look at OpenTSDB which is open source, builds on HBase and was built to store time series measurement at scale. You can see its schema aquí

Respondido el 23 de Septiembre de 13 a las 20:09

Thanks for your comments. The system records an event every in average 5 seconds for each user, and we're capturing data since 2005! So you can imagine the data get's really huge. However, queries are always user specific, I mean all of the queries have the userID as a condition. - Eli

how much data does an event generate? - Arnon Rotem-Gal-Oz

The key I mentioned would retrieve user's data efficiently - but if you're looking for motion patterns you may want to look at storing the data in a graph database like titan (github.com/thinkaurelius/titan/wiki) or neo4J (neo4j.org) or store the data in Hadoop (not Hbase) and use Apache Giraph giraph.apache.org - Arnon Rotem-Gal-Oz

Here is an example. Suppose there are 4 sensors (A,B,C,D) in the living room capturing the motions in the four angles of the room. Today from 8:00 am to 8:03 am, the motion patterns in terms of sensor sequence is: {(A,8:00),(A,8:01),(B,8:02),(C,8:03)}. Now the question is which day in the past has the most similar motion patterns compare with today roughly around the same or close time interval? First thing is to know a distance measure, which we came up with one. The second question is how we efficiently scan the history of sensor time series to get the answer? Thanks for your help :) - Eli

I don't get that how graph can be helpful here?! Can you please explain it more? I appreciate your help. - Eli

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.