DZone's Guide to

Apache Hive vs. Apache HBase

It's the battle of big data tech. Come check out the pros and cons of Apache Hive and Apache HBase and learn questions you should ask yourself before making a choice.

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

People are always asking me at meetups whether they should use Apache Hive, Apache HBase, Apache SparkSQL, or some buzzword data engine.

My answer is yes: use them all for the appropriate use case and data.

Ask yourself some questions first:

My next question is: How are you ingesting it? For most cases, it makes sense to use Apache NiFi for either Apache Hive or Apache HBase destinations. Sometimes, Apache SQOOP makes sense, as well. What is the source format? Do you need to store it in the original format? Is it already JSON or CSV?

Apache HBase has some very interesting updates coming in version 2.0 that makes it great for a lot of use cases.

Apache Hive is great for its full SQL, in-memory caching, sorting, joining data, ACID, and integration with BI tools, Druid, and Spark SQL integration.

With Apache Phoenix, HBase has a good set of SQL to start with — but it's nowhere near as mature or rich as Apache Hive's SQL.

Apache HBase pros:

Apache Hive pros:

So, who wins? There was a time I tried to use Apache Phoenix for everything since its JDBC driver is really solid, made it easy to put lots of data in quickly, and makes for fast queries. It's also great for use cases that I used to use something like MongoDB for, with varying JSON data.

Apache Hive has the Apache Spark SQL integration and rich SQL that makes it great for tabular data, and its Apache ORC format is amazing.

In most use cases, Apache Hive wins. For NoSQL, sparse data, really high-end requirements, Apache HBase wins. The good news is that they both work well together on the same Hadoop cluster and utilize your massive HDFS store. I rarely see places where they don't use both. Use them both — if one doesn't work, use the other. The two together have solved every query and storage requirement that I have had for 100 different use cases in dozens of different enterprises.

References

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Opinions expressed by DZone contributors are their own.