NoSQL - HBase vs Cassandra vs MongoDB

What is NoSQL?

NoSQL provides the new data management technologies designed to meet the increasing volume, velocity, and variety of data. It can store and retrieve data that is modeled in means other than the tabular relations used in relational databases. NoSQL systems are also called “Not only SQL” to emphasize that they may also support SQL-like query languages.

Why do I need NoSQL?

The Relational Databases have the following challenges:

Not good for large volume (Petabytes) of data with variety of data types (eg. images, videos, text)
Cannot scale for large data volume
Cannot scale-up, limited by memory and CPU capabilities
Cannot scale-out, limited by cache dependent Read and Write operations
Sharding (break database into pieces and store in different nodes) causes operational problems (e.g. managing a shared failure)
Complex RDBMS model
Consistency limits the scalability in RDBMS

Compared to relational databases, NoSQL databases are more scalable and provide superior performance. NoSQL databases address the challenges that the relational model does not by providing the following solution:

A scale-out, shared-nothing architecture, capable of running on a large number of nodes
A non-locking concurrency control mechanism so that real-time reads will not conflict writes
Scalable replication and distribution – thousands of machines with distributed data
An architecture providing higher performance per node than RDBMS
Schema-less data model

CAP Theorem and NoSQL databases

CAP provides the basic requirements for a distributed system to follow the following requirements:

Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Theoretically it is impossible to fulfill all three requirements. Therefore the current NoSQL databases follow the different combinations of the C,A,P from the CAP theorem.

CA – Single site cluster, therefore all nodes are always in contact. When a partition occurs, the systems blocks.

CP – Some data may be not accessible, but the rest is still consistent/accurate.

AP – System is still available under partitioning, but some of the data returned may be inaccurate.

The following graph shows where RDBMS and different NoSQL databases fit into the CAP theorem.

NoSQL is A BASE not ACID system

NoSQL is a BASE system that gives up on consistency. A BASE system has the following characteristics:

Basically Available indicates that the system does guarantee availability, in terms of the CAP theorem.
Soft State indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual Consistency indicates that the system will become consistent over time, given that the system does not receive input during that time.

NoSQL Classification

NoSQL Type	Document Data Store	Key Value	Column	Graph
Data Model	Collection of key value connections	Collection of key value pairs	Column families	“Property Graph” – Nodes
Strength	Incomplete Data Tolerant	Fast Look-ups	Fast Look-ups	Graph Algorithms – Shortest path, etc
Weakness	Query Performance, No Standard Query Syntax	Stored Data has no schema	Very low level API	Not easy to cluster, need to traverse whole graph to get answer
Example	MongoDB, CouchDB	Amazon Simple DB, Redis	HBase, Cassandra	InfoGrid, Infinite Graph

Read/Write speed: column > document > key-value >graph

Query/Navigation speed: graph > key-value > column > document

HBase vs Cassandra vs MongoDB

NoSQL Database	HBase	Cassandra	MongoDB
Key characteristics	Distributed and scalable big data store Strong consistency Built on top of Hadoop HDFS CP on CAP	High availability Incremental scalability Eventually consistent Trade-offs between consistency and latency Minimal administration No SPF (Single point of failure) – all nodes are the same in Cassandra AP on CAP	Schemas to change as applications evolve (Schema-free) Full index support for high performance Replication and failover for high availability Auto Sharding for easy Scalability Rich document based queries for easy readability Master-slave model CP on CAP
Good for	Optimized for read Well suited for range based scan Strict consistency Fast read and write with scalability	Simple setup, maintenance code Fast random read/write Flexible parsing/wide column requirement No multiple secondary index needed	RDBMS replacement for web applications Semi-structured content management Real-time analytics and high-speed logging, caching and high scalability Web 2.0, Media, SAAS, Gaming
Not good for	Classic transactional applications or even relational analytics Applications need full table scan Data to be aggregated, rolled up, analyzed cross rows	Secondary index Relational data Transactional operations (Rollback, Commit) Primary & Financial record Stringent and authorization needed on data Dynamic queries/searching on column data Low latency	Highly transactional system Applications with traditional database requirements such as foreign key constraints
Use Case	Facebook message	Twitter, Travel portal	Craigslist, Foursquare

Generally, Cassandra performs better than the other two when the data volume is very big.

References:

Choose the right NoSQL Databases https://www.youtube.com/watch?v=gJFG04Sy6NY

NoSQL Databases Explained http://www.mongodb.com/nosql-explained

Why NoSQL? http://www.couchbase.com/why-nosql/nosql-database

Tagged on: BASE, CAP, cassandra, HBase, MongoDB, NoSQL, RDBMS

Pingback: NoSQL – HBase vs Cassandra vs MongoDB – Toronto Cow Boy()
https://scalegrid.io Dharshan

Great comparison. I also put together a more strategic comparison between MongoDB and Cassandra – https://scalegrid.io/blog/cassandra-vs-mongodb/
Marty Jones

Why do I need NoSQL?

The Relational Databases have the following challenges:

• Not good for large volume (Petabytes) of data with variety of data types (eg. images, videos, text) – Who says so?

• Cannot scale for large data volume – Seriously?

• Cannot scale-up, limited by memory and CPU capabilities – And this doesn’t happen to NoSQL when they exceed their need for resources?

• Cannot scale-out, limited by cache dependent Read and Write operations – relational databases don’t need to read through cache, especially when they are in read-only mode.

• Sharding (break database into pieces and store in different nodes) causes operational problems (e.g. managing a shared failure) – some relational databases have supported this from the eighties, e..g for more than two decades

• Complex RDBMS model – Huh? This makes no sense whatsoever.

• Consistency limits the scalability in RDBMS – Again, it depends how and for what you are using the relational database.

That’s a pile of erroneous information going on there.

Please consider investigating the facts a little more and revising your views.

Regards,

Martyn
- http://jennyxiaozhang.com Jenny Zhang
  
  Hi Martyn,
  
  Thank you so much for the comment. Nice summary and great picture.

Jenny (Xiao) Zhang

Technology Professional and Enthusiast

NoSQL – HBase vs Cassandra vs MongoDB