Jenny (Xiao) Zhang

NoSQL – HBase vs Cassandra vs MongoDB

nosql-logos

What is NoSQL?

NoSQL provides the new data management technologies designed to meet the increasing volume, velocity, and variety of data. It can store and retrieve data that is modeled in means other than the tabular relations used in relational databases. NoSQL systems are also called “Not only SQL” to emphasize that they may also support SQL-like query languages.

Why do I need NoSQL?

The Relational Databases have the following challenges:

  • Not good for large volume (Petabytes) of data with variety of data types (eg. images, videos, text)
  • Cannot scale for large data volume
  • Cannot scale-up, limited by memory and CPU capabilities
  • Cannot scale-out, limited by cache dependent Read and Write operations
  • Sharding (break database into pieces and store in different nodes) causes operational problems (e.g. managing a shared failure)
  • Complex RDBMS model
  • Consistency limits the scalability in RDBMS

Compared to relational databases, NoSQL databases are more scalable and provide superior performance. NoSQL databases address the challenges that the relational model does not by providing the following solution:

  • A scale-out, shared-nothing architecture, capable of running on a large number of nodes
  • A non-locking concurrency control mechanism so that real-time reads will not conflict writes
  • Scalable replication and distribution – thousands of machines with distributed data
  • An architecture providing higher performance per node than RDBMS
  • Schema-less data model
CAP Theorem and NoSQL databases

CAP provides the basic requirements for a distributed system to follow the following requirements:

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it was successful or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Theoretically it is impossible to fulfill all three requirements. Therefore the current NoSQL databases follow the different combinations of the C,A,P from the CAP theorem.

CA – Single site cluster, therefore all nodes are always in contact. When a partition occurs, the systems blocks.

CP – Some data may be not accessible, but the rest is still consistent/accurate.

AP – System is still available under partitioning, but some of the data returned may be inaccurate.

The following graph shows where RDBMS and different NoSQL databases fit into the CAP theorem.

CAP

 

NoSQL is A BASE not ACID system

NoSQL is a BASE system that gives up on consistency. A BASE system has the following characteristics:

  • Basically Available indicates that the system does guarantee availability, in terms of the CAP theorem.
  • Soft State indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
  • Eventual Consistency indicates that the system will become consistent over time, given that the system does not receive input during that time.
NoSQL Classification
NoSQL Type Document Data Store Key Value Column Graph
Data Model Collection of key value connections Collection of key value pairs Column families “Property Graph” – Nodes
Strength Incomplete Data Tolerant Fast Look-ups Fast Look-ups Graph Algorithms – Shortest path, etc
Weakness Query Performance, No Standard Query Syntax Stored Data has no schema Very low level API Not easy to cluster, need to traverse whole graph to get answer
Example MongoDB, CouchDB Amazon Simple DB, Redis HBase, Cassandra InfoGrid, Infinite Graph

Read/Write speed: column > document > key-value >graph

Query/Navigation speed: graph > key-value > column > document

HBase vs Cassandra vs MongoDB
NoSQL Database HBase Cassandra MongoDB
Key characteristics
  • Distributed and scalable big data store
  • Strong consistency
  • Built on top of Hadoop HDFS
  • CP on CAP
  • High availability
  • Incremental scalability
  • Eventually consistent
  • Trade-offs between consistency and latency
  • Minimal administration
  • No SPF (Single point of failure) – all nodes are the same in Cassandra
  • AP on CAP
  • Schemas to change as applications evolve (Schema-free)
  • Full index support for high performance
  • Replication and failover for high availability
  • Auto Sharding for easy Scalability
  • Rich document based queries for easy readability
  • Master-slave model
  • CP on CAP
Good for
  • Optimized for read
  • Well suited for range based scan
  • Strict consistency
  • Fast read and write with scalability

 

  • Simple setup, maintenance code
  • Fast random read/write
  • Flexible parsing/wide column requirement
  • No multiple secondary index needed
  • RDBMS replacement for web applications
  • Semi-structured content management
  • Real-time analytics and high-speed logging, caching and high scalability
  • Web 2.0, Media, SAAS, Gaming
Not good for
  • Classic transactional applications or even relational analytics
  • Applications need full table scan
  • Data to be aggregated, rolled up, analyzed cross rows
  • Secondary index
  • Relational data
  • Transactional operations (Rollback, Commit)
  • Primary & Financial record
  • Stringent and authorization needed on data
  • Dynamic queries/searching  on column data
  • Low latency
  • Highly transactional system
  • Applications with traditional database requirements such as foreign key constraints
Use Case Facebook message Twitter, Travel portal Craigslist, Foursquare

Generally, Cassandra performs better than the other two when the data volume is very big.

References:

Choose the right NoSQL Databases https://www.youtube.com/watch?v=gJFG04Sy6NY

NoSQL Databases Explained http://www.mongodb.com/nosql-explained

Why NoSQL? http://www.couchbase.com/why-nosql/nosql-database

  • https://scalegrid.io Dharshan

    Great comparison. I also put together a more strategic comparison between MongoDB and Cassandra – https://scalegrid.io/blog/cassandra-vs-mongodb/

  • Marty Jones

    Why do I need NoSQL?

    The Relational Databases have the following challenges:

    • Not good for large volume (Petabytes) of data with variety of data types (eg. images, videos, text) – Who says so?

    • Cannot scale for large data volume – Seriously?

    • Cannot scale-up, limited by memory and CPU capabilities – And this doesn’t happen to NoSQL when they exceed their need for resources?

    • Cannot scale-out, limited by cache dependent Read and Write operations – relational databases don’t need to read through cache, especially when they are in read-only mode.

    • Sharding (break database into pieces and store in different nodes) causes operational problems (e.g. managing a shared failure) – some relational databases have supported this from the eighties, e..g for more than two decades

    • Complex RDBMS model – Huh? This makes no sense whatsoever.

    • Consistency limits the scalability in RDBMS – Again, it depends how and for what you are using the relational database.

    That’s a pile of erroneous information going on there.

    Please consider investigating the facts a little more and revising your views.

    Regards,

    Martyn

    • http://jennyxiaozhang.com Jenny Zhang

      Hi Martyn,

      Thank you so much for the comment. Nice summary and great picture. :)

%d bloggers like this: