Why migrating to Cassandra from MySQL makes sense?

Consider this scenario

There is a real time application server monitoring the following attributes of several access points.
1.   Health - they are up or down
2.   load condition - they are under or over or appropriately loaded
3.   how many users are connected to access points
4.   location of access point
5.   their interference
----

the access points connect to this front end application and report their information.  the application then works with a database abstraction layer to put the data onto database node.

Now some new requirement came up - Customer wanted the database to be a NoSQL one like Cassandra. but why does he really want a NoSQL database?

Then we began to think - why NoSQL now and what are the disadvantages of using MySQL database?  or rather what we cannot achieve with MySQL that we want to and how we can achieve it using NoSQL

As we continued to analyse, we figured out following for MySQL
1.  the data model in MySQL is fixed and is based on the schema
2.  In MySQL data needs to be entered for all columns or NULL if no data should be entered - no option to
     leave a particular column blank
3.  MySQL nodes work in a Master / Slave node with Master and slave nodes syncing data among them
     periodically.
4. if nodes go down, data connectivity is gone and gets restored only when nodes are up again.

What we actually need / envision
1.  Ability to replicate data across multiple nodes so that even if 1 or 2 nodes go down, data is still available.
2.  Ability to have a flexible data model / schema to store different kinds of data.
3.  Ability for the database nodes to sync up with each other on database data and operations
     automatically
4.  Ability to easily move database nodes to cloud
5.  Ability to launch and configure database nodes as demand arise and shut them down as demand
      decreases.
6.  Ability for the application to automatically switch to other available nodes when the main coordinator
     node goes down.

Some questions we are thinking to ourselves
1.  Do we really need joins and secondary indexes - well depends on the data model we come up with
2.  How many tables are we looking at?
3.  what is the scale of data we are looking at and how many database nodes do we think we may need?

Given the above requirements and needs, we came up with following conclusions and options

1.  MySQL has no option to replicate across multiple nodes - at the max it supports master / slave
     configuration and if either of the nodes go down, other takes up.  but if both go down, then we are in
     disaster and application needs to wait until either of the nodes comes up.  Also there is no guarantee
     the data is synchronized between master and slave .
2.  MySQL data model / schema is fixed and cannot change
3.  MySQL supports secondary indexes, joins and mathematical functions like sum avg, etc

Then enter the options of Cassandra and MongoDB and several more to consider.

Well both Cassandra and MongoDB are No SQL meaning Non SQL databases.  this means they are not actually SQL databases and do not support the ACID properties which are specific to SQL databases. Instead NoSQL databases support BASE meaning Basically Available Softstate Eventually consistent.  Also NoSQL databases support only 2 properties of the famous CAP - Consistency Availability and Partition Tolerance theorem.

Again we were at crossroads thinking which one to go with and again we went back to our table and listed down what we need
1. Ability to design data model around queries
2. Support a SQL like query language
3. Support a C/C++ driver for integrating with the front end application
4. Highly available and fault tolerant
5. Tuneable Read after write consistency
6. All nodes should be consistent with data

We felt that Cassandra Database would fit our requirements more better than MongoDB or other NoSQL databases - though we did not do an extensive research on other NoSQL databases to justify this conclusion.

Now that we decided to go ahead with Cassandra as our future database, we began to work on the migration strategy to migrate MySQL database to Cassandra DB.  Some of the points our strategy was based on are:
1.   Think of database queries first
2.   the queries should be such that the database read performance is best and a particular read operation
      does not span more than 1 table.  This will ensure that Read operation is fast.
3.   In Cassandra, writes are cheap, so take advantage of it and write as many tables as possible in order to
      increase read performance
4.   Cassandra uses a protocol called "Gossip" as a means of inter node communication for data replication.
      between seed nodes and other data replicas
5.   Ensure that there are not too many nodes configured as seed nodes - else boot up time would increase

Given the above points, we re designed our Cassandra data model such a way that every read operation spans only 1 table and also the data replication is easier.  So even if a particular node goes down, retrieving data from other nodes ensures we get the right data.

So at this stage we have successfully migrated our entire MySQL database to multi node Cassandra database and also moved Cassandra nodes to AWS cloud.

Some of the Cassandra terms that would be useful are

Component / Property
Description
Anti entropy
the synchronization of replica data on nodes to ensure that data is fresh
Back pressure
Pausing or blocking the buffering of incoming requests after reaching the threshold until the internal processing of buffered requests catches up.
Bootstrap
The process by which new nodes join cluster transparently gathering the data from existing nodes
Node
Machine that stores data – a virtual machine or physical machine storing cassandra data.
Cluster
2 or more cassandra nodes that exchange data using Gossip protocol
Compaction
The process of consolidating SSTables, discarding tombstones, and regenerating the SSTable index
Clustering
The storage engine process that creates an index and keeps data in order based on the index
Co-ordinator node
The node that determines which nodes in the ring should get the request based on the cluster configured snitch.
Clustering column
In the table definition, a clustering column is a column that is part of the compound primary key definition, but not the first column, which is position reserved for the partition key. Columns are clustered in multiple rows within a single partition. The clustering order is determined by the position of columns in the compound primary key definition.
Data Center
Collection of related nodes
Row
Collection of one or more columns
Column family
Collection of one or more rows. Represent structure of the data.
Keys_cached
A property of column family that represents number of locations to keep cached per SSTable
rows_cached
A property of column family that represents whose entire contents will be cached in memory
preload_row_cache
A property of column family that specifies whether you want to pre-populate the row cachel
Keyspace
Outer most container for data in Cassandra. Contains one or more column families
Replication factor
It is the number of machines in the cluster that will receive copies of same data.
Relica Placement Strategy
Strategy to place replicas in the ring -  simple stragey (rack aware strategy), network topology strategy (data center aware)
Cluster
Collection of one or more data centers
Gossip
A peer-to-peer communication protocol for exchanging location and state information between nodes.
Commitlog
Crash recovery mechanism in Cassandra. Every write operation is written to the commit log
Cross data center forwarding
A technique for optimizing replication across data centers that sends data from one data center to a node in another data center, and that node forwards the data to other nodes in its data center.
Mem-table
It is memory resident data store. After writing to commitlog, data is written to the mem-table. There will be 1 mem-table for every row or multiple mem-tables for a single column-family.
SSTable
It is disk file.  Data from mem-table is flushed to SSTable when a particular threshold is reached.
Bloom filter
It is a quick, nondeterministic, algorithms for testing whether an element is a member of a set
Faceted search
Faceted search is the dynamic clustering of items or search results into categories that uses any value in any field to drill into search results, or even skip searching entirely.
Denormalization
Denormalization refers to the process of optimizing the read performance of a database by adding redundant data or by grouping data.
Snitch
A snitch determines which data centers and racks belong to. Snitches inform cassandra about network topology so that requests are routed efficiently. It allows cassandra to distribute replicas by grouping nodes into data centers and racks
a)      Dynamic snitching – monitors the performance of reads from various replicas and chooses the best replica based on this history
b)      Simple snitch – used for single data center deployment
c)       RackInferringSnitch – Determines the location of nodes by rack and data center corresponding to the IP address
d)      PropertyFileSnitch – Determines the location of nodes by rack and data center
e)      GossipingPropertyFileSnitch – Automatically updates all nodes when adding new nodes.  Recommended for production
f)       Ec2Snitch – Use with AWS in a single region
g)      EC2MultiRegionSnitch – Use with AWS in multi region deployment
h)      GoogleCloudSnitch – For Cassandra deployments in Google cloud across one or more regions
i)        CloudStackSnitch – for Cassandra deployments in cloudstack environment
Composite partition key
A partition key consisting of many columns
Compound primary key
A primary key consisting of the partition key, which determines on which node data is stored, and one or more additional columns that determine clustering
Consistency
The synchronization of data on replicas in a cluster. Consistency is categorized as weak or strong.
Consistency level
A setting that defines a successful write or read by the number of cluster replicas that acknowledge the write or respond to the read request, respectively
Read consistency level
The value that determines the desired consistency for reads. It is recommended to have read consistency level as “all” meaning the last updated value (based on timestamp) is returned
Write consistency level
The value that determines the
Read repair
A mechanism by which the most recent version of the data is pushed to any out of date replica during a read operation.
Seed node
Some nodes in a Cassandra cluster are designated as seed nodes in the cluster. These are the first nodes to be started in the cluster and facilitate bootstrapping for the new nodes. When new node comes up, it will talk to seed node to obtain info about other nodes.

Comments

Post a Comment

Popular posts from this blog

Amazon Scenario Questions

Using ssh keys to login to VMs

Accessing NATted VMs from Host machine using Port Forwarding