Why migrating to Cassandra from MySQL makes sense?
Consider this scenario
There is a real time application server monitoring the following attributes of several access points.
1. Health - they are up or down
2. load condition - they are under or over or appropriately loaded
3. how many users are connected to access points
4. location of access point
5. their interference
----
the access points connect to this front end application and report their information. the application then works with a database abstraction layer to put the data onto database node.
Now some new requirement came up - Customer wanted the database to be a NoSQL one like Cassandra. but why does he really want a NoSQL database?
Then we began to think - why NoSQL now and what are the disadvantages of using MySQL database? or rather what we cannot achieve with MySQL that we want to and how we can achieve it using NoSQL
As we continued to analyse, we figured out following for MySQL
1. the data model in MySQL is fixed and is based on the schema
2. In MySQL data needs to be entered for all columns or NULL if no data should be entered - no option to
leave a particular column blank
3. MySQL nodes work in a Master / Slave node with Master and slave nodes syncing data among them
periodically.
4. if nodes go down, data connectivity is gone and gets restored only when nodes are up again.
What we actually need / envision
1. Ability to replicate data across multiple nodes so that even if 1 or 2 nodes go down, data is still available.
2. Ability to have a flexible data model / schema to store different kinds of data.
3. Ability for the database nodes to sync up with each other on database data and operations
automatically
4. Ability to easily move database nodes to cloud
5. Ability to launch and configure database nodes as demand arise and shut them down as demand
decreases.
6. Ability for the application to automatically switch to other available nodes when the main coordinator
node goes down.
Some questions we are thinking to ourselves
1. Do we really need joins and secondary indexes - well depends on the data model we come up with
2. How many tables are we looking at?
3. what is the scale of data we are looking at and how many database nodes do we think we may need?
Given the above requirements and needs, we came up with following conclusions and options
1. MySQL has no option to replicate across multiple nodes - at the max it supports master / slave
configuration and if either of the nodes go down, other takes up. but if both go down, then we are in
disaster and application needs to wait until either of the nodes comes up. Also there is no guarantee
the data is synchronized between master and slave .
2. MySQL data model / schema is fixed and cannot change
3. MySQL supports secondary indexes, joins and mathematical functions like sum avg, etc
Then enter the options of Cassandra and MongoDB and several more to consider.
Well both Cassandra and MongoDB are No SQL meaning Non SQL databases. this means they are not actually SQL databases and do not support the ACID properties which are specific to SQL databases. Instead NoSQL databases support BASE meaning Basically Available Softstate Eventually consistent. Also NoSQL databases support only 2 properties of the famous CAP - Consistency Availability and Partition Tolerance theorem.
Again we were at crossroads thinking which one to go with and again we went back to our table and listed down what we need
1. Ability to design data model around queries
2. Support a SQL like query language
3. Support a C/C++ driver for integrating with the front end application
4. Highly available and fault tolerant
5. Tuneable Read after write consistency
6. All nodes should be consistent with data
We felt that Cassandra Database would fit our requirements more better than MongoDB or other NoSQL databases - though we did not do an extensive research on other NoSQL databases to justify this conclusion.
Now that we decided to go ahead with Cassandra as our future database, we began to work on the migration strategy to migrate MySQL database to Cassandra DB. Some of the points our strategy was based on are:
1. Think of database queries first
2. the queries should be such that the database read performance is best and a particular read operation
does not span more than 1 table. This will ensure that Read operation is fast.
3. In Cassandra, writes are cheap, so take advantage of it and write as many tables as possible in order to
increase read performance
4. Cassandra uses a protocol called "Gossip" as a means of inter node communication for data replication.
between seed nodes and other data replicas
5. Ensure that there are not too many nodes configured as seed nodes - else boot up time would increase
Given the above points, we re designed our Cassandra data model such a way that every read operation spans only 1 table and also the data replication is easier. So even if a particular node goes down, retrieving data from other nodes ensures we get the right data.
So at this stage we have successfully migrated our entire MySQL database to multi node Cassandra database and also moved Cassandra nodes to AWS cloud.
Some of the Cassandra terms that would be useful are
There is a real time application server monitoring the following attributes of several access points.
1. Health - they are up or down
2. load condition - they are under or over or appropriately loaded
3. how many users are connected to access points
4. location of access point
5. their interference
----
the access points connect to this front end application and report their information. the application then works with a database abstraction layer to put the data onto database node.
Now some new requirement came up - Customer wanted the database to be a NoSQL one like Cassandra. but why does he really want a NoSQL database?
Then we began to think - why NoSQL now and what are the disadvantages of using MySQL database? or rather what we cannot achieve with MySQL that we want to and how we can achieve it using NoSQL
As we continued to analyse, we figured out following for MySQL
1. the data model in MySQL is fixed and is based on the schema
2. In MySQL data needs to be entered for all columns or NULL if no data should be entered - no option to
leave a particular column blank
3. MySQL nodes work in a Master / Slave node with Master and slave nodes syncing data among them
periodically.
4. if nodes go down, data connectivity is gone and gets restored only when nodes are up again.
What we actually need / envision
1. Ability to replicate data across multiple nodes so that even if 1 or 2 nodes go down, data is still available.
2. Ability to have a flexible data model / schema to store different kinds of data.
3. Ability for the database nodes to sync up with each other on database data and operations
automatically
4. Ability to easily move database nodes to cloud
5. Ability to launch and configure database nodes as demand arise and shut them down as demand
decreases.
6. Ability for the application to automatically switch to other available nodes when the main coordinator
node goes down.
Some questions we are thinking to ourselves
1. Do we really need joins and secondary indexes - well depends on the data model we come up with
2. How many tables are we looking at?
3. what is the scale of data we are looking at and how many database nodes do we think we may need?
Given the above requirements and needs, we came up with following conclusions and options
1. MySQL has no option to replicate across multiple nodes - at the max it supports master / slave
configuration and if either of the nodes go down, other takes up. but if both go down, then we are in
disaster and application needs to wait until either of the nodes comes up. Also there is no guarantee
the data is synchronized between master and slave .
2. MySQL data model / schema is fixed and cannot change
3. MySQL supports secondary indexes, joins and mathematical functions like sum avg, etc
Then enter the options of Cassandra and MongoDB and several more to consider.
Well both Cassandra and MongoDB are No SQL meaning Non SQL databases. this means they are not actually SQL databases and do not support the ACID properties which are specific to SQL databases. Instead NoSQL databases support BASE meaning Basically Available Softstate Eventually consistent. Also NoSQL databases support only 2 properties of the famous CAP - Consistency Availability and Partition Tolerance theorem.
Again we were at crossroads thinking which one to go with and again we went back to our table and listed down what we need
1. Ability to design data model around queries
2. Support a SQL like query language
3. Support a C/C++ driver for integrating with the front end application
4. Highly available and fault tolerant
5. Tuneable Read after write consistency
6. All nodes should be consistent with data
We felt that Cassandra Database would fit our requirements more better than MongoDB or other NoSQL databases - though we did not do an extensive research on other NoSQL databases to justify this conclusion.
Now that we decided to go ahead with Cassandra as our future database, we began to work on the migration strategy to migrate MySQL database to Cassandra DB. Some of the points our strategy was based on are:
1. Think of database queries first
2. the queries should be such that the database read performance is best and a particular read operation
does not span more than 1 table. This will ensure that Read operation is fast.
3. In Cassandra, writes are cheap, so take advantage of it and write as many tables as possible in order to
increase read performance
4. Cassandra uses a protocol called "Gossip" as a means of inter node communication for data replication.
between seed nodes and other data replicas
5. Ensure that there are not too many nodes configured as seed nodes - else boot up time would increase
Given the above points, we re designed our Cassandra data model such a way that every read operation spans only 1 table and also the data replication is easier. So even if a particular node goes down, retrieving data from other nodes ensures we get the right data.
So at this stage we have successfully migrated our entire MySQL database to multi node Cassandra database and also moved Cassandra nodes to AWS cloud.
Some of the Cassandra terms that would be useful are
Component / Property
|
Description
|
Anti entropy
|
the
synchronization of replica data on nodes to ensure that data is fresh
|
Back pressure
|
Pausing or
blocking the buffering of incoming requests after reaching the threshold
until the internal processing of buffered requests catches up.
|
Bootstrap
|
The process by
which new nodes join cluster transparently gathering the data from existing
nodes
|
Node
|
Machine that
stores data – a virtual machine or physical machine storing cassandra data.
|
Cluster
|
2 or more
cassandra nodes that exchange data using Gossip protocol
|
Compaction
|
|
Clustering
|
The
storage engine process that creates an index and keeps data in order based on
the index
|
Co-ordinator node
|
The
node that determines which nodes in the ring should get the request based on
the cluster configured snitch.
|
Clustering column
|
In
the table definition, a clustering column is a column that is part of the
compound primary key definition, but not the first column, which is position
reserved for the partition
key. Columns are clustered in multiple rows within a single
partition. The clustering order is determined by the position of columns in
the compound primary key definition.
|
Data Center
|
Collection of
related nodes
|
Row
|
Collection of one
or more columns
|
Column family
|
Collection of one
or more rows. Represent structure of the data.
|
Keys_cached
|
A property of
column family that represents number of locations to keep cached per SSTable
|
rows_cached
|
A property of
column family that represents whose entire contents will be cached in memory
|
preload_row_cache
|
A property of
column family that specifies whether you want to pre-populate the row cachel
|
Keyspace
|
Outer most
container for data in Cassandra. Contains one or more column families
|
Replication factor
|
It is the number
of machines in the cluster that will receive copies of same data.
|
Relica Placement
Strategy
|
Strategy to place
replicas in the ring - simple stragey
(rack aware strategy), network topology strategy (data center aware)
|
Cluster
|
Collection of one
or more data centers
|
Gossip
|
A
peer-to-peer communication protocol for exchanging location and state
information between nodes.
|
Commitlog
|
Crash recovery
mechanism in Cassandra. Every write operation is written to the commit log
|
Cross data center
forwarding
|
A technique for optimizing replication across data centers that
sends data from one data center to a node in another data center, and that
node forwards the data to other nodes in its data center.
|
Mem-table
|
It is memory
resident data store. After writing to commitlog, data is written to the
mem-table. There will be 1 mem-table for every row or multiple mem-tables for
a single column-family.
|
SSTable
|
It is disk
file. Data from mem-table is flushed
to SSTable when a particular threshold is reached.
|
Bloom filter
|
It is a quick,
nondeterministic, algorithms for testing whether an element is a member of a
set
|
Faceted search
|
Faceted
search is the dynamic clustering of items or search results into categories
that uses any value in any field to drill into search results, or
even skip searching entirely.
|
Denormalization
|
Denormalization
refers to the process of optimizing the read performance of a database by
adding redundant data or by grouping data.
|
Snitch
|
A snitch
determines which data centers and racks belong to. Snitches inform cassandra
about network topology so that requests are routed efficiently. It allows
cassandra to distribute replicas by grouping nodes into data centers and
racks
a)
Dynamic snitching – monitors the
performance of reads from various replicas and chooses the best replica based
on this history
b)
Simple snitch – used for single data
center deployment
c)
RackInferringSnitch – Determines the
location of nodes by rack and data center corresponding to the IP address
d)
PropertyFileSnitch – Determines the
location of nodes by rack and data center
e)
GossipingPropertyFileSnitch –
Automatically updates all nodes when adding new nodes. Recommended for production
f)
Ec2Snitch – Use with AWS in a single
region
g)
EC2MultiRegionSnitch – Use with AWS in
multi region deployment
h)
GoogleCloudSnitch – For Cassandra
deployments in Google cloud across one or more regions
i)
CloudStackSnitch – for Cassandra deployments
in cloudstack environment
|
Composite
partition key
|
A partition key
consisting of many columns
|
Compound primary
key
|
A primary key
consisting of the partition key, which determines on which node data is
stored, and one or more additional columns that determine
clustering
|
Consistency
|
The synchronization of data on
replicas in a cluster. Consistency is categorized as weak or strong.
|
Consistency level
|
A setting that
defines a successful write or read by the number of cluster replicas that
acknowledge the write or respond to the read request, respectively
|
Read consistency
level
|
The value that
determines the desired consistency for reads. It is recommended to have read
consistency level as “all” meaning the last updated value (based on
timestamp) is returned
|
Write consistency
level
|
The value that
determines the
|
Read repair
|
A mechanism by
which the most recent version of the data is pushed to any out of date replica
during a read operation.
|
Seed node
|
Some nodes in a
Cassandra cluster are designated as seed nodes in the cluster. These are the
first nodes to be started in the cluster and facilitate bootstrapping for the
new nodes. When new node comes up, it will talk to seed node to obtain info
about other nodes.
|
Moving from MySQL to Cassandra is worth it but it's too difficult.
ReplyDeleteWe provide mobile app development services with cassandra consulting.
Delete