Apache Cassandra 1.1 - Part 1: basic architecture

Posted by Sergey Enin 19 August 2012 at 09:40

Apache Cassandra 1.1

Preface

I`ve been working with Cassandra for last ~7 months. I mostly writting Ruby 'drivers' for more easy and consistence access for Cassandra data.  In my opinion, generally, a lot of software engeeners has got of understanding what Cassandra data model provides and how it could be used.

I decided to start series of articles about Cassandra, how data is stored and processed within Cassandra and how we can work with Cassandra from Ruby specifically to cover lack of understanding. Hope, it will helps developers, who are newbie in world of BigData and NoSql.

Key improvements

Cassandra Query Language (CQL) Enhancements

One of the main objectives of Cassandra 1.1 was to bring CQL up to parity with the legacy API and command line interface (CLI) that has shipped with Cassandra for several years. This release achieves that goal. CQL is now the primary interface into the DBMS.

Composite Primary Key Columns

The most significant enhancement of CQL is support for composite primary key columns and wide rows. Composite keys distribute column family data among the nodes. New querying capabilities are a beneficial side effect of wide-row support. You use an ORDER BY clause to sort the result set.

Global Row and Key Caches

Memory caches for column families are now managed globally instead of at the individual column family level, simpliying  configuration and tuning. Cassandra automatically distributes memory for various column families based on the overall workload and specific column family usage.   Administrators can choose to include or exclude column families from being cached via the caching parameter that is used when creating or modifying column families.

Row-Level Isolation

Full row-level isolation is now in place so that writes to a row are isolated to the client performing the write and are not visible to any other user until they are complete. From a transactional ACID (atomic, consistent, isolated, durable) standpoint, this enhancement now gives Cassandra transactional AID support. Consistency in the ACID sense typically involves referential integrity with foreign keys among related tables, which Cassandra does not have. Cassandra offers tunable consistency not in the ACID sense, but in the CAP theorem sense where data is made consistent across all the nodes in a distributed database cluster. A user can pick and choose on a per operation basis how many nodes must receive a DML command or respond to a SELECT query.

Hadoop Integration

The following low-level features have been added to Cassandra’s support for Hadoop:

  • Secondary index support for the column family input format. Hadoop jobs can now make use of Cassandra secondary indexes.
  • Wide row support. Previously, wide rows that had, for example, millions of columns could not be accessed, but now they can be read and paged through in Hadoop.
  • The bulk output format provides a more efficient way to load data into Cassandra from a Hadoop job.

Basic architecture

A Cassandra instance is a collection of independent nodes that are configured together into a cluster. In a Cassandra cluster, all nodes are peers, meaning there is no master node or centralized management process. A node joins a Cassandra cluster based on certain aspects of its configuration. This section explains those aspects of the Cassandra cluster architecture.

Cassandra uses a protocol called gossip to discover location and state information about the other nodes participating in a Cassandra cluster. Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about.

In Cassandra, the gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster. A gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.

When a node first starts up, it looks at its configuration file to determine the name of the Cassandra cluster it belongs to and which node(s), called seeds, to contact to obtain information about the other nodes in the cluster. These cluster contact points are configured in the cassandra.yaml configuration file for a node.

Failure detection is a method for locally determining, from gossip state, if another node in the system is up or down. Failure detection information is also used by Cassandra to avoid routing client requests to unreachable nodes whenever possible.

Data partitioning

In Cassandra, the total data managed by the cluster is represented as a circular space or ring. The ring is divided up  into ranges equal to the number of nodes, with each node being responsible for one or more ranges of the overall data. Before a node can join the ring, it must be assigned a token. The token determines the node's position on the ring and the range of data it is responsible for.

Column family data is partitioned across the nodes based on the row key. To determine the node where the first replica of a row will live, the ring is walked clockwise until it locates the node with a token value greater than that of the row key. Each node is responsible for the region of the ring between itself (inclusive) and its predecessor (exclusive). With the nodes sorted in token order, the last node is considered the predecessor of the first node; hence the ring representation.

Unlike almost every other configuration choice in Cassandra, the partitioner may not be changed without reloading all of your data. It is important to choose and configure the correct partitioner before initializing your cluster. Cassandra offers a number of partitioners out-of-the-box, but the random partitioner is the best choice for most Cassandra deployments.

RandomPartitioner

The RandomPartitioner is the default partitioning strategy for a Cassandra cluster, and in almost all cases is the right choice.

Random partitioning uses consistent hashing to determine which node will store a particular row. Unlike naive modulus-by-node-count, consistent hashing ensures that when nodes are added to the cluster, the minimum possible set of data is affected.

About Ordered Partitioners

Not recomended, since it causes hot spots and unavaibility of load balancing.

Using an ordered partitioner allows range scans over rows, meaning you can scan rows as though you were moving a cursor through a traditional index. For example, if your application has user names as the row key, you can scan rows for users whose names fall between Jake and Joe. This type of query would not be possible with randomly partitioned row keys, since the keys are stored in the order of their MD5 hash (not sequentially).

Although having the ability to do range scans on rows sounds like a desirable feature of ordered partitioners, there are ways to achieve the same functionality using column family indexes. Most applications can be designed with a data model that supports ordered queries as slices over a set of columns rather than range scans over a set of rows.

Replication

When you create a keyspace in Cassandra, you must decide the replica placement strategy, that is, the number of replicas and how those replicas are distributed across nodes in the cluster. The replication strategy relies on the cluster-configured snitch to help it determine the physical location of nodes and their proximity to each other. The total number of replicas across the cluster is often referred to as the replication factor. All replicas are equally important; there is no primary or master replica in terms of how read and write requests are handled.

Replica placement strategy

The replica placement strategy determines how replicas for a keyspace are distributed across the cluster. The replica placement strategy is set when you create a keyspace.

SimpleStrategy

SimpleStrategy is the default replica placement strategy when creating a keyspace using the Cassandra CLI. Other interfaces, such as the CQL utility, require you to explicitly specify a strategy. SimpleStrategy places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering rack or data center location.

NetworkTopologyStrategy

NetworkTopologyStrategy is the preferred replication placement strategy when you have information about how nodes are grouped in your data center, or when you have (or plan to have) your cluster deployed across multiple data centers. This strategy allows you to specify how many replicas you want in each data center.

Snitches

A snitch maps IPs to racks and data centers. It is is a configurable component of a Cassandra cluster that defines how the nodes are grouped together within the overall network topology. Cassandra uses this information to route inter-node requests as efficiently as possible within the confines of the replica placement strategy. The snitch does not affect requests between the client application and Cassandra and it does not control which node a client connects to.

Accessing & management data

All nodes in Cassandra are peers. A client read or write request can go to any node in the cluster. When a client connects to a node and issues a read or write request, that node serves as the coordinator for that particular client operation.

The job of the coordinator is to act as a proxy between the client application and the nodes (or replicas) that own the data being requested. The coordinator determines which nodes in the ring should get the request based on the cluster configured partitioner and replica placement strategy.

Writes

For writes, the coordinator sends the write to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a success acknowledgement in order for the write to be considered successful. In multi data center deployments, Cassandra optimizes write performance by choosing one coordinator node in each remote data center to handle the requests to replicas within that data center.

Cassandra is optimized for write. Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (sorted string table). Memtables and SSTables are maintained per column family. Memtables are organized in sorted order by row key and flushed to SSTables sequentially (no random seeking as in relational databases). SSTables are immutable (they are not written to again after they have been flushed).

Write consistency

When you do a write in Cassandra, the consistency level specifies on how many replicas the write must succeed before returning an acknowledgement to the client application.

Different consistency levels are available, with ANY being the lowest consistency (but highest availability), and ALL  being the highest consistency (but lowest availability). QUORUM is a good middle-ground ensuring strong consistency,  yet still tolerating some level of failure.

QUORUM =  ROUND_DOWN( (replication_factor / 2) + 1 )

For example, with a replication factor of 3, a quorum is 2 (can tolerate 1 replica down). With a replication factor of 6, a quorum is 4 (can tolerate 2 replicas down).

Reads

For reads, there are two types of read requests that a coordinator can send to a replica; a direct read request and a background read repair request. The number of replicas contacted by a direct read request is determined by the consistency level specified by the client. Background read repair requests are sent to any additional replicas that did not receive a direct request. Read repair requests ensure that the requested row is made consistent on all replicas.

When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that  contain columns from the row in question, as well as from any unflushed memtables, to produce the requested data. To optimize this piecing-together process, Cassandra uses an in-memory structure called a bloom filter: each SSTable has a bloom filter associated with it that is used to check if any data for the requested row exists in the SSTable before doing any disk I/O. As a result, Cassandra is very performant on reads when compared to other storage systems, even for read-heavy workloads.

Read consistency 

When you do a read in Cassandra, the consistency level specifies how many replicas must respond before a result is returned to the client application. Cassandra checks the specified number of replicas for the most recent data to satisfy the read request (based on the timestamp).

The quorum is calculated the same as in write consistency.

Cassandra has a number of built-in repair features to ensure that data remains consistent across replicas.

Deletes

  1. Deleted data is not immediately removed from disk. Data that is inserted into Cassandra is persisted to SSTables on disk. Once an SSTable is written, it is immutable (the file is not updated by further DML operations). This means that a deleted column is not removed immediately. Instead a marker called a tombstone is written to indicate the new column status.
  2. A deleted column can reappear if routine node repair is not run. Marking a deleted column with a tombstone  ensures that a replica that was down at the time of delete will eventually receive the delete when it comes back up again.
  3. The row key for a deleted row may still appear in range query results. When you delete a row in Cassandra, it marks all columns for that row key with a tombstone. Until those tombstones are cleared by compaction, you have an empty row key (a row that contains no columns). These deleted keys can show up in results of get_range_slices() calls. If your client application performs range queries on rows, you may want to have if filter out row keys that return empty column lists.

ACID

Cassandra does not offer fully ACID-compliant transactions, the standard for transactional behavior in a relational database systems:

    • Atomic. Everything in a transaction succeeds or the entire transaction is rolled back.

In Cassandra, a write is atomic at the row-level, meaning inserting or updating columns for a given row key will be  treated as one write operation. Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing operation. Nor does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.

    • Consistent. A transaction cannot leave the database in an inconsistent state.

There are no locking or transactional dependencies when concurrently updating multiple rows or column families. Cassandra supports tuning between availability and consistency, and always gives you partition tolerance. Cassandra can be tuned to give you strong consistency in the CAP sense where data is made consistent across all the nodes in a distributed database cluster. A user can pick and choose on a per operation basis how many nodes should respond to a SELECT query.

    • Isolated. Transactions cannot interfere with each other.

Full row-level isolation is now in place so that writes to a row are isolated to the client performing the write and are not visible to any other user until they are complete. From a transactional ACID (atomic, consistent, isolated, durable) standpoint, this enhancement now gives Cassandra transactional AID support. A write is isolated at the row-level in the storage engine.

    • Durable. Completed transactions persist in the event of crashes or server failure.

Writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.

As a non-relational database, Cassandra does not support joins or foreign keys, and consequently does not offer consistency in the ACID sense. For example, when moving money from account A to B the total in the accounts does not change. Cassandra supports atomicity and isolation at the row-level, but trades transactional isolation and atomicity for high availability and fast write performance. Cassandra writes are durable.

Cassandra Data Model

Keyspaces

In Cassandra, the keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together. Typically, a cluster has one keyspace per application. Replication is controlled on a per-keyspace basis, so data that has different replication requirements should reside in different keyspaces. Keyspaces are not designed to be used as a significant map layer within the data model, only as a way to control data replication for a set of column families.

ColumnFamilies

In Cassandra, you define column families. Column families can (and should) define metadata about the columns, but the actual columns that make up a row are determined by the client application. Each row can have a different set of columns. There are two types of column families:

    • Static column family -- The typical Cassandra column family design

    • Dynamic column family -- For use with a custom data type

Column families consists of these kinds of columns:

    • Standard -- Has one primary key.

    • Composite -- Has more than one primary key, recommended for managing wide rows  or when you want to create columns that you can query to return sorted results.

    • Expiring -- Gets deleted during compaction.

    • Counter -- Counts occurrences of an event.

    • Super -- Used to manage wide rows, inferior to using composite columns.

Although column families are very flexible, in practice a column family is not entirely schema-less.

Data compression can be configured on a per-column family basis. Compression maximizes the storage capacity of your Cassandra nodes by reducing the volume of data on disk.

Besides reducing data size, compression typically improves both read and write performance. Cassandra is able to quickly find the location of rows in the SSTable index, and only decompresses the relevant row chunks. This means compression improves read performance not just by allowing a larger data set to fit in memory, but it also benefits workloads where the hot data set does not fit into memory.

Cassandra now writes column families to disk using this directory and file naming format:

/var/lib/cassandra/data/ks1/cf1/ks1-cf1-hc-1-Data.db

Cassandra creates a subdirectory for each column family, which allows a developer or admin to symlink a column family to a chosen physical drive or data volume.

In the background, Cassandra periodically merges SSTables together into larger SSTables using a process called compaction. Compaction merges row fragments together, removes expired tombstones (deleted columns), and rebuilds primary and secondary indexes. Since the SSTables are sorted by row key, this merge is efficient (no random disk I/O). Once a newly merged SSTable is complete, the input SSTables are marked as obsolete and eventually deleted by the JVM garbage collection (GC) process.

Data types

You can define data types when you create your column family schemas (which is recommended), but Cassandra does not require it. Internally, Cassandra stores column names and values as hex byte arrays (BytesType).

In Cassandra, the data type for a column (or row key) value is called a validator.

The data type for a column name is called a comparator.

Within a row, columns are always stored in sorted order by their column name. The comparator specifies the data type for the column name, as well as the sort order in which columns are stored within a row. Unlike validators, the comparator may not be changed after the column family is defined, so this is an important consideration when defining a column family in Cassandra.

Indexes

Primary indexes

In Cassandra, the primary index for a column family is the index of its row keys. Each node maintains this index for the data it manages.

Rows are assigned to nodes by the cluster-configured partitioner and the keyspace-configured replica placement strategy. The primary index in Cassandra allows looking up of rows by their row key. Since each node knows what ranges of keys each node manages, requested rows can be efficiently located by scanning the row indexes only on the relevant replicas.

With randomly partitioned row keys (the default in Cassandra), row keys are partitioned by their MD5 hash and cannot be scanned in order like traditional b-tree indexes. Using an ordered partitioner does allow for range queries over rows, but is not recommended because of the difficulty in maintaining even data distribution across nodes.

Secondary indexes

Secondary indexes in Cassandra refer to indexes on column values (to distinguish them from the primary row key index for a column family). Cassandra supports secondary indexes of the type KEYS (similar to a hash index). Secondary indexes allow for efficient querying by specific values using equality predicates (where column x = value y). Also, queries on indexed values can apply additional filters to the result set for values of other columns. Cassandra's built-in secondary indexes are best for cases when many rows contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index.

CQL3

I am planning to review CQL3 more deeply in next article, so, you can find below just basic information.

Quick start

CQLsh 2 is the default query language and CQLsh 3 is the new, recommended version. After starting the Cassandra Server, follow these steps to get started using CQLsh 3 quickly:

  1. Start CQLsh 3 on Linux or Mac:

    From the bin directory of the Cassandra installation, run the cqlsh script.

    cd <install_location>/bin

      ./cqlsh --3
     

  2. Create a keyspace to work with:

    cqlsh> CREATE KEYSPACE TagService

               WITH strategy_class = 'org.apache.cassandra.locator.SimpleStrategy'

               AND strategy_options:replication_factor='1';

    cqlsh> USE demodb;
     

  3. Create a column family (the counterpart to a table in relational database world):

    cqlsh> CREATE TABLE users (

                 user_name varchar,

                 password varchar,

                 state varchar,

                 PRIMARY KEY (user_name)

                 );
     

  4. Enter and read data from Cassandra:

        cqlsh> INSERT INTO users

                    (user_name, password)

                    VALUES ('jsmith', 'ch@ngem3a');

        cqlsh> SELECT * FROM users WHERE user_name='jsmith';

       The output is:

        user_name | password | state

        -----------+-----------+-------

        jsmith | ch@ngem3a | null
     

  5. Exit CQLsh:

    cqlsh> exit

 

Links

[1] DataStax: Apache Cassandra 1.1 Documentation - http://www.datastax.com/doc-source/pdf/cassandra11.pdf

[2] Cassandra storage engine: Log structured merge tree - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.2782

[3] Schema in Cassandra 1.1 - http://www.datastax.com/dev/blog/schema-in-cassandra-1-1

Posted in ,  | Tags , , ,

comments powered by Disqus