HBase architecture

Daria Barysheva

Contents

Components
Operations with data
- Preprocessing
- Write
- Read
Performance improvements
- Region split
- HFiles compaction
Data recovery

Components

HBase architecture is based on the following components:

Region Server. It serves one or more Regions — ranges of rows, stored together. Each Region is served only by one Region Server. Region Servers are also called HRegionServers. A Region Server contains multiple components, some of them work on top of HDFS, using it as a persistent data storage.
Master server. It is a main server responsible for managing an HBase cluster. It is similar to a NameNode in HDFS. A Master server manages distribution of Regions between Region Servers, maintains the registration of Regions, etc. It is also called HMaster. You can deploy several Master servers in your cluster: one active and one or more standby.
ZooKeeper. It is a special service designed to manage configurations and synchronization of services. It is used to coordinate actions between HBase services.

The high-level architecture view of HBase is shown below.

HBase architecture

HBase architecture

Region Server

A Region Server maintains several Regions, running on top of HDFS. Each Region contains the rows, having row keys from the same range — assigned to that Region. Rows are sorted alphabetically by their keys. Each Region has a default size of 256 MB, each Region Server can include approximately 1000 Regions.

NOTE

Each Region is defined by the range of row keys. The first Region of the table contains the range, starting with an empty value. The last Region of the table contains the range, ending with an empty value.

The example of Row keys distribution between Regions

The example of Row keys distribution between Regions

A Region Server is responsible for managing the set of its Regions and executing read/write operations on them. Each Region Server consists of the following components:

HFiles — the main persistent data storage in HBase. These files are stored in HDFS in the special format. The data in each HFile is sorted by Row key. One pair <Region, Column family> corresponds to at least one HFile.
MemStore — the write buffer, stored in the memory. Since the data stored in HFiles is in the sorted order, updating files for each write operation could be very expensive. Instead of it, new data is written into a special memory area, called MemStore, where it is accumulated for some time. When the MemStore size reaches a certain limit, the data is flushed to a new HFile. The data is sorted in lexicographical order before saving it to HFiles.

One MemStore corresponds to one Column family. That is why there can be multiple MemStores in one Region, since each Region can contain multiple Column families. The combination of the MemStore and all HFiles, containing data values of the same Column family, is called the Store.
BlockCache — the read cache that stores frequently used data in memory. It reduces the time of read operations. The data that is not used for a long time is removed from the BlockCache.

Write Ahead Log (WAL) — a special log file, attached to every Region Server and stored in HDFS. It is used to prevent possible data loss due to MemStore and other failures. All incoming operations are written to this log before their actual implementation. So, WAL stores the new data values, that have not been saved to the permanent storage yet. This allows you to recover data after any failure. WAL is also called HLog.

Region Server components

Region Server components

Master server

A Master server is responsible for monitoring all Region Servers and maintaining the interface for all metadata changes in the HBase cluster. Its main functions are listed below:

Assigns Regions to Region Servers on startup and re-assigns them during data recovery and load balancing processes.
Coordinates Region Servers, similarly to how NameNodes manage DataNodes in HDFS.
Performs DDL operations, providing the interface for creating, deleting, and updating tables.
Monitors all the instances of Region Servers in the cluster and performs data recovery after their crash — with the help of ZooKeeper (see Data Recovery).

ZooKeeper

ZooKeeper acts like a coordinator of other components work in HBase clusters, performing the following functions:

Checks the availability of Region Servers by receiving regular heartbeats from them. If any Region Server fails to send a heartbeat, Master server gets the notification and runs the automatic data recovery procedure.
Provides standby Master servers with information about the status of the active Master server. If the active Master server fails to send a regular heartbeat to ZooKeeper, the inactive Master gets the corresponding notification and becomes active.
Stores the location of the catalog table hbase:meta (previously called .META.), which keeps a list of all Regions in the system. This table stores data about Regions in the form of key/value pairs, where key — the combination <table name, Region start row key, Region ID>; and the value — the full information about this Region and the path to the Region Server, containing it (including host name and port number).

Operations with data

Preprocessing

Whenever a client application makes a write or read request to HBase, first of all it should get the address of the Region Server, which contains the Region with necessary data. That is why any data operation in HBase begins with the following steps:

The client retrieves the location of the catalog table hbase:meta from ZooKeeper and stores it in its own cache.
From the hbase:meta table, the client requests the location of the Region Server that contains the piece of data with defined row keys. The client also keeps this information in the cache.
The client requests necessary data from the detected Region Server.

For future references, the client uses its own cache to retrieve the location of the hbase:meta table and the path to the Region Server, corresponding to the previously read row keys. The client does not refer to hbase:meta table again, until or unless there is a data miss because of the Region shifting or moving. In this case, the client requests the information from hbase:meta table again and updates its cache.

Write

The typical steps used for write operations in HBase after detection the right Region Server, are described below:

HBase writes data to WAL (Write Ahead Log) that is used to provide fault-tolerance.
The data is copied to the MemStore that serves as the RAM of the HDFS DataNode. The choice of MemStore depends on the Column family, present in the request.
Once the data is placed to the WAL and MemStore, the client receives the acknowledgment (ACK) — a confirmation that the task has completed.
Further, when the MemStore fills up, it flushes all its information to a new HFile in a sorted order.

NOTE

Client applications do not interact directly with underlying HFiles stored in HDFS.

Typical write sequence

Typical write sequence

Read

The typical steps, used for read operations in HBase after the right Region Server has been detected, are described below:

The required data is searched in the BlockCache — an area, where all the recently read key/value pairs are stored.
If the required data is not found in the BlockCache, the search process moves to the MemStore that keeps the most recently written files, not dumped to HFiles yet.

If the required data is not found in the MemStore, the search process moves to HFiles, stored in HDFS.

NOTE

As each HFile contains a MemStore snapshot created at the point when it was flushed, data for the whole requested row can be stored across multiple HFiles. So, to compose the completed row, HBase reads all HFiles that can contain information for that row. Using Bloom filters can facilitate this process.

After the search is completed (no matter at what stage it returned the result) the required data is written to the BlockCache and returned to the client along with the acknowledgment (ACK) — the task completion confirmation.

Typical read sequence

Typical read sequence

Performance improvements

Region split

When the amount of data in some Region reaches a specific critical size, HBase launches the Region split — a special operation that divides the source Region into two child Regions. Then, it is reported to the Master server.

Initially, new Regions are handled by the same Region Server, until the Master allocates them to a new one — for load balancing.

TIP

In order to avoid frequent splits of Regions in your database, you can set the boundaries of Regions and increase their maximum size.

Region split

HFiles compaction

As data in one Region can be stored in several HFiles, HBase periodically merges them together to reduce the number of disk seeks needed for one read operation, and to speed up the work. This operation is called compaction. There are two types of compactions:

Minor compaction — picks several smaller HFiles and rewrites them to bigger HFiles. This is helpful for storage space optimization. It starts automatically and runs in the background with a low priority, compared to other HBase operations. In Minor compaction files to compact are selected based on heuristic.

Minor compaction

Minor compaction
Major compaction — rewrites all HFiles of the Region Store to one bigger HFile. It also performs physical deletion of the expired data and the data, previously marked with the tombstone label. It increases the read performance. Major compaction is started manually or when certain condition are triggered (for example, by a timer). It has a high priority and can significantly slow down the cluster — as input-output disks and network traffic can be congested. So, this type of compaction should be scheduled during low peak load timings.

Major compaction

Major compaction

Data recovery

HBase supports automatic data recovery after different failures. The typical recovery process includes the following steps:

If any of the Region Servers fails, ZooKeeper notifies Master server about this.
The Master server distributes all the Regions of the crashed Region Server between active Region Servers. While doing this, the Master also distributes the WAL of the crashed Region Server stored in HDFS — to recover the data of crashed MemStores (which are now unavailable).
Each Region Server re-processes the WAL to rebuild the MemStore for each Column family from the failed Region Server. Since all data is written to the WAL in the chronological order, re-processing its records means making all necessary changes, that were stored in MemStores.
Regions continue to work on active Region Servers.

Found a mistake? Seleсt text and press Ctrl+Enter to report it