HDFS architecture

Albert Bagdasaryan

Contents

Components
Architecture
Operations with files
Replication
- Principles
- How it works

HDFS (Hadoop Distributed File System) is a highly fault-tolerant distributed file system designed for deployment on low-cost hardware. It provides high-speed access to application data and is best suited for applications with large datasets.

All files and directories in HDFS are represented on the NameNode by inodes that contain various attributes such as permissions, modification timestamp, disk space quota, namespace quota, and access time.

Components

HDFS is supported by the following components:

NameNode acts as the master of the system. It maintains the HDFS namespace, that is, directories and files, and manages the blocks located on the DataNodes.
DataNodes are NameNode subordinates deployed on most of the cluster machines for storing and processing application data. They are responsible for processing read and write requests from clients.
Secondary NameNode periodically makes HDFS namespace checkpoints using the HDFS transaction log and clears that log. It makes the checkpoints available to the NameNode so that the latter can update its namespace image as needed. In the event of a NameNode failure, you can restart the NameNode using the latest saved HDFS checkpoint and transaction log.

Architecture

The NameNode is the primary server that manages the HDFS namespace and controls client access to files. The DataNodes manage the storage attached to the nodes they run on.

HDFS components

HDFS exposes a file system namespace and enables user data to be stored in files. An HDFS file contains one or more blocks stored in multiple DataNodes. The NameNode performs file system namespace operations including opening, closing, and renaming files and directories. The NameNode also controls the mapping of file blocks to the DataNodes.

The DataNodes process read and write requests from HDFS clients. In addition, they perform block creation, deletion, and replication when the NameNode instructs them to do so.

HDFS supports a traditional hierarchical file organization. An application or a user can create directories and then store files in them. The HDFS namespace hierarchy is similar to most other file systems, meaning a user can create, remove, rename, or move files.

The NameNode records any change to the file system namespace or its properties. An application can request the number of file replicas that HDFS should maintain. The NameNode organizes the required number of copies of the file’s blocks. This number is called the replication factor of that file.

Operations with files

Unlike local file systems, HDFS does not allow you to modify a file. Files in HDFS can be written only once, and only one process writes to the file at a time. Since HDFS is used for big data, this file system focuses on large file sizes (>10 GB). The files consist of larger blocks than in conventional file systems.

An HDFS block is a BLOB on the underlying local file system with a default size of 128 MB. You can extend the block size up to 256 MB.

IMPORTANT

The files compressed with some codecs (for example, gzip, zlib) are not divided into blocks (non-splittable). It means that each file will be processed by a single Mapper. If you want your compressed file to be processed by several Mappers in parallel, use splittable formats, e.g. bzip2.

In Hadoop, you can do the following operations on files:

write
read
delete
replicate

The entire process of creating (writing) an HDFS file is as follows:

The client splits its data into parts with the block size.
The client connects the NameNode and requests the write procedure with a specific replication factor.
In response, the NameNode returns a list of DataNodes to record replicas of the first block.
The client goes to the first DataNode in the list and writes the first block. If the connection fails, the client takes the second DataNode from the list and so on.
The first DataNode writes the block and passes the block replica to the second DataNode in the list. That node writes the block and sends its replica to the next node in the list. This continues until the entire list is covered.
After completion of the block recording, the DataNodes send success messages to the client in reverse direction.
As soon as the client receives the first confirmation of the successful block writing, it notifies the NameNode about the block recording, then receives a list of DataNodes to write the second block to, and so on.

The client proceeds to write the next block if it manages to successfully write the previous block to at least one node, that is, the replication will work in background according to the eventual principle, spreading the blocks further to DataNodes and achieving the required replication level.

Replication

HDFS provides a reliable way to store large amounts of data in a distributed environment. It replicates file blocks to provide fault tolerance at the block level. Each block has multiple copies in HDFS.

Principles

HDFS splits a large file into several blocks and stores each block in 3 different DataNodes (the default replication factor is 3). For example, if you store a 128 MB file in HDFS, you will use 384 MB (3 * 128 MB) because the file will consist of only one block, and its three replicas will be on different DataNodes.

Notice that a DataNode cannot contain more than one replica of a block. Typically, two copies are on the same rack and one copy is off the rack. It is advised to set the default replication factor to at least 3 so that even if something happens to the rack, one copy is always safe.

NOTE

The topology (racking) is configured manually. For more information, see Rack awareness article.
You can change the replication factor for any file in HDFS.

There are the following advantages of splitting HDFS files into blocks:

Very large files fit better into smaller disks.
There is less wasted space left on each disk because many 128 MB blocks can be stored on a disk.
It optimizes file transfer by distributing the load across multiple machines. For example, if a file consists of 10 blocks stored on 10 DataNodes and a client reads the file, the load is distributed across 10 machines.

How it works

The default replication factor is 3 and the block size is 128 MB. When a client saves a file to HDFS, it first splits the file into 128 MB chunks and then writes them to HDFS blocks. Except for the last block, all other blocks will be 128 MB in size. The last block can be less than or equal to 128 MB depending on the size of the file. The default block size is configurable.

For example, we want to store 500 MB of data in HDFS. This data will get split into 3 blocks of 128 MB each, and the last block will be 116 MB.

Distribution of HDFS blocks

All blocks are distributed among DataNodes in accordance with the replication factor. Each block replica is stored in one unique DataNode and cannot be found twice in the same DataNode.

The process of getting an HDFS file is similar to downloading a movie through a torrent. The movie file gets broken down into multiple pieces and these pieces are downloaded from multiple machines in parallel. It helps to download the file faster.

You can set the default replication factor for the entire file system as well as for each file and directory individually. For unimportant files, you can reduce the replication factor, and for very important files, set a higher replication factor.

Whenever a DataNode goes down or fails, the NameNode instructs the DataNodes that have copies of lost blocks to start replicating the blocks to other DataNodes, so that each file again reaches the replication factor assigned to it.

Found a mistake? Seleсt text and press Ctrl+Enter to report it