Storage policies

Overview

Storage policies are a set of rules that help improve performance of applications by optimizing the data distribution based on demand for computing power and data accessibility. For example, if some data is used more often than the other, you can move it to an SSD disk to improve performance, and move the less popular data to the archive to save the resources.

If a storage policy is specified for a directory, HDFS will store at least one replica of each file in that directory in the designated storage type. If there’s not enough space, HDFS will store the new data in the Fallback storages for creation and replicas will be stored in the Fallback storages for replication.

Storage types

Depending on the selected policy, different storage types are available:

  • ARCHIVE — a storage with low computational resources but high capacity;

  • DISK — the default storage type with average resource for compute and capacity;

  • SSD — a storage hosted on SSD drives;

  • RAM_DISK — DataNode memory;

  • PROVIDED — a storage outside of the HDFS system. For more information on Providede storage, see the HDFS Provided Storage article.

Before setting a storage policy, make sure that the required type of storage exists. See more details on how to create a specific type of storage in the Add HDFS data directories article.

Storage policies

In HDFS, the following policies exist:

  • Hot. This policy is fitting for data regularly used for processing. All data blocks are stored in DISK when there’s enough space, and, if DISK space is limited, blocks will be stored in ARCHIVE.

  • Cold. The Cold storage policy is applied to rarely used data. All blocks are stored in ARCHIVE.

  • Warm. When the data is used only occasionally, you can set the Warm storage policy. In this case, only one replica is stored on DISK and the rest replicas are stored in ARCHIVE.

  • All_SSD. This storage policy is applied for storing all data on SSD. If there’s no space on SSD, the data is stored on DISK.

  • One_SSD. According to this storage policy, only one of the replicas is stored on the SSD. The remaining replicas are stored in DISK.

  • Lazy_Persist. When the Lazy_Persist storage policy is used, one replica is always stored in memory. This policy is useful only for single replica blocks. For blocks with more than one replica, all the replicas will be written to DISK.

  • Provided. This storage policy is applied for storing one replica in the PROVIDED type of storage and the rest replicas are stored in DISK.

Policy name ID Block placement Fallback storages for creation Fallback storages for replication

Hot

7

All replicas are stored in DISK

 — 

ARCHIVE

Cold

2

All replicas are stored in ARCHIVE

 — 

 — 

Warm

5

One replica is stored in DISK and the rest in ARCHIVE

ARCHIVE, DISK

ARCHIVE, DISK

All_SSD

12

All replicas are stored in SSD

DISK

DISK

One_SSD

10

One replica is stored in SSD and the rest in DISK

SSD, DISK

SSD, DISK

Lazy_Persist

15

One replica is stored in RAM_DISK. If there’s more than one replica, all of the replicas are stored in DISK

DISK

DISK

Provided

1

One replica is stored in PROVIDED and the rest in DISK

PROVIDED, DISK

PROVIDED, DISK

IMPORTANT
For the striped erasure coded files, the only suitable storage policies are All_SSD, Hot, or Cold. If you set a different policy, it will not take effect.

Commands

For managing storage policies, use the storagepolicies command.

The example of using the command to check the assigned storage policy:

$ hdfs storagepolicies -getStoragePolicy -path /tmp

When you set a new policy, it’s not applied automatically. To make sure that block placement complies with the storage policies, use the Mover action.

The output:

The storage policy of /tmp:
BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
Found a mistake? Seleсt text and press Ctrl+Enter to report it