Data compression

Daria Barysheva

Contents

Overview
Recommendations on data compression usage

Overview

To improve I/O operations and decrease data size in ADB clusters, you can use data compression. It is available only for append-optimized tables. There are two types of data compression:

Table-level compression — can be applied to an entire table. It is possible both for row-oriented and column-oriented append-optimized tables.
Column-level compression — can be applied to a specific column. You can choose different column-level compression algorithms for different table columns. This data compression type is possible only for column-oriented append-optimized tables.

Data orientation Compression level Algorithms Example

Row

Table

ZLIB — deflate algorithm.
ZSTD — Zstandard algorithm.

Table level only:

CREATE TABLE ao_row_compressed (a int, b text)
WITH (appendoptimized=true,
           compresstype=zlib,
           compresslevel=5)
DISTRIBUTED BY (a);

Column

Table and column

ZLIB — deflate algorithm.
ZSTD — Zstandard algorithm.
RLE_TYPE — run-length encoding.

Table level:

CREATE TABLE ao_col_compressed_table (a int, b text)
WITH (appendoptimized=true,
           orientation=column,
           compresstype=zlib,
           compresslevel=5)
DISTRIBUTED BY (a);

Column level:

CREATE TABLE ao_col_compressed_column
          (a int,
           b text ENCODING (compresstype=RLE_TYPE,
                                           compresslevel=1))
WITH (appendoptimized=true,
           orientation=column)
DISTRIBUTED BY (a);

Regardless of the level where you define data compression (table or column), you can use the following storage directives. They should be specified in the WITH or ENCODING clause, accordingly:

compresstype — type of data compression. Possible values: ZLIB, ZSTD, and RLE_TYPE. Values are not case-sensitive. The default value is none — no compression.
compresslevel — compression level. Possible values for different compression algorithms are summarized in the table below. Levels with small numbers are the fastest, but they perform the least compression.

IMPORTANT

When an ENCODING clause conflicts with a WITH clause, the ENCODING clause has a higher precedence.

Storage directives for compression
Directive	Deflate	Zstandard	Run-length encoding
compresstype	ZLIB	ZSTD	RLE_TYPE
compresslevel	Possible values: `1` — `9`. `1` is the fastest method with the least compression. It is used by default. `9` is the slowest method with the most compression.	Possible values: `1` — `19`. `1` is the fastest method with the least compression. It is used by default. `19` is the slowest method with the most compression.	Possible values: `1` — apply RLE_TYPE only. It is the fastest method with the least compression. It is used by default. `2` — apply RLE_TYPE, then apply the ZLIB compression with level 1. `3` — apply RLE_TYPE, then apply the ZLIB compression with level 5. `4` — apply RLE_TYPE, then apply the ZLIB compression with level 9. It is the slowest method with the most compression.

Recommendations on data compression usage

There are some recommendations on how to get the maximum performance when using data compression:

Choose the most suitable compression algorithm:
- ZSTD — provides both speed and a good compression ratio that can be tuned with the compresslevel option.
- ZLIB — is usually used for backward compatibility. ZSTD outperforms this compression type on usual workloads.
- RLE_TYPE — compresses data better than other algorithms when the same data values occur in many consecutive rows. But this compression type is not useful with tables that do not have large sets of repeated data.
When specifying the compression level, remember that the bigger that value is — the more (yet slower) the compression is. In fact, data compression allows you to "buy" I/O and disk space for CPU. So you should always balance between the available CPU resources and the required speed of compression. Try to find the optimal settings — to avoid long compression times or slow scan rates.
The final performance of compressed append-optimized tables depends on many factors, such as hardware, query tuning parameters, etc. You should perform comparison testing to determine whether data compression is useful in your environment.
It is not recommended to create compressed append-optimized tables on file systems that use compression.

TIP

Use the RLE_TYPE compression type for large sets of repeated data.
For most other tasks, the ZSTD compression with level 1 is the best option (comparable with the ZLIB compression with level 5).

Found a mistake? Seleсt text and press Ctrl+Enter to report it