Multi-table inserts in Hive

Hive supports multi-table inserts that allow writing data to several Hive tables or to several HDFS files at a time. This is an effective technique as the source table is read only once and the retrieved data is reused to populate multiple targets.

Test database

Throughout this article, the following Hive table transactions is used to demonstrate multi-table insert examples.

SELECT * FROM transactions;
+----------------------+----------------------+--------------------------+------------------------+
| transactions.txn_id  | transactions.acc_id  | transactions.txn_amount  | transactions.txn_date  |
+----------------------+----------------------+--------------------------+------------------------+
| 1                    | 1002                 | 10.00                    | 2023-01-01             |
| 8                    | 1003                 | 50.00                    | 2023-01-01             |
| 3                    | 1002                 | 30.00                    | 2023-01-02             |
| 4                    | 1001                 | 100.50                   | 2023-01-02             |
| 2                    | 1002                 | 20.00                    | 2023-01-03             |
| 6                    | 1001                 | 200.50                   | 2023-01-03             |
| 7                    | 1003                 | 50.00                    | 2023-01-03             |
| 5                    | 1001                 | 150.50                   | 2023-01-04             |
| 9                    | 1003                 | 75.00                    | 2023-01-04             |
+----------------------+----------------------+--------------------------+------------------------+
SQL for creating test table

 
The following SQL can be used to create and populate the test Hive table using /bin/beeline.

CREATE DATABASE IF NOT EXISTS test_multi_inserts;
USE test_multi_inserts;

DROP TABLE IF EXISTS transactions;

CREATE TABLE transactions(
    txn_id int,
    acc_id int,
    txn_amount decimal(10,2),
    txn_date date)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

INSERT INTO transactions VALUES
(1, 1002, 10.00, '2023-01-01'),
(2, 1002, 20.00, '2023-01-03'),
(3, 1002, 30.00, '2023-01-02'),
(4, 1001, 100.50, '2023-01-02'),
(5, 1001, 150.50, '2023-01-04'),
(6, 1001, 200.50, '2023-01-03'),
(7, 1003, 50.00, '2023-01-03'),
(8, 1003, 50.00, '2023-01-01'),
(9, 1003, 75.00, '2023-01-04');

Syntax

The multi-insert syntax has two forms presented below:

FROM <src_tbl>
INSERT OVERWRITE
    (TABLE <target_tbl> [PARTITION(<partition_spec>)] | DIRECTORY [LOCAL] '<path/to/hdfs>')
<SELECT ... >


INSERT OVERWRITE
    (TABLE <target_tbl> [PARTITION(<partition_spec>)] | DIRECTORY [LOCAL] '<path/to/hdfs')
<SELECT ... FROM <src_tbl>. ... >

Where:

  • <src_tbl> — the source Hive table to read data from.

  • <SELECT …​ > — the SELECT clause to read data from <src_tbl>.

  • <target_tbl> — the target table to write data to. Target tables are not created automatically and have to be pre-created manually.

  • <partition_spec> — specifies one or more table partitions to write data into.

  • <path/to/hdfs> — the path to an HDFS file to write data. If a relative path is provided, Hive stores files under the /user/hive/ HDFS directory. Adding the LOCAL keyword writes the data to the local file system.

Examples

Insert into multiple Hive tables

In the following example, the query calculates minimum and maximum transaction amount per account and writes the results into the corresponding Hive tables.

NOTE
The target tables must be created before running INSERT.
CREATE TABLE max_txn_per_acc (account integer, max_txn_value decimal(10,2));
CREATE TABLE min_txn_per_acc (account integer, min_txn_value decimal(10,2));

FROM transactions
    INSERT INTO TABLE max_txn_per_acc
        SELECT acc_id, MAX(txn_amount)
        GROUP BY acc_id
    INSERT INTO TABLE min_txn_per_acc
        SELECT acc_id, MIN(txn_amount)
        GROUP BY acc_id;

To verify the insertion results, query the tables:

SELECT * FROM max_txn_per_acc;
SELECT * FROM min_txn_per_acc;

The output:

SELECT * FROM max_txn_per_acc;
+--------------------------+--------------------------------+
| max_txn_per_acc.account  | max_txn_per_acc.max_txn_value  |
+--------------------------+--------------------------------+
| 1001                     | 200.50                         |
| 1002                     | 30.00                          |
| 1003                     | 75.00                          |
+--------------------------+--------------------------------+

SELECT * FROM min_txn_per_acc;
+--------------------------+--------------------------------+
| min_txn_per_acc.account  | min_txn_per_acc.min_txn_value  |
+--------------------------+--------------------------------+
| 1001                     | 100.50                         |
| 1002                     | 10.00                          |
| 1003                     | 50.00                          |
+--------------------------+--------------------------------++

Insert into table partitions

Hive allows you to insert data into specific table partitions using multi-insert operations. The following sample query writes specific source columns into multiple known partitions.

CREATE TABLE txns_partitioned_acc_date(txn_id int, txn_amount decimal(10, 2))
PARTITIONED BY (account_id int, txn_date date);

FROM transactions txns
    INSERT OVERWRITE TABLE txns_partitioned_acc_date
    PARTITION(account_id='1001', txn_date='2023-01-02')
SELECT txn_id, txn_amount
WHERE txns.acc_id=1001 AND txns.txn_date='2023-01-02';

SELECT * FROM test_multi_inserts.txns_partitioned_acc_date;

The result:

+-----------------------------------+---------------------------------------+---------------------------------------+-------------------------------------+
| txns_partitioned_acc_date.txn_id  | txns_partitioned_acc_date.txn_amount  | txns_partitioned_acc_date.account_id  | txns_partitioned_acc_date.txn_date  |
+-----------------------------------+---------------------------------------+---------------------------------------+-------------------------------------+
| 4                                 | 100.50                                | 1001                                  | 2023-01-02                          |
+-----------------------------------+---------------------------------------+---------------------------------------+-------------------------------------+

By default, the PARTITION(<spec>) clause must include at least one static partition like PARTITION(account_id='1001') in the example above. To use dynamic partitions with multi-inserts (e.g. PARTITION('acount_id')) and thereby let Hive decide the destination partition automatically, set the following properties:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

Then, you can specify dynamic partitions as shown in the following example.

CREATE TABLE txns_partitioned_acc(txn_amount decimal(10, 2))
PARTITIONED BY (account_id int);

SET hive.exec.dynamic.partition.mode=nonstrict;

FROM transactions txns
    INSERT OVERWRITE TABLE txns_partitioned_acc
    PARTITION(account_id)
SELECT txn_amount, acc_id;

SELECT * FROM txns_partitioned_acc;

The output suggests that Hive has automatically routed the data to corresponding partitions.

+----------------------------------+----------------------------------+
| txns_partitioned_acc.txn_amount  | txns_partitioned_acc.account_id  |
+----------------------------------+----------------------------------+
| 100.50                           | 1001                             |
| 150.50                           | 1001                             |
| 200.50                           | 1001                             |
| 10.00                            | 1002                             |
| 20.00                            | 1002                             |
| 30.00                            | 1002                             |
| 50.00                            | 1003                             |
| 50.00                            | 1003                             |
| 75.00                            | 1003                             |
+----------------------------------+----------------------------------+

Write to multiple files

Hive allows writing data to multiple HDFS/local files at a time. The query below instructs Hive to write data from the source table into multiple HDFS files.

FROM transactions
    INSERT OVERWRITE DIRECTORY 'multi-insert-demo/transactions_20230101'
    SELECT acc_id, txn_amount WHERE txn_date='2023-01-01'
    INSERT OVERWRITE DIRECTORY 'multi-insert-demo/transactions_20230102'
    SELECT acc_id, txn_amount WHERE txn_date='2023-01-02';

To verify the new directories created in HDFS, use the following command:

$ hdfs dfs -ls /user/hive/multi-insert-demo

The output indicates two HDFS directories that store files with the corresponding selection results.

...
drwxr-xr-x   - hive hadoop          0 2023-11-29 10:51 /user/hive/multi-insert-demo/transactions_20230101
drwxr-xr-x   - hive hadoop          0 2023-11-29 10:51 /user/hive/multi-insert-demo/transactions_20230102

To write to files on a local file system rather than HDFS, use the LOCAL keyword as shown below.

FROM transactions
    INSERT OVERWRITE LOCAL DIRECTORY '/tmp/test.out'
SELECT transactions.*;
Found a mistake? Seleсt text and press Ctrl+Enter to report it