Hive on Spark parameters

Konstantin Alpashkin

Contents

Remote Spark Driver

This section describes the parameters used to control the interaction between Spark and Hive Metastore. For more information on using the parameters, see the Spark and Hive section.

Parameter Description Default value

hive.spark.job.monitor.timeout

The timeout for the job monitor to get a Spark job state (in seconds)

hive.spark.dynamic.partition.pruning

When set to true, enables dynamic partition pruning (DPP) for the Spark engine so that joins on partition keys are processed by writing to a temporary HDFS file, and read later for removing unnecessary partitions

false

hive.spark.dynamic.partition.pruning.map.join.only

Similar to hive.spark.dynamic.partition.pruning, but enables DPP only if the join on the partitioned table can be converted to a MapJoin

false

hive.spark.dynamic.partition.pruning.max.data.size

The maximum data size for the dimension table that generates partition pruning information (in megabytes). If a table reaches this limit, the optimization is disabled

100

hive.spark.exec.inplace.progress

Updates Spark job execution progress in-place in the terminal

true

hive.spark.use.ts.stats.for.mapjoin

If set to true, the MapJoin optimization in Hive/Spark uses statistics from TableScan operators at the root of the operator tree, instead of parent ReduceSink operators of the Join operator

false

hive.spark.explain.user

Defines whether to show EXPLAIN result at the user level for Hive-on-Spark queries. When enabled, logs EXPLAIN output for the query at the user level

false

hive.prewarm.spark.timeout

Time to wait to finish pre-warming Spark executors when hive.prewarm.enabled=true (in milliseconds)

5000

hive.spark.optimize.shuffle.serde

If set to true, Hive on Spark registers custom serializers for data types in shuffle. This should result in less shuffled data

false

hive.merge.sparkfiles

Merges small files at the end of a Spark DAG Transformation

false

hive.spark.use.op.stats

Defines whether to use operator stats to determine reducer parallelism for Hive on Spark. If set to false, Hive will use source table stats to determine reducer parallelism for all first level reduce tasks, and the maximum reducer parallelism from all parents for all the rest (second level and onward) reducer tasks. Setting to false triggers an alternative algorithm for calculating the number of partitions per Spark shuffle. This new algorithm typically results in an increased number of partitions per shuffle

true

hive.spark.use.ts.stats.for.mapjoin

If set to true, MapJoin optimization in Hive/Spark will use statistics from TableScan operators at the root of operator tree, instead of parent ReduceSink operators of the Join operator. Setting this to true is useful when the operator statistics used for a common join-to-map join conversion are inaccurate

false

hive.spark.use.groupby.shuffle

When set to true, uses Spark’s RDD#groupByKey() to perform groupings. When set to false, use Spark’s RDD#repartitionAndSortWithinPartitions() to perform groupings. While groupByKey() has better performance when running grouping, it can use an excessive amount of memory. Setting this to false may reduce memory usage, but will affect the performance

true

Remote Spark Driver

The remote Spark driver is the application launched in the Spark cluster that submits actual Spark jobs. It is a long-lived application initialized upon the first query of the current user and is running until the user session is closed.

The following properties control the remote communication between the remote Spark driver and the Hive client that spawns it.

Parameter Description Default value

hive.spark.client.future.timeout

Timeout for requests from the Hive client to the remote Spark driver (in seconds)

hive.spark.client.connect.timeout

Timeout for the remote Spark driver to connect back to the Hive client (in milliseconds)

1000

hive.spark.client.server.connect.timeout

Timeout for the handshake between the Hive client and the remote Spark driver (in milliseconds). Checked by both processes

90000

hive.spark.client.secret.bits

Number of bits of randomness in the generated secret for communication between the Hive client and the remote Spark driver. Rounded down to the nearest multiple of 8

256

hive.spark.client.rpc.server.address

The server address of the HiveServer2 host to be used for communication between the Hive client and the remote Spark driver

hive.spark.client.rpc.server.address, localhost if unavailable

hive.spark.client.rpc.threads

The maximum number of threads for the remote Spark driver’s RPC event loop

hive.spark.client.rpc.max.size

The maximum message size in bytes for communication between the Hive client and the remote Spark driver

52,428,800 (50 * 1024 * 1024, or 50 MB)

hive.spark.client.channel.log.level

Channel logging level for remote Spark driver. Possible values: DEBUG, ERROR, INFO, TRACE, WARN. If unset, TRACE is used

—

Found a mistake? Seleсt text and press Ctrl+Enter to report it