spark-submit

Константин Алпашкин

Contents

Master URL

The spark-submit script is used to launch applications on a cluster. It is located in the Spark’s bin directory.

An example of invoking spark-submit is below:

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

Where:

--class — the entry point for your Spark application (e.g. org.apache.spark.examples.SparkPi).
--master — the master URL for the cluster (e.g. spark://23.195.26.187:7077).
--deploy-mode — indicates whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client, default value).
--conf — an arbitrary Spark configuration property in the key=value format. For values that contain spaces, you should wrap the properties in quotes ("key1"="value 1"). Multiple configurations should be passed as separate arguments (for example, --conf <key>=<value> --conf <key2>=<value2>).
<application-jar> — the path to a JAR including your application and all the dependencies. The URL must be globally accessible inside of your cluster, for instance, an hdfs://* path or a file://* path that is present on all nodes.
[application-arguments] — arguments to be passed to the main method of your main class, if any.

Master URL

The master URL passed to Spark can be in one of the following formats:

URL Description

local

Runs Spark locally with one worker thread (no parallelism)

local[K]

Runs Spark locally with K worker threads (ideally, set this to the number of cores on your machine)

local[K,F]

Runs Spark locally with K worker threads and F maxFailures (for more information, see spark.task.maxFailures)

local[*]

Runs Spark locally with as many worker threads as the number of logical cores on your machine

local[*,F]

Runs Spark locally with as many worker threads as the number of logical cores on your machine and F maxFailures

local-cluster[N,C,M]

Local-cluster mode is used only for unit tests. It emulates a distributed cluster in a single JVM with N number of workers, C cores per worker and M MiB of memory per worker

spark://HOST:PORT

Connects to the given Spark cluster in the Spark standalone mode. The port must be whichever one your master is configured to use, which is 7077 by default

spark://HOST1:PORT1,HOST2:PORT2

Connects to the given Spark cluster in the Spark standalone mode with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default

mesos://HOST:PORT

Connects to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://…. To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher

yarn

Connects to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable

k8s://HOST:PORT

Connects to a Kubernetes cluster in client or cluster mode depending on the value of --deploy-mode. The HOST and PORT refer to the Kubernetes API Server. It connects using TLS by default. In order to force it to use an unsecured connection, you can use k8s://http://HOST:PORT

Found a mistake? Seleсt text and press Ctrl+Enter to report it