Spark Connect

Overview

Spark Connect is a component of the Spark3 ADH service that acts as a thin client to provide connectivity to Spark clusters from a remote location. Using Spark Connect, you can work with your Spark cluster from a remote location, for example, using your favorite IDE on a user-grade laptop.

The following diagram illustrates the high-level Spark Connect architecture.

Spark Connect architecture
Spark Connect architecture
Spark Connect architecture
Spark Connect architecture

How Spark Connect works

Spark Connect implements the decoupled client-server architecture. The server side is the Spark Connect ADH service that listens for commands from clients. A client can be represented by a Spark application/Spark shell instance that uses the Spark Connect client library to speak to the server.

The main feature of Spark applications that use Spark Connect is the creation of a remote Spark session object in the application code. A remote session object establishes a connection to the Spark Connect server and automatically maintains the communication with the server — it sends DataFrame operations to the server and receives the results of their execution. Once a remote Spark session is created, it is used in the same way as a regular Spark session object.

Spark Connect uses the DataFrame API and unresolved logical plans as a protocol to communicate with the Spark cluster. The Spark Connect client library translates DataFrame operations into unresolved logical plans requests, encodes them, and then sends to the Spark Connect server over gRPC. The Spark Connect server receives and converts the unresolved logical plans into Spark logical plan operators. From this point, the standard Spark execution process kicks in, including all the Spark’s optimizations and features.

For more information on how to work with Spark via Spark Connect, see Spark Connect Usage.

Spark Connect benefits

The major operational benefits of using Spark Connect are as follows:

  • Lightweight client environment. You can work with your massive Spark cluster from your favorite IDE/Jupyter notebook on an ordinary user-grade laptop. The major requirement is to use a remote Spark session instance that communicates with the Spark Connect server using the client library.

  • Better resource isolation. Working with Spark via Spark Connect creates isolated sandboxes on the ADH hosts with resources allocated in advance. Spark executors that belong to different applications coexist within these isolated sandboxes without competing for resources. This helps to avoid cases of massive resource consumption, for example, when a heavy Spark computation consumes all available RAM on the cluster node (e.g. due to misconfigured startup parameters).

  • Upgradability. Spark driver can be updated separately, regardless of the client part. For example, you can update your Spark core to pick up the latest security fixes without modifying your client applications. The only requirement is that gRPC interfaces of Spark Connect client/server must remain compliant after the update.

Found a mistake? Seleсt text and press Ctrl+Enter to report it