The Spark provisioning plugin allows the creation of Spark standalone clusters in Sahara. Sahara EDP should support running Spark applications on those clusters.
The Spark plugin uses the standalone deployment mode for Spark (as opposed to Spark on YARN or Mesos). An EDP implementation must be created that supports the basic EDP functions using the facilities provided by the operating system and the Spark standalone deployment.
The basic EDP functions are:
The Sahara job manager has recently been refactored to allow provisioning plugins to select or provide an EDP engine based on the cluster and job_type. The plugin may return a custom EDP job engine object or choose a default engine provided by Sahara.
A default job engine for Spark standalone clusters can be added to Sahara that implements the basic EDP functions.
Note, there are no public APIs in Spark for job status or cancellation beyond facilities that might be available through a SparkContext object instantiated in a Scala program. However, it is possible to provide basic functionality without developing Scala programs.
Engine selection criteria
The Spark provisioning plugin must determine if the default Spark EDP engine may be used to run a particular job on a particular cluster. The following conditions must be true to use the engine
Remote commands via ssh
All operations should be implemented using ssh to run remote commands, as opposed to writing a custom agent and client. Furthermore, any long running commands should be run asynchronously.
The run_job() function will be implemented using the spark-submit script provided by Spark.
Job status can be determined by monitoring the PID returned by run_job() via ps and reading the file containing the exit status
A Spark application may be canceled by sending SIGINT to the process running spark-submit.
The Ooyala job server is an alternative for implementing Spark EDP, but it’s a project of its own outside of OpenStack and introduces another dependency. It would have to be installed by the Spark provisioning plugin, and Sahara contributors would have to understand it thoroughly.
Other than Ooyala, there does not seem to be any existing client or API for handling job submission, monitoring, and cancellation.
There is no data model impact, but a few fields will be reused.
The oozie_job_id will store an id that allows the running application to be operated on. The name of this field should be generalized at some point in the future.
The job_execution.extra field may be used to store additional information necessary to allow operations on the running application
None. Initially Spark jobs (jars) can be run using the Java job type. At some point a specific Spark job type will be added (this will be covered in a separate specification).
Unit tests will be added for the changes in Sahara.
Integration tests for Spark standalone clusters will be added in another blueprint and specification.
The Elastic Data Processing section of the User Guide should talk about the ability to run Spark jobs and any restrictions.