[EDP] Add a Spark job type (instead of overloading Java)¶
Spark EDP has been implemented initially using the Java job type. However, the spark semantics are slightly different and Spark jobs will probably continue to diverge from Java jobs during future development. Additionally, a specific job type will help users distinguish between Spark apps and Java mapreduce jobs in the Sahara database.
The work involves adding a Spark job type to the job type enumeration in Sahara and extending the dashboard to allow the creation and submission of Spark jobs. The Sahara client must be able to create and submit Spark jobs as well (there may not be any new work in the client to support this).
Existing unit tests and integration tests must be repaired if the addition of a new job type causes them to fail. Unit tests analagous to the tests for current job types should be added for the Spark job type.
Integration tests for Spark clusters/jobs will be added as a separate effort.
Changes in the Sahara-api code:
Add the Spark job type to the enumeration
Add validation methods for job creation and job execution creation
Add unit tests for the Spark job type
Add the Spark job type to the job types supported by the Spark plugin
Leave the Java type supported for Spark until the dashboard is changed
Add config hints for the Spark type
These may be empty initially
Changes in the Sahara dashboard:
Add the Spark job type as a selectable type on the job creation form.
Include the “Choose a main binary” input on the Create Job tab
Supporting libraries are optional, so the form should include the Libs tab
Add a “Launch job” form for the Spark job type
The form should include the “Main class” input.
No data sources, as with Java jobs
Spark jobs will share the edp.java.main_class configuration with Java jobs. Alternatively, Spark jobs could use a edp.spark.main_class config
There may be additional configuration parameters in the future, but none are supported at present. The Configuration button may be included or left out.
The arguments button should be present
Overload the existing Java job type. It is similar enough to work as a proof of concept, but long term this is probably not clear, desirable or maintainable.
Data model impact¶
None. Job type is stored as a string in the database, so there should be no impact there.
REST API impact¶
None. The JSON schema will list “Spark” as a valid job type, but the API calls themselves should not be affected.
Other end user impact¶
Sahara-dashboard / Horizon impact¶
Described under Proposed Change.
- Primary assignee:
Trevor McKay (sahara-api)
- Other contributors:
Chad Roberts (dashboard)
Additional notes on implementation of items described under Proposed Change:
The simple addition of JOB_TYPE_SPARK to sahara.utils.edp.JOB_TYPES_ALL did not cause existing unit tests to fail in an experiment
Existing unit tests should be surveyed and analagous tests for the Spark job type should be added as appropriate
sahara.service.edp.job_manager.get_job_config_hints(job_type) needs to handle the Spark job type. Currently all config hints are retrieved from the Oozie job engine, this will not be correct.
Related, the use of JOB_TYPES_ALL should probably be modified in workflow_creator.workflow_factory.get_possible_job_config() just to be safe
sahara.service.edp.job_utils.get_data_sources() needs to treat Spark jobs like Java jobs (there are no data sources, only arguments)
service.validations.edp.job.check_main_libs() needs to require a main application jar for Spark jobs and allow supporting libs as optional
Spark requires edp.java.main_class (or edp.spark.main_class)
check_edp_job_support() is called here and should be fine. The default is an empty body and the Spark plugin does not override this method since the Spark standalone deployment is part of a Spark image generated from DIB
Use of EDP job types in sahara.service.edp.oozie.workflow_creator should not be impacted since Spark jobs shouldn’t be targeted to an Oozie engine by the job manager (but see note on get_job_config_hints() and JOB_TYPES_ALL)
The Sahara client does not appear to reference specific job type values so there is likely no work to do in the client
New unit tests will be added for the Spark job type, analogous to existing tests for other job types. Existing unit and integration tests will ensure that other job types have not been broken by the addition of a Spark type.
Integration tests for Spark clusters should be added in the following blueprint, including tests for EDP with Spark job types
The User Guide calls out details of the different job types for EDP. Details of the Spark type will need to be added to this section.