Support for S3-compatible object stores

https://blueprints.launchpad.net/sahara/+spec/sahara-support-s3

Following the efforts done to make data sources and job binaries “pluggable”, it should be feasible to introduce support for S3-compatible object stores. This will be an additional alternative to the existing HDFS, Swift, MapR-FS, and Manila storage options.

A previous spec regarding this topic existed around the time of Icehouse release, but the work has been stagnant since then: https://blueprints.launchpad.net/sahara/+spec/edp-data-source-from-s3

Problem description

Hadoop already offers filesystem libraries with support for s3a:///path URIs, so supporting S3-compatible object stores on Sahara is a reasonable feature to add.

Within the world of OpenStack, many cloud operators choose Ceph RadosGW instead of Swift. RadosGW object store supports access through either the Swift or S3 APIs. Also, with some extra configuration, a “native” install of Swift can also support the S3 API. For some users we may expect the Hadoop S3 library to be preferable over the Hadoop Swift library as it has recently received several enhancements including support for larger objects and other performance improvements.

Additionally, some cloud users may wish to use other S3-compatible object stores, including:

  • Amazon S3 (including AWS Public Datasets)

  • LeoFS

  • Riak Cloud Storage

  • Cloudian HyperStore

  • Minio

  • SwiftStack

  • Eucalyptus

It is clear that adding support for S3 datasources will open up a new world of Sahara use cases.

Proposed change

An “s3” data source type will be added, via new code in sahara.service.edp.data_sources. We will need utilities to validate S3 URIs, as well as to handle job configs (access key, secret key, endpoint, bucket URI).

Regarding EDP, there should not be much work to do outside of defining the new data source type, since the Hadoop S3 library allows jobs to be run against S3 seamlessly.

Similar work will be done to enable an “s3” job binary type, including the writing of “job binary retriever” code.

While the implementation of the abstraction is simple, a lot of work comes from dashboard, saharaclient, documentation, and testing.

Alternatives

Do not add support for S3 as a data source for EDP. Since the Hadoop S3 libraries are already included on the image regardless of this change, users can run data processing jobs against S3 manually. We still may wish to add the relevant JARs to the classpath as a courtesy to users.

Data model impact

None

REST API impact

None (only “s3” as a valid data source type and job binary type in the schema)

Other end user impact

None

Deployer impact

None

Developer impact

None

Sahara-image-elements impact

On most images, hadoop-aws.jar needs to be added to the classpath. Generally images with Hadoop (or related component) installed already have the JAR. The work will probably take place during the transition from SIE to image packing, so the work will probably need to be done in both places.

Sahara-dashboard / Horizon impact

Data Source and Job Binary forms should support s3 type, with fields for access key, secret key, S3 URI, and S3 endpoint. Note that this is a lot of fields, in fact more than we have for Swift. There will probably be some saharaclient impact too, because of this.

Implementation

Assignee(s)

Primary assignee:

Jeremy Freudberg

Other contributors:

None

Work Items

  • S3 as a data source

  • S3 as a job binary

  • Ensure presence of AWS JAR on images

  • Dashboard and saharaclient work

  • Scenario tests

  • Documentation

Dependencies

None

Testing

We will probably want scenario tests (although we don’t have them for Manila).

Documentation Impact

Nothing out of the ordinary, but important to keep in mind both user and developer perspective.

References

None