This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode

Improve Erasure Coding Efficiency for Global Cluster

This SPEC describes an improvement of efficiency for Global Cluster with Erasure Coding. It proposes a way to improve the PUT/GET performance in the case of Erasure Coding with more than 1 regions ensuring original data even if a region is lost.

Problem description

Swift now supports Erasure Codes (EC) which ensures higher durability and lower disk cost than the replicated case for a one region cluster. However, currently if Swift were running EC over 2 regions, using < 2x data redundancy (e.g. ec_k=10, ec_m=4) and then one of the regions is gone due to some unfortunate reasons (e.g. huge earthquake, fire, tsunami), there is a chance data would be lost. That is because, assuming each region has an even available volume of disks, each region should have around 7 fragments, less than ec_k, which is not enough data for the EC scheme to rebuild the original data.

To protect stored data and to ensure higher durability, Swift has to keep >= 1 data size for each region (i.e. >= 2x for 2 regions) by employing larger ec_m like ec_m=14 for ec_k=10. However, this increase sacrifices encode performance. In my measurements running PyECLib encode/decode on an Intel Xeon E5-2630v3 [1], the benchmark result was as follows:

scheme ec_k ec_m encode decode
jerasure_rs_vand 10 4 7.6Gbps 12.21Gbps
  10 14 2.67Gbps 12.27Gbps
  20 4 7.6Gbps 12.87Gbps
  20 24 1.6Gbps 12.37Gbps
isa_lrs_vand 10 4 14.27Gbps 18.4Gbps
  10 14 6.53Gbps 18.46Gbps
  20 4 15.33Gbps 18.12Gbps
  20 24 4.8Gbps 18.66Gbps

Note that “decode” uses (ec_k + ec_m) - 2 fragments so performance will decrease less than when encoding as is shown in the results above.

In the results above, comparing ec_k=10, ec_m=4 vs ec_k=10, ec_m=14, the encode performance falls down about 1/3 and other encodings follow a similar trend. This demonstrates that there is a problem when building a 2+ region EC cluster.

1: http://ark.intel.com/ja/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz

Proposed change

Add an option like “duplication_factor”. Which will create duplicated (copied) fragments instead of employing a larger ec_m.

For example, with a duplication_factor=2, Swift will encode ec_k=10, ec_m=4 and store 28 fragments (10x2 data fragments and 4x2 parity fragments) in Swift.

This requires a change to PUT/GET and the reconstruct sequence to map from the fragment index in Swift to actual fragment index for PyECLib but IMHO we don’t need to make an effort to build much modification for the conversation among proxy-server <-> object-server <-> disks.

I don’t want describe the implementation in detail in the first patch of the spec because it should be an idea to improve Swift. More discussion on the implementation side will following in subsequent patches.

Considerations of acutal placement

Placement of these doubled fragments are important. If the same fragments, original and copied, appear in the same region and the second region fails, then we would be in the same situation where we couldn’t rebuild the original object as we were in the smaller parity fragments case.`

e.g:

  • duplication_factor=2, k=4, m=2
  • 1st Region: [0, 1, 2, 6, 7, 8]
  • 2nd Region: [3, 4, 5, 9, 10, 11]
  • (Assuming actual indices to rebuild mapped as index // (k+m))

In this case, 1st region has only fragments consisting of fragment index 0, 1, 2 and 2nd has only 3, 4, 5. Therefore, it is not able to rebuild the original object from the fragments in only one region because the fragment uniqueness in the region is less than k. The worst case scenario, like this, will cause significant data loss as would happen with no duplication factor.

i.e. In fact, data durability will be

  • “no duplication” < “with duplication” < “more unique parities”

In future work, we can find a way to tie a fragment index to a region, something like “1st subset should be in 1st Region and 2nd subset should be ...” but so far this is beyond this spec.

Alternatives

We can find a way to use container-sync as a solution to the problem rather then employing my proposed change. This section will describe the pros/cons for my “proposed change” and “container-sync”.

Proposed Change

Pros:

  • Higher performance way to spread objects across regions (No need to re-decode/encode for transferring across regions)
  • No extra configuration other than storage policy is needed for users to turn on the global replication. (strictly global erasure coding?)
  • Able to use other global cluster efficiency improvements (affinity control)

Cons:

  • Need to employ more complex handling around ECObjecController

Container-Sync

Pros:

  • Simple and able to reuse existing swift mechanisms
  • Less data transfer between regions

Cons:

  • Re-decode/encode is required when transferring objects to another region
  • Need to set the sync option for each container
  • Impossible to retrieve/reconstruct an object when > ec_m disks unavailable (includes ip unreachable)

Implementation

  • Proxy-Server PUT/GET path
  • Object-Reconstructor
  • (Optional) Ring placement strategy

Questions and Answers

  • TBD

Assignee(s)

Primary assignee:
kota_ (Kota Tsuyuzaki)

Work Items

Develop codes around proxy-server and object-reconstructor

Repositories

None

Servers

None

DNS Entries

None

Dependencies

None