This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
This SPEC describes an improvement of efficiency for Global Cluster with Erasure Coding. It proposes a way to improve the PUT/GET performance in the case of Erasure Coding with more than 1 regions ensuring original data even if a region is lost.
Swift now supports Erasure Codes (EC) which ensures higher durability and lower disk cost than the replicated case for a one region cluster. However, currently if Swift were running EC over 2 regions, using < 2x data redundancy (e.g. ec_k=10, ec_m=4) and then one of the regions is gone due to some unfortunate reasons (e.g. huge earthquake, fire, tsunami), there is a chance data would be lost. That is because, assuming each region has an even available volume of disks, each region should have around 7 fragments, less than ec_k, which is not enough data for the EC scheme to rebuild the original data.
To protect stored data and to ensure higher durability, Swift has to keep >= 1 data size for each region (i.e. >= 2x for 2 regions) by employing larger ec_m like ec_m=14 for ec_k=10. However, this increase sacrifices encode performance. In my measurements running PyECLib encode/decode on an Intel Xeon E5-2630v3 [1], the benchmark result was as follows:
scheme | ec_k | ec_m | encode | decode |
---|---|---|---|---|
jerasure_rs_vand | 10 | 4 | 7.6Gbps | 12.21Gbps |
10 | 14 | 2.67Gbps | 12.27Gbps | |
20 | 4 | 7.6Gbps | 12.87Gbps | |
20 | 24 | 1.6Gbps | 12.37Gbps | |
isa_lrs_vand | 10 | 4 | 14.27Gbps | 18.4Gbps |
10 | 14 | 6.53Gbps | 18.46Gbps | |
20 | 4 | 15.33Gbps | 18.12Gbps | |
20 | 24 | 4.8Gbps | 18.66Gbps |
Note that “decode” uses (ec_k + ec_m) - 2 fragments so performance will decrease less than when encoding as is shown in the results above.
In the results above, comparing ec_k=10, ec_m=4 vs ec_k=10, ec_m=14, the encode performance falls down about 1/3 and other encodings follow a similar trend. This demonstrates that there is a problem when building a 2+ region EC cluster.
1: http://ark.intel.com/ja/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz
Add an option like “duplication_factor”. Which will create duplicated (copied) fragments instead of employing a larger ec_m.
For example, with a duplication_factor=2, Swift will encode ec_k=10, ec_m=4 and store 28 fragments (10x2 data fragments and 4x2 parity fragments) in Swift.
This requires a change to PUT/GET and the reconstruct sequence to map from the fragment index in Swift to actual fragment index for PyECLib but IMHO we don’t need to make an effort to build much modification for the conversation among proxy-server <-> object-server <-> disks.
I don’t want describe the implementation in detail in the first patch of the spec because it should be an idea to improve Swift. More discussion on the implementation side will following in subsequent patches.
Placement of these doubled fragments are important. If the same fragments, original and copied, appear in the same region and the second region fails, then we would be in the same situation where we couldn’t rebuild the original object as we were in the smaller parity fragments case.`
e.g:
In this case, 1st region has only fragments consisting of fragment index 0, 1, 2 and 2nd has only 3, 4, 5. Therefore, it is not able to rebuild the original object from the fragments in only one region because the fragment uniqueness in the region is less than k. The worst case scenario, like this, will cause significant data loss as would happen with no duplication factor.
i.e. In fact, data durability will be
In future work, we can find a way to tie a fragment index to a region, something like “1st subset should be in 1st Region and 2nd subset should be ...” but so far this is beyond this spec.
We can find a way to use container-sync as a solution to the problem rather then employing my proposed change. This section will describe the pros/cons for my “proposed change” and “container-sync”.
Pros:
Cons:
Pros:
Cons:
Develop codes around proxy-server and object-reconstructor
None
None
None
None