GithubHelp home page GithubHelp logo

catid / leopard Goto Github PK

View Code? Open in Web Editor NEW
130.0 13.0 25.0 3.35 MB

Leopard-RS : O(N Log N) MDS Reed-Solomon Block Erasure Code for Large Data

License: BSD 3-Clause "New" or "Revised" License

C++ 94.35% C 4.03% CMake 1.62%

leopard's Introduction

Leopard-RS

MDS Reed-Solomon Erasure Correction Codes for Large Data in C

Leopard-RS is a fast library for Erasure Correction Coding. From a block of equally sized original data pieces, it generates recovery symbols that can be used to recover lost original data.

Motivation:

It gets slower as O(N Log N) in the input data size, and its inner loops are vectorized using the best approaches available on modern processors, using the fastest finite fields (8-bit or 16-bit Galois fields with Cantor basis {2}).

It sets new speed records for MDS encoding and decoding of large data, achieving over 1.2 GB/s to encode with the AVX2 instruction set on a single core.

Example applications are data recovery software and data center replication.

Encoder API:

Preconditions:

  • The original and recovery data must not exceed 65536 pieces.
  • The recovery_count <= original_count.
  • The buffer_bytes must be a multiple of 64.
  • Each buffer should have the same number of bytes.
  • Even the last piece must be rounded up to the block size.
#include "leopard.h"

For full documentation please read leopard.h.

  • leo_init() : Initialize library.
  • leo_encode_work_count() : Calculate the number of work_data buffers to provide to leo_encode().
  • leo_encode(): Generate recovery data.

Decoder API:

For full documentation please read leopard.h.

  • leo_init() : Initialize library.
  • leo_decode_work_count() : Calculate the number of work_data buffers to provide to leo_decode().
  • leo_decode() : Recover original data.

Benchmarks:

On the the MacBook Pro 15-Inch (Mid-2015) featuring a 22 nm "Haswell/Crystalwell" 2.8 GHz Intel Core i7-4980HQ processor, compiled with Visual Studio 2017 RC:

Leopard Encoder(8.192 MB in 128 pieces, 128 losses): Input=2102.67 MB/s, Output=2102.67 MB/s
Leopard Decoder(8.192 MB in 128 pieces, 128 losses): Input=686.212 MB/s, Output=686.212 MB/s

Leopard Encoder(64 MB in 1000 pieces, 200 losses): Input=2194.94 MB/s, Output=438.988 MB/s
Leopard Decoder(64 MB in 1000 pieces, 200 losses): Input=455.633 MB/s, Output=91.1265 MB/s

Leopard Encoder(2097.15 MB in 32768 pieces, 32768 losses): Input=451.168 MB/s, Output=451.168 MB/s
Leopard Decoder(2097.15 MB in 32768 pieces, 32768 losses): Input=190.471 MB/s, Output=190.471 MB/s

2 GB of 64 KB pieces encoded in 4.6 seconds, and worst-case recovery in 11 seconds.

More benchmark results are available here: https://github.com/catid/leopard/blob/master/Benchmarks.md

Comparisons:

There is another library FastECC by Bulat-Ziganshin that should have similar performance: https://github.com/Bulat-Ziganshin/FastECC. Both libraries implement the same high-level algorithm in {3}, while Leopard implements the newer polynomial basis GF(2^r) approach outlined in {1}, and FastECC uses complex finite fields modulo special primes. There are trade-offs that may make either approach preferable based on the application:

  • Older processors do not support SSSE3 and FastECC supports these processors better.
  • FastECC supports data sets above 65,536 pieces as it uses 32-bit finite field math.
  • Leopard does not require expanding the input or output data to make it fit in the field, so it can be more space efficient.

FFT Data Layout:

We pack the data into memory in this order:

[Recovery Data (Power of Two = M)] [Original Data] [Zero Padding out to 65536]

For encoding, the placement is implied instead of actual memory layout. For decoding, the layout is explicitly used.

Encoder algorithm:

The encoder is described in {3}. Operations are done O(K Log M), where K is the original data size, and M is up to twice the size of the recovery set.

Roughly in brief:

Recovery = FFT( IFFT(Data_0) xor IFFT(Data_1) xor ... )

It walks the original data M chunks at a time performing the IFFT. Each IFFT intermediate result is XORed together into the first M chunks of the data layout. Finally the FFT is performed.

Encoder optimizations:

  • The first IFFT can be performed directly in the first M chunks.
  • The zero padding can be skipped while performing the final IFFT. Unrolling is used in the code to accomplish both these optimizations.
  • The final FFT can be truncated also if recovery set is not a power of 2. It is easy to truncate the FFT by ending the inner loop early.
  • The decimation-in-time (DIT) FFT is employed to calculate two layers at a time, rather than writing each layer out and reading it back in for the next layer of the FFT.

Decoder algorithm:

The decoder is described in {1}. Operations are done O(N Log N), where N is up to twice the size of the original data as described below.

Roughly in brief:

Original = -ErrLocator * FFT( Derivative( IFFT( ErrLocator * ReceivedData ) ) )

Precalculations:

At startup initialization, FFTInitialize() precalculates FWT(L) as described by equation (92) in {1}, where L = Log[i] for i = 0..Order, Order = 256 or 65536 for FF8/16. This is stored in the LogWalsh vector.

It also precalculates the FFT skew factors (s_i) as described by equation (28). This is stored in the FFTSkew vector.

For memory workspace N data chunks are needed, where N is a power of two at or above M + K. K is the original data size and M is the next power of two above the recovery data size. For example for K = 200 pieces of data and 10% redundancy, there are 20 redundant pieces, which rounds up to 32 = M. M + K = 232 pieces, so N rounds up to 256.

Online calculations:

At runtime, the error locator polynomial is evaluated using the Fast Walsh-Hadamard transform as described in {1} equation (92).

At runtime the data is explicit laid out in workspace memory like this:

[Recovery Data (Power of Two = M)] [Original Data (K)] [Zero Padding out to N]

Data that was lost is replaced with zeroes. Data that was received, including recovery data, is multiplied by the error locator polynomial as it is copied into the workspace.

The IFFT is applied to the entire workspace of N chunks. Since the IFFT starts with pairs of inputs and doubles in width at each iteration, the IFFT is optimized by skipping zero padding at the end until it starts mixing with non-zero data.

The formal derivative is applied to the entire workspace of N chunks.

The FFT is applied to the entire workspace of N chunks. The FFT is optimized by only performing intermediate calculations required to recover lost data. Since it starts wide and ends up working on adjacent pairs, at some point the intermediate results are not needed for data that will not be read by the application. This optimization is implemented by the ErrorBitfield class.

Finally, only recovered data is multiplied by the negative of the error locator polynomial as it is copied into the front of the workspace for the application to retrieve.

Finite field arithmetic optimizations:

For faster finite field multiplication, large tables are precomputed and applied during encoding/decoding on 64 bytes of data at a time using SSSE3 or AVX2 vector instructions and the ALTMAP approach from Jerasure.

Addition in this finite field is XOR, and a vectorized memory XOR routine is also used.

References:

This library implements an MDS erasure code introduced in this paper:

    {1} S.-J. Lin, T. Y. Al-Naffouri, Y. S. Han, and W.-H. Chung,
    "Novel Polynomial Basis with Fast Fourier Transform
	and Its Application to Reed-Solomon Erasure Codes"
    IEEE Trans. on Information Theory, pp. 6284-6299, November, 2016.
    {2} D. G. Cantor, "On arithmetical algorithms over finite fields",
    Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp. 285-300, 1989.
    {3} Sian-Jheng Lin, Wei-Ho Chung, "An Efficient (n, k) Information
    Dispersal Algorithm for High Code Rate System over Fermat Fields,"
    IEEE Commun. Lett., vol.16, no.12, pp. 2036-2039, Dec. 2012.
    {4} Plank, J. S., Greenan, K. M., Miller, E. L., "Screaming fast Galois Field
    arithmetic using Intel SIMD instructions."  In: FAST-2013: 11th Usenix
    Conference on File and Storage Technologies, San Jose, 2013

Some papers are mirrored in the /docs/ folder.

Credits

Inspired by discussion with:

Software by Christopher A. Taylor [email protected]

Please reach out if you need support or would like to collaborate on a project.

leopard's People

Contributors

catid avatar cskiraly avatar gbletr42 avatar musalbas avatar nemequ avatar thebluematt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

leopard's Issues

New benchmark for ECC libraries

https://github.com/Bulat-Ziganshin/ECC-Benchmark

I made ECC benchmark that includes mainly your libraries :) The reason is that I don't know better alternatives for them, so if anyone has better ideas, I'm all ears.

Currently it lacks better build system (just compile.cmd works for me), as well as automatic data gathering/publishing system (I work on it).

Nevertheless, for me it looks better that your own comparison of your own libraries, so you may like to link it from your readme and/or publish yourself any data gathered this way.

BTW, the benchmark can be build with -fopenmp too.

GCC ≤ 4.6 build fails

See https://travis-ci.org/nemequ/leopard/jobs/395895113

/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1480:41: error: a brace-enclosed initializer is not allowed here before ‘{’ token
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1480:42: sorry, unimplemented: non-static data member initializers
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1480:42: error: ‘constexpr’ needed for in-class initialization of static data member ‘Words’ of non-integral type
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1484:46: error: a brace-enclosed initializer is not allowed here before ‘{’ token
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1484:47: sorry, unimplemented: non-static data member initializers
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1484:47: error: ‘constexpr’ needed for in-class initialization of static data member ‘BigWords’ of non-integral type
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1487:43: error: a brace-enclosed initializer is not allowed here before ‘{’ token
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1487:44: sorry, unimplemented: non-static data member initializers
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1487:44: error: ‘constexpr’ needed for in-class initialization of static data member ‘BiggestWords’ of non-integral type
/home/travis/build/nemequ/leopard/LeopardFF16.cpp: In member function ‘void leopard::ff16::ErrorBitfield::Set(unsigned int)’:
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1492:9: error: ‘Words’ was not declared in this scope
/home/travis/build/nemequ/leopard/LeopardFF16.cpp: In member function ‘bool leopard::ff16::ErrorBitfield::IsNeeded(unsigned int, unsigned int) const’:
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1504:26: error: ‘BiggestWords’ was not declared in this scope
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1509:26: error: ‘BigWords’ was not declared in this scope
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1511:22: error: ‘Words’ was not declared in this scope
/home/travis/build/nemequ/leopard/LeopardFF16.cpp: In member function ‘void leopard::ff16::ErrorBitfield::Prepare()’:
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1528:24: error: ‘Words’ was not declared in this scope
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1545:32: error: ‘Words’ was not declared in this scope
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1551:9: error: ‘BigWords’ was not declared in this scope
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1563:28: error: ‘BigWords’ was not declared in this scope
/home/travis/build/nemequ/leopard/LeopardFF16.cpp:1569:5: error: ‘BiggestWords’ was not declared in this scope
make[2]: *** [CMakeFiles/libleopard.dir/LeopardFF16.cpp.o] Error 1
make[2]: Leaving directory `/home/travis/build/nemequ/leopard/build'
make[1]: *** [CMakeFiles/libleopard.dir/all] Error 2
make[1]: Leaving directory `/home/travis/build/nemequ/leopard/build'
make: *** [all] Error 2

4.7+ works.

recovery_count > original_count

As the title says, I have small original count, large recovery and need to pad up to 32k/32k. The complexity is not pretty (though still best one can get for a MDS).
Would it be possible to arrange the field so as to avoid the overhead?

I can see some attempts in encodeL(), but indeed as the prototype code says, it doesn't seem to work, or would need specialized decoder. Is this inherent limitation of the algorithm, or just not implemented?

No description how to restore missed original data after decoding

I've looked into benchmark.cpp and the only place decode_work_data is used is at

if (!CheckPacket(decode_work_data[i], params.buffer_bytes))
.

Thus i conjecture decode_work_data[i] has CRC32 in first 4 bytes, then the length of data in next 4 bytes, and then the data (which correspond to missed original_data[i]) itself in remaining (length) data(?).

But if this hypothesis is correct then we should have extra 8 bytes room in each decode_work_data[i], but the code uses buffer_bytes only (exactly as for any other buffer).

If the hypothesis is wrong then where should we extract missed original_data[i] from?

Progressive/incremental encoding ?

Hi, is it possible to do progressive/incremental encoding with this? It would significantly reduce the memory usage if we could feed in one block of the input file at a time instead of having to keep the whole file in memory.

Allow more parity symbols aka losses than data

We've run into some hiccups making this work with more parity symbols than data symbols.

There is a check leopard.cpp:135 to prevent this and if removed it segfaults

#0  leopard::SIMDSafeFree (ptr=0xf8913455fd9fcc4a) at /tmp/leopard/tests/../LeopardCommon.h:480
#1  Benchmark (params=...) at /tmp/leopard/tests/benchmark.cpp:452

due to benchmark/test code assuming all lost symbols come from the data symbols at benchmark.cpp:452. I suppose test cases should cover loosing only some of the original data, but it'll presumably be straightforward to (randomly?) replace less than losses data symbols with parity symbols though.

We also run into some strange slowdown in leo_encode when adjusting the code to produce more parity symbols than data symbols. At first blush, there is seemingly just some runaway code that fills up many gigabytes of memory, which sounds strange since allocations seemingly occur outside leo_encode.

I've not yet fully understood the algorithm so I'd love your input on whether you think either the encode or decode algorithm should run into any performance or other hiccups when handling more parity chunks than data chunks. Thoughts?

Typo in the library name

Traditionally libraries are named lib[name], but in your CMakeLists.txt you have the library name as libleopard, which cmake turns into liblibleopard. If possible, I feel it would be good to change it to leopard, so it follows the standard naming.

Decoder to correct both erasure and error

I found a new paper about this FFT based Reed-Solomon Codes. They present erasure-and-error decoding of an (n, k) RS code. The decoding complexity is with only O(n log n + (n − k) log2 (n − k)). As reference, complexity of current erasure decoding is O(n log n).

"On fast Fourier transform-based decoding of Reed-Solomon codes"
by Yunghsiang S. Han, Chao Chen, Sian-Jheng Lin, Baoming Bai
http://ct.ee.ntust.edu.tw/IJAHUC2021.pdf

I post the link here as a notice. If you know it already, just ignore this issue. I cannot understand the theory in the paper. If you implement the error correction also in your leopard-RS library, it may be useful rarely. While erasure correction is enough mostly, possibility of error correction is welcome.

Question regarding the API

I'm writing a go-wrapper for this implementation; BTW the C wrapper makes that really convenient, thanks for that!

I struggle to completely understand the API though. What I'm trying to do is quite basic: given n original data buffers or symbols, I want to generate n "parity symbols" s.t. that I can loose any n (of the now 2*n total) symbols and still recover the data.

I mainly experimented with the API in banchmarks.cpp: OK, sounds like I would need to call leo_encode with original_count == recovery_count for that, right? leo_encode_work_count returns 2*original_count for these params and calling leo_encode yields encode_work_data with the size 2*original_count as expected. So far so good. What is surprising though: I would have expected encode_work_data to contain the original data too instead it does seem to contain 2*original_count recovery or parity symbols. So is there no way I can achieve what described above using the public C API or am I using recovery_count wrong?

Thanks in advance :-)

Speeding up mulE

mulE description says return a*exp[b] over GF(2^r) so it's a simple multiplication in GF(2^r) optimizable with PSHUFB. they just precompute logarithms of twiddle factors to speed up their implementation

Port to ARM NEON

I had one contact who may have found this useful on cellphones so I should port it to ARM NEON

Test(s) which terminate

AFAICT both bench_leopard and experiment_leopard just go on forever, or at least for a very long time. It would be nice to have something which could be run on CI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.