GithubHelp home page GithubHelp logo

bulat-ziganshin / fastecc Goto Github PK

View Code? Open in Web Editor NEW
403.0 30.0 36.0 230 KB

Reed-Solomon coder computing one million parity blocks at 1 GB/s. O(N*log(N)) algo employing FFT.

License: Apache License 2.0

Batchfile 4.43% C++ 90.95% C 0.60% Shell 1.80% Makefile 2.21%
reed-solomon galois-field number-theoretic-transform error-correcting-codes erasure-codes forward-error-correction

fastecc's Introduction

FastECC implements FFT-based O(N*log(N)) Reed-Solomon coder, running at 1.2 GB/s on i7-4770 for (n,k)=(2^20,2^19), i.e. calculating 524288 parity blocks from 524288 data blocks.

It's also pretty fast for small orders, outperforming previously fastest Reed-Solomon library, Intel ISA-L for 64+ parity blocks.

The encoding and decoding algorithms going to be implemented by FastECC were described in the paper An Efficient (n,k) Information Dispersal Algorithm based on Fermat Number Transforms by Sian-Jheng Lin and Wei-Ho Chung. FastECC v0.1 implements only encoding, so it isn't yet ready for real use.

Almost all existing Reed-Solomon ECC implementations employ matrix multiplication and thus have O(N^2) speed behavior, i.e. they can produce N parity blocks in O(N^2) time, thus spending O(N) time per block. F.e. the fastest implementation I know, MultiPar, can compute 1000 parity blocks at the speed ~50MB/s, but only at ~2 MB/s in its maximum configuration, 32000 parity blocks. And computations in GF(2^32), implemented in the same way, will build one million parity blocks at 50 KB/s.

One of few exceptions is closed-source RSC32 by persicum with O(N*log(N)) speed, i.e. it spends O(log(N)) time per parity block. Its speed with million parity blocks is 100 MB/s, i.e. it computes one million of 4 KB parity blocks from one million of data blocks (processing 8 GB overall) in just 80 seconds. Note that all speeds mentioned here are measured on i7-4770, employing all features available in a particular program - including multi-threading, SIMD and x64 support.

FastECC is open-source library implementing O(N*log(N)) encoding algorithm. It computes million parity blocks at [1.2 GB/s]. Future versions will implement decoding that's also O(N*log(N)), although 1.5-3 times slower than encoding. Current implementation is limited to 2^20 blocks, removing this limit is the main priority for future work aside of decoder implementation. And if you are interested in smaller configs, look at small NTT benchmarks - FastECC outperforms quadratic algorithms (ISA-L, CM256 and MultiPar) starting from 32-64 parity blocks.

Leopard is a new library, faster than FastECC, especially for small orders. It implements similar algorithm, described in newer paper by the same authors: Lin, Han and Chung "Novel Polynomial Basis and Its Application to Reed-Solomon Erasure Codes".

You can also find a few research-grade libraries with O(N*log(N)) speed.

For comparison - Wirehair, the best open-source LDPC codec I know, is O(N) and already as fast as FastECC, but can be made several times faster using SSE1. It's limited to 64000 source blocks, but amount of parity blocks can be arbitrary. It's an LDPC codec, so not MDS, but chances that it needs even a single extra block to recover is as low as 0.1%. Moreover, it works with binary data, so no need for recoding and no need for extra space to store "overflow" bits.

All O(N*log(N)) Reed-Solomon implementations I'm aware of, use fast transforms like FFT or FWT. FastECC employs fast Number-Theoretic Transform that is just an FFT over integer field or ring. Let's see how it works. Note that below by length-N polynomial I mean any polynomial with order < N.

For any given set of N points, only one length-N polynomial may go through all these points. Let's consider N input words as values of some length-N polynomial at N fixed points, only one such polynomial may exist.

Typical Reed-Solomon encoding computes coefficients of this unique polynomial (so-called polynomial interpolation), evaluates the polynomial at M another fixed points (the polynomial evaluation) and outputs these M words as the resulting parity data.

At the decoding stage, we may receive any subset of N values out of those N source data words and M computed parity words. But since they all belong to the original length-N polynomial, we may recover this polynomial from N known points and then compute its values at other points, in particular those N points assigned to original data, thus restoring them.

Usually, Reed-Solomon libraries implement encoding by multiplication with Vandermonde matrix (O(N^2) algo) and decoding by multiplication with the matrix inverse.

But with special choice of fixed points we can perform polynomial interpolation and evaluation at these points in O(N*log(N)) time, using NTT for evaluation and inverse NTT for interpolation. So, the fast encoding is as simple as:

  • consider N input words as values of length-N polynomial at N special points
  • compute the polynomial coefficients in O(N*log(N)) time using inverse NTT
  • evaluate the polynomial at another M special points in O(M*log(M)) time using NTT

Decoding is more involved. We have N words representing values of length-N polynomial at some N points. Since we can't choose these points, we can't just use iNTT to compute the polynomial coefficients. So it's a generic polynomial interpolation problem that can be solved in O(N*log(N)^2) time.

But this specific polynomial interpolation problem has faster solution. Indeed, decoder knows values of length-N polynomial f(x) at N points a[i], but lost its values at M erasure points e[i]. Let's build "erasure locator" polynomial l(x) = (x-e[1])*...*(x-e[M]) and compute polynomial product p(x)=f(x)*l(x). We have order(p) = order(f)+order(l) < N+M and l(e[i])=0, so by computing values l(a[i]) and then multiplying f(x) and l(x) in the value space we can build polynomial p(x) with order<N+M described by its values at N+M points p(a[i]) = f(a[i])*l(a[i]), p(e[i]) = f(e[i])*l(e[i]) = 0, i.e. we have fully defined p(x).

Now we just need to perform polynomial long division f(x)=p(x)/l(x), that is O(N*log(N)) operation.

But there is even faster algorithm! Let's build formal derivative p'(x) = f'(x)*l(x) + f(x)*l'(x) and evaluate it at each e[i]: p'(e[i]) = f'(e[i])*l(e[i]) + f(e[i])*l'(e[i]) = f(e[i])*l'(e[i]) since l(e[i])=0. Moreover, l'(e[i])!=0, so we can recover all the lost f(e[i]) values by simple scalar divisions: f(e[i]) = p'(e[i]) / l'(e[i])!!!

So, the entire algorithm is:

  • transform polynomials p and l into coefficient space
  • compute their formal derivatives p' and l'
  • transform p' and l' into value space
  • evaluate each f(e[i]) = p'(e[i]) / l'(e[i])

When M<=N, first operation on p is iNTT(2N), third operation on p' is NTT(N) since we need to compute p'(x) values only at N points corresponding to original data, and rest is either O(N) or operations on l(x) that is performed only once, so overall decoding is 1.5-3 times slower than iNTT(N)+NTT(M) operations required for encoding.

FastECC is absolutely useless for RAID storage (such as hadoop). With RAID, when one sector is overwritten, RAID software should read all other data sectors in the same shard in order to recompute parity sectors of the shard, and then overwrite all these parity sectors. So, RAID software developers are looking for solutions that will allow them to read/write less sectors on each operation (such as pyramid codes), rather than opposite.

FastECC is probably useless for any hardware controllers (SSD, Ethernet, LTE, DVB...). These controllers work with analog signals and tend to use soft decoders to extract as much data as possible (soft decoders understand that data received as 7 have more chances to be decoded as 8 rather than to be decoded as 2, and can deal even with something like 7.4 which is more probably 8 rather than 6). Soft decoders are absolutely out of my Math skills, so if someone will build soft decoder, it will be not FastECC, but other great lib (and most probably not free).

Also, FastECC implements only erasure decoding. I have no idea how to implement error decoding nor plan to learn it.

There are some applications still. PAR3 is one of them - it's still interesting for some people, although not many. Various communication applications and P2P data storage are also frequently mentioned in discussions.


I made a quick speed comparison and found that FastECC is faster than 16-bit RS codec in MultiPar starting from ~32 parity blocks. 8-bit RS codecs such as ISA-L should be even faster than 16-bit ones. And with 20% redundancy 32 parity blocks means 160 data blocks, close to the maximum possible for 8-bit RS. So, it seems that FastECC territory starts right where 8-bit codecs territory ends - if you need more than 256 data+parity blocks, FastECC should be faster than any 16-bit RS coders, otherwise 8-bit ISA-L or CM256 is preferable.

Moreover, FastECC is free from patent restrictions that has any fast RS codec employing PSHUFB (i.e. SSSE3) including Leopard. And slow codecs are several times slower than MultiPar, so they have even less chances.

There is a great alternative to FastECC - Wirehair library, but afair it also may be covered with patents. It's already as fast as FastECC, but can be made several times faster using SSE1. It's limited to 64000 source blocks, but amount of parity blocks can be arbitrary. It's an LDPC codec, so not MDS, but chances that it needs even single extra block to recover is as low as 0.1%.

Unlike Wirehair and MultiPar, FastECC doesn't work directly with arbitrary binary data - it works in GF(p) or GF(p^2), default is GF(0xFFF00001). This means that input data should be converted into 0..0xFFF00000 range, that further means that parity sectors are slightly longer that input ones (f.e. 4096 byte data sectors and 4100 byte parity sectors). This also limits its applications.

So, overall, FastECC should replace any use of 16-bit RS codecs, while LDPC and 8-bit RS codecs will keep their niches.

  • Encoder (version 0.1)
  • Decoder (version 0.2)
  • Public API (see issue #1)
  • SSE2/AVX2/NEON-intrinsics with runtime selection of scalar/sse2/avx2/neon code path
  • NTT of sizes!=2^n: NTT5/NTT7/NTT13, PFA and MFA+PFA combined algo
  • Optimizations for asymmetric cases (n!=2k)
  • Optimized code for GF(2^31-1), GF(2^61-1) and GF(p^2)

fastecc's People

Contributors

bulat-ziganshin avatar spikebike avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastecc's Issues

Alternatives to FastECC

It's permanent topic for discussion and comparison of other ECC libraries.

Fast GF(256) libraries with O(N2) speed:

  • ISA-L: hands-down optimized asm code for SSSE3, AVX2 and AVX512. No support for non-x86 and pre-SSSE3 architectures
  • CM256: fast SSSE3 and plain C implementations

Fast O(N) LDPC algorithm:

I will add more libraries and comparison later...

v0.2 decoder

Hi Bulat,

I am interested in this project and would love to use it for a database shard application. I was wondering what the status is for the decoder?

If this project is inactive, I can read up on the papers and attempt to write the code myself.

Collaboration for a Python library

Hey there,

I came from this post you posted on StackOverflow (thank you very much!). Your project is very interesting and i would be very interested in including it in my pyFileFixity project (a sort of MultiPar but aimed at maximum robustness and MIT licensed), and potentially port it into Python (you will be co-author ofc), as your readme explanations and code comments are very clear.

Of course, the Python port would be slower, but I think it can still retain most of the speed by using linear algebra specialized libraries.

For the port, I will probably need to ask you some maths questions when I'll be implementing. Would you be interested in this collaboration? I would start after you finished implementing decoding and the public API.

In any case, I'm really looking forward the development of this project, it is VERY interesting!

No make/cmake/configure file? Only Visual Studio supported?

Is this code supposed to be Microsoft Windows specific? I was expecting to see a cmake, make, configure or something similar to allow compiling on other platforms. Currently it looks like the compile script assumes Microsoft Visual Studio.

Public API development

This topic is dedicated to construction of public API that the library should provide.

I think, it should meet the following criteria, in the order of importance:

  • minimize memory usage
  • maximize speed
  • provide C-compatible, exception-free and alloc-free API, that can be delivered in dll/so
  • flexibility
  • easy of use

My initial proposal:

  • RS codec
    • CreateEncoderContext/CreateDecoderContext(SrcBlocks,EccBlocks,BlockSize)
    • AllocMemory(ctx), GetMemorySize(ctx), SetMemoryBuffer(ctx,buf), GetMemoryBuffer(ctx)
    • Encode(ctx,srcdata,eccdata)
    • Decode(ctx,alive_data,alive_indexes,wanted_data,wanted_indexes)
  • GF(p) <-> binary conversion
    • int encode(size,srcdata,outdata), decode(size,srcdata,int,outdata)

Progressive/incremental encoding?

Hi, can FastECC support progressive encoding to minimize memory usage? ISA-L can do this with the ec_encode_data_update function.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.