GithubHelp home page GithubHelp logo

AVX about sse2 HOT 34 OPEN

mischasan avatar mischasan commented on August 16, 2024
AVX

from sse2.

Comments (34)

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

If I understood you, you have moved some of the code base to AVX2 but are not planning on publishing that source code. But you may make an AVX version of the bmx procedure available to the public. Is that right ?

"We" is me and a dev who I've asked to help me because he has some SIMD experience. We have been using boost.simd to do some benchmarking.

The GPL could be a problem as I want to develop a commercial application (for engineering). There is no problem sharing changes that we might make to the bmx proc but the GPL would require releasing all the code it is linked with and that is problematic.

The app would run on industry server farms so managing different SIMD implementations/generations is an issue. We were thinking of using gcc intrinsics for this. One idea would be to map bmx to intrinsics. You did not want to use intrinsics ?

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

Only targetting x86 at this stage.

I tried compiling on Ubuntu 16.04:

cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c
sseutil.c:1:18: fatal error: plat.h: No such file or directory

Is that file missing from the repo ?

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

Also missing msutil.h and sock.h

I ran make then make test which gives:

make test
cc   -pthread  -L/usr/local/lib        ssebmx_t.o libsse.a tap.o bitmat.o     -lstdc++  -lm    -o ssebmx_t
bitmat.o: In function `bitmat_trans':
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx'
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx_m'
collect2: error: ld returned 1 exit status
<builtin>: recipe for target 'ssebmx_t' failed
make: *** [ssebmx_t] Error 1

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

No problem, it is worth the effort if we can use the code. I resolved the missing files (downloaded the 3 headers from your utils package). But ran into the compile error reported in my previous message. Can you get the bmx test running ? The GNUMakefile and rules are new to me so not so easy to quickly understand where the issue is. Thanks.

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

Hi,

I don't see updates to the repo, are you using attachments with these messages ? I don't think I can access those.

I'm in France. I imagine the user of our software will have AVX512 boxes. But I don't have a server farm. I plan to do testing on cloud infrastructure e.g. AWS.

from sse2.

markNZed avatar markNZed commented on August 16, 2024

With 256 or 512bit registers does the optimal size of the bit matrix for transposition change ?

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

If you like you could upload to https://expirebox.com/ it is very simple, no login, provides a link to the file (which gets deleted after 48hrs).

from sse2.

markNZed avatar markNZed commented on August 16, 2024

For bmx, does AVX provide improved instructions or is the only benefit larger registers ?

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

I have a hard time understanding why the CPU don't provide native support for a bitwise transpose, it seems such a fundamental building block. Do you see why that hasn't happened ?

The zip ran fine on my machine, I only tried ssebmx_t (I'm using an Intel Core i5 on my laptop). Thanks!

Have you tried benchmarking between clang and gcc ? I was surprised to see how much better clang-3.8 was than gcc-6.2 on some auto-vectorization test cases, seemed to make better use of the ymm/xmm registers.

Yeah lucky to be in France, so much come down to luck...

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

Non, néo-zélandais, beaucoup de chance la aussi!

from sse2.

markNZed avatar markNZed commented on August 16, 2024

This is a bit of a diverging thread but I hesitate to create new issues for questions. The bmx is 16x8 and I am wondering, if we are targeting a size of 256 x W (where W is typically less than 512). Are there changes to the algorithm that could match up with the initial row count of 256 and improve performance ? Or is it best to just break that up into 16x8 chunks. Thanks.

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

Nice idea with INP and OUT. I would hope that the hardware could prefetch but in any case memory will be the bottleneck. It is premature to optimise now. I will be late next week before I can do profiling and the current bmx may be plenty enough.

The application is analysing decompressed trace files from digital circuit simulation. One dimension of the matrix is time/cycles and the another dimension inputs. The matrix can be quite big (e.g. GBs).

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

Could __builtin_prefetch be a big help with that ? If the gather/scatter work on a block that fits in L1...

I should probably mention that we are looking to transpose blocks (kBs) not the entire matrix (potentially GBs). So the scatter can be limited.

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

Hi, thanks! Can you please upload it to github or https://expirebox.com

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

markNZed avatar markNZed commented on August 16, 2024

Hi, we ran some benchmarking and got slightly better results with code based on http://stackoverflow.com/questions/41778362/how-to-efficiently-transpose-a-2d-bit-matrix targetting a 64x64 matrix. It was surprising. 940.423 MB/s vs 747.659 MB/s and AVX2 was actually slower at 400.961 MB/s Thanks for your support!

from sse2.

mischasan avatar mischasan commented on August 16, 2024

from sse2.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.