GithubHelp home page GithubHelp logo

memcpy_sse's People

Contributors

level1wendell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

memcpy_sse's Issues

Wrong loop counters

I'm sure you'll catch this one, but the copy loop is unrolled 8 times so you need to step the pointers by 1024 bytes and not 128. Should be a nice speedup!

With this change memcpy_sse is almost twice as fast as memcpy (gcc 5.4 on Linux/Nehalem). FWIW, this CPU doesn't exhibit and performance change between x32 and x64.

This is not bug free.

This is not bug free.

// TODO: I am not sure this is bug free... or the right way to do this.

When low 4 bits destination and source addresses are different the function will use _mm_load_si128 to read unaligned memory at the later loop at line 33.
When destination address is unaligned and size is < 16 bytes function will copy 16 bytes and if the pad is > size the size will wrap to 32 ^ 2 - (size - pad) on 32 bit systems and to 64 ^ 2 - (size - pad) on 64.

Can't compile with the commands provided in the readme

I tried to compile the code, with the following command, as provided in the readme:

gcc -march=sse3 -O3 -m32 testmem_modified.c -o tm32

but i get this error:

cc1: error: bad value (‘sse3’) for ‘-march=’ switch
cc1: note: valid arguments to ‘-march=’ switch are: i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client icelake-server bonnell atom silvermont slm knl knm geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 btver1 btver2 native

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.