GithubHelp home page GithubHelp logo

libdivide.h by ridiculousfish about fastdoom HOT 8 CLOSED

viti95 avatar viti95 commented on June 26, 2024
libdivide.h by ridiculousfish

from fastdoom.

Comments (8)

viti95 avatar viti95 commented on June 26, 2024

Good idea, I've tried to add it to the project but OpenWatcom can't compile it correctly. We need to port it to OpenWatcom first to be able to use it 😢

from fastdoom.

RamonUnch avatar RamonUnch commented on June 26, 2024

It seems to be mostly die to the new C99 option to mix declaration with expressions, I am working on a fix.
Also there is no 128b integer support on WC

from fastdoom.

RamonUnch avatar RamonUnch commented on June 26, 2024

Well. this one works, (still work in progress).
libdivide-wc.zip
When I use gcc, the performance difference between division and fastdiv is HUGE, however it is just around 20% faster on openwatcom.
I guess this is because of inlining...
I guess we can extract only the signed int32/int32 functions and build them with gcc, and then inline the generated assembly code.
Maybe we also need unsigned divisions and 16 bit divisions?
unsigned divisions are faster...

from fastdoom.

RamonUnch avatar RamonUnch commented on June 26, 2024

more minimalist libdiv.h:
libdiv.zip
Without SSE AVX etc and without 128 bit integers

from fastdoom.

RamonUnch avatar RamonUnch commented on June 26, 2024

Well here is the minimal version with inlined asm for Openwatcom C89 compatible.
There still are improvements for the generating stuff.


#define int32_t long
#define uint32_t unsigned long
#define uint8_t unsigned char
#define uint64_t unsigned long long

enum {
    LIBDIVIDE_16_SHIFT_MASK = 0x1F,
    LIBDIVIDE_32_SHIFT_MASK = 0x1F,
    LIBDIVIDE_64_SHIFT_MASK = 0x3F,
    LIBDIVIDE_ADD_MARKER = 0x40,
    LIBDIVIDE_NEGATIVE_DIVISOR = 0x80
};

struct libdivide_s32_t {
    int32_t magic;
    uint8_t more;
};

// TODO: use BSR inline asm for WATCOM
static int32_t libdivide_count_leading_zeros32(uint32_t val)
{
    int32_t result = 8;
    uint32_t hi = 0xFFU << 24;
    if (val == 0) return 32;
    while ((val & hi) == 0) {
        hi >>= 8;
        result += 8;
    }
    while (val & hi) {
        result -= 1;
        hi <<= 1;
    }
    return result;
}


// libdivide_64_div_32_to_32: divides a 64-bit uint {u1, u0} by a 32-bit
// uint {v}. The result must fit in 32 bits.
// Returns the quotient directly and the remainder in *r
static uint32_t libdivide_64_div_32_to_32(uint32_t u1, uint32_t u0, uint32_t v, uint32_t *r)
{
#if (defined(LIBDIVIDE_i386) || defined(LIBDIVIDE_X86_64)) && defined(LIBDIVIDE_GCC_STYLE_ASM)
    uint32_t result;
    __asm__("divl %[v]" : "=a"(result), "=d"(*r) : [v] "r"(v), "a"(u0), "d"(u1));
    return result;
#else
    uint64_t n = ((uint64_t)u1 << 32) | u0;
    uint32_t result = (uint32_t)(n / v);
    *r = (uint32_t)(n - result * (uint64_t)v);
    return result;
#endif
}

// generate psedo inverse to go inside libdiv_s32_do(x, div)
struct libdivide_s32_t libdiv_s32_gen(int32_t d)
{
    struct libdivide_s32_t result;

    // If d is a power of 2, or negative a power of 2, we have to use a shift.
    // This is especially important because the magic algorithm fails for -1.
    // To check if d is a power of 2 or its inverse, it suffices to check
    // whether its absolute value has exactly one bit set. This works even for
    // INT_MIN, because abs(INT_MIN) == INT_MIN, and INT_MIN has one bit set
    // and is a power of 2.
    uint32_t ud = (uint32_t)d;
    uint32_t absD = (d < 0) ? -ud : ud;
    uint32_t floor_log_2_d = 31 - libdivide_count_leading_zeros32(absD);
    // check if exactly one bit is set,
    // don't care if absD is 0 since that's divide by zero
    if ((absD & (absD - 1)) == 0) {
        result.magic = 0;
        result.more = (uint8_t)(floor_log_2_d | (d < 0 ? LIBDIVIDE_NEGATIVE_DIVISOR : 0));
    } else {
        // LIBDIVIDE_ASSERT(floor_log_2_d >= 1);

        uint8_t more;
        int32_t magic;
        // the dividend here is 2**(floor_log_2_d + 31), so the low 32 bit word
        // is 0 and the high word is floor_log_2_d - 1
        uint32_t e;
        uint32_t rem, proposed_m;
        proposed_m = libdivide_64_div_32_to_32((uint32_t)1 << (floor_log_2_d - 1), 0, absD, &rem);
        e = absD - rem;

        // We are going to start with a power of floor_log_2_d - 1.
        // This works if works if e < 2**floor_log_2_d.
        if (e < ((uint32_t)1 << floor_log_2_d)) {
            // This power works
            more = (uint8_t)(floor_log_2_d - 1);
        } else {
            // We need to go one higher. This should not make proposed_m
            // overflow, but it will make it negative when interpreted as an
            // int32_t.
            const uint32_t twice_rem = rem + rem;
            proposed_m += proposed_m;
            if (twice_rem >= absD || twice_rem < rem) proposed_m += 1;
            more = (uint8_t)(floor_log_2_d | LIBDIVIDE_ADD_MARKER);
        }

        proposed_m += 1;
        magic = (int32_t)proposed_m;

        // Mark if we are negative.
        if (d < 0) {
            more |= LIBDIVIDE_NEGATIVE_DIVISOR;
            magic = -magic;
        }

        result.more = more;
        result.magic = magic;
    }
    return result;
}
#undef int32_t
#undef uint32_t
#undef uint8_t
#undef uint64_t

// Build from
int libdiv_s32_do(int x, struct libdivide_s32_t *div);
#pragma aux libdiv_s32_do = \
    "mov    esi, ecx",             \
    "mov    bl, BYTE PTR [edx+4]", \
    "mov    eax, DWORD PTR [edx]", \
    "mov    cl, bl",               \
    "and    ecx, 31",              \
    "test   eax, eax",             \
    "jne    L2",                   \
    "sar    bl, 7",                \
    "movsx  ebx, bl",              \
    "mov    eax, esi",             \
    "sar    eax, cl",              \
    "xor    eax, ebx",             \
    "sub    eax, ebx",             \
    "jmp    ENDPROC",              \
    "L2:",                         \
    "imul   esi",                  \
    "mov    eax, edx",             \
    "test   bl, 64",               \
    "je     L4",                   \
    "sar    bl, 7",                \
    "movsx  ebx, bl",              \
    "xor    esi, ebx",             \
    "add    eax, esi",             \
    "sub    eax, ebx",             \
    "L4:",                         \
    "mov    edx, eax",             \
    "sar    edx, cl",              \
    "shr    eax, 31",              \
    "add    eax, edx",             \
    "ENDPROC:" parm [ecx] [edx] value [eax] modify exact [ecx edx eax ebx esi]

I tried in a loop in openwatcom and I get ~5 time faster divisions than idiv.
I am sure this could help with fast doom for FixedDiv. extra bit shifting would be needed though.

from fastdoom.

RamonUnch avatar RamonUnch commented on June 26, 2024

Also openwatcom can handle C99 with the -za99 and -aa falgs it is not required here, we only need long long

from fastdoom.

viti95 avatar viti95 commented on June 26, 2024

Oh I forgot to reply to this commit in the past! I did try using this division method, it works but didn't found any loop with the same DIV or IDIV called everytime with the same divisor, ID Software already tried to remove all possible divisions. I'll look again to see if something can be optimized.

from fastdoom.

RamonUnch avatar RamonUnch commented on June 26, 2024

You are right, I thought I could optimize R_ClearPlanes the your idea of iprojection but I get even more rounding problems and I am unable to gain any measurable overall performances.

from fastdoom.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.