GithubHelp home page GithubHelp logo

Non string (composite) keys about hat-trie HOT 10 CLOSED

tessil avatar tessil commented on July 16, 2024
Non string (composite) keys

from hat-trie.

Comments (10)

Tessil avatar Tessil commented on July 16, 2024

For now the HAT-trie only supports a contiguous sequence of chars (could be used to store binary data as a '\0' char inside the sequence is supported).

I'm not that familiar with the composite keys of boost::multi_index_container. Do you have a simple example of what you want to achieve so I can wrap my head around?

If you want to store something like:

struct employee {
    std::string first_name;
    std::string last_name;
};

You could store in the HAT-trie the concatenation of the first name and the last name and provide a string_view into this concatenation in your interface.

struct employee {
    std::string_view first_name() const;
    std::string_view last_name() const;
};

from hat-trie.

 avatar commented on July 16, 2024

I'm more thinking of binary types like:

struct order {
    int64_t price;
    int32_t id;
};

And then I'd need to find an order by it's price + id combination and iterate over orders at the same price but having lower id. Would you recon it could be done by putting both to std::string maybe in big-endian?

from hat-trie.

Tessil avatar Tessil commented on July 16, 2024

You can do something like this.

#include <iostream>
#include "htrie_set.h"

struct order {
    int64_t price;
    int32_t id;
};

int main() {
    order o1{10, 1};
    order o2{5, 2};
    order o3{10, 3};
    order o4{16, 4};
    
    static_assert(std::is_pod<order>::value, "");
    
    tsl::htrie_set<char> test;
    test.insert_ks(reinterpret_cast<char*>(&o1), sizeof(o1));
    test.insert_ks(reinterpret_cast<char*>(&o2), sizeof(o2));
    test.insert_ks(reinterpret_cast<char*>(&o3), sizeof(o3));
    test.insert_ks(reinterpret_cast<char*>(&o4), sizeof(o4));
    


    
    int64_t price = 10;
    auto its_prefix = test.equal_prefix_range_ks(
                                 reinterpret_cast<char*>(&price), sizeof(price));

    order o;
    std::string key_buffer;

    // 3: 10    1: 10
    for(auto it = its_prefix.first; it != its_prefix.second; ++it) {
        it.key(key_buffer);
        
        assert(key_buffer.size() == sizeof(o));
        std::memcpy(&o, key_buffer.data(), sizeof(o));
        
        std::cout << o.id << ": " << o.price << std::endl;
    }
}

If I'm not mistaken, there should be no undefined behaviour if order is a POD and you don't transfer the trie between big and little endian but it's quite "hackish".

You also can't really find objects with a id lower than 'x' (you could check for id bigger than 'x' if you use an unsigned int by knowing the binary representation, but well...).

What is the problem with boost::multi_index_container for this usecase? Using too much memory?

from hat-trie.

 avatar commented on July 16, 2024

"hakish" and slight "undefined behavior" is perfectly fine as long as gcc compiled x64 is doing good.

boost::multi_index_container has pretty good hash set but I don't like relying on red-black tree to maintain an order as it's not cache friendly and once tree's growing deep quite a bit of different cache lines could be required.

the hash set is heavily modified too i.e. a lot of deletes, inserts going on. I'll try to benchmark trie vs boost for my cases, thanks so much for your help!

if you'd also need to know the sum of all order quantities (extra uint32_t qty field in the struct) for each price is there a good way to store a bit of extra data in the trie node? maybe I can just have some values in the order itself I could use for extra storage though.

from hat-trie.

Tessil avatar Tessil commented on July 16, 2024

Yes I think it should be good to do a benchmark as I'm not sure that the HAT-trie offers a significant advantage compared to boost::multi_index_container for your use case considering the complexity that it add.

For the sum, I'm not sure to understand. You can do:

std::size_t sum = 0;
for(auto it = its_prefix.first; it != its_prefix.second; ++it) {
    it.key(key_buffer);
    std::memcpy(&o, key_buffer.data(), sizeof(o));
        
    sum += o.qty;
}

You want to have an aggregate field for each price to avoid this calculation?

from hat-trie.

 avatar commented on July 16, 2024

Exactly, an aggregate and some way to get N highest & lowest aggregates by price.

Now have a separate std::map collection of aggregates which is constantly updated and each order have a pointer to map node. But not happy with the setup, kind of too many tree structures everywhere.

from hat-trie.

Tessil avatar Tessil commented on July 16, 2024

I think it's better to have a separate structure for the aggregates.

As the trie stores binary data, you could store multiple data types in the trie.

Example you will have a char buffer that looks like this for an order:
| char (type) | int64_t (id) | int64_t (price) | uint32 (qty) |

And for an aggregate:
| char (type) | char (aggregate type min, max, avg, ...) | int64_t (price) | uint32 (total_qty) |

The first byte differentiate an order from an aggregate. When you retrieve your data you would do:

if(key_buffer[0] == ORDER_TYPE) {
    std::memcpy(&ord, key_buffer.data() + 1, sizeof(order));
}
else {
    assert(key_buffer[0] == AGGREGATE_TYPE);
    std::memcpy(&aggr, key_buffer.data() + 1, sizeof(aggragate));
}

But I don't see any advantage compared to a separate map while it really increases the complexity of the code.

from hat-trie.

 avatar commented on July 16, 2024
#define NONIUS_RUNNER

#include <boost/multi_index/composite_key.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/mem_fun.hpp>
#include <boost/multi_index/member.hpp>
#include <boost/pool/pool.hpp>

#include "../../hat-trie/src/htrie_set.h"
#include "../../nonius/nonius.h++"
#include <unordered_set>

struct order
{
    int64_t price;
    int32_t id;
};

order ht_o1 {10, 1};
order us_o1 {10, 1};
order mi_o1 {10, 1};

struct order_hash
{
    size_t operator()(const order &o) const
    {
        return std::hash<int64_t>()(o.price) + std::hash<int32_t>()(o.id);
    }
};

struct order_equal
{
    bool operator()(const order &x, const order &y) const
    {
        return x.price == y.price && x.id == y.id;
    }
};

using namespace boost::multi_index;

using mindex_set = multi_index_container<
    order,
    indexed_by<
        hashed_unique<
            member<order, const int32_t, &order::id>>,
        ordered_non_unique<
            composite_key<
                order,
                member<order, int64_t, &order::price>,
                member<order, int32_t, &order::id>>>>>;

tsl::htrie_set<char> ht_test;
std::unordered_set<order, order_hash, order_equal> us_test;
mindex_set mi_test;

NONIUS_BENCHMARK("htrie-basic", [](nonius::chronometer meter)
{
    ht_test.clear();

    meter.measure([&](int i)
    {
        ++ht_o1.id;

        if (ht_o1.id % 10 == 0)
            ++ht_o1.price;

        ht_test.insert_ks(reinterpret_cast<char*>(&ht_o1), sizeof(ht_o1));
    });
})

NONIUS_BENCHMARK("unordered_set-basic", [](nonius::chronometer meter)
{
    us_test.clear();

    meter.measure([&](int i)
    {
        ++us_o1.id;

        if (us_o1.id % 10 == 0)
            ++us_o1.price;

        us_test.insert(us_o1);
    });
})

NONIUS_BENCHMARK("multi_index-basic", [](nonius::chronometer meter)
{
    mi_test.clear();

    meter.measure([&](int i)
    {
        ++mi_o1.id;

        if (mi_o1.id % 10 == 0)
            ++mi_o1.price;

        mi_test.insert(mi_o1);
    });
})
clock resolution: mean is 20.725 ns (20480002 iterations)

benchmarking htrie-basic
collecting 100 samples, 155 iterations each, in estimated 2.077 ms
mean: 87.9874 ns, lb 86.7012 ns, ub 89.3417 ns, ci 0.95
std dev: 6.71555 ns, lb 5.84097 ns, ub 8.5427 ns, ci 0.95
found 1 outliers among 100 samples (1%)
variance is severely inflated by outliers

benchmarking unordered_set-basic
collecting 100 samples, 189 iterations each, in estimated 2.0601 ms
mean: 39.9197 ns, lb 39.7768 ns, ub 40.1902 ns, ci 0.95
std dev: 0.976376 ns, lb 0.597635 ns, ub 1.42363 ns, ci 0.95
found 15 outliers among 100 samples (15%)
variance is moderately inflated by outliers

benchmarking multi_index-basic
collecting 100 samples, 80 iterations each, in estimated 2.072 ms
mean: 62.8324 ns, lb 62.3842 ns, ub 64.4419 ns, ci 0.95
std dev: 3.71753 ns, lb 0.54332 ns, ub 8.42814 ns, ci 0.95
found 5 outliers among 100 samples (5%)
variance is severely inflated by outliers

from hat-trie.

 avatar commented on July 16, 2024

htrie seem to be quite complicated indeed so basic inserts are getting slower. probably got to stick to simple data structures for now.

from hat-trie.

Tessil avatar Tessil commented on July 16, 2024

Yes, I think the HAT-trie is not well-suited for your usecase. It could probably be speed-up to take advantage of the fixed size of the entries (the HAT-trie here must support strings of variable length) but it's not the main purpose of the structure, there are probably more suited datastructures for this.

from hat-trie.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.