GithubHelp home page GithubHelp logo

Comments (10)

sanikolaev avatar sanikolaev commented on August 12, 2024

Hi. What exactly config file do you mean?

from columnar.

sangensong avatar sangensong commented on August 12, 2024

thanks, I mean manticore.conf

from columnar.

sanikolaev avatar sanikolaev commented on August 12, 2024

For what exactly benchmark? They are slightly different.

from columnar.

sangensong avatar sangensong commented on August 12, 2024

Emm.Dataset about nginx.
Elasticsearch vs Manticore with 1500MB RAM limit - Elasticsearch is 5.68x slower than Manticore Columnar Library:

from columnar.

sanikolaev avatar sanikolaev commented on August 12, 2024

Here it is:

source logs116m
{
        type = csvpipe
        csvpipe_command = cat /input/data.csv

        csvpipe_field = remote_addr
        csvpipe_field = remote_user
        csvpipe_attr_uint = runtime
        csvpipe_attr_timestamp = time_local
        csvpipe_field = request_type
        csvpipe_field_string = request_path
        csvpipe_field = request_protocol
        csvpipe_attr_uint = status
        csvpipe_attr_uint = size
        csvpipe_field = referer
        csvpipe_field = usearagent
}

index logs116m
{
        path = /var/lib/manticore/logs116m.0.9.9
        source = logs116m
        min_infix_len = 2
	columnar_attrs = id, remote_addr, remote_user, request_type, request_protocol, referer,  runtime, status, size, usearagent, request_path
	stored_fields = remote_addr, remote_user, request_type, referer, usearagent, request_protocol
}


searchd
{
        listen = 9306:mysql
        listen = 9308:http
        log = /var/log/manticore/searchd.log
        pid_file = /var/run/manticore/searchd.pid
        binlog_path =
        qcache_max_bytes = 0

        access_plain_attrs = mmap
        access_blob_attrs = mmap

}

from columnar.

sangensong avatar sangensong commented on August 12, 2024

thank you very much

from columnar.

sanikolaev avatar sanikolaev commented on August 12, 2024

No problem. Let me know if I can help with anything else. If you have a chance to benchmark or test the columnar library too I'll appreciate if you share your results here or at [email protected].

from columnar.

sangensong avatar sangensong commented on August 12, 2024

I'm sorry to reopen the issue. Can you give me the config file. about data "hacker_news_comments.csv", the conclusion is
"Elasticsearch vs Manticore with 1024MB RAM limit - Elasticsearch is 6.51x slower"

from columnar.

sanikolaev avatar sanikolaev commented on August 12, 2024

source hn_small
{
        type = csvpipe
        csvpipe_command = cat /input/data.csv
        csvpipe_attr_uint = story_id
        csvpipe_field = story_text
        csvpipe_field_string = story_author
        csvpipe_attr_uint = comment_id
        csvpipe_field = comment_text
        csvpipe_field_string = comment_author
        csvpipe_attr_uint = comment_ranking
        csvpipe_attr_uint = author_comment_count
        csvpipe_attr_uint = story_comment_count
}

index hn_small
{
        path = /var/lib/manticore/hn_small.0.9.9
        source = hn_small
        min_infix_len = 2

	columnar_attrs = id, story_id, comment_id, comment_ranking, author_comment_count, story_comment_count, story_author, comment_author

        stored_fields = story_text, comment_text
}

index fake
{
        type = rt
        path = /var/lib/manticore/fake
        rt_field = f
}


searchd
{
	listen = 9306:mysql41

	listen = 9308:http

        log = /var/log/manticore/searchd.log
        pid_file = /var/run/manticore/searchd.pid
        binlog_path =
	qcache_max_bytes = 0

        access_plain_attrs = mmap
	access_blob_attrs = mmap

}

If you want to reproduce the benchmark here's more details:

Elasticsearch init:

/usr/share/logstash/bin/logstash -f $PWD/../$test/logstash.conf --pipeline.batch.size=2000 --pipeline.workers=4
curl -XPOST "localhost:9200/$test/_forcemerge?max_num_segments=1"
curl -X PUT "localhost:9200/$test/_settings?pretty" -H 'Content-Type: application/json' -d' { "index" : { "number_of_replicas" : 0 }}'

Logstash config:

input {
    file {
        codec => multiline {
                pattern => "^\"\d+\",\"\d+\","
                negate => "true"
                what => "previous"
        }
        path => ["${PWD}/data/data.csv"]
        start_position => "beginning"
        sincedb_path => "/dev/null"
        mode => "read"
        exit_after_read => "true"
        file_completed_action => "log"
        file_completed_log_path => "/dev/null"
    }
}

filter {
    csv {
        separator => ","
        skip_header => "true"
        columns => [
                "id",
                "story_id",
                "story_text",
                "story_author",
                "comment_id",
                "comment_text",
                "comment_author",
                "comment_ranking",
                "author_comment_count",
                "story_comment_count"
        ]
    }
    mutate {
        remove_field => ["path", "host", "message", "@version", "@timestamp", "id"]
    }

}

output {
    elasticsearch {
        template => "${PWD}/template.json"
        template_overwrite => true
        hosts => ["127.0.0.1:9200"]
        index => "${test}"
    }
}

logstash template:

{
    "index_patterns" : "hn_small",
    "settings": {
      "number_of_replicas": 0,
      "number_of_shards": 1,
      "analysis": {
        "analyzer": "simple"
      },
      "index.max_result_window" : "100000",
      "index.queries.cache.enabled": false
    },
    "mappings": {
        "_source": {
          "enabled": true
        },
        "properties": {
           "story_id": {"type": "integer"},
           "story_text": {"type": "text"},
           "story_author": {"type": "text", "fields": {"raw": {"type":"keyword"}}},
           "comment_id": {"type": "integer"},
           "comment_text": {"type": "text"},
           "comment_author": {"type": "text", "fields": {"raw": {"type":"keyword"}}},
           "comment_ranking": {"type": "integer"},
           "author_comment_count": {"type": "integer"},
           "story_comment_count": {"type": "integer"}
        }
   }
}

csv preparation:

[ ! -f "/data/downloaded.csv" ] && wget https://zenodo.org/record/45901/files/hacker_news_comments.csv?download=1 -O /data/downloaded.csv
echo "Cleaning";
cat /data/downloaded.csv | tr -cd '\11\12\15\40-\176' > /data/cleaned.csv
echo "Multiplying"
for n in `seq 1 1`; do echo $n; cat /data/cleaned.csv >> /data/multiplied.csv; done;
rm /data/cleaned.csv
echo "Preparing"
rm /data/data.csv 2>/dev/null
csvcut -e utf-8 -l -c 0,3,4,5,6,7,8,9,10 -z 1073741824 /data/multiplied.csv|grep -v author_comment_count|csvformat -U1 -z 1073741824 > /data/data.csv
rm /data/multiplied.csv

indexation in Manticore is much simpler:

indexer -c /path/to/manticore.conf --all

from columnar.

sangensong avatar sangensong commented on August 12, 2024

Ok.Thanks for you reply

from columnar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.