rubytogether / kirby Goto Github PK

License: MIT License

Shell 12.25% VCL 10.02% Ruby 5.83% Python 25.87% Rust 46.03%

kirby's Introduction

kirby

Kirby slurps up the firehose of logs from Fastly and calculates daily counts for various Ruby ecosystem statistics, pretty quickly.

How fast is pretty quickly?

For an 80MB gzipped log file containing 915,427 JSON event objects (which is 1.02GB uncompressed):

2.7 seconds total to read the entire file line by line
5.0 seconds total to also parse every JSON object into a Rust struct
7.8 seconds total to further parse every User Agent field for Bundler, RubyGems, and Ruby versions and other metrics

This is... very good. For comparison, a Python script that used AWS Glue to do something similar took about 30 minutes. My first approach of writing a nom parser-combinator to parse the User Agent field, instead of using a regex, took 18.7 seconds. Processing a gigabyte of almost a million JSON objects into useful histograms in less than 8 seconds just blows my mind. But then I figured out how to use Rayon, and now it can parse 8 gzipped log files in parallel on an 8-core MacBook Pro, and that's super fast.

Then Rust got more optimized and Apple released the M1, and it got still faster. Finally, and I found the profile-guided optimization docs, and it improved even more than I thought was still possible.

Wait, how fast?

    ~525 records/second/cpu in Python on Apache Spark in AWS Glue
 ~14,000 records/second/cpu in Ruby on a 2018 Intel MacBook Pro
~353,000 records/second/cpu in Rust on a 2018 Intel MacBook Pro
~550,000 records/second/cpu in Rust on a 2021 M1 MacBook Pro
~638,000 records/second/cpu in Rust on M1 with profile-guided optimization

Are you kidding me?

No. The latest version (which I am now benchmarking without also running cargo build 🤦🏻‍♂️) can parse records really, really fast.

    ~4,200 records/second in Python with 8 worker instances on AWS Glue
~1,085,000 records/second in Rust with rayon on an 8-core Intel MacBook Pro
~3,195,000 records/second in Rust with rayon on a 10-core M1 MacBook Pro
~3,583,000 records/second in Rust with rayon on M1 with profile-guided optimization

What does it calculate?

It counts Bundler, RubyGems, and Ruby versions, in hourly buckets, and prints those out as nested JSON to stdout.

Tell me more about how this happened.

Okay, I wrote a blog post with details about creating this library, and a follow up about more optimizations.

kirby's People

Contributors

Stargazers

Watchers

kirby's Issues

Clarify I/O in benchmark numbers

Hi, I would like to test kirby with a different format, which is a 3gb text file consisting of JSON objects (1 per line).

I understand the impressive benchmark numbers were achieved by streaming the data in, and not reading files from disk (SSD / NVMe)?

If you did indeed achieve it with files read from disk, could you clarify how you managed to get around the bottle neck of read speeds?

Maybe this question doesn't make sense, but I feel the speed of serde would be wasted in our setup because we are dealing with files as input.

Question about s3 logs

Hi!
Thanks for the great article. Just wondering how you kept this thing fed with s3 logs?

Thanks!
J

Impressive

That's quite a speedup from the original!

Although this is not a complete solution, I was curious how plain old Javascript would perform. I did a basic test and it performed reasonably well. It processed a one million line log file in 1.5 seconds on my Windows machine (Processor Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 8 Logical Processor(s)) using the SpiderMonkey JavaScript Engine (https://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/)

One can accomplish a lot in a few lines of Javascript code :-)

C:\spidermonkey>cat sample_1000000.log.txt | js parseJSONFile.js
Processed 1000000 records in 1562ms

//parseJSONFile.js
var start=Date.now();
var records=[];

function getVal(key,str) {
return str.split(key+"/")[1] ? r.split(key+"/")[1].split(" ")[0] : "";
}

while(x=JSON.parse(readline())){
var r=x.user_agent.toLowerCase();
x._user_Agent={
"bundler":getVal("bundler",r),
"rubygems":getVal("rubygems",r),
"ruby":getVal("ruby",r),
"platform":r.split("(")[1] ? r.split("(")[1].split(")")[0] : "",
"command":getVal("command",r),
"options":getVal("options",r),
"jruby":getVal("jruby",r),
"truffleruby":getVal("truffleruby",r),
"ci":getVal("ci",r),
"gemstach":getVal("gemstach",r)
}
records.push(x);
}
print("Processed "+records.length+" records in "+(Date.now()-start)+"ms");

Performance suggestion: Use zstd compression with a dictionary for the logs

Hi,

I've noticed that's you're open for more performance suggestions.
To make this one work you'll have to change the logger that dumps the log files to S3, to compress the logs using zstd and pretrained dictionary.

I can say that it works wonders for repetitive data, which is the case for logs. I got from ~10MB(~50MB uncompressed, with repetitive html files) zip file to ~1MB compressed with a 100KB(if you'll do it, try different dictionary sizes to find the sweet spot) dictionary. All using rust zstd bindings. The sense of wonder never left me since.

Obviously this may not be as good for your case, but I assume it will. Basically you train a dictionary with one of your 1GB logs and use that for compression/decompression in the future, and save the dictionary in S3 as well, associate an archive with the dictionary(maybe by file name: log_dict1_stamp.bin) in case you'll want to train more dictionaries later.

Can't provide any size/perf comparisons since I don't have access to your data. Also these will require more changes to the infrastructure.

Decompression speed should be well over 1GB/s on modern hardware, maybe it's even worth investigating decompressing a stream, and search the data over the stream(but you'll have to copy the found strings), obviously the tool would change a fair amount with this architecture.

Anyway, please feel free to ignore this, if you thought about it before, or it's obvious, or it's not practical.

Performance improvements

Looking at your code there are a few areas where it could be made even faster.

file.rs is doing BuffReader::new without specifying a size. This results in a buffer of just 8kb. You might see some improvement by increasing this size.
In lib.rs it's calling lines() on the stream. This is parsing the data from bytes into utf-8 string, which is then handed to serde to deserialize from the utf-8 string. However if you use split instead it will give you bytes instead without parsing, which serde is capable of serializing directly. So you can cut out the whole UTF-8 parsing step.
The counters are going into a hashMap which is using the default ddos resistant hash function, so it's not the fastest. Though you can plug in an alternative hash such as https://crates.io/crates/fnv .
In the case of the inner map even that is not needed because all the keys are just string constants. It would be both more type safe, and faster to replace the strings with an Enum and use an enum-map.

Deserialize without allocation

I think you should try to change the Request struct a little bit to use &str because it seems you can avoid allocations, the request doesn't outlive the function.

https://serde.rs/lifetimes.html

Add MIT license?

Hi,
please add MIT license.

Kirby imports not resolving

Hi , I am building the docker file and after fixing some of the paths inside it it reached till step 20 on which it says
---> 861b440a6508
Step 20/34 : RUN cargo build --target $BUILD_TARGET --release --bin kirby-s3
---> Running in 1253d5f404d1
Compiling kirby v0.1.0 (/build)
error[E0432]: unresolved import kirby::Options
--> src/bin/kirby-s3.rs:13:5
|
13 | use kirby::Options;
| ^^^^^^^^^^^^^^ no Options in the root

error[E0432]: unresolved import kirby::stream_stats
--> src/bin/kirby-s3.rs:20:5
|
20 | use kirby::stream_stats;
| ^^^^^^^^^^^^^^^^^^^ no stream_stats in the root

warning: trait objects without an explicit dyn are deprecated
--> src/bin/kirby-s3.rs:22:53
|
22 | fn read_object(bucket_name: &str, key: &str) -> Box {
| ^^^^^^^ help: use dyn: dyn BufRead
|
= note: #[warn(bare_trait_objects)] on by default

error: aborting due to 2 previous errors

For more information about this error, try rustc --explain E0432.
error: could not compile kirby.

To learn more, run the command again with --verbose.

Can you help with this ?

Create a generic tool to analyse JSON logs

Hi.

Congrats, kirby is really fast. I am used it to analyze Open edX JSON logs.

I am curious about the comparison with MongoDB. If the kirby performance is similar to MongoDB, a generic tool/lib to analyze JSON logs could be very useful. See: https://twitter.com/rubenrua/status/1111240276583071744

I am working in a tool to import JSON logs into MongoDB to compare kirby and MongoDB. I wil update this issue with my results.