GithubHelp home page GithubHelp logo

kirby's Introduction

kirby

Kirby slurps up the firehose of logs from Fastly and calculates daily counts for various Ruby ecosystem statistics, pretty quickly.

How fast is pretty quickly?

For an 80MB gzipped log file containing 915,427 JSON event objects (which is 1.02GB uncompressed):

  • 2.7 seconds total to read the entire file line by line
  • 5.0 seconds total to also parse every JSON object into a Rust struct
  • 7.8 seconds total to further parse every User Agent field for Bundler, RubyGems, and Ruby versions and other metrics

This is... very good. For comparison, a Python script that used AWS Glue to do something similar took about 30 minutes. My first approach of writing a nom parser-combinator to parse the User Agent field, instead of using a regex, took 18.7 seconds. Processing a gigabyte of almost a million JSON objects into useful histograms in less than 8 seconds just blows my mind. But then I figured out how to use Rayon, and now it can parse 8 gzipped log files in parallel on an 8-core MacBook Pro, and that's super fast.

Then Rust got more optimized and Apple released the M1, and it got still faster. Finally, and I found the profile-guided optimization docs, and it improved even more than I thought was still possible.

Wait, how fast?

    ~525 records/second/cpu in Python on Apache Spark in AWS Glue
 ~14,000 records/second/cpu in Ruby on a 2018 Intel MacBook Pro
~353,000 records/second/cpu in Rust on a 2018 Intel MacBook Pro
~550,000 records/second/cpu in Rust on a 2021 M1 MacBook Pro
~638,000 records/second/cpu in Rust on M1 with profile-guided optimization

Are you kidding me?

No. The latest version (which I am now benchmarking without also running cargo build ๐Ÿคฆ๐Ÿปโ€โ™‚๏ธ) can parse records really, really fast.

    ~4,200 records/second in Python with 8 worker instances on AWS Glue
~1,085,000 records/second in Rust with rayon on an 8-core Intel MacBook Pro
~3,195,000 records/second in Rust with rayon on a 10-core M1 MacBook Pro
~3,583,000 records/second in Rust with rayon on M1 with profile-guided optimization

What does it calculate?

It counts Bundler, RubyGems, and Ruby versions, in hourly buckets, and prints those out as nested JSON to stdout.

Tell me more about how this happened.

Okay, I wrote a blog post with details about creating this library, and a follow up about more optimizations.

kirby's People

Contributors

byroot avatar indirect avatar killercup avatar rubenrua avatar tkaitchuck avatar wezm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kirby's Issues

Clarify I/O in benchmark numbers

Hi, I would like to test kirby with a different format, which is a 3gb text file consisting of JSON objects (1 per line).

I understand the impressive benchmark numbers were achieved by streaming the data in, and not reading files from disk (SSD / NVMe)?

If you did indeed achieve it with files read from disk, could you clarify how you managed to get around the bottle neck of read speeds?

Maybe this question doesn't make sense, but I feel the speed of serde would be wasted in our setup because we are dealing with files as input.

Question about s3 logs

Hi!
Thanks for the great article. Just wondering how you kept this thing fed with s3 logs?

Thanks!
J

Impressive

That's quite a speedup from the original!

Although this is not a complete solution, I was curious how plain old Javascript would perform. I did a basic test and it performed reasonably well. It processed a one million line log file in 1.5 seconds on my Windows machine (Processor Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 8 Logical Processor(s)) using the SpiderMonkey JavaScript Engine (https://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/)

One can accomplish a lot in a few lines of Javascript code :-)

C:\spidermonkey>cat sample_1000000.log.txt | js parseJSONFile.js
Processed 1000000 records in 1562ms

//parseJSONFile.js
var start=Date.now();
var records=[];

function getVal(key,str) {
return str.split(key+"/")[1] ? r.split(key+"/")[1].split(" ")[0] : "";
}

while(x=JSON.parse(readline())){
var r=x.user_agent.toLowerCase();
x._user_Agent={
"bundler":getVal("bundler",r),
"rubygems":getVal("rubygems",r),
"ruby":getVal("ruby",r),
"platform":r.split("(")[1] ? r.split("(")[1].split(")")[0] : "",
"command":getVal("command",r),
"options":getVal("options",r),
"jruby":getVal("jruby",r),
"truffleruby":getVal("truffleruby",r),
"ci":getVal("ci",r),
"gemstach":getVal("gemstach",r)
}
records.push(x);
}
print("Processed "+records.length+" records in "+(Date.now()-start)+"ms");

Performance suggestion: Use zstd compression with a dictionary for the logs

Hi,

I've noticed that's you're open for more performance suggestions.
To make this one work you'll have to change the logger that dumps the log files to S3, to compress the logs using zstd and pretrained dictionary.

I can say that it works wonders for repetitive data, which is the case for logs. I got from ~10MB(~50MB uncompressed, with repetitive html files) zip file to ~1MB compressed with a 100KB(if you'll do it, try different dictionary sizes to find the sweet spot) dictionary. All using rust zstd bindings. The sense of wonder never left me since.

Obviously this may not be as good for your case, but I assume it will. Basically you train a dictionary with one of your 1GB logs and use that for compression/decompression in the future, and save the dictionary in S3 as well, associate an archive with the dictionary(maybe by file name: log_dict1_stamp.bin) in case you'll want to train more dictionaries later.

Can't provide any size/perf comparisons since I don't have access to your data. Also these will require more changes to the infrastructure.

Decompression speed should be well over 1GB/s on modern hardware, maybe it's even worth investigating decompressing a stream, and search the data over the stream(but you'll have to copy the found strings), obviously the tool would change a fair amount with this architecture.

Anyway, please feel free to ignore this, if you thought about it before, or it's obvious, or it's not practical.

Performance improvements

Looking at your code there are a few areas where it could be made even faster.

  • file.rs is doing BuffReader::new without specifying a size. This results in a buffer of just 8kb. You might see some improvement by increasing this size.
  • In lib.rs it's calling lines() on the stream. This is parsing the data from bytes into utf-8 string, which is then handed to serde to deserialize from the utf-8 string. However if you use split instead it will give you bytes instead without parsing, which serde is capable of serializing directly. So you can cut out the whole UTF-8 parsing step.
  • The counters are going into a hashMap which is using the default ddos resistant hash function, so it's not the fastest. Though you can plug in an alternative hash such as https://crates.io/crates/fnv .
  • In the case of the inner map even that is not needed because all the keys are just string constants. It would be both more type safe, and faster to replace the strings with an Enum and use an enum-map.

Kirby imports not resolving

Hi , I am building the docker file and after fixing some of the paths inside it it reached till step 20 on which it says
---> 861b440a6508
Step 20/34 : RUN cargo build --target $BUILD_TARGET --release --bin kirby-s3
---> Running in 1253d5f404d1
Compiling kirby v0.1.0 (/build)
error[E0432]: unresolved import kirby::Options
--> src/bin/kirby-s3.rs:13:5
|
13 | use kirby::Options;
| ^^^^^^^^^^^^^^ no Options in the root

error[E0432]: unresolved import kirby::stream_stats
--> src/bin/kirby-s3.rs:20:5
|
20 | use kirby::stream_stats;
| ^^^^^^^^^^^^^^^^^^^ no stream_stats in the root

warning: trait objects without an explicit dyn are deprecated
--> src/bin/kirby-s3.rs:22:53
|
22 | fn read_object(bucket_name: &str, key: &str) -> Box {
| ^^^^^^^ help: use dyn: dyn BufRead
|
= note: #[warn(bare_trait_objects)] on by default

error: aborting due to 2 previous errors

For more information about this error, try rustc --explain E0432.
error: could not compile kirby.

To learn more, run the command again with --verbose.

Can you help with this ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.