GithubHelp home page GithubHelp logo

Comments (2)

kirkrodrigues avatar kirkrodrigues commented on July 23, 2024 1

Hi @emschwartz,

So sorry for the delayed response.

We don't currently have a detailed write up, but perhaps I can provide some info here.

For some frame of reference, to store (unstructured) logs in Parquet files, the simplest approach would be to store timestamps (as an offset from the UNIX epoch), message content, and other fields (e.g., log-level) in separate columns. This should give you some compression as a result of Parquet's built-in encodings. For instance, you could use delta-encoding for the timestamps and dictionary-encoding for the other fields.

Similarities to CLP:

  • Both Parquet and CLP store logs in a columnar format where the columns are split into row-groups (such that you don't end up with extremely large columns).
  • Both apply specialized encodings per data-type to improve compression ratio.
  • On a similar note, both can use dictionaries for data deduplication.

Differences with CLP:

  • With Parquet's dictionary-encoding, the message content likely wouldn't compress very well (at least not as well as CLP) since the content contains variable values. In contrast, CLP's parsing separates the variable values from the rest of the message which benefits compression since each column is less random. That said, if you use CLP's parsing, you could store the parsed log messages in Parquet.

  • Parquet's dictionary is limited to a certain size in each column chunk, so the level of deduplication is more localized compared to CLP where we maintain separate dictionary files per archive.

  • Parquet's on-disk layout seems optimized for sequential writes but not necessarily sequential reads. For instance, the metadata is stored at the end of the file, so using it requires seeking to the end of the file, then seeking backwards to read the metadata and eventually the desired columns. In contrast, CLP's layout is optimized for both sequential writes and reads (so that we can take advantage of low-cost storage like hard drives and block storage). First, CLP stores its metadata in files separate from the actual data. In addition, CLP avoids seeks when writing and reading the data. Although, I must admit I'm not too familiar with the latest innovations in the Parquet space; I would imagine they have some solutions for this problem.

Overall, I would say Parquet and CLP have similar storage formats (at least at the level of storing log events), which is part of what motivated using it at Uber. That said, we anticipate some potential compression and performance bottlenecks (including for search) with using Parquet since we wouldn't as much flexibility to choose the precise layout of the data.

from clp.

emschwartz avatar emschwartz commented on July 23, 2024

Thanks @kirkrodrigues, that's exactly what I was looking for!

from clp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.