(Sorry if this isn't the right place for this question or if the question doesn't real

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Difference between CLP and Parquet? about clp HOT 2 CLOSED

y-scope commented on July 23, 2024

Difference between CLP and Parquet?

from clp.

Comments (2)

kirkrodrigues commented on July 23, 2024 1

Hi @emschwartz,

So sorry for the delayed response.

We don't currently have a detailed write up, but perhaps I can provide some info here.

For some frame of reference, to store (unstructured) logs in Parquet files, the simplest approach would be to store timestamps (as an offset from the UNIX epoch), message content, and other fields (e.g., log-level) in separate columns. This should give you some compression as a result of Parquet's built-in encodings. For instance, you could use delta-encoding for the timestamps and dictionary-encoding for the other fields.

Similarities to CLP:

Both Parquet and CLP store logs in a columnar format where the columns are split into row-groups (such that you don't end up with extremely large columns).
Both apply specialized encodings per data-type to improve compression ratio.
On a similar note, both can use dictionaries for data deduplication.

Differences with CLP:

With Parquet's dictionary-encoding, the message content likely wouldn't compress very well (at least not as well as CLP) since the content contains variable values. In contrast, CLP's parsing separates the variable values from the rest of the message which benefits compression since each column is less random. That said, if you use CLP's parsing, you could store the parsed log messages in Parquet.
Parquet's dictionary is limited to a certain size in each column chunk, so the level of deduplication is more localized compared to CLP where we maintain separate dictionary files per archive.
Parquet's on-disk layout seems optimized for sequential writes but not necessarily sequential reads. For instance, the metadata is stored at the end of the file, so using it requires seeking to the end of the file, then seeking backwards to read the metadata and eventually the desired columns. In contrast, CLP's layout is optimized for both sequential writes and reads (so that we can take advantage of low-cost storage like hard drives and block storage). First, CLP stores its metadata in files separate from the actual data. In addition, CLP avoids seeks when writing and reading the data. Although, I must admit I'm not too familiar with the latest innovations in the Parquet space; I would imagine they have some solutions for this problem.

Overall, I would say Parquet and CLP have similar storage formats (at least at the level of storing log events), which is part of what motivated using it at Uber. That said, we anticipate some potential compression and performance bottlenecks (including for search) with using Parquet since we wouldn't as much flexibility to choose the precise layout of the data.

from clp.

emschwartz commented on July 23, 2024

Thanks @kirkrodrigues, that's exactly what I was looking for!

from clp.

Recommend Projects

Difference between CLP and Parquet? about clp HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs