Comments (2)
Hi @emschwartz,
So sorry for the delayed response.
We don't currently have a detailed write up, but perhaps I can provide some info here.
For some frame of reference, to store (unstructured) logs in Parquet files, the simplest approach would be to store timestamps (as an offset from the UNIX epoch), message content, and other fields (e.g., log-level) in separate columns. This should give you some compression as a result of Parquet's built-in encodings. For instance, you could use delta-encoding for the timestamps and dictionary-encoding for the other fields.
Similarities to CLP:
- Both Parquet and CLP store logs in a columnar format where the columns are split into row-groups (such that you don't end up with extremely large columns).
- Both apply specialized encodings per data-type to improve compression ratio.
- On a similar note, both can use dictionaries for data deduplication.
Differences with CLP:
-
With Parquet's dictionary-encoding, the message content likely wouldn't compress very well (at least not as well as CLP) since the content contains variable values. In contrast, CLP's parsing separates the variable values from the rest of the message which benefits compression since each column is less random. That said, if you use CLP's parsing, you could store the parsed log messages in Parquet.
-
Parquet's dictionary is limited to a certain size in each column chunk, so the level of deduplication is more localized compared to CLP where we maintain separate dictionary files per archive.
-
Parquet's on-disk layout seems optimized for sequential writes but not necessarily sequential reads. For instance, the metadata is stored at the end of the file, so using it requires seeking to the end of the file, then seeking backwards to read the metadata and eventually the desired columns. In contrast, CLP's layout is optimized for both sequential writes and reads (so that we can take advantage of low-cost storage like hard drives and block storage). First, CLP stores its metadata in files separate from the actual data. In addition, CLP avoids seeks when writing and reading the data. Although, I must admit I'm not too familiar with the latest innovations in the Parquet space; I would imagine they have some solutions for this problem.
Overall, I would say Parquet and CLP have similar storage formats (at least at the level of storing log events), which is part of what motivated using it at Uber. That said, we anticipate some potential compression and performance bottlenecks (including for search) with using Parquet since we wouldn't as much flexibility to choose the precise layout of the data.
from clp.
Thanks @kirkrodrigues, that's exactly what I was looking for!
from clp.
Related Issues (20)
- sbin/compress.sh with an input list fails.
- compression_scheduler doesn't report failure with partially invalid input paths
- clp-s sometimes fails to search archives containing multiple sub-archives.
- CLP-S: Undefined behavior on file with no read permission HOT 2
- clp: Queries for `*` from the package/webui don't return any results.
- Quoting (") the query in the webui searches *with* the quotes rather than stripping them.
- Timestamp inconsistency before/after compression
- webui: "Invalid Date" shown for both begin and end dates in Ingestion Details panel HOT 1
- webui: Dragging on the timeline is disabled when elements such as text or images are selected.
- webui: Add path from query input to docs describing query syntax.
- webui: CPU usage is high after completing a query
- clp: Infinite loop when compressing zip archive. HOT 1
- clp-s: Implicit wildcard behaviour is unintuitive
- clp-s: Use standard syntax for conditions on values between ()
- clp-s: Support `?` to match single character in searches against variable value columns HOT 2
- CLP-package: search scheduler generating weird error logs
- clp-s: Incorrect search results with a wildcard character in the middle of a query
- clp-s: Incorrect search results with a "?" at the end of a query
- clp-s: Inifinite loop for queries with consecutive backslash and wildcard characters
- clp-s: Incorrect decompression results with array fields
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clp.