GithubHelp home page GithubHelp logo

y-scope / clp Goto Github PK

View Code? Open in Web Editor NEW
745.0 16.0 63.0 3.22 MB

Compressed Log Processor (CLP) is a free log management tool capable of compressing text logs and searching the compressed logs without decompression.

Home Page: https://yscope.com

License: Apache License 2.0

CMake 3.24% C++ 84.50% Dockerfile 0.11% Shell 0.93% Python 7.08% ANTLR 0.05% CSS 0.01% HTML 0.02% JavaScript 3.91% SCSS 0.15%
logging compression analytics search log-parser log-management

clp's Introduction

CLP

Open bug reports Open feature requests CLP on Zulip

YScope's Compressed Log Processor (CLP) compresses your logs, and allows you to search the compressed logs without decompression. CLP supports both JSON logs and unstructured (i.e., free text) logs. It also supports real-time log compression within several logging libraries. CLP also includes purpose-built web interfaces for searching and viewing the compressed logs. To learn more about it, you can read our paper.

Benchmarks

CLP Benchmark on JSON Logs CLP Benchmark on Unstructured Logs

The figures above show CLP's compression and search performance compared to other tools. We separate the experiments between JSON and unstructured logs because (1) some tools can only handle one type of logs, and (2) tools that can handle both types often have different designs for each type (such as CLP).

Compression ratio is measured as the average across a variety of log datasets. Some of these datasets can be found here. Search performance is measured using queries on the MongoDB logs (for JSON) and the Hadoop logs (for unstructured logs). Note that CLP uses an index-less design, so for a fair comparison, we disabled MongoDB and PostgreSQL's indexes; If we left them enabled, MongoDB and PostgreSQL's compression ratio would be worse. We didn't disable indexing for Elasticsearch or Splunk since these tools are fundamentally index-based (i.e., logs cannot be searched without indexes). More details about our experimental methodology can be found in the CLP paper.

System Overview

CLP systems overview

CLP provides an end-to-end log management pipeline consisting of compression, search, analytics, and viewing. The figure above shows the CLP ecosystem architecture. It consists of the following features:

  • Compression and Search: CLP compresses logs into archives, which can be searched and analyzed in a web UI. The input can either be raw logs or CLP's compressed IR (intermediate representation) produced by CLP's logging libraries.

  • Real-time Compression with CLP Logging Libraries: CLP provides logging libraries for Python and Java (Log4j and Logback). The logging libraries compress logs in real-time, so only compressed logs are written to disk or transmitted over the network. The compressed logs use CLP's intermediate representation (IR) format which achieves a higher compression ratio than general purpose compressors like Zstandard. Compressing IR into archives can further double the compression ratio and enable global search, but this requires more memory usage as it needs to buffer enough logs. More details on IR versus archives can be found in this Uber Engineering Blog.

  • Log Viewer: the compressed IR can be viewed in a web-based log viewer. Compared to viewing the logs in an editor, CLP's log viewer supports advanced features like filtering logs based on log level verbosity (e.g., only displaying logs with log level equal or higher than ERROR). These features are possible because CLP's logging libraries parse the logs before compressing them into IR.

  • IR Analytics Libraries: we also provide a Python library and a Go library that can analyze compressed IR.

  • Log parser: CLP also includes a custom pushdown-automata-based log parser that is 3x faster than state-of-the-art regular expression engines like RE2. The log parser is available as a library that can be used by other applications.

Getting Started

You can download a release package which includes support for distributed compression and search. Or, to quickly try CLP's core compression and search, you can use a prebuilt container.

We also have guides for building the package and CLP core from source.

For some logs you can use to test CLP, check out our open-source datasets.

Docs

You can find our docs online or view the source in docs/src.

Providing Feedback

You can use GitHub issues to report a bug or request a feature.

Join us on Zulip to chat with developers and other community members.

Next Steps

This is our open-source release which we will be constantly updating with bug fixes, features, etc. If you would like a feature or want to report a bug, please file an issue and we'll be happy to engage.

clp's People

Contributors

abvarun226 avatar all-less avatar davemarco avatar davidlion avatar diy1 avatar gibber9809 avatar haiqi96 avatar henry8192 avatar jackluo923 avatar junhaoliao avatar kirkrodrigues avatar linzhihao-723 avatar oliversm95 avatar sharafmohamed avatar thepegasos avatar wraymo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clp's Issues

Unable to setup CLP in Debian OS

Hi Team,
I am trying to follow the steps mentioned in package-template, when I am trying to start clp, it is failing with this error. The system has enough memory. Can you please edit the readme.md and post all the required softwares.

Traceback (most recent call last):
  File "/root/clp/tools/packager/out/clp-package-ubuntu-focal-x86_64-v0.0.1/etc/../sbin/start-clp", line 462, in main
    start_queue(instance_id, clp_config)
  File "/root/clp/tools/packager/out/clp-package-ubuntu-focal-x86_64-v0.0.1/etc/../sbin/start-clp", line 271, in start_queue
    subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
  File "/opt/conda/default/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'exec', '-it', 'clp-queue-209e', 'rabbitmqctl', 'wait', '/tmp/rabbitmq.pid']' returned non-zero exit status 137.

Search job logs and compression job logs overwrite each other

Bug

Logs from compression jobs and search jobs are stored in the same directory and use the same file name format, so they overwrite each other.

CLP version

f17be1d

Environment

  • Ubuntu 20.04
  • Docker 20.10.8, build 3967b7d

Reproduction steps

  • Clear all data and logs from your package
  • Run a compression job
  • Write some text in job-1-task-1.log if it is empty.
  • Run a search job and you'll see job-1-task-1.log was overwritten with the search job's logs.

Questions about intermediate results

Request

Great works! I had compressed and decompressed lots of logs successfully, and I had some problems understanding the code in depth.

  1. After a file is compressed, how do I obtain the intermediate result generated during the compression, such as the extracted template?

  2. I read logtype.dict and var.dict myself, but I found that they were all numbers, and I didn't know what they meant. In my understanding, logtypes and templates should be in string form, but I can't get these string templates at the moment.

  3. I'm very interested in figuring out what all these staging files, logtype.dict, logtype.segindex, metadata, metadata.db, var.dict, var.segindex, mean. And what are the files 0, 1, 2, 3, and so on in folder s.

Possible implementation

Can you provide some scripts to read the aforementioned files?
Also, some scripts to generate intermediate results.

Working directory of some Docker images prevents container startup when run as non-root user

Bug

Some of the Docker images (e.g., clp/clp-execution-x86-ubuntu-focal) have their working directory set to /root which will prevent starting up the container as a non-root user. This only occurs with some versions of Docker (e.g., 19.03.2).

The error message is:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "chdir to cwd (\"/root\") set in config.json failed: permission denied": unknown.

CLP version

5f80e1e

Environment

  • Docker 19.03.2
  • Ubuntu 18.04

Reproduction steps

docker run -it --rm -u$(id -u):$(id -g) ghcr.io/y-scope/clp/clp-execution-x86-ubuntu-focal:main /bin/bash

Centos7.4 build failed

All dependencies have been successfully compiled and installed. But an error is reported when make is executed, and the error message is as follows:
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_read_support_filter_lz4.c.o): in function lz4_filter_read_legacy_stream': archive_read_support_filter_lz4.c:(.text+0x1f9): undefined reference to LZ4_decompress_safe'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_read_support_filter_lz4.c.o): in function lz4_filter_read_default_stream': archive_read_support_filter_lz4.c:(.text+0x575): undefined reference to LZ4_decompress_safe_usingDict'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_read_support_filter_lz4.c:(.text+0x79e): undefined reference to LZ4_decompress_safe' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_cryptor.c.o): in function aes_ctr_release':
archive_cryptor.c:(.text+0x18): undefined reference to EVP_CIPHER_CTX_free' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_cryptor.c.o): in function aes_ctr_init':
archive_cryptor.c:(.text+0x4e): undefined reference to EVP_CIPHER_CTX_new' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0x89): undefined reference to EVP_aes_192_ecb'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0xb7): undefined reference to EVP_CIPHER_CTX_init' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0xc9): undefined reference to EVP_aes_128_ecb'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0xd9): undefined reference to EVP_aes_256_ecb' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_cryptor.c.o): in function aes_ctr_update':
archive_cryptor.c:(.text+0x2f9): undefined reference to EVP_EncryptInit_ex' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0x318): undefined reference to EVP_EncryptUpdate'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_hmac.c.o): in function __hmac_sha1_cleanup': archive_hmac.c:(.text+0x10): undefined reference to HMAC_CTX_cleanup'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_hmac.c.o): in function __hmac_sha1_final': archive_hmac.c:(.text+0x48): undefined reference to HMAC_Final'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_hmac.c.o): in function __hmac_sha1_init': archive_hmac.c:(.text+0x95): undefined reference to EVP_sha1'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_hmac.c:(.text+0xa9): undefined reference to HMAC_Init_ex' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_hmac.c.o): in function __hmac_sha1_update':
archive_hmac.c:(.text+0x64): undefined reference to `HMAC_Update'
collect2: error: ld returned 1 exit status
make[2]: *** [clp] mistake 1
make[1]: *** [CMakeFiles/clp.dir/all] mistake 2
make: *** [all] mistake 2

Archive uncompressed size is incorrect (higher) in the global metadata database

Symptom

The archive's uncompressed size is higher in the global metadata database than the amount of data that was compressed.

Reproduction steps

  • clp c archives <log file>
  • Open the global metadata database
  • The archive's uncompressed size is higher than the size of the uncompressed file

Component

core

Version

0.0.0

Does CLP tool have a command-line tool? How to use it?

Request

Hello,
Recently, I saw that Uber uses the CLP tool for log data compression, and the ratio is quite impressive. Our company also wants to try it, but upon checking on GitHub and even the official paper, I haven’t found any specific instructions on how to use the CLP tool.
Is there a Linux version of the tool? Or a demo in Java code? How to use it?

Possible implementation

I am looking for a case or document that can be used for compression testing.

Adding a library to support for reading from/writing to memory addresses

Request

Currently, CLP/CLG only supports writing to/reading from files under a specific directory structure in a POSIX file system. However, some FSless applications may wish to store/read data to/from a memory address. A fuse layer may achieve this, but it may cause huge overhead. It would be great to use CLP/CLG as a library providing memory interfaces.

Possible implementation

Providing a C++ class which includes pointers to memory buffers that would otherwise be stored in the archive directory in the current implementation. This class also allows users to call compress, decompress, and search directly instead of going through the command line.

Compress command from Package-template gets stuck (process runs indefinitely)

Bug

  • When installed everything and ran the compress command via pacakge-template/src/sbin, it gives following output and hangs indefinitely.
  • The ../var directory does have file sources BUT it never produces any archived files or directory
  • Feel free to let me know if any other information is required
user@ubuntu1:~/code/CLP/clp/components/package-template/src/sbin$ ./compress /home/indra/TestLogs
2022-11-30 05:15:39,689 [INFO] [/opt/clp/sbin/native/compress] Compression job submitted to compression-job-handler.
2022-11-30 05:15:39,689 [INFO] [compression-job-handler] compression-job-handler started.
2022-11-30 05:15:39,695 [INFO] [job-1] Iterating and partitioning files into tasks.
2022-11-30 05:15:39,702 [INFO] [job-1] Waiting for 1 task(s) to finish.

Nothing after this. Here is the list of /TestLogs directory.

user@ubuntu1:~/code/CLP/clp/components/package-template/src/sbin$ ls -l ~/TestLogs
total 12
-rw-rw-r-- 1 indra indra 4493 Nov 30 03:57 ibm_dummy_logs.log
-rw-rw-r-- 1 indra indra   70 Nov 30 02:01 my_log_file.log

CLP version

0.0.1

Environment

Ubuntu 20.04 LTS:

user@ubuntu1:~/code/CLP/clp/components/package-template/src/sbin$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
user@ubuntu1:~ /code/CLP/clp/components/package-template/src/sbin$ docker -v
Docker version 20.10.12, build 20.10.12-0ubuntu2~20.04.1

Reproduction steps

  • Installed core per the directions on the repository – directly on Ubuntu environment. Did not go with docker image route.
  • Installed packages clp_py_utils, compression_job_handler, and job_orchestration in /components/package-template/src/lib/python3/site-packages/
  • Request: I want to build the "package" from scratch rather than playing with package-template and see if that works
  • From the docs, I am unsure how to do that. Any pointers on doing that would help to troubleshoot this.

Schema unit tests fail if run outside build directory.

Bug

If the unit tests are run outside the build directory, e.g.:
./build/unitTest instead of cd build && ./unitTest
you will get 11 failures related to missing schema files.

This is caused by relative paths being used in the schema unit tests.

An error example:

/home/lion/yscope/clp-ffi-go/cpp/clp/components/core/tests/test-Grep.cpp:21: FAILED:
due to unexpected exception with message:
ReaderInterface operation failed

[2023-05-20 23:25:59.016] [info] [test-ParserWithUserSchema.cpp:72] File not found: /home/lion/yscope/clp-ffi-go/cpp/clp/components/tests/test_schema_files/missing_schema.txt

CLP version

460d377

Environment

ghcr.io/y-scope/clp/clp-core-dependencies-x86-ubuntu-focal:main

Reproduction steps

  1. build the latest CLP
  2. run unitTest outside of your build directory (i.e. run unitTest from any directory other than the file's parent)

mariadb connector 3.2.3 no longer available

Bug

Running

./tools/scripts/lib_install/mariadb-connector-c.sh 3.2.3

Returns

dpkg-query: no packages found matching libmariadb-dev
Checking for elevated privileges...
curl: (22) The requested URL returned error: 404

Running with 3.3.3 works for now.

CLP version

54497a0

Environment

Ubuntu 22.04.1 LTS

Docker v4.14.1

uname -a: Linux 485fbd483ae2 5.15.49-linuxkit #1 SMP PREEMPT Tue Sep 13 07:51:32 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Reproduction steps

Run

./tools/scripts/lib_install/mariadb-connector-c.sh 3.2.3

Zstded files can not be compressed using clp-package

Bug

The current CLP uses libarchive to decompress zstded input. However, libarchive internally relies on cli zstded but the zstded is currently not installed on the clp-execution image. As a result, files compressed with zstded in the package flow will trigger an error of "Cannot compress {} - not UTF-8 encoded."

CLP version

4fc3b61

Environment

Ubuntu 18.04.1 LTS

Reproduction steps

python3 build-clp-package.py --config ../../config/build-clp-package.yaml
./sbin/start-clp
./sbin/compresse file.zst

Out-of-order logs for large split files with timestamped and un-timestamped sections

Bug

Timestamped and un-timestamped segments are built in parallel, and clp relies on their sequential order for decompression. As a result, large split files with timestamped and un-timestamped sections may have out-of-order logs vs. original when compressed/decompressed.

CLP version

776fc3a

Environment

Ubuntu 22.04

Reproduction steps

  1. Copy HDFS_1 from Loghub-dataset (log file without timestamps)
  2. Append logs with timestamps to the end of HDFS.log
    • For example take Hadoop logs from Loghub-dataset
    • cat Hadoop/application_1445062781478_0011/container_1445062781478_0011_01_000001.log >> HDFS.log
  3. Copy HDFS_1 and Hadoop directories into a new directory
    • Copy HDFS_1 first, then Hadoop, so Hadoop gets compressed first
    • Purpose of Hadoop directory is to create a segment with timestamps first in sequential order
  4. Compress/decompress directory
  5. Beginning of compressed/decompressed HDFS.log is out of order
    • head HDFS.log, should return 081111... instead of 081109...

Note: This will only reproduce bug after PR #231 is merged fixing issue #167. If not, issue #167 will create a similar output.

clp-s: Nested timestamps result in incorrect mst node names

Bug

Ingesting records with nested timestamps using the --timestamp-key argument turns the key name in the last level of the MST into the fully qualified name. e.g. records like {"a": "ts": ...} will look like {"a": "a.ts": ...} when decompressed.

CLP version

e80cee0

Environment

Ubuntu focal clp image

Reproduction steps

  1. Ingest a dataset with nested timestamps while using the --timestamp-key argument
  2. Observe incorrect data upon decompression

CLP triggers segmentation fault when compressing tarred openstack log

Bug

When compressing the openstack.tar.gz logs (https://zenodo.org/records/7094972), CLP fails a with segmentation fault. Note that CLP can compress uncompressed openstack logs without any issue.

I tried to create a tar.gz from the plain log that CLP can compress, and the new tar.gz still triggers the segmentation fault.

Verified that commit 4f9ed0a doesn't have this issue.

CLP version

5d6ff54

Environment

Ubuntu22.04

Reproduction steps

clp c output openstack-24hr.tar.gz

Large files split across archives not always decompressed correctly

Bug

A large file that is split across archives is sometimes decompressed with the splits recombined in the wrong order.

CLP version

0.0.1

Reproduction steps

  • Start the CLP package
  • Compress one or more large files
  • Decompress
  • Compare the original files with the decompressed files

This may not always work since it depends on the order that archives are stored in the database.

Schema matching fails on query with negated varstring and clpstring filters on same key

Bug

For the following data

{"a": "clp string"}
{"a": "string clp"}

The query NOT a: "clp string" will return the second record, but the query NOT (a: "clp string" OR a: "b") will fail schema matching and return no records.

The AST for the failing case is AndExpr(!FilterExpr(EQ, ColumnDescriptor<clpstring,array>("a"), "clp string"), !FilterExpr(EQ, ColumnDescriptor<varstring,array>("a"), "b"))

The problem is that since the dataset contains no varstring type column the second filter fails column matching, gets replaced with EmptyExpr, and the entire AndExpr gets constant propagated away. This bug is a bit annoying, because while the varstring column "a" doesn't exist the JSON string column "a" does exist.

The expected behaviour is to treat the filter as False on the clpstring column "a" when it does exist (and the inverted filter as true).

Actually, for every string that matches either strictly varstring or clpstring there should be an implicit negated condition on the existence of the other type.

For example a<clpstring> NEQ "a b" -> a<clpstring> NEQ "a b" OR EXISTS a<varstring>
and similarly a<varstring> NEQ "b" -> a<varstring> NEQ "b" OR EXISTS a<clpstring>.

This edge case only applies for negated conditions -- something that does not match a particular clpstring will not match every varstring, but something that does match a particular clpstring will never match a varstring.

Easiest way to handle this edge case is probably by either augmenting the AST during either ConvertToExists or in another pass, or by very careful treatment during Schema Matching.

CLP version

9e6b755

Environment

Ubuntu focal docker image

Reproduction steps

  1. Ingest example data above
  2. Run example query

sbin/search.sh doesn't always print all search results

Bug

When searching using the package's sbin/search.sh script, sometimes, not all results are printed (but they exist in the results cache). The scheduler/worker logs don't show any errors.

It's possible that the search tasks / job finishes before the results cache has made the search results available for reading.

CLP version

d524ead

Environment

Ubuntu 22.04

Reproduction steps

  • sbin/start.sh
  • sbin/compress.sh
  • sbin/search.sh "DESERIALIZE_ERRORS" > results.txt

Count the lines of output. It should be 6945. Running the query a few times should result in one instance where not all 6945 results are written to the file.

Android OS library

Request

I want to try using CLP on Android,for compressing log on Android

Possible implementation

provide a library like jar,or program for android 64bit os. tks

clp-s: Timestamp Dictionary doesn't work correctly for nested timestamps

Bug

Nested timestamp keys aren't stored correctly in the timestamp dictionary, which makes them unusable at search time.

For records like the following ingested with --timestamp-key a.ts

{"a" : {"ts": 1234}}

The time range of a.ts is correctly recorded in the timestamp dictionary, but the key name is recorded as ts. This means that at search time filtering on a.ts will not successfully take advantage of the timestamp dictionary. (On the other hand filtering on ts will use the timestamp dictionary, but will fail on schema matching afterwards).

CLP version

4f74dd2

Environment

Ubuntu focal clp image

Reproduction steps

  1. Ingest any dataset with a nested timestamp key
  2. Observe incorrect timestamp dictionary

Decompression fails on one specific large file with long lines

Bug

Using the heuristic to compress and decompress Loghub-datasets/Loghub-datasets/HDFS_1/HDFS.log results in a decompressed file that does not match the original.

CLP version

c2e5823

Environment

Ubuntu 20.04

Reproduction steps

Compress using the heuristic
Decompress the resulting archive
Diff decompressed file with original file

Search taking lot of time using clg binary

Bug

We are using CLP for compressing logs generated by our Kubernetes cluster which are in JSON format. A sample log is given below:

{
"log_time": "2023-08-29T13:55:09.477456Z",
"stream": "stdout",
"time": "2023-08-29T13:55:09.477456564Z",
"@timestamp": "2023-08-29T19:25:09.477+05:30",
"@Version": "1",
"message": " Method: POST;Root=1-64edf8bd-5c762a676349ee71616bb687 , Request Body : {"orders":[{"order_type":"normal","external_reference_id":"69426","items":[{"offset_in_minutes":"721","quantity":"1","external_product_id":"225090"}]}]}",
"level": "INFO",
"level_value": 20000,
"request_id": "6f3f3651-a22b-42a0-b5fe-412d2167c5ca",
"kubernetes_docker_id": "caa9102a169a1495e5790cb2c17cb21d0a279ffc50d802d413938870ba59c7c0",
}
When we are using the clg binary to search through the generated archive using various search queries, It is taking a lot of time to process each query (around 25-30s on average). We only search for request_id and namespace name as mentioned below. It is not feasible for us if the search takes so much time for each archive. Ideally, for one archive 4-5s is the expected search time by us.
For example, To search for the request_id in the above log, we generally use the following queries

  1. info6f3f3651-a22b-42a0-b5fe-412d2167c5ca*
  2. 6f3f3651-a22b-42a0-b5fe-412d2167c5ca
    Our archives are sized 40MB on average wherein the sizes of the internal files and folders of an archive is as given below:
  3. var.dict : 18.2MB
  4. /s: 16.8MB
  5. var.segindex: 2.6MB
  6. metadata.db: 1.9MB
  7. logtype.dict: 736KB
  8. logtype.segindex: 36KB
  9. metadata: 4KB

Is this the correct way to write search queries (the correct way in the sense that will it use the log type and other dictionaries to search through the archives efficiently)? because as mentioned earlier searching through a single archive itself takes more than 30s which is very infeasible. We expect the search time for each archive to be around 4-5s, not more than that for a single archive. Please guide me If I am doing wrong anywhere like the search query being inefficient, etc.

CLP version

3a20c0d

Environment

UBUNTU 20.04
EC2 instance type: m5.8xlarge

Reproduction steps

NA

Escaped and wildcard "?" not properly handled during search

Bug

Currently CLG replaces ? with * in the search query because we don't handle ? wildcard and fallback to decompression and match.
However, this doesn't consider the case when a ? is escaped.

Note: the query discussed below are queries seen by CLP. if you enter the query on the commandline via bash, you may need to add extra escapes for bash to properly interpret the query.

For example, considering the query INFO \? TEXT, which shall match a plain text ? in the log. current CLG will replace the ? with * and end up searching for the wrong results.

While I tried a quick fix (not submitted) for the issue above, there were other bugs that I noticed.

For example, considering the query INFO \\? TEXT. The query shall match a plain text \ in the log, then ? will server as wildcard and match any character. However the CLG is not returning the correct results.

CLP version

5d6ff54

Environment

22.04 Ubuntu

Reproduction steps

Attaching log and queries that can be used for testing and reproducing the issue.

Currently running the query in the CLP won't generate the correctly matched results.

Note: the queries attached are what should be seen by CLP. if you enter the query on the commandline via bash, you may need to add extra escapes for bash to properly interpret the query.

query.txt
log.txt

Floating numbers are not correctly encoded when using IR four-byte encoding

Bug

When using CLP IR four-byte encoding methods to encode a message that contains a floating point number, the decoded message differs from the original message:
Message before encoding (raw text): fps=60.000004
Message after decoding (from the encoded IR): fps=.0
The errors are caught with multiple different floating point numbers, while the decoded results are all being .0.

CLP version

cb6058c

Environment

Ubuntu 18.04
macOS 12.5

Reproduction steps

  1. Encodes the following log message using CLP IR four-byte encoding method:
    fps=60.000004
  2. Decodes the encoded message back to the raw text using CLP IR four-byte decoding methods.

Fail to compile CLP

I have met another compilation problem when I try to execute "make" command. It refers missing "xmlReaderForIO" in libarchive.a, I wonder why ? I have "libxml2.a" installed in my lib directory.

Compression fails on MacOS and WSL 1

Bug

On MacOS, when running CLP through docker, this error is logged after attempting to compress a log file:

2021-12-28 10:42:48,593 [critical] Failed to write archive file metadata collection in file: /root/clp/var/data/clp-mini-cluster/archives/26d4c5e4-0d50-43df-9df7-1d210ba13a4b/metadata
2021-12-28 10:42:48,595 [error] Compression failed: src/FileWriter.cpp:116 FileWriter operation failed, errno=2

On WSL 1, when compressing with the core component natively, this error is printed:

2021-12-26 20:47:23,795 [critical] Failed to write archive file metadata collection in file: archives/5b6a7367-c69a-49f7-8689-9778626d25fb/metadata
2021-12-26 20:47:23,795 [error] Compression failed: src/FileWriter.cpp:116 FileWriter operation failed, errno=1

CLP version

0.0.1

Environment

  • MacOS version: OS X v10.15.7 build 19H15

  • Windows version: 10.0.17763.1935

  • WSL version: 1

  • Linux distro under WSL: Ubuntu 18.04

Reproduction steps

For MacOS:

  • sbin/start-clp --uncompressed-logs-dir <dir>
  • sbin/compress <file path>

For WSL 1:

  • ./clp c archives <file path>

Compression error on var-logs

Bug

The CLP compressor runs into an error when compressing our internal var-log. some example error messages are attached below

2024-01-24 00:57:45,022 [error] Cannot compress hostd-10 - not an IR stream or UTF-8 encoded
2024-01-24 00:57:45,125 [error] Cannot compress hostd-12 - not an IR stream or UTF-8 encoded

Note that 4f9ed0a doesn't have this issue, but we don't know from which commit did the issue start

CLP version

5d6ff54

Environment

ubuntu 22.04

Reproduction steps

./clp c varlog_compressed ///var-logs

ffi: The IR encoder doesn't escape escapes ('\'), causing existing decoders to gobble up backslashes

Bug

The existing IR encoding in the ffi namespace only escapes variable placeholders, but not the escape characters themselves. Although this could be handled in the decoder, our existing decoders already expect escape characters to be escaped.

Thanks to @haiqi96 for spotting the bug!

CLP version

3a20c0d

Environment

N/A

Reproduction steps

  • Log some text with backslashes using the `ClpIrFileAppender from here.
  • Open the output file in the log viewer
  • You should see that the backslashes were gobbled up

Schema matching fails on negated precise array search matching wildcard

Bug

On a log file consisting of the following example data
{"a": [{"b": "c"}]}
Search will work correctly for queries like a.b: c and a.b: *, but will fail on schema matching for the query NOT a.b: *.

The expected behaviour is for schema matching to succeed on the array column "a" following by the record not being matched because an object with the key "b" exists within the array "a". This is likely caused by an interaction between the Convert To Exists pass and Schema Matching.

CLP version

284a558

Environment

Ubuntu focal clp image.

Reproduction steps

  1. Compress the example data {"a": [{"b": "c"}]}
  2. Perform the search NOT a.b: *
  3. Observe schema matching failure

clg reports SQLitePreparedStatement operation failure after printing one log message

Bug

The current version of clg can cause SQLitePreparedStatement: Failed to bind int64 to statement - column index out of range bug. The reason is due to implicit conversion from segment_id to bool when calling clp::streaming_archive::reader::Archive::get_file_iterator for false == is_superseding_query case.

The fix is pretty simple. We just need to add false as the fifth parameter of get_file_iterator so that it can match the correct method.

CLP version

59b2c22

Environment

MacOS 13.2.1

Reproduction steps

  • Compress components/core/README.md using clp
  • Search option on the compressed archive using clg

Mesage not properly filtered by timestamp in CLG

Bug

In CLG, the message are not properly filtered with timestamp when the timestamp range becomes small.

For example, on hadoop-24hrs-log, the following search commnad (1) returns several messages, but none of (2) or (3) returns any matched search results, although (2) and (3) cover the same exact time range as (1) so they should also return matches.

  1. clg --tge 1427099450000 --tle 1427099490000 hadoop_output/ "*"
  2. clg --tge 1427099460000 --tle 1427099490000 hadoop_output/ "*"
  3. clg --tge 1427099450000 --tle 1427099460000 hadoop_output/ "*"

Suspect that this is caused by some issues in the database queries.

CLP version

5d135cf

Environment

baker

Reproduction steps

  • compress hadoop-24hrs-log to hadoop-out
  1. clg --tge 1427099450000 --tle 1427099490000 hadoop_output/ "*"
  2. clg --tge 1427099460000 --tle 1427099490000 hadoop_output/ "*"
  3. clg --tge 1427099450000 --tle 1427099460000 hadoop_output/ "*"

The 2nd and 3rd command will not return any output. but the you should see there are message matches in the first command

Difference between CLP and Parquet?

(Sorry if this isn't the right place for this question or if the question doesn't really make sense.) I read about CLP in the Uber blog post and I'm curious how CLP differs from storing logs in Parquet files. Is there a write up of comparisons anywhere?

Thank you!

[ERROR] [clp] Unable to connect to the database with the provided credentials

When I run ./sbin/start-clp --uncompressed-logs-dir <directory containing your uncompressed logs>, it prompts an error such as the title. The detailed execution process is as follows

$ ./sbin/start-clp --uncompressed-logs-dir ./myTestData/input/
2021-10-24 21:25:28,315 [INFO] [clp] Using default config file at etc/clp-config.yaml
2021-10-24 21:25:28,320 [INFO] [clp] Provision docker network bridge
2021-10-24 21:25:28,634 [INFO] [clp] Starting CLP scheduler
2021-10-24 21:25:28,634 [INFO] [clp] Starting scheduler mariadb database
2021-10-24 21:25:34,037 [INFO] [clp] Starting scheduler queue
2021-10-24 21:25:39,313 [INFO] [clp] Initializing scheduler queue
2021-10-24 21:25:40,831 [INFO] [clp] Initializing scheduler database tables
2021-10-24 21:26:05,825 [ERROR] [clp] Unable to connect to the database with the provided credentials
2021-10-24 21:26:05,826 [ERROR] [clp] 
2021-10-24 21:26:05,826 [ERROR] [clp] Failed to provision "clp-mini-cluster"

CLP Setup issue

I'm able to exxcute start-clp and compress scripts, I have imported the sample Hive Data set, when I try to search for following sbin % ./search "Task *" query, I'm getting below error:

My understanding: as per https://github.com/y-scope/clp/releases, clp-package-ubuntu-focal-x86_64-v0.0.2.tar.gz is built for ubuntu-focal. Do I need to have EC2 instance with ubuntu-focal only ? can you guide me to resolve the below error:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Traceback (most recent call last):
File "/opt/clp/sbin/native/search", line 261, in
sys.exit(main(sys.argv))
File "/opt/clp/sbin/native/search", line 248, in main
for ip in set(socket.gethostbyname_ex(socket.gethostname())[2]):
socket.gaierror: [Errno -2] Name or service not known
Traceback (most recent call last):
File "./search", line 135, in
sys.exit(main(sys.argv))
File "./search", line 126, in main
subprocess.run(cmd, check=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['docker', 'run', '-i', '--rm', '--network', 'host', '-w', '/opt/clp', '-u', '504:20', '--name', 'clp-search-61c9', '--mount', 'type=bind,src=/Users/rerajesh/Downloads/clp-package-ubuntu-focal-x86_64-v0.0.2,dst=/opt/clp', '--mount', 'type=bind,src=/Users/rerajesh/Downloads/clp-package-ubuntu-focal-x86_64-v0.0.2/var/log/bcd07464dac2,dst=/opt/clp/var/log', '--mount', 'type=bind,src=/Users/rerajesh/Downloads/clp-package-ubuntu-focal-x86_64-v0.0.2/var/data/archives,dst=/mnt/archive-output', 'ghcr.io/y-scope/clp/clp-execution-x86-ubuntu-focal:main', '/opt/clp/sbin/native/search', '--config', '/opt/clp/var/log/.clp-search-61c9-config.yml', 'Task * deprecation']' returned non-zero exit status 1.

Compression: unable to mount uncompressed logs directory in the container

Hello,
I am unable to compress the log file as the path for uncompressed logs is not mounted in the container. Please see the logs below for further inspection.
CLP starts successfully. uncomp_logs is the <uncompressed_logs-dir>

$ ./clp-package-ubuntu-focal-x86_64-v0.0.1/sbin/start-clp --uncompressed-logs-dir uncomp_logs/
2021-11-02 11:25:29,892 [INFO] [clp] Using default config file at clp-package-ubuntu-focal-x86_64-v0.0.1/etc/clp-config.yaml
2021-11-02 11:25:29,898 [INFO] [clp] Provision docker network bridge
2021-11-02 11:25:30,102 [INFO] [clp] Starting CLP scheduler
2021-11-02 11:25:30,103 [INFO] [clp] Starting scheduler mariadb database
2021-11-02 11:25:33,941 [INFO] [clp] Starting scheduler queue
2021-11-02 11:25:39,687 [INFO] [clp] Initializing scheduler queue
2021-11-02 11:25:41,374 [INFO] [clp] Initializing scheduler database tables
2021-11-02 15:25:41,608 [INFO] Successfully created clp metadata tables for compression and search
2021-11-02 15:25:41,835 [INFO] Successfully created compression_jobs and compression_tasks orchestration tables
2021-11-02 11:25:41,851 [INFO] [clp] Starting scheduler service
2021-11-02 11:25:41,960 [INFO] [clp] Starting CLP worker

Upon compressing, it gives the following error:

$ ./clp-package-ubuntu-focal-x86_64-v0.0.1/sbin/compress uncomp_logs/auth.log 
2021-11-02 11:26:23,597 [INFO] [clp] Using default config file at clp-package-ubuntu-focal-x86_64-v0.0.1/etc/clp-config.yaml
2021-11-02 15:26:23,842 [INFO] [compress] Compression job submitted to compression-job-handler.
2021-11-02 15:26:23,842 [INFO] [compression-job-handler] compression-job-handler started.
2021-11-02 15:26:23,863 [INFO] [job-8] Iterating and partitioning files into tasks.
2021-11-02 15:26:23,863 [ERROR] [job-8] "/opt/clp/uncomp_logs/auth.log" does not exist.
2021-11-02 15:26:23,871 [INFO] [job-8] Waiting for 0 task(s) to finish.

There indeed is no mentioned directory in the container.

root@clp-mini-cluster:/opt/clp# ls   
LICENSE  README.md  bin  etc  lib  requirements-pre-3.7.txt  sbin  var

I checked the permission of the uncompressed-log-directory and it is user:docker. Is there anything else I need to check

File with messages which swap a variable for a variable placeholder aren't compressed correctly

Bug

If a file contains two messages which are identical except one message contains a variable and the other contains its corresponding variable placeholder, the file will not be compressed correctly such that decompressing it either stops prematurely or crashes.

For example, these log messages (where ^Q is an integer-variable placeholder)...

2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties ^Q from
2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties 123 from
2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties 123 from

... will cause the following exception:

2023-09-17 21:36:24,750 [error] FileWriter not closed before being destroyed - may cause data loss                                                                                                               2023-09-17 21:36:24,750 [error] Decompression failed: src/DictionaryReader.hpp:208 DictionaryReader operation failed, error_code=3

If instead we reorder the messages...

2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties 123 from
2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties ^Q from
2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties 123 from

... decompression will stop after the 2nd message, where the 2nd message is also incorrect.

CLP version

5d0f7b3

Environment

Ubuntu 20.04

Reproduction steps

  • Compress the file described in the bug report: ./clp c archives log.txt
  • Decompress the file described in the bug report: ./clp x archives decomp
  • Observe a crash or incorrect decompressed output: diff log.txt decomp/log.txt

unable to get clp running in mac

Bug

I am trying to get clp up and running by following the instructions provided in https://github.com/y-scope/clp/blob/main/tools/packager/README.md but I see the below error ModuleNotFoundError: No module named 'zstandard.backend_c'

CLP version

6d35126

Environment

OS: MacOS Monterey
Python: 3.10.8
Docker version 20.10.17, build 100c701

Reproduction steps

Steps:

  1. Followed steps as per building package
  2. cd out/clp-package-ubuntu-focal-x86_64-v0.0.1
  3. Start clp - /sbin/start-clp --uncompressed-logs-dir <full path>

Source Compile & Test

I have a question here, I don’t know if you can answer it. I found that CLP uses docker to run the program, and it uses the MySQL database. But the core source code is still implemented in c++. I want to know how to compile and test the CLP algorithm using only C++ code, thank you! @kirkrodrigues @jackluo923

Queries with a variable followed by a period and a wildcard do not return results

Bug

If a variable exists in a log message like so: "... job_1427088391284_0004. ...". The query " *job_1427088391284_0004.* " does not return this message, but it should.

CLP version

0.0.1

Reproduction steps

  • Compress a log message containing a variable followed by a period like "... job_1427088391284_0004. ..."
  • Search for " *job_1427088391284_0004.* "

Empty logtypes incorrectly cause an error to be thrown

Bug

When trying to write an empty logtype to the dictionary, an exception is thrown. Empty logtypes should be allowed.

CLP version

e4b7635

Environment

Ubuntu 18.04

Reproduction steps

Call bool LogTypeDictionaryWriter::add_entry (LogTypeDictionaryEntry& logtype_entry, logtype_dictionary_id_t& logtype_id)
with an empty string for the value in logtype_entry.

Can not build on redHat 7

I am trying to build CLP on Linux server(redHat 7), but I encountered CMake(version: 3.21.1) build failure. The debug messages refer it could not find Boost (missing iostreams, program_options filesystem system). But I have libboost-iostreams.a, libboost-program_options.a, libboost-filesystem.a and libboost-system.a(all are in version 1.59.0) under /usr/include/lib/, I wonder why the Cmake program could not find them.
I am looking forward to a solution, thanks.

Searching for a variable with a '/' in an archive composed of IR files returns nothing

Bug

If we compress an IR file containing a variable with one or more forward slashes (e.g., "/usr/bin/python3") into archive and then try to search for that variable using clg (e.g., ./clg archives "/usr/bin/python3"), the search returns nothing.

CLP version

084efa3

Environment

Ubuntu 20.04

Reproduction steps

  • Create an IR file (using the four-byte encoding) containing a log message: 2023-09-16T00:00:00.000 INFO Python's path is /usr/bin/python3?
  • Compress that IR file using clp: ./clp c archives test.clp.zst
  • Search for the path string: ./clg archives "/usr/bin/python3"

Integers less than -2^63 aren't compressed losslessly

Bug

Compressing a log with an integer less than -2^63 (e.g., -2^63 - 1) results in the integer being set to -2^63.

CLP version

583fdde

Environment

  • Ubuntu 20.04

Reproduction steps

  • Create a log with an integer smaller than -2^63, e.g., "This number is very negative: -9223372036854775810"
  • Compress the log
  • Decompress it and you'll see that the message has been decompressed as: "This number is very negative: -9223372036854775808"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.