y-scope / clp Goto Github PK

Compressed Log Processor (CLP) is a free log management tool capable of compressing text logs and searching the compressed logs without decompression.

Home Page: https://yscope.com

License: Apache License 2.0

CMake 3.24% C++ 84.50% Dockerfile 0.11% Shell 0.93% Python 7.08% ANTLR 0.05% CSS 0.01% HTML 0.02% JavaScript 3.91% SCSS 0.15%

logging compression analytics search log-parser log-management

clp's Introduction

YScope's Compressed Log Processor (CLP) compresses your logs, and allows you to search the compressed logs without decompression. CLP supports both JSON logs and unstructured (i.e., free text) logs. It also supports real-time log compression within several logging libraries. CLP also includes purpose-built web interfaces for searching and viewing the compressed logs. To learn more about it, you can read our paper.

Benchmarks

The figures above show CLP's compression and search performance compared to other tools. We separate the experiments between JSON and unstructured logs because (1) some tools can only handle one type of logs, and (2) tools that can handle both types often have different designs for each type (such as CLP).

Compression ratio is measured as the average across a variety of log datasets. Some of these datasets can be found here. Search performance is measured using queries on the MongoDB logs (for JSON) and the Hadoop logs (for unstructured logs). Note that CLP uses an index-less design, so for a fair comparison, we disabled MongoDB and PostgreSQL's indexes; If we left them enabled, MongoDB and PostgreSQL's compression ratio would be worse. We didn't disable indexing for Elasticsearch or Splunk since these tools are fundamentally index-based (i.e., logs cannot be searched without indexes). More details about our experimental methodology can be found in the CLP paper.

System Overview

CLP provides an end-to-end log management pipeline consisting of compression, search, analytics, and viewing. The figure above shows the CLP ecosystem architecture. It consists of the following features:

Compression and Search: CLP compresses logs into archives, which can be searched and analyzed in a web UI. The input can either be raw logs or CLP's compressed IR (intermediate representation) produced by CLP's logging libraries.
Real-time Compression with CLP Logging Libraries: CLP provides logging libraries for Python and Java (Log4j and Logback). The logging libraries compress logs in real-time, so only compressed logs are written to disk or transmitted over the network. The compressed logs use CLP's intermediate representation (IR) format which achieves a higher compression ratio than general purpose compressors like Zstandard. Compressing IR into archives can further double the compression ratio and enable global search, but this requires more memory usage as it needs to buffer enough logs. More details on IR versus archives can be found in this Uber Engineering Blog.
Log Viewer: the compressed IR can be viewed in a web-based log viewer. Compared to viewing the logs in an editor, CLP's log viewer supports advanced features like filtering logs based on log level verbosity (e.g., only displaying logs with log level equal or higher than ERROR). These features are possible because CLP's logging libraries parse the logs before compressing them into IR.
IR Analytics Libraries: we also provide a Python library and a Go library that can analyze compressed IR.
Log parser: CLP also includes a custom pushdown-automata-based log parser that is 3x faster than state-of-the-art regular expression engines like RE2. The log parser is available as a library that can be used by other applications.

Getting Started

You can download a release package which includes support for distributed compression and search. Or, to quickly try CLP's core compression and search, you can use a prebuilt container.

We also have guides for building the package and CLP core from source.

For some logs you can use to test CLP, check out our open-source datasets.

Docs

You can find our docs online or view the source in docs/src.

Providing Feedback

You can use GitHub issues to report a bug or request a feature.

Join us on Zulip to chat with developers and other community members.

Next Steps

This is our open-source release which we will be constantly updating with bug fixes, features, etc. If you would like a feature or want to report a bug, please file an issue and we'll be happy to engage.

clp's People

Contributors

Stargazers

Watchers

clp's Issues

ffi: The IR encoder doesn't escape escapes ('\'), causing existing decoders to gobble up backslashes

Bug

The existing IR encoding in the ffi namespace only escapes variable placeholders, but not the escape characters themselves. Although this could be handled in the decoder, our existing decoders already expect escape characters to be escaped.

Thanks to @haiqi96 for spotting the bug!

CLP version

3a20c0d

Environment

N/A

Reproduction steps

Log some text with backslashes using the `ClpIrFileAppender from here.
Open the output file in the log viewer
You should see that the backslashes were gobbled up

Does CLP tool have a command-line tool? How to use it?

Request

Hello,
Recently, I saw that Uber uses the CLP tool for log data compression, and the ratio is quite impressive. Our company also wants to try it, but upon checking on GitHub and even the official paper, I haven’t found any specific instructions on how to use the CLP tool.
Is there a Linux version of the tool? Or a demo in Java code? How to use it?

Possible implementation

I am looking for a case or document that can be used for compression testing.

Large files split across archives not always decompressed correctly

Bug

A large file that is split across archives is sometimes decompressed with the splits recombined in the wrong order.

CLP version

0.0.1

Reproduction steps

Start the CLP package
Compress one or more large files
Decompress
Compare the original files with the decompressed files

This may not always work since it depends on the order that archives are stored in the database.

CLP triggers segmentation fault when compressing tarred openstack log

Bug

When compressing the openstack.tar.gz logs (https://zenodo.org/records/7094972), CLP fails a with segmentation fault. Note that CLP can compress uncompressed openstack logs without any issue.

I tried to create a tar.gz from the plain log that CLP can compress, and the new tar.gz still triggers the segmentation fault.

Verified that commit 4f9ed0a doesn't have this issue.

CLP version

5d6ff54

Environment

Ubuntu22.04

Reproduction steps

clp c output openstack-24hr.tar.gz

Searching and writting data with an SQL query engine?

Request

I'm looking at CLP and would like to know if you have tried writing an SQL query engine or a query engine for any other declarative language that would make it easier to search over CLP.

Possible implementation

Implementing SQL over CLP using https://github.com/dolthub/go-mysql-server

Compression: unable to mount uncompressed logs directory in the container

Hello,
I am unable to compress the log file as the path for uncompressed logs is not mounted in the container. Please see the logs below for further inspection.
CLP starts successfully. uncomp_logs is the <uncompressed_logs-dir>

$ ./clp-package-ubuntu-focal-x86_64-v0.0.1/sbin/start-clp --uncompressed-logs-dir uncomp_logs/
2021-11-02 11:25:29,892 [INFO] [clp] Using default config file at clp-package-ubuntu-focal-x86_64-v0.0.1/etc/clp-config.yaml
2021-11-02 11:25:29,898 [INFO] [clp] Provision docker network bridge
2021-11-02 11:25:30,102 [INFO] [clp] Starting CLP scheduler
2021-11-02 11:25:30,103 [INFO] [clp] Starting scheduler mariadb database
2021-11-02 11:25:33,941 [INFO] [clp] Starting scheduler queue
2021-11-02 11:25:39,687 [INFO] [clp] Initializing scheduler queue
2021-11-02 11:25:41,374 [INFO] [clp] Initializing scheduler database tables
2021-11-02 15:25:41,608 [INFO] Successfully created clp metadata tables for compression and search
2021-11-02 15:25:41,835 [INFO] Successfully created compression_jobs and compression_tasks orchestration tables
2021-11-02 11:25:41,851 [INFO] [clp] Starting scheduler service
2021-11-02 11:25:41,960 [INFO] [clp] Starting CLP worker

Upon compressing, it gives the following error:

$ ./clp-package-ubuntu-focal-x86_64-v0.0.1/sbin/compress uncomp_logs/auth.log 
2021-11-02 11:26:23,597 [INFO] [clp] Using default config file at clp-package-ubuntu-focal-x86_64-v0.0.1/etc/clp-config.yaml
2021-11-02 15:26:23,842 [INFO] [compress] Compression job submitted to compression-job-handler.
2021-11-02 15:26:23,842 [INFO] [compression-job-handler] compression-job-handler started.
2021-11-02 15:26:23,863 [INFO] [job-8] Iterating and partitioning files into tasks.
2021-11-02 15:26:23,863 [ERROR] [job-8] "/opt/clp/uncomp_logs/auth.log" does not exist.
2021-11-02 15:26:23,871 [INFO] [job-8] Waiting for 0 task(s) to finish.

There indeed is no mentioned directory in the container.

root@clp-mini-cluster:/opt/clp# ls   
LICENSE  README.md  bin  etc  lib  requirements-pre-3.7.txt  sbin  var

I checked the permission of the uncompressed-log-directory and it is user:docker. Is there anything else I need to check

Source Compile & Test

I have a question here, I don’t know if you can answer it. I found that CLP uses docker to run the program, and it uses the MySQL database. But the core source code is still implemented in c++. I want to know how to compile and test the CLP algorithm using only C++ code, thank you! @kirkrodrigues @jackluo923

Search taking lot of time using clg binary

Bug

We are using CLP for compressing logs generated by our Kubernetes cluster which are in JSON format. A sample log is given below:

{
"log_time": "2023-08-29T13:55:09.477456Z",
"stream": "stdout",
"time": "2023-08-29T13:55:09.477456564Z",
"@timestamp": "2023-08-29T19:25:09.477+05:30",
"@Version": "1",
"message": " Method: POST;Root=1-64edf8bd-5c762a676349ee71616bb687 , Request Body : {"orders":[{"order_type":"normal","external_reference_id":"69426","items":[{"offset_in_minutes":"721","quantity":"1","external_product_id":"225090"}]}]}",
"level": "INFO",
"level_value": 20000,
"request_id": "6f3f3651-a22b-42a0-b5fe-412d2167c5ca",
"kubernetes_docker_id": "caa9102a169a1495e5790cb2c17cb21d0a279ffc50d802d413938870ba59c7c0",
}
When we are using the clg binary to search through the generated archive using various search queries, It is taking a lot of time to process each query (around 25-30s on average). We only search for request_id and namespace name as mentioned below. It is not feasible for us if the search takes so much time for each archive. Ideally, for one archive 4-5s is the expected search time by us.
For example, To search for the request_id in the above log, we generally use the following queries

info6f3f3651-a22b-42a0-b5fe-412d2167c5ca*
6f3f3651-a22b-42a0-b5fe-412d2167c5ca
Our archives are sized 40MB on average wherein the sizes of the internal files and folders of an archive is as given below:
var.dict : 18.2MB
/s: 16.8MB
var.segindex: 2.6MB
metadata.db: 1.9MB
logtype.dict: 736KB
logtype.segindex: 36KB
metadata: 4KB

Is this the correct way to write search queries (the correct way in the sense that will it use the log type and other dictionaries to search through the archives efficiently)? because as mentioned earlier searching through a single archive itself takes more than 30s which is very infeasible. We expect the search time for each archive to be around 4-5s, not more than that for a single archive. Please guide me If I am doing wrong anywhere like the search query being inefficient, etc.

CLP version

3a20c0d

Environment

UBUNTU 20.04
EC2 instance type: m5.8xlarge

Reproduction steps

Cannot download submodules when using GitHub source packages

The source packages downloadable from the releases page don't contain a .git directory, so git submodules cannot be downloaded using components/core/tools/scripts/deps-download/download-all.sh.

Thanks to @charleswu52 for finding the issue in #22

Out-of-order logs for large split files with timestamped and un-timestamped sections

Bug

Timestamped and un-timestamped segments are built in parallel, and clp relies on their sequential order for decompression. As a result, large split files with timestamped and un-timestamped sections may have out-of-order logs vs. original when compressed/decompressed.

CLP version

776fc3a

Environment

Ubuntu 22.04

Reproduction steps

Copy HDFS_1 from Loghub-dataset (log file without timestamps)
Append logs with timestamps to the end of HDFS.log
- For example take Hadoop logs from Loghub-dataset
- cat Hadoop/application_1445062781478_0011/container_1445062781478_0011_01_000001.log >> HDFS.log
Copy HDFS_1 and Hadoop directories into a new directory
- Copy HDFS_1 first, then Hadoop, so Hadoop gets compressed first
- Purpose of Hadoop directory is to create a segment with timestamps first in sequential order
Compress/decompress directory
Beginning of compressed/decompressed HDFS.log is out of order
- head HDFS.log, should return 081111... instead of 081109...

Note: This will only reproduce bug after PR #231 is merged fixing issue #167. If not, issue #167 will create a similar output.

ffi: Generating subqueries for certain queries triggers a segfault

Bug

Calling ffi::search::generate_subqueries for certain queries triggers a segfault at this line.

CLP version

dfa992d

Environment

Ubuntu 20.04

Reproduction steps

Call ffi::search::generate_subqueries one this query: "*zzzzzzz zzzzz zzzzzzz zzzzzzzz zzz zzzzzzzzz: z*"

Unable to setup CLP in Debian OS

Hi Team,
I am trying to follow the steps mentioned in package-template, when I am trying to start clp, it is failing with this error. The system has enough memory. Can you please edit the readme.md and post all the required softwares.

Traceback (most recent call last):
  File "/root/clp/tools/packager/out/clp-package-ubuntu-focal-x86_64-v0.0.1/etc/../sbin/start-clp", line 462, in main
    start_queue(instance_id, clp_config)
  File "/root/clp/tools/packager/out/clp-package-ubuntu-focal-x86_64-v0.0.1/etc/../sbin/start-clp", line 271, in start_queue
    subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
  File "/opt/conda/default/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'exec', '-it', 'clp-queue-209e', 'rabbitmqctl', 'wait', '/tmp/rabbitmq.pid']' returned non-zero exit status 137.

Working directory of some Docker images prevents container startup when run as non-root user

Bug

Some of the Docker images (e.g., clp/clp-execution-x86-ubuntu-focal) have their working directory set to /root which will prevent starting up the container as a non-root user. This only occurs with some versions of Docker (e.g., 19.03.2).

The error message is:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "chdir to cwd (\"/root\") set in config.json failed: permission denied": unknown.

CLP version

5f80e1e

Environment

Docker 19.03.2
Ubuntu 18.04

Reproduction steps

docker run -it --rm -u$(id -u):$(id -g) ghcr.io/y-scope/clp/clp-execution-x86-ubuntu-focal:main /bin/bash

Zstded files can not be compressed using clp-package

Bug

The current CLP uses libarchive to decompress zstded input. However, libarchive internally relies on cli zstded but the zstded is currently not installed on the clp-execution image. As a result, files compressed with zstded in the package flow will trigger an error of "Cannot compress {} - not UTF-8 encoded."

CLP version

4fc3b61

Environment

Ubuntu 18.04.1 LTS

Reproduction steps

python3 build-clp-package.py --config ../../config/build-clp-package.yaml
./sbin/start-clp
./sbin/compresse file.zst

Fail to compile CLP

I have met another compilation problem when I try to execute "make" command. It refers missing "xmlReaderForIO" in libarchive.a, I wonder why ? I have "libxml2.a" installed in my lib directory.

unable to get clp running in mac

Bug

I am trying to get clp up and running by following the instructions provided in https://github.com/y-scope/clp/blob/main/tools/packager/README.md but I see the below error ModuleNotFoundError: No module named 'zstandard.backend_c'

CLP version

6d35126

Environment

OS: MacOS Monterey
Python: 3.10.8
Docker version 20.10.17, build 100c701

Reproduction steps

Steps:

Followed steps as per building package
cd out/clp-package-ubuntu-focal-x86_64-v0.0.1
Start clp - /sbin/start-clp --uncompressed-logs-dir <full path>

[ERROR] [clp] Unable to connect to the database with the provided credentials

When I run ./sbin/start-clp --uncompressed-logs-dir <directory containing your uncompressed logs>, it prompts an error such as the title. The detailed execution process is as follows

$ ./sbin/start-clp --uncompressed-logs-dir ./myTestData/input/
2021-10-24 21:25:28,315 [INFO] [clp] Using default config file at etc/clp-config.yaml
2021-10-24 21:25:28,320 [INFO] [clp] Provision docker network bridge
2021-10-24 21:25:28,634 [INFO] [clp] Starting CLP scheduler
2021-10-24 21:25:28,634 [INFO] [clp] Starting scheduler mariadb database
2021-10-24 21:25:34,037 [INFO] [clp] Starting scheduler queue
2021-10-24 21:25:39,313 [INFO] [clp] Initializing scheduler queue
2021-10-24 21:25:40,831 [INFO] [clp] Initializing scheduler database tables
2021-10-24 21:26:05,825 [ERROR] [clp] Unable to connect to the database with the provided credentials
2021-10-24 21:26:05,826 [ERROR] [clp] 
2021-10-24 21:26:05,826 [ERROR] [clp] Failed to provision "clp-mini-cluster"

File with messages which swap a variable for a variable placeholder aren't compressed correctly

Bug

If a file contains two messages which are identical except one message contains a variable and the other contains its corresponding variable placeholder, the file will not be compressed correctly such that decompressing it either stops prematurely or crashes.

For example, these log messages (where ^Q is an integer-variable placeholder)...

2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties ^Q from
2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties 123 from
2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties 123 from

... will cause the following exception:

2023-09-17 21:36:24,750 [error] FileWriter not closed before being destroyed - may cause data loss                                                                                                               2023-09-17 21:36:24,750 [error] Decompression failed: src/DictionaryReader.hpp:208 DictionaryReader operation failed, error_code=3

If instead we reorder the messages...

2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties 123 from
2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties ^Q from
2015-03-23 07:29:48,942 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties 123 from

... decompression will stop after the 2nd message, where the 2nd message is also incorrect.

CLP version

5d0f7b3

Environment

Ubuntu 20.04

Reproduction steps

Compress the file described in the bug report: ./clp c archives log.txt
Decompress the file described in the bug report: ./clp x archives decomp
Observe a crash or incorrect decompressed output: diff log.txt decomp/log.txt

Optimising search by only searching archives falling under selected time window

Request

Currently the metadata DB is queried for all the archives and with this change it will only queries the time window in which the user queries fall.

Possible implementation

#138

Can not build on redHat 7

I am trying to build CLP on Linux server(redHat 7), but I encountered CMake(version: 3.21.1) build failure. The debug messages refer it could not find Boost (missing iostreams, program_options filesystem system). But I have libboost-iostreams.a, libboost-program_options.a, libboost-filesystem.a and libboost-system.a(all are in version 1.59.0) under /usr/include/lib/, I wonder why the Cmake program could not find them.
I am looking forward to a solution, thanks.

Empty logtypes incorrectly cause an error to be thrown

Bug

When trying to write an empty logtype to the dictionary, an exception is thrown. Empty logtypes should be allowed.

CLP version

e4b7635

Environment

Ubuntu 18.04

Reproduction steps

Call bool LogTypeDictionaryWriter::add_entry (LogTypeDictionaryEntry& logtype_entry, logtype_dictionary_id_t& logtype_id)
with an empty string for the value in logtype_entry.

mariadb connector 3.2.3 no longer available

Bug

Running

./tools/scripts/lib_install/mariadb-connector-c.sh 3.2.3

Returns

dpkg-query: no packages found matching libmariadb-dev
Checking for elevated privileges...
curl: (22) The requested URL returned error: 404

Running with 3.3.3 works for now.

CLP version

54497a0

Environment

Ubuntu 22.04.1 LTS

Docker v4.14.1

uname -a: Linux 485fbd483ae2 5.15.49-linuxkit #1 SMP PREEMPT Tue Sep 13 07:51:32 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Reproduction steps

Run

./tools/scripts/lib_install/mariadb-connector-c.sh 3.2.3

clp-s: Nested timestamps result in incorrect mst node names

Bug

Ingesting records with nested timestamps using the --timestamp-key argument turns the key name in the last level of the MST into the fully qualified name. e.g. records like {"a": "ts": ...} will look like {"a": "a.ts": ...} when decompressed.

CLP version

e80cee0

Environment

Ubuntu focal clp image

Reproduction steps

Ingest a dataset with nested timestamps while using the --timestamp-key argument
Observe incorrect data upon decompression

Compress command from Package-template gets stuck (process runs indefinitely)

Bug

When installed everything and ran the compress command via pacakge-template/src/sbin, it gives following output and hangs indefinitely.
The ../var directory does have file sources BUT it never produces any archived files or directory
Feel free to let me know if any other information is required

user@ubuntu1:~/code/CLP/clp/components/package-template/src/sbin$ ./compress /home/indra/TestLogs
2022-11-30 05:15:39,689 [INFO] [/opt/clp/sbin/native/compress] Compression job submitted to compression-job-handler.
2022-11-30 05:15:39,689 [INFO] [compression-job-handler] compression-job-handler started.
2022-11-30 05:15:39,695 [INFO] [job-1] Iterating and partitioning files into tasks.
2022-11-30 05:15:39,702 [INFO] [job-1] Waiting for 1 task(s) to finish.

Nothing after this. Here is the list of /TestLogs directory.

user@ubuntu1:~/code/CLP/clp/components/package-template/src/sbin$ ls -l ~/TestLogs
total 12
-rw-rw-r-- 1 indra indra 4493 Nov 30 03:57 ibm_dummy_logs.log
-rw-rw-r-- 1 indra indra   70 Nov 30 02:01 my_log_file.log

CLP version

0.0.1

Environment

Ubuntu 20.04 LTS:

user@ubuntu1:~/code/CLP/clp/components/package-template/src/sbin$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

user@ubuntu1:~ /code/CLP/clp/components/package-template/src/sbin$ docker -v
Docker version 20.10.12, build 20.10.12-0ubuntu2~20.04.1

Reproduction steps

Installed core per the directions on the repository – directly on Ubuntu environment. Did not go with docker image route.
Installed packages clp_py_utils, compression_job_handler, and job_orchestration in /components/package-template/src/lib/python3/site-packages/
Request: I want to build the "package" from scratch rather than playing with package-template and see if that works
From the docs, I am unsure how to do that. Any pointers on doing that would help to troubleshoot this.

How to pass custom delimiters, dictionary and non-dictionary schemas

According to the paper, we can pass following configs for CLP.

delimiters
dictionary_variables
non_dictionary_variables

But, AFAIU, there is no way to pass these for clg and clp now.

Can you help me if I miss anything? Thanks

Schema matching fails on negated precise array search matching wildcard

Bug

On a log file consisting of the following example data
{"a": [{"b": "c"}]}
Search will work correctly for queries like a.b: c and a.b: *, but will fail on schema matching for the query NOT a.b: *.

The expected behaviour is for schema matching to succeed on the array column "a" following by the record not being matched because an object with the key "b" exists within the array "a". This is likely caused by an interaction between the Convert To Exists pass and Schema Matching.

CLP version

284a558

Environment

Ubuntu focal clp image.

Reproduction steps

Compress the example data {"a": [{"b": "c"}]}
Perform the search NOT a.b: *
Observe schema matching failure

Decompression fails on one specific large file with long lines

Bug

Using the heuristic to compress and decompress Loghub-datasets/Loghub-datasets/HDFS_1/HDFS.log results in a decompressed file that does not match the original.

CLP version

c2e5823

Environment

Ubuntu 20.04

Reproduction steps

Compress using the heuristic
Decompress the resulting archive
Diff decompressed file with original file

Search job logs and compression job logs overwrite each other

Bug

Logs from compression jobs and search jobs are stored in the same directory and use the same file name format, so they overwrite each other.

CLP version

f17be1d

Environment

Ubuntu 20.04
Docker 20.10.8, build 3967b7d

Reproduction steps

Clear all data and logs from your package
Run a compression job
Write some text in job-1-task-1.log if it is empty.
Run a search job and you'll see job-1-task-1.log was overwritten with the search job's logs.

clg reports SQLitePreparedStatement operation failure after printing one log message

Bug

The current version of clg can cause SQLitePreparedStatement: Failed to bind int64 to statement - column index out of range bug. The reason is due to implicit conversion from segment_id to bool when calling clp::streaming_archive::reader::Archive::get_file_iterator for false == is_superseding_query case.

The fix is pretty simple. We just need to add false as the fifth parameter of get_file_iterator so that it can match the correct method.

CLP version

59b2c22

Environment

MacOS 13.2.1

Reproduction steps

Compress components/core/README.md using clp
Search option on the compressed archive using clg

Centos7.4 build failed

All dependencies have been successfully compiled and installed. But an error is reported when make is executed, and the error message is as follows：
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_read_support_filter_lz4.c.o): in function lz4_filter_read_legacy_stream': archive_read_support_filter_lz4.c:(.text+0x1f9): undefined reference to LZ4_decompress_safe'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_read_support_filter_lz4.c.o): in function lz4_filter_read_default_stream': archive_read_support_filter_lz4.c:(.text+0x575): undefined reference to LZ4_decompress_safe_usingDict'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_read_support_filter_lz4.c:(.text+0x79e): undefined reference to LZ4_decompress_safe' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_cryptor.c.o): in function aes_ctr_release':
archive_cryptor.c:(.text+0x18): undefined reference to EVP_CIPHER_CTX_free' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_cryptor.c.o): in function aes_ctr_init':
archive_cryptor.c:(.text+0x4e): undefined reference to EVP_CIPHER_CTX_new' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0x89): undefined reference to EVP_aes_192_ecb'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0xb7): undefined reference to EVP_CIPHER_CTX_init' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0xc9): undefined reference to EVP_aes_128_ecb'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0xd9): undefined reference to EVP_aes_256_ecb' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_cryptor.c.o): in function aes_ctr_update':
archive_cryptor.c:(.text+0x2f9): undefined reference to EVP_EncryptInit_ex' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_cryptor.c:(.text+0x318): undefined reference to EVP_EncryptUpdate'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_hmac.c.o): in function __hmac_sha1_cleanup': archive_hmac.c:(.text+0x10): undefined reference to HMAC_CTX_cleanup'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_hmac.c.o): in function __hmac_sha1_final': archive_hmac.c:(.text+0x48): undefined reference to HMAC_Final'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_hmac.c.o): in function __hmac_sha1_init': archive_hmac.c:(.text+0x95): undefined reference to EVP_sha1'
/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: archive_hmac.c:(.text+0xa9): undefined reference to HMAC_Init_ex' /opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /usr/local/lib/libarchive.a(archive_hmac.c.o): in function __hmac_sha1_update':
archive_hmac.c:(.text+0x64): undefined reference to `HMAC_Update'
collect2: error: ld returned 1 exit status
make[2]: *** [clp] mistake 1
make[1]: *** [CMakeFiles/clp.dir/all] mistake 2
make: *** [all] mistake 2

clp-execution-image-build workflow is not triggered on changes to the container files

Bug

The .github/workflows/clp-execution-image-build.yaml GitHub workflow is not triggered when files in tools/docker-images/clp-execution-base-focal/ are changed.

CLP version

6b135da

Environment

N/A

Reproduction steps

Commit a change to a file in tools/docker-images/clp-execution-base-focal/

possible to store in S3 and query via Athena

Request

I would like to know if it is possible to store CLP compressed data in S3 and query it through Athena using SQL?

Possible implementation

sbin/search.sh doesn't always print all search results

Bug

When searching using the package's sbin/search.sh script, sometimes, not all results are printed (but they exist in the results cache). The scheduler/worker logs don't show any errors.

It's possible that the search tasks / job finishes before the results cache has made the search results available for reading.

CLP version

d524ead

Environment

Ubuntu 22.04

Reproduction steps

sbin/start.sh
sbin/compress.sh
sbin/search.sh "DESERIALIZE_ERRORS" > results.txt

Count the lines of output. It should be 6945. Running the query a few times should result in one instance where not all 6945 results are written to the file.

Queries with a variable followed by a period and a wildcard do not return results

Bug

If a variable exists in a log message like so: "... job_1427088391284_0004. ...". The query " *job_1427088391284_0004.* " does not return this message, but it should.

CLP version

0.0.1

Reproduction steps

Compress a log message containing a variable followed by a period like "... job_1427088391284_0004. ..."
Search for " *job_1427088391284_0004.* "

Integers less than -2^63 aren't compressed losslessly

Bug

Compressing a log with an integer less than -2^63 (e.g., -2^63 - 1) results in the integer being set to -2^63.

CLP version

583fdde

Environment

Ubuntu 20.04

Reproduction steps

Create a log with an integer smaller than -2^63, e.g., "This number is very negative: -9223372036854775810"
Compress the log
Decompress it and you'll see that the message has been decompressed as: "This number is very negative: -9223372036854775808"

Mesage not properly filtered by timestamp in CLG

Bug

In CLG, the message are not properly filtered with timestamp when the timestamp range becomes small.

For example, on hadoop-24hrs-log, the following search commnad (1) returns several messages, but none of (2) or (3) returns any matched search results, although (2) and (3) cover the same exact time range as (1) so they should also return matches.

clg --tge 1427099450000 --tle 1427099490000 hadoop_output/ "*"
clg --tge 1427099460000 --tle 1427099490000 hadoop_output/ "*"
clg --tge 1427099450000 --tle 1427099460000 hadoop_output/ "*"

Suspect that this is caused by some issues in the database queries.

CLP version

5d135cf

Environment

baker

Reproduction steps

compress hadoop-24hrs-log to hadoop-out

clg --tge 1427099450000 --tle 1427099490000 hadoop_output/ "*"
clg --tge 1427099460000 --tle 1427099490000 hadoop_output/ "*"
clg --tge 1427099450000 --tle 1427099460000 hadoop_output/ "*"

The 2nd and 3rd command will not return any output. but the you should see there are message matches in the first command

Difference between CLP and Parquet?

(Sorry if this isn't the right place for this question or if the question doesn't really make sense.) I read about CLP in the Uber blog post and I'm curious how CLP differs from storing logs in Parquet files. Is there a write up of comparisons anywhere?

Thank you!

Adding a library to support for reading from/writing to memory addresses

Request

Currently, CLP/CLG only supports writing to/reading from files under a specific directory structure in a POSIX file system. However, some FSless applications may wish to store/read data to/from a memory address. A fuse layer may achieve this, but it may cause huge overhead. It would be great to use CLP/CLG as a library providing memory interfaces.

Possible implementation

Providing a C++ class which includes pointers to memory buffers that would otherwise be stored in the archive directory in the current implementation. This class also allows users to call compress, decompress, and search directly instead of going through the command line.

Android OS library

Request

I want to try using CLP on Android，for compressing log on Android

Possible implementation

provide a library like jar，or program for android 64bit os. tks

CLP Setup issue

I'm able to exxcute start-clp and compress scripts, I have imported the sample Hive Data set, when I try to search for following sbin % ./search "Task *" query, I'm getting below error:

My understanding: as per https://github.com/y-scope/clp/releases, clp-package-ubuntu-focal-x86_64-v0.0.2.tar.gz is built for ubuntu-focal. Do I need to have EC2 instance with ubuntu-focal only ? can you guide me to resolve the below error:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Traceback (most recent call last):
File "/opt/clp/sbin/native/search", line 261, in
sys.exit(main(sys.argv))
File "/opt/clp/sbin/native/search", line 248, in main
for ip in set(socket.gethostbyname_ex(socket.gethostname())[2]):
socket.gaierror: [Errno -2] Name or service not known
Traceback (most recent call last):
File "./search", line 135, in
sys.exit(main(sys.argv))
File "./search", line 126, in main
subprocess.run(cmd, check=True)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['docker', 'run', '-i', '--rm', '--network', 'host', '-w', '/opt/clp', '-u', '504:20', '--name', 'clp-search-61c9', '--mount', 'type=bind,src=/Users/rerajesh/Downloads/clp-package-ubuntu-focal-x86_64-v0.0.2,dst=/opt/clp', '--mount', 'type=bind,src=/Users/rerajesh/Downloads/clp-package-ubuntu-focal-x86_64-v0.0.2/var/log/bcd07464dac2,dst=/opt/clp/var/log', '--mount', 'type=bind,src=/Users/rerajesh/Downloads/clp-package-ubuntu-focal-x86_64-v0.0.2/var/data/archives,dst=/mnt/archive-output', 'ghcr.io/y-scope/clp/clp-execution-x86-ubuntu-focal:main', '/opt/clp/sbin/native/search', '--config', '/opt/clp/var/log/.clp-search-61c9-config.yml', 'Task * deprecation']' returned non-zero exit status 1.

Schema unit tests fail if run outside build directory.

Bug

If the unit tests are run outside the build directory, e.g.:
./build/unitTest instead of cd build && ./unitTest
you will get 11 failures related to missing schema files.

This is caused by relative paths being used in the schema unit tests.

An error example:

/home/lion/yscope/clp-ffi-go/cpp/clp/components/core/tests/test-Grep.cpp:21: FAILED:
due to unexpected exception with message:
ReaderInterface operation failed

[2023-05-20 23:25:59.016] [info] [test-ParserWithUserSchema.cpp:72] File not found: /home/lion/yscope/clp-ffi-go/cpp/clp/components/tests/test_schema_files/missing_schema.txt

CLP version

460d377

Environment

ghcr.io/y-scope/clp/clp-core-dependencies-x86-ubuntu-focal:main

Reproduction steps

build the latest CLP
run unitTest outside of your build directory (i.e. run unitTest from any directory other than the file's parent)

Escaped and wildcard "?" not properly handled during search

Bug

Currently CLG replaces ? with * in the search query because we don't handle ? wildcard and fallback to decompression and match.
However, this doesn't consider the case when a ? is escaped.

Note: the query discussed below are queries seen by CLP. if you enter the query on the commandline via bash, you may need to add extra escapes for bash to properly interpret the query.

For example, considering the query INFO \? TEXT, which shall match a plain text ? in the log. current CLG will replace the ? with * and end up searching for the wrong results.

While I tried a quick fix (not submitted) for the issue above, there were other bugs that I noticed.

For example, considering the query INFO \\? TEXT. The query shall match a plain text \ in the log, then ? will server as wildcard and match any character. However the CLG is not returning the correct results.

CLP version

5d6ff54

Environment

22.04 Ubuntu

Reproduction steps

Attaching log and queries that can be used for testing and reproducing the issue.

Currently running the query in the CLP won't generate the correctly matched results.

Note: the queries attached are what should be seen by CLP. if you enter the query on the commandline via bash, you may need to add extra escapes for bash to properly interpret the query.

query.txt
log.txt

Archive uncompressed size is incorrect (higher) in the global metadata database

Symptom

The archive's uncompressed size is higher in the global metadata database than the amount of data that was compressed.

Reproduction steps

clp c archives <log file>
Open the global metadata database
The archive's uncompressed size is higher than the size of the uncompressed file

Component

core

Version

0.0.0

Compression fails on MacOS and WSL 1

Bug

On MacOS, when running CLP through docker, this error is logged after attempting to compress a log file:

2021-12-28 10:42:48,593 [critical] Failed to write archive file metadata collection in file: /root/clp/var/data/clp-mini-cluster/archives/26d4c5e4-0d50-43df-9df7-1d210ba13a4b/metadata
2021-12-28 10:42:48,595 [error] Compression failed: src/FileWriter.cpp:116 FileWriter operation failed, errno=2

On WSL 1, when compressing with the core component natively, this error is printed:

2021-12-26 20:47:23,795 [critical] Failed to write archive file metadata collection in file: archives/5b6a7367-c69a-49f7-8689-9778626d25fb/metadata
2021-12-26 20:47:23,795 [error] Compression failed: src/FileWriter.cpp:116 FileWriter operation failed, errno=1

CLP version

0.0.1

Environment

MacOS version: OS X v10.15.7 build 19H15
Windows version: 10.0.17763.1935
WSL version: 1
Linux distro under WSL: Ubuntu 18.04

Reproduction steps

For MacOS:

sbin/start-clp --uncompressed-logs-dir <dir>
sbin/compress <file path>

For WSL 1:

./clp c archives <file path>

Floating numbers are not correctly encoded when using IR four-byte encoding

Bug

When using CLP IR four-byte encoding methods to encode a message that contains a floating point number, the decoded message differs from the original message:
Message before encoding (raw text): fps=60.000004
Message after decoding (from the encoded IR): fps=.0
The errors are caught with multiple different floating point numbers, while the decoded results are all being .0.

CLP version

cb6058c

Environment

Ubuntu 18.04
macOS 12.5

Reproduction steps

Encodes the following log message using CLP IR four-byte encoding method:
fps=60.000004
Decodes the encoded message back to the raw text using CLP IR four-byte decoding methods.

Questions about intermediate results

Request

Great works! I had compressed and decompressed lots of logs successfully, and I had some problems understanding the code in depth.

After a file is compressed, how do I obtain the intermediate result generated during the compression, such as the extracted template?
I read logtype.dict and var.dict myself, but I found that they were all numbers, and I didn't know what they meant. In my understanding, logtypes and templates should be in string form, but I can't get these string templates at the moment.
I'm very interested in figuring out what all these staging files, logtype.dict, logtype.segindex, metadata, metadata.db, var.dict, var.segindex, mean. And what are the files 0, 1, 2, 3, and so on in folder s.

Possible implementation

Can you provide some scripts to read the aforementioned files?
Also, some scripts to generate intermediate results.

Schema matching fails on query with negated varstring and clpstring filters on same key

Bug

For the following data

{"a": "clp string"}
{"a": "string clp"}

The query NOT a: "clp string" will return the second record, but the query NOT (a: "clp string" OR a: "b") will fail schema matching and return no records.

The AST for the failing case is AndExpr(!FilterExpr(EQ, ColumnDescriptor<clpstring,array>("a"), "clp string"), !FilterExpr(EQ, ColumnDescriptor<varstring,array>("a"), "b"))

The problem is that since the dataset contains no varstring type column the second filter fails column matching, gets replaced with EmptyExpr, and the entire AndExpr gets constant propagated away. This bug is a bit annoying, because while the varstring column "a" doesn't exist the JSON string column "a" does exist.

The expected behaviour is to treat the filter as False on the clpstring column "a" when it does exist (and the inverted filter as true).

Actually, for every string that matches either strictly varstring or clpstring there should be an implicit negated condition on the existence of the other type.

For example a<clpstring> NEQ "a b" -> a<clpstring> NEQ "a b" OR EXISTS a<varstring>
and similarly a<varstring> NEQ "b" -> a<varstring> NEQ "b" OR EXISTS a<clpstring>.

This edge case only applies for negated conditions -- something that does not match a particular clpstring will not match every varstring, but something that does match a particular clpstring will never match a varstring.

Easiest way to handle this edge case is probably by either augmenting the AST during either ConvertToExists or in another pass, or by very careful treatment during Schema Matching.

CLP version

9e6b755

Environment

Ubuntu focal docker image

Reproduction steps

Ingest example data above
Run example query

Compression error on var-logs

Bug

The CLP compressor runs into an error when compressing our internal var-log. some example error messages are attached below

2024-01-24 00:57:45,022 [error] Cannot compress hostd-10 - not an IR stream or UTF-8 encoded
2024-01-24 00:57:45,125 [error] Cannot compress hostd-12 - not an IR stream or UTF-8 encoded

Note that 4f9ed0a doesn't have this issue, but we don't know from which commit did the issue start

CLP version

5d6ff54

Environment

ubuntu 22.04

Reproduction steps

./clp c varlog_compressed ///var-logs

clp-s: Timestamp Dictionary doesn't work correctly for nested timestamps

Bug

Nested timestamp keys aren't stored correctly in the timestamp dictionary, which makes them unusable at search time.

For records like the following ingested with --timestamp-key a.ts

{"a" : {"ts": 1234}}

The time range of a.ts is correctly recorded in the timestamp dictionary, but the key name is recorded as ts. This means that at search time filtering on a.ts will not successfully take advantage of the timestamp dictionary. (On the other hand filtering on ts will use the timestamp dictionary, but will fail on schema matching afterwards).

CLP version

4f74dd2

Environment

Ubuntu focal clp image

Reproduction steps

Ingest any dataset with a nested timestamp key
Observe incorrect timestamp dictionary

Searching for a variable with a '/' in an archive composed of IR files returns nothing

Bug

If we compress an IR file containing a variable with one or more forward slashes (e.g., "/usr/bin/python3") into archive and then try to search for that variable using clg (e.g., ./clg archives "/usr/bin/python3"), the search returns nothing.

CLP version

084efa3

Environment

Ubuntu 20.04

Reproduction steps

Create an IR file (using the four-byte encoding) containing a log message: 2023-09-16T00:00:00.000 INFO Python's path is /usr/bin/python3?
Compress that IR file using clp: ./clp c archives test.clp.zst
Search for the path string: ./clg archives "/usr/bin/python3"

y-scope / clp Goto Github PK

clp's Introduction

Benchmarks

System Overview

Getting Started

Docs

Providing Feedback

Next Steps

clp's People

Contributors

Stargazers

Watchers

Forkers

clp's Issues

Bug

CLP version

Environment

Reproduction steps

Request

Possible implementation

Bug

CLP version

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Request

Possible implementation

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Request

Possible implementation

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Bug

CLP version

Environment

Reproduction steps

Bug