cloudwise-opensource / gaia-dataset Goto Github PK

GAIA, with the full name Generic AIOps Atlas, is an overall dataset for analyzing operation problems such as anomaly detection, log analysis, fault localization, etc.

License: GNU General Public License v2.0

ai aiops analysis dataset devops metrics ops

gaia-dataset's People

Contributors

Stargazers

Watchers

gaia-dataset's Issues

Duplicate timestamps in metrics

There are duplicate timestamps in many metrics. Some of these duplicates have the same value, but often the same timestamp appears in multiple rows with different values. Usually in such cases, one of these rows has a valid value and the remaining rows are 0. Can I just take the non-zero row as the correct row to use for this timestamp? Is this expected when you collected and compiled the data? Thanks.

Causes and measures of failure

Could you explain to me the causes and corresponding measures of several types of failures in a dataset?
Thanks a lot.

Some injected memory anomalies do not have impact on the memory metrics

For example, I am checking the impact of the following injected anomaly:

2021-08-14 03:49:04,212 | WARNING | 0.0.0.4 | 172.17.0.3 | dbservice1 | [memory_anomalies] trigger a high memory program, start at 2021-08-14 03:39:03.575551 and lasts 600 seconds and use 1g memory

Below is a plot of the memory-related metrics on this node and service around that time. Red region is the duration of the high-memory program.

There is no change in any metric before and after the anomaly is injected.

Question regarding to [cpu anomalies] in MicroSS dataset

First of all, it is really appreciated to establish Atlas dataset including metrics, logs, and traces data with anomaly labels!

I have looked into the MicroSS dataset and tried to realize the injected faults which are recored in file MicroSS/run/run.zip.
Running logs have shown that possibly four faulty types have been injected:

memory anomalies: 2021-07-26 20:38:42,890 | WARNING | 0.0.0.2 | 172.17.0.2 | dbservice2 | [memory_anomalies] trigger a high memory program, start at 2021-07-26 20:28:42.025044 and lasts 600 seconds and use 1g memory
permission denied, 2021-07-27 00:01:00,853 | WARNING | 0.0.0.2 | 172.17.0.2 | dbservice2 | trigger an access permission denied exception, will lasts an hour
file missing, 2021-07-28 16:10:01,076 | WARNING | 0.0.0.3 | 172.17.0.4 | webservice2 | trigger the file moving program, start with 2021-07-28 16:00:00.976817, last for 600 seconds
cpu anomalies, 2021-07-28 06:40:20,943 | WARNING | 0.0.0.4 | 172.17.0.2 | mobservice2 | [cpu_anomalies] trigger a parallel fast sorting program , start at 2021-07-28 06:40:20.936320 and lasts 3.0034542083740234 seconds

I am curious about the duration of injected cpu anomalies. Since other anomalies are injected for around several hundreds seconds but cpu anomalies are only injected for 3 seconds. An important issue is that can 3s cpu anomalies affect the reliability and availability in system?

When I searched all cpu anomalies duration, a more weird issue emerged. There are several running logs show that cpu anomalies have been injected for more than 1m seconds. For example:

2021-07-29 | logservice1 | 2021-07-29 22:09:57,277 | WARNING | 0.0.0.3 | 172.17.0.3 | logservice1 | [cpu_anomalies] trigger a parallel fast sorting program , start at 2021-07-29 22:09:57.274933 and lasts 1985016.0505759716 seconds

Complete log data

Excuse me, first of all, thank you for providing the data set. The logs in the business folder only have data between July 1st and July 6th. Can you provide a log data set for a whole month? Thanks a lot.

Few records of servies in July in the document of 'business' in MicroSS

After I unzipped all folders of, I found that there are no records for all servies in July except webservice1 in the document of 'business'. There is also no August data in the Trace folder. Could you please provide complete data, at least the complete data of a month?

Question about faulty types in MircoSS dataset

First of all, thank you for your MicroSS dataset!
After extracting the templates for the run_table _2021-07.csv, I found that there are 16 templates. Except for the two faulty types [cpu_anomalies] and [memory_anomalies], I am curious whether the following are considered faulty types:

"<>-<>-<> <>:<>:<>,<> | WARNING | <> | <> | <> | [normal memory freed label] lasts ten minutes
"<>-<>-<> <>:<>:<>,<> | WARNING | <> | <> | <> | <> | wait for <> seconds for follow-up operations to simulate the login failure of the QR code expired
"<>-<>-<> <>:<>:<>,<> | ERROR | <> | <> | <> | upload run_logs logs on <>-<>-<*> failed: 'str' object does not support item assignment
"<>-<>-<> <>:<>:<>,<> | WARNING | <> | <> | <> | trigger the file moving program, start with <>-<>-<> <>:<>:<>.<>, last for <> seconds
"<>-<>-<> <>:<>:<>,<> | ERROR | <> | <> | <> | upload business logs on <>-<>-<> failed: (pymysql.err.OperationalError) (<>, 'ny connections')
"<>-<>-<> <>:<>:<>,<> | WARNING | <> | <> | <> | trigger an access permission denied exception, will lasts an hour
"<>-<>-<> <>:<>:<>,<> | ERROR | <> | <> | <> | upload <> logs on <>-<>-<> failed: (pymysql.err.OperationalError) (<>, ""Can't connect to MySQL server on '<>' ([Errno <*>] Connection refused)"")

Missing Files in Julg and August

Thank you for providing this valuable dataset. However, when I unzipped the business and trace zip files, I only received the trace data in July and the log data of almost the service are in August except webservice1. Could you please provide the complete business and trace data so that we can alignment the those multi-source data along timestamp.
Looking forward to your reply.

About the data for July

Hello, I only saw the data of webservice1 node in July, may I ask if there is any data of other nodes in July, just like in August

What are the meanings of the fields in a "business" log message?

For example in 'business_table_2021-08.csv',

What are the meanings of the 4th and 5th fields in a message?

Regarding the inconsistent type of metirc

Through preliminary analysis of the metric , we found that the metric names recorded by the same service in different periods and different services in the same period are different. Below are some screenshots of some information about memory,

I would like to ask if the metric of different services are collected in different ways. Can you give a detailed description of these metric ? This will be of great help to us in analyzing the performance of the service and whether there are any abnormalities.

Question about log message

First, thank you for sharing this data set.
I have some questions about the message field in log.

What does this trace id "c124e30fb40651dc" mean？Is there any relationship between this log and this trace?
And What does "permission_operate.py -> permission_operation -> 35" mean？

Thanks a lot.

About the label of Micross Data.

First of all, thanks for sharing your data, but I do not find any label in MicroSS\metric dataset, could you please provide complete label such as 0 or 1 of MicroSS-metric. Thanks a lot.

Question about MicroSS Data

In "metric" folder, each csv filename contains the node to which the file belongs. In "business" folder, each file contains the business log of a node. But in "trace" folder, "service_name" is the name of service or host.
What is the difference between a service and a node here?
And why some nodes have the corresponding metric but no associated logs and traces?
I would also like to know the relationship between nodes and containers here, for example, is each node here deployed in a container?

Thanks a lot.

DataSet is Empty

Hi, when accessing the GAIA-DataSet, it shows "This repository is empty." Can you please guide me how to access the data? Thank you!

the question about the run file

Hello! Thank you for your recently uploaded GAIA dataset! I would like to ask if each line in the csv file in the run file corresponds to an injected exception?
Because I see some lines of log information such as "upload business logs on 2021-07-31 successfully", is this also an exception? If so, what type of exception? Looking forward to your reply, thank you very much!

Dataset repository not accessible

Hello，

This repository seems not accessible from GitHub except for its mobile app. Please take a look if you know what's going on.

Injection Schadule

Hello,

I'm trying to extract time windows between failure injection and occurrence of failure messages in side log data.

Where is the record providing this information?

Thank you

cloudwise-opensource / gaia-dataset Goto Github PK

gaia-dataset's People

Contributors

Stargazers

Watchers

Forkers

gaia-dataset's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs