leonyang95 / plelog Goto Github PK

Implementation of PLELog in ICSE 2021 accepted paper:Semi-supervised Log-based Anomaly Detection via Probabilistic Label Estimation.

License: Apache License 2.0

Python 100.00%

plelog's Introduction

PLELog

This is the basic implementation of our submission in ICSE 2021: Semi-supervised Log-based Anomaly Detection via Probabilistic Label Estimation.

PLELog

Description

PLELog is a novel approach for log-based anomaly detection via probabilistic label estimation. It is designed to effectively detect anomalies in unlabeled logs and meanwhile avoid the manual labeling effort for training data generation. We use semantic information within log events as fixed-length vectors and apply HDBSCAN to automatically clustering log sequences. After that, we also propose a Probabilistic Label Estimation approach to reduce the noises introduced by error labeling and put "labeled" instances into attention-based GRU network for training. We conducted an empirical study to evaluate the effectiveness of PLELog on two open-source log data (i.e., HDFS and BGL). The results demonstrate the effectiveness of PLELog. In particular, PLELog has been applied to two real-world systems from a university and a large corporation, further demonstrating its practicability.

Project Structure

├─approaches  # PLELog main entrance.
├─config      # Configuration for Drain
├─entities    # Instances for log data and DL model.
├─utils
├─logs        
├─datasets    
├─models      # Attention-based GRU and HDBSCAN Clustering.
├─module      # Anomaly detection modules, including classifier, Attention, etc.
├─outputs           
├─parsers     # Drain parser.
├─preprocessing # preprocessing code, data loaders and cutters.
├─representations # Log template and sequence representation.
└─util        # Vocab for DL model and some other common utils.

Datasets

We used 2 open-source log datasets, HDFS and BGL. In the future, we are planning on testing PLELog on more log data.

Software System	Description	Time Span	# Messages	Data Size	Link
HDFS	Hadoop distributed file system log	38.7 hours	11,175,629	1.47 GB	LogHub
BGL	Blue Gene/L supercomputer log	214.7 days	4,747,963	708.76MB	Usenix-CFDR Data

Reproducibility

We have published an full version of PLELog (including HDFS log dataset, glove word embdding as well as a trained model) in Zenodo, please find the project from the zenodo badge at the beginning.

Environment

Note:

We attach great importance to the reproducibility of PLELog. Here we list some of the key packages to reproduce our results. However, as discussed in issue#14, please refer to the requirements.txt file for package installation.
According to issue#16, there seems to have some problem with suggested hdbscan version, if your environment has such an error, please refer to the issue for support. Great thanks for this valuable issue!
According to issue#19 , remove numpy version requirements from requirements.txt file. Great thanks for this suggestion!

Key Packages:

PyTorch v1.10.1

python v3.8.3

hdbscan v0.8.27

overrides v6.1.0

scikit-learn v0.24

tqdm

regex

Drain3

hdbscan and overrides are not available while using anaconda, try using pip or: conda install -c conda-forge pkg==ver where pkg is the target package and ver is the suggested version.

Please be noted: Since there are some known issue about joblib, scikit-learn > 0.24 is not supported here. We'll keep watching.

Preparation

You need to follow these steps to completely run PLELog.

Step 1: To run PLELog on different log data, create a directory under datasets folder using unique and memorable name(e.g. HDFS and BGL). PLELog will try to find the related files and create logs and results according to this name.
Step 2: Move target log file (plain text, each raw contains one log message) into the folder of step 1.
Step 3: Download glove.6B.300d.txt from Stanford NLP word embeddings, and put it under datasets folder.
Step 4: Run approaches/PLELog.py (make sure it has proper parameters). You can find the details about Drain parser from IBM.

Note: Since log can be very different, here in this repository, we only provide the processing approach of HDFS and BGL w.r.t our experimental setting.

Anomaly Detection

To those who are interested in applying PLELog on their log data, please refer to BasicLoader abstract class in preprocessing/BasicLoader.py` for more instructions.

Step 1: To run PLELog on different log data, create a directory under datasets folder using unique and memorable name(e.g. HDFS and BGL). PLELog will try to find the related files and create logs and results according to this name.
Step 2: Move target log file (plain text, each raw contains one log message) into the folder of step 1.
Step 3: Create a new dataloader class implementing BasicLoader.
Step 4: Go to preprocessing/Preprocess.py and add your new log data into acceptable variables.

Contact

We are happy to see PLELog being applied in the real world and willing to contribute to the community. Feel free to contact us if you have any question! Authors information:

Name	Email Address
Lin Yang	[email protected]
Junjie Chen *	[email protected]
Weijing Wang	[email protected]

* corresponding author

plelog's People

Contributors

Stargazers

Watchers

plelog's Issues

Some problems with the requirements.txt

Hello! Thanks for such a great sharing.
I configured my virtual env based on the provided requirements.txt using pip, but some error occured.

After the establishment of the virtual env, I started to run the project. The first error occured was: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject.
I searched for solutions, and finally this one worked for me. scikit-learn-contrib/hdbscan#457

I did some experiments, but found that this seems to be a problem with the hdbscan package itself, and has nothing to do with the version of numpy.
If you used the command pip install hdbscan to install the hdbscan package in your virtual environment, please uninstall it, and then try to use the command conda install -c conda-forge hdbscan to reinstall hdbscan.

I used the above command to install hdbscan==0.8.27, but then I got the second error: TypeError: init() got an unexpected keyword argument 'cachedir'. This solution helped me: https://stackoverflow.com/questions/73830225/init-got-an-unexpected-keyword-argument-cachedir-when-importing-top2vec
I installed hdbscan==0.8.29 using conda, and then everything worked fine.

I put an issue here just want to help anyone who want to establish the env for your code.
Thanks again for your great work.

How to reproduce the result of other methods?

Hello, Dr.Yang.
Some results of other researches which are not consistent with their original papers appear in your paper.
However, I find that there is no source code provided by their authors publicly. (Deeplog, LogAnomaly, LogRobust, and so on).
Can you tell me how to reproduce the result of other methods?
I appreciate it If you can offer some source code or other efficient methods.

Run Drain.py to extract templates parameter Settings

hello, I want to use Drain.py to extract templates for BGL dataset but the result is not so good, could you tell me the specific parameter settings？thanks very much

my settings are as follows：

removeCol = [0, 1, 2, 3, 4, 5, 6, 7, 8]  # [0,1,2,3,4] for HDFS
st = 0.5
depth = 4
rex = [r'core\.\d+']

Training process error

The training process encountered the following problems

bidirectional set as False.Another problem arose.

I have removed the back part of the multiplication in the image below.Can be trained but the result is wrong.

Below is the contents of my data file

I can't solve the problem yet.I hope I can get some help to sovle this problem.Thank you very much!

running PLElog.py problem

Hello，Dr.Yang, I want to reproduce the PLElog algorithm for processing HDFS. According to the description of preparation in readme.md, I just need to place the hdfs dataset and glove.6B.300d.txt under the datasets folder to run PLElog.py.
I downloaded the hdfs dataset and exception tag file, followed the instructions to place the file and run PLElog.py in pycharm. The program ran for 4-5 hours without any results. Am I running the wrong steps? If I run the wrong steps, please point out my mistake.

The following three pictures are screenshots of the files I configured according to the requirements and running the program. It has been running for 4-5 hours, and there are no errors reported in this interface.

Seems a package conflict occurs

Dear Dr Yang,

Many thanks for your code. However, I met an issue when running from hdbscan import HDBSCAN as dbscan. The error message is

ImportError: numpy.core.multiarray failed to import

I think it may stem from the conflict between the NumPy and habscan. Could you please provide more details about your environment? A requirements.txt listing all packages and their versions would be very helpful.

Thanks!

Error running pipeline.py

Hi,

I am facing trouble running pipeline.py, and it shows error logs below:

python pipeline.py --dataset HDFS
/home/dsi/shezhang/.local/lib/python3.8/site-packages/sklearn/decomposition/_fastica.py:116: ConvergenceWarning: FastICA did not converge. Consider increasing tolerance or the maximum number of iterations.
warnings.warn(
pipeline.py:194: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
predicts = estimator.fit_predict(np.asarray(trainReprs, dtype=np.float)).tolist()
Traceback (most recent call last):
File "/home/dsi/shezhang/.local/lib/python3.8/site-packages/joblib/parallel.py", line 822, in dispatch_one_batch
tasks = self._ready_batches.get(block=False)
File "/home/dsi/shezhang/Download/Python-3.8.3/Lib/queue.py", line 167, in get
raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "pipeline.py", line 373, in
result = main_process(save_path=save_path, pre_train=pre_train, pre_dev=pre_dev, pre_test=pre_test, ratios=ratios,
File "pipeline.py", line 248, in main_process
train, dev, test, precision, recall, f, num_of_neg1, num_outlier0 = PULearn(pre_train, pre_dev, pre_test,
File "pipeline.py", line 194, in PULearn
predicts = estimator.fit_predict(np.asarray(trainReprs, dtype=np.float)).tolist()
File "/home/dsi/shezhang/.local/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 941, in fit_predict
self.fit(X)
File "/home/dsi/shezhang/.local/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 919, in fit
self.min_spanning_tree) = hdbscan(X, **kwargs)
File "/home/dsi/shezhang/.local/lib/python3.8/site-packages/hdbscan/hdbscan.py", line 610, in hdbscan
(single_linkage_tree, result_min_span_tree) = memory.cache(
File "/home/dsi/shezhang/.local/lib/python3.8/site-packages/joblib/memory.py", line 349, in call
return self.func(*args, **kwargs)
File "/home/dsi/shezhang/.local/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 275, in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
File "hdbscan/_hdbscan_boruvka.pyx", line 375, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init
File "hdbscan/_hdbscan_boruvka.pyx", line 411, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
File "/home/dsi/shezhang/.local/lib/python3.8/site-packages/joblib/parallel.py", line 1043, in call
if self.dispatch_one_batch(iterator):
File "/home/dsi/shezhang/.local/lib/python3.8/site-packages/joblib/parallel.py", line 833, in dispatch_one_batch
islice = list(itertools.islice(iterator, big_batch_size))
File "hdbscan/_hdbscan_boruvka.pyx", line 412, in genexpr
TypeError: delayed() got an unexpected keyword argument 'check_pickle'

dependency conflict

Hello，Dr.Yang
When I installed the requirement.txt file, it showed that there was dependency conflict.then I checked closed issue. No one had encountered this problem before, so I followed the prompts to install numpy=1.22.3, and the later installation was normal, so what I wondered is: was it written wrong in the file? or something changed after that?

ERROR: Cannot install -r requirements.txt (line 1), -r requirements.txt (line 5), -r requirements.txt (line 6), -r requirements.txt (line 9) and numpy==1.21.2 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested numpy==1.21.2
bottleneck 1.3.2 depends on numpy
fasttext 0.9.2 depends on numpy
hdbscan 0.8.27 depends on numpy>=1.16
mkl-fft 1.3.1 depends on numpy<1.23.0 and >=1.22.3

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

Error loading HDFS dataset.

Thanks for your great work.

As HDFS labels are initially strings, is this a typo in the following line? token[1] are strings, which cannot be converted to int.

https://github.com/YangLin-George/PLELog/blob/fd35d79c9bd38b2d087e2241d5a942113013640a/preprocessing/dataloader/HDFSLoader.py#L199

Not found pipeline.py file

Hi,
There is no existing pipeline.py.
Please check it out.

About Step 4: Download Stanford NLP word embeddings

Sorry to bother you. I want to know in the Step 4, which pre-trained word vectors should I download？Does it depend on my task? I want to do log anomaly detection as well

Hello, Dr.Yang. I encountered some problems when running your code.

error when train

When I was training, the following error occurred，and I don't know how to correct it。

Traceback (most recent call last):
File "/Users/heyingying/PycharmProjects/PLELog/approaches/PLELog.py", line 255, in
loss = plelog.forward(tinst.inputs, tinst.targets)
File "/Users/heyingying/PycharmProjects/PLELog/approaches/PLELog.py", line 56, in forward
tag_logits = self.model(inputs)
File "/Users/heyingying/opt/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/heyingying/PycharmProjects/PLELog/models/gru.py", line 71, in forward
represents = hiddens * sent_probs
RuntimeError: The size of tensor a (200) must match the size of tensor b (100) at non-singleton dimension 2

How to get templates in parse_by_Official() function of class HDFSLoader

Hi Dr. Yang,

Can you guide me on how to get the templates like this? Thank you so much.

class HDFSLoader(BasicDataLoader):

    ...

    def parse_by_Official(self):
        self._restore()
        templates = [
            "Adding an already existing block (.*)",
            "(.*)Verification succeeded for (.*)",
            "(.*) Served block (.*) to (.*)",
            "(.*):Got exception while serving (.*) to (.*):(.*)",
            "Receiving block (.*) src: (.*) dest: (.*)",
            "Received block (.*) src: (.*) dest: (.*) of size ([-]?[0-9]+)",
            "writeBlock (.*) received exception (.*)",
            "PacketResponder ([-]?[0-9]+) for block (.*) Interrupted\.",
            "Received block (.*) of size ([-]?[0-9]+) from (.*)",
            "PacketResponder (.*) ([-]?[0-9]+) Exception (.*)",
            "PacketResponder ([-]?[0-9]+) for block (.*) terminating",
            "(.*):Exception writing block (.*) to mirror (.*)(.*)",
            "Receiving empty packet for block (.*)",
            "Exception in receiveBlock for block (.*) (.*)",
            "Changing block file offset of block (.*) from ([-]?[0-9]+) to ([-]?[0-9]+) meta file offset to ([-]?[0-9]+)",
            "(.*):Transmitted block (.*) to (.*)",
            "(.*):Failed to transfer (.*) to (.*) got (.*)",
            "(.*) Starting thread to transfer block (.*) to (.*)",
            "Reopen Block (.*)",
            "Unexpected error trying to delete block (.*)\. BlockInfo not found in volumeMap\.",
            "Deleting block (.*) file (.*)",
            "BLOCK\* NameSystem\.allocateBlock: (.*)\. (.*)",
            "BLOCK\* NameSystem\.delete: (.*) is added to invalidSet of (.*)",
            "BLOCK\* Removing block (.*) from neededReplications as it does not belong to any file\.",
            "BLOCK\* ask (.*) to replicate (.*) to (.*)",
            "BLOCK\* NameSystem\.addStoredBlock: blockMap updated: (.*) is added to (.*) size ([-]?[0-9]+)",
            "BLOCK\* NameSystem\.addStoredBlock: Redundant addStoredBlock request received for (.*) on (.*) size ([-]?[0-9]+)",
            "BLOCK\* NameSystem\.addStoredBlock: addStoredBlock request received for (.*) on (.*) size ([-]?[0-9]+) But it does not belong to any file\.",
            "PendingReplicationMonitor timed out block (.*)"
        ]

Hello, Dr.Yang. I don't know how to download the HDFS dataset.

I am instersted in your paper, but I don't know how to download the HDFS dataset.
Would you like to tell me how to download the HDFS dataset?

Question of Preparation Step 4

Step 4: Run approaches/PLELog.py (make sure it has proper parameters).

I got this error in the fourth step, can you help me take a look

Confusion about the probability labeling in Common.py

hello, thanks for your sharing. It really helps me learn about this field.
I am a little confused about the probabilistic labeling part in Common.py. Specifically, the following two lines:

tinst.tags[b, vocab.tag2id(inst.predicted)] = 1 - confidence
tinst.tags[b, 1 - vocab.tag2id(inst.predicted)] = confidence

In the original paper, you said: P(anomalous)=1-score/2, P(normal)=score/2.
confidence in the code seems to be score/2 in the above equation. My question is, how do you distinguish different tags?
inst.predicted can be either anomalous or normal, but the calculation of probabilistic label is the same.

I would appreciate it very much if you can answer my confusion.

PLElog.py running problem(numpy version)

Hello, Dr. Yang. I mentioned the problem of library conflicts in # issue19. You removed the required version of numpy. I tested it today and found that it entered an endless cycle.
Numpy=1.21.2 is compatible with mkl_fft=1.3.1 and mkl_random=1.2.2, provided that numpy, mkl_fft and mkl_random cannot be installed with requirement.txt. When I tested it, I installed numpy=1.21.2 first, and then installed mkl_fft and mkl_random manually, thus solving this problem.
When I thought everything ok, I found that PLElog still can not run, the program gave me an error: ValueError: numpy.ndarray size changed, may indicate binary incompatibility.
Expected 96 from C header, got 88 from PyObject .
I checked this problem because the numpy version is too low, and when you upgrade numpy, you will get an error: RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf.
I found the answer to this question in the rest of the github code: lower the version of numpy, tensorflow/tensorflow#57106 he reduced the numpy version from 1.22.4 to 1.21.
in this situation， I don't know how to run the code, and I hope you take the time to look at this problem.thanks a lot.

(1)problem will happen when i changed numpy version from 1.21.2 to 1.22.4 :

Traceback (most recent call last):
File "F:\code\PLELog\approaches\PLELog.py", line 9, in
from preprocessing.AutoLabeling import Probabilistic_Labeling
File "F:\code\PLELog\preprocessing\AutoLabeling.py", line 3, in
from models.clustering import Solitary_HDBSCAN
File "F:\code\PLELog\models\clustering.py", line 7, in
from hdbscan import HDBSCAN as dbscan
File "E:\app\anaconda\envs\PLElog_py\lib\site-packages\hdbscan_init_.py", line 1, in
from .hdbscan_ import HDBSCAN, hdbscan
File "E:\app\anaconda\envs\PLElog_py\lib\site-packages\hdbscan\hdbscan_.py", line 21, in
from ._hdbscan_linkage import (single_linkage,
File "hdbscan/_hdbscan_linkage.pyx", line 1, in init hdbscan._hdbscan_linkage
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

(2)problem will happen when i changed numpy version from 1.22.4 to 1.21.2:

2023-05-18 23:45:45,332 - Statistics_Template_Encoder - SESSION_fbd73c53bc62b07b79da9e657917a461 - INFO: Construct Statistics Template Encoder success, current working directory: F:\code\PLELog\approaches, logs will be written in F:\code\PLELog\logs
2023-05-18 23:45:45,354 - StatisticsRepresentation. - SESSION_fbd73c53bc62b07b79da9e657917a461 - INFO: Construct StatisticsLogger success, current working directory: F:\code\PLELog\approaches, logs will be written in F:\code\PLELog\logs
Traceback (most recent call last):
File "init.pxd", line 882, in numpy.import_array
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf . Check the section C-API incompatibility at the Troubleshooting ImportError section at https://numpy.org/devdocs/user/troubleshooting-importerror.html#c-api-incompatibility for indications on how to solve this problem .

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "F:\code\PLELog\approaches\PLELog.py", line 9, in
from preprocessing.AutoLabeling import Probabilistic_Labeling
File "F:\code\PLELog\preprocessing\AutoLabeling.py", line 3, in
from models.clustering import Solitary_HDBSCAN
File "F:\code\PLELog\models\clustering.py", line 7, in
from hdbscan import HDBSCAN as dbscan
File "E:\app\anaconda\envs\PLElog_py\lib\site-packages\hdbscan_init_.py", line 1, in
from .hdbscan_ import HDBSCAN, hdbscan
File "E:\app\anaconda\envs\PLElog_py\lib\site-packages\hdbscan\hdbscan_.py", line 21, in
from ._hdbscan_linkage import (single_linkage,
File "hdbscan/_hdbscan_linkage.pyx", line 1, in init hdbscan._hdbscan_linkage
File "hdbscan/dist_metrics.pyx", line 13, in init hdbscan.dist_metrics
File "init.pxd", line 884, in numpy.import_array
ImportError: numpy.core.multiarray failed to import

Process finished with exit code 1

How to improve accuracy?

Hi Dr @YangLin-George ,

PLELog is applied to reduce the cost of labeling data. But if I use PLELog and after a long time, I have more data that I know if they are normal or abnormal, and there are logs that PLELog detects wrongly. How can I improve it?

Thank you so much.

Why using FastICA for reducing dimensionality?

Hi Dr. @YangLin-George,

Why did you use FastICA for reducing dimensionality but not PCA? In practice, I see that PCA is more popular for dimension reduction and FastICA for separating superimposed signals.

Thank you.

How to select hyperparameter

Hi Dr @YangLin-George ,

Sorry for bothering you again.

In the paper, you tested PLELog with different PLELog configurations (the number of ICA components, the number of GRU layers, and the size of GRU layers). I also tried them with min cluster size = 100 and min samples = 100, and I also saw the stable result as the paper.

But I think min cluster size, min samples, window size, and window step are also very important hyperparameters.
Note: window step I mentioned is line 562 in BGLLoader.py
i += self.window_size
Here, window step = window size

I tried on the BGL dataset. For min cluster size, min samples, I tried them with 10, 50, 100, and 200, and the results of PLELog are not stable. Especially when I set window size smaller and used window step = 10, the results differ significantly between sets of hyperparameters (some are very good, and some are very bad). I noticed that the cluster performance of HDBSCAN is stable with different min cluster size, min samples, window size, and window step.

Note: in fact, I would like to try with window step=1 because I think in the real system, we would like to detect after receiving a new log message.

Do you have any idea to select the hyperparameters I mentioned?

Thank you so much