GithubHelp home page GithubHelp logo

khundman / telemanom Goto Github PK

View Code? Open in Web Editor NEW
978.0 38.0 241.0 9.53 MB

A framework for using LSTMs to detect anomalies in multivariate time series data. Includes spacecraft anomaly data and experiments from the Mars Science Laboratory and SMAP missions.

Home Page: https://arxiv.org/abs/1802.04431

License: Other

Jupyter Notebook 97.47% Python 2.49% Dockerfile 0.03%
deep-learning lstm time-series keras tensorflow kdd2018 kdd rnn anomaly-detection

telemanom's People

Contributors

dependabot[bot] avatar diogodcarvalho avatar haisamido avatar khundman avatar peteryschneider avatar tylerbamford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

telemanom's Issues

Potential bug in evaluation?

I found that if the model just predict the whole sequence as anomalous (E_seq = [(0,999999999)]), it would get perfect scores. Is this expected?

Help to solve errors

Hi sir, I am running your project and I am facing these errors. Can you help me? I would be very thankful to you. Thanks

Traceback (most recent call last):
File "example.py", line 10, in detector.run()
File "C:\Users\Muhammad Talmeez\telemanom\telemanom\detector.py", line 210, in run errors.process_batches(channel)
File "C:\Users\Muhammad Talmeez\telemanom\telemanom\errors.py", line 141, in process_batches window.prune_anoms()
File "C:\Users\Muhammad Talmeez\telemanom\telemanom\errors.py", line 418, in prune_anoms
E_seq = np.delete(E_seq, i_to_remove, axis=0)
File "<array_function internals>", line 5, in delete
File "C:\Users\Muhammad Talmeez\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\function_base.py", line 4406, in delete
keep[obj,] = False
IndexError: arrays used as indices must be of integer (or boolean) type

Can't rename data npy files other than already given files

Hi,

I can not name my data files any name other than already present channel names. Also, can you shed some light on how the npy file should be structured in case of uni-variate and multi-variate timeseries data showing a toy data like if we should include timestamps or just put the values one by one in different files as first dimension.

Normalisation of test and training data

First, many thanks for your insightful work ...

I however have an issue when loading the training and test data sets: in most cases they seem to be normalised to [-1,1] independently, and I was wondering whether this would not make the trained model inaccurate.

Eg. a density plot for the channel E4 :
distplot-E-4

Or did i miss something else?

Does Channel mean the first column in the dataset

In section 3.1 the last paragraph, I read

“A prediction length lp then determines the number of steps ahead to predict, where the number of dimensions d being predicted is 1 ≤ d ≤ m. Since our aim is to predict telemetry values for a single channel we consider the ...“

I think the word “channel” is overloaded – if I were to consider channel as group of sensors, the above highlighted phrase is meant to say “single sensor” I think. Because predict channel means, predicting all the dimensions.
OR

Should I assume the first column is the channel and remaining columns are command inputs or status outputs and "predicting channel" means predicting the first column - I think it is the latter - but wanted to confirm.

Sliding window dynamic thresholding behaviour

thumbnail_image001
In this figure, x axis is time, y axis is error(abs(y - yhat)) and blue line is the threshold drawn by the dynamic threshold algorithm. Ignore the first 1000 data points since I simply apply 6 sigma (z=6) until data is accumulated.

I am applying dynamic thresholding without pruning to streaming data with sliding the window for each new data that arrives. My window size is 1000 therefore it always calculates the threshold by looking at last 1000 errors. I have two questions, one of which is the issue.

1- Whenever a very high error(anomaly) is introduced a peak is observed for a single timestep and it drops. In the end the threshold is increased and this is the expected behavior but is it expected to see a peak that lasts 1 time step? See around timestep 1200 for example.

2- This is the reason I am posting: Whenever an anomaly is leaving the window, being 1000 timesteps behind, again a similar peak occurs. See around timestep 2200, and any 1000 timestep after an anomaly. This is the real oddness. Is it normal? If so what is the justification, I could not figure it out.

Having witnessed this odd behavior, what is the best/proper practice to use dynamic thresholding? Should I simply not calculate it in each timestep but this would defeat the purpose.

A question regarding your publication

Hello Kyle,

I read your paper recently "Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding"
Very nice paper. And I just have a question regarding your training dataset vs test dataset, which seem two different ones. any reason to set it that way? it will be great if you can clarify and help me understand.
For example:

In train dataset:
There are two major categories (1, and -1) in E-5.

In test dataset:
There are three categories(1, 0, -1) in E-5, and category 1 is point anomaly.

Thank you very much and hope to hear from you soon

start and end of anomalies

Hello, Thank you for sharing this project. I have some questions regarding the indexes of the outliers. When the anomaly is a point, isn't it supposed to be on one point so why is there a start and end index also for points in anomaly_sequence?

And for collective anomalies, you only consider the start and end indexes as outliers, I wanted to know why you didn't consider taking all the indexes between the start and the end as outliers, especially for short anomalies that represent a temporary change.
Thank you in advance.

window_size and batch_size in config.yaml

Hello @khundman,

to my understanding the window_size and batch_size form the number of historical error values (window_size * batch_size = h).

telemanom/config.yaml

Lines 7 to 11 in 26831a0

# number of values to evaluate in each batch
batch_size: 70
# number of trailing batches to use in error calculation
window_size: 30

I also know that values will be aggregated in windows of one minute and processed in batches of 70 minutes, as stated in your paper:

Telemetry values are aggregated into one minute windows and evaluated in batches of 70 minutes mimicking the downlink schedule for SMAP and our current system implementation.

I assume that means that one minute contains 30 values (so 1 value per 2 seconds). Is that correct?
The parameter h is then used to calculate the dynamic threshold and evaluate each batch.

Could you explain the reason for h to be divided into 2 seperate parameters? Why can't there be an h paramter of 2100 instead of 30 * 70 (window_size * batch_size) to define each batch? Is there a way to efficiantly configure these two parameters for a use case not dealing with with SMAP?

Thank you in advance!

Number of values count in labeled_anomalies.csv is off

There is a slight difference between the amount of records in the num_values column and the actual amount of records present. For instance P-1 is shown to have 8321 records in labeled_anomalies.csv, but the data itself has 8505 records.

co-relation

How the multi variate co-relation between channels achieved? Thanks

[Question]Changes required to run LSTM on GPU(CUDA)

I'm trying to run LSTM on GPU ,Since it is an RNN it is not having very good performance when compared to running it on CPU.
Can you please suggest on changes that has to be made with respect to keras libraries for LSTM?

What I Tried:
I used keras.layers.CuDNNLSTM. There was significant improvement in time taken, but it was running only one LSTM network at a time with 30% GPU usage(NVIDIA Tesla Dual GPU Kepler K80 Graphics).

Question :

  1. Using multiple GPUs,Will i be able to run multiple neural networks at a time?

  2. Is there a way to allocate half no. of core of GPU to one neural network and other half to another and so on.?

Asking for parameters setting tricks

Hi Hundman,

I am trying to implement 'telemanom' on my own data. And after few experiments, I have some questions about 'telemanom', would you like to give me some intuitions about tuning parameters?

  1. Dose 'telemanom' fit better on seasonal streaming data (the one we are gonna predict)? And should I delete the known anomalies in the training data or do denoising on the data?

  2. Do I need to reset the anomalies labels when I using different ' l_s : num previous timesteps provided to model to predict future values '?
    Actually, I did it in my experiments, otherwise the results are not as expected.

  3. How could I find a set of parameters could be widely-used for different multiple time series? Will you consider the "score" derived from the unsupervised anomaly detection part?

  4. Will you add the code for the supervised anomaly detection by using the labels (which you mentioned in the paper ) in the 'telemanom' open source code?

Thanks a lot.

Issues with run.py

Hi,

I'm having a few issues with this repo (I cloned the no-labels branch for unsupervised learning).

  • I tried to run run.py in my IDE (Spyder downloaded via Anaconda) but am given the error ModuleNotFoundError: No module named 'telemanom._globals', yet I can run the script in my Anaconda Prompt.

  • When I do run run.py in the prompt, only 76/85 of the channels seem to be processed. For instance, the issue I get with T-11 is
    Chan: T-11 (76 of 85) Traceback (most recent call last): File "run.py", line 90, in <module> run(config, _id, logger) File "run.py", line 51, in run X_train, y_train, X_test, y_test = helpers.load_data(anom) File "C:\...\telemanom\telemanom\helpers.py", line 83, in load_data X_test, y_test = shape_data(test, train=False) File "C:\...\telemanom\telemanom\helpers.py", line 110, in shape_dat a data = data[:, :] IndexError: too many indices for array

  • Regarding the above line of code (data = data[:,:]) in helpers.py, I was wondering what the purpose of it is? It seems to just reassign data to itself and ensure it two-dimensional.

  • Lastly, after executing run.py the average channel seems to take on average about 45 s. This leads to an accumulated processing time of around 45 min every time I try to make a change to the script to get it working. Is this normal?

Thank you in advance!

About reproducing

Thank you for sharing this wonderful work!
In fact, I have succeeded to reproduce with higher scores :) This situation makes me wonder why this happens. Was there any project/data update or am I doing something wrong?

I followed the "To run with local or virtual environment" instructions in README.py without changing config.yaml since it says "to recreate the experiment from paper, leave as is". I obtain an absolute 3% more recall.

I obtain 0.87 precision and 0.83 recall, whereas in the paper, it is 87.5% precision and 80.0% recall.
Screen Shot 2020-07-19 at 00 37 39

Running example

I am trying to run the example in Readme

python example.py -l labeled_anomalies.csv

I get the error:

Traceback (most recent call last):
File "example.py", line 1, in
from telemanom.detector import Detector
File "/Users/nofelyaseen/mount/vincentserver/telemanom/telemanom/detector.py", line 224
result_row = {**result_row,
^
SyntaxError: invalid syntax

Problem about the dataset dimensions

Dear Author,

As you said, the data in column(dimension)0 is real telemetry data, so what role does the data from other dimensions play in this project, and do they influence the results(performance)? Can I keep only the 0 th dimension of real telemetry data for my training?

Dataset questions

Answers to questions received via email:

Q1. For the anomaly_sequences column in the labeled_anomalies.csv, it means the start and end indices of true anomalies in stream. However, I don’t know the indice of your file is begun at 0 or 1? For example, the [[6000,8127]] for channel id “D-2”, I want to know whether the start indice “6000” means the “6000”(begun at “1”) or “6001”(begun at “0”) row of the file “test/D-2.txt”?

The indices begin at 0.

Q2. For the anomaly_sequences and num values column in the labeled_anomalies.csv, I found that some end indice is larger than the num values: A-8.txt, A-9.txt, D-9.txt, F-2.txt. Is there any mistake?

This was an error and has been cleaned up. The anomalies go to the end of the sequence and the end of the range should equal num_values - 1.

Q3. In both your test and train files, I found most values of data is 0, and I want to know more background knowledge of the data to explain why most value of the value is 0.

The “Raw experiment data” section of the readme explains this: “Model input data also includes one-hot encoded information about commands that were sent or received by specific spacecraft modules in a given time window. No identifying information related to the timing or nature of commands is included in the data.” So you see lots of zeroes where commands weren’t sent/received for to a specific spacecraft module in a time window. At most timesteps for most of the spacecraft submodules, there is no command activity. The first dimension is the prior telemetry values for that channel (the -1.000s in the example you screenshotted) and will be primarily nonzero.

Q4. What is the time interval between the adjacent rows?

For the anomalies from the SMAP spacecraft, values are aggregated into 1 minute buckets. For MSL, the time bucket size is variable as data rates are inconsistent and no interpolation between values was performed to fill missing buckets. This is one factor in the poorer performance seen for MSL anomalies and something we will be addressing in future iterations.

Q5. I found that the anomaly of channel id P-2 are described twice and different (in row 19 and row 53), however, there are no descriptions about the anomaly of channel id T-10.

P-2 is the same channel with two anomalies occurring at different points in time, which is why you see two separate anomalies for that channel. These are entirely separate events and data that happen to occur for the same channel at different points in time. The full ranges of values are non-overlapping and the fact that the anomalous sequences have overlapping indices is coincidental.

T-10 didn’t have enough values to include so it was removed intentionally from the dataset and in the interest of time we didn’t rename all the channels.

Unsupervised Anomaly Detection

There was a branch for unsupervised anomaly detection a while ago when I last checked. Is that branch removed, or does telemanom still support unsupervised anomaly detection?

some problem of test and train data

I can not find test and train data from these files. My graduation project will use your model, but I can't find your training set and test set. I want to understand the input format of the entire code through this.
I look forward to hearing back.

Multi Variant time series data - not telemetry

Hi. I am trying to use this approach for multi variant time series data present to find anomalies in payment gateway attributes like timeouts, latency, etc.. In the paper it is mentioned that I should be using the first dimension of each channel to be telemetry values. Would it matter if I don't have any telemetry data to work with?

overall f0.5score calculation

Hello @khundman,

Thanks for the paper and for making the code public. I have a question whose answer I couldn't find in the repo.

Table 2 in the paper shows precision, recall and f0.5 score for the overall dataset. How are the results from multiple channels combined to give a single precision and recall for each dataset? Is the f0.5 score here calculated based on the reported precision and recall?

Am I correct in understanding that f0.5 refers to f beta score with beta = 0.5, meaning that more weight is given to precision? In some cases the f0.5 score is lower than both precision and recall, eg. Non-Parametric w/ Pruning (p = 0.13) SMAP. For precision = recall = 85.5%, shouldn't fbeta score also be 85.5% for any value of beta?

Question about n_predictions

Hello,

I was wondering the purpose of the n_prediction config value. First, in your paper you mention that you could predict one time step ahead, or more than one. Currently, n_predictions in your config file is set to 10, and your model predicts 10 steps ahead. However, in the predict_in_batches method it looks like you ignore all those predictions except for the first one, effectivley only predicting one step ahead again.

final_y_hat.append(y_hat_t[0]) # first prediction
. I was wondering if I am understanding this right or if I am missing something.

The anomaly detection stage is so long when I use my own model and dataset for train

Hi,
I am confused about whether I misunderstand the paper and the code. I used my own model to predict the time series on the SWat dataset. After got the predicted time series and got the error based on the distance between true and predict data, I wanted to use your ideas about calculating the threshold.
By reading code, I think you calculated the threshold just based on the error fragment which has h = window_size * batch_size elements. I integrated your detection code with my calculating error code to get the perfect threshold, then I run the code, I found that the number_windows = 3132 in my case, and only finishing ten window_e_s took about 8 hours (i from 1 to 10), so I stopped the running. I think it maybe is due to the idea that the code calculated a different threshold for each window_e_s. Maybe I misunderstand your code and used it wrongly.
Thank you for reading my question and this amazing paper and work!

Problem with the simple example

Could anyone help with the next error on the simple example:

ValueError: Run ID 2018-05-19_15.00.10 is not valid. If loading prior models or predictions, must provide valid ID.

Thanks for your time.

Use for Univariate Time Series

Hi, before I dive into trying to use this implementation, I am wondering if you can use these methods for univariate time-series numerical data?

Should the number smoothed_errors be equal to total observations? (i.e. an anomaly score for each observation)

# for values at beginning < sequence length, just use avg
if not channel.id == 'C-2': # anomaly occurs early in window
self.e_s[:self.config.l_s] = \
[np.mean(self.e_s[:self.config.l_s * 2])] * self.config.l_s

I'm wondering if the length of self.e_s should be equal to the total number of observations in the sequence? All self.e_s are missing 260 observations.

The above code snippet does not add the 250 observations to the front of the array, but simply changes the existing 250 front elements. Something like this will add them: self.e_s = np.insert(self.e_s, 0, [np.mean(self.e_s[:self.config.l_s * 2])] * self.config.l_s)

However, there is still 10 missing observations from each self.e_s, perhaps due to the n_predictions option?

Does the data only have categorical inputs?

I have noticed that most of the input entries are zero, and I am wondering if the attached dataset has any numerical inputs (sensor readings, etc.)? If there is, are you using them with one-hot encodings directly?

Telemanom availability for notebook users

Hello,

Great tool I may say. Still I'm trying to make this available for multiple notebook users but failing.
How can I link this tool to a Jupyterhub session?
Kindly offer some notes on the matter.

Best!

point anomalies and `error_buffer`

Thanks for your brilliant work. I am confused about some questions.

  • It seems that all the anomalies in the smap dataset occur consecutively. I would like to know how you define point anomalies
  • I would like to know the purpose of setting the parameter error_buffer
i_anom = np.sort(np.concatenate((i_anom,
                                            np.array([i+buffer for i in i_anom])
                                             .flatten(),
                                            np.array([i-buffer for i in i_anom])
                                             .flatten())))

Table 2 F_Beta Scores don't add up

Looking at table two, there's something amiss. Under the first heading, "Non-Parametric w/ Pruning (p = 0.13)", the Precision and Recall scores are equal but the F_0.5 score is 0.71. If the precision and recall are equal the score should also be the same, no matter what the Beta is.

Thresholding Approach Precision Recall F0.5 score
Non-Parametric w/ Pruning (p = 0.13)
MSL 92.6% 69.4% 0.69
SMAP 85.5% 85.5% 0.71
Total 87.5% 80.0% 0.71

I calculated the F_0.5 scores for a few other rows in the table based off of the precision and recall and was unable to get the same resulting score as the paper. Am I interpreting this table correctly?

Bug in loading historic models

There is issue in loading the pretrained models with "train =False" configuration in config.yaml.

Even after specifying train =False in config.yaml, the program retrains the model.
This seems to stem from code with in modeling.py (line:28)
if not train and os.path.exists(os.path.join("data", "models", anom["chan_id"] + ".h5")):-This evaluates to False always as 'os.path.join("data", "models", anom["chan_id"] + ".h5")' doesn't exist.

Reason: In modeling.py (line:61) we save models to path os.path.join("data",anom['run_id'], "models", anom["chan_id"] + ".h5")

Prerequisite: config.use_id should contain the detail of old model to be run. I'm using user_id to store old run_id which I want to use.
ex. use_id: "2018-09-12_15.00.10"

Correction needed in modeling.py (line:28) : if not train and os.path.exists(os.path.join("data",config.use_id, "models", anom["chan_id"] + ".h5"))

Asking for overfitting issue

I'm trying to apply this model on my data. (minute data)

My data shows a trend of "one day"(1440 minutes), and I trained with 21 days of training data before running the test. "l_s" is set to 1440 and "n_prediction" is set to 10. However, the y_hat graph was drawn in the form of lagging, just follow right before value.

Please advise how I can solve the problem.

PS. I tried various parameter settings and added LSTM layers for the model.. However I haven't found right model yet.

Question relating to 'num_values' and rows for each dataset

I'm confused about the relationship between the number of rows in a given data.h5 file and the number of values in the labeled_anomalies row for that file.

e.g. channel P-1 has 2872 rows but 'num_values' for this data 8502, 'num_train_values' is 2612, 'num_test_values' is '8245' and a TP sequence may be located for example in the range '(4520, 4589)'.

Does anyone know how Telemanom converts the rows into these sequences and how that would work exactly? Where would a sequence like '(4520, 4589)' occur in the P-1.npy file?

labeled_anomalies.csv - numvalues does not match data

HI

Thank you for the code and publication. I have a few question I post them separately for tracking.

In README.md, the description for labeled_anomalies.csv suppose to bem
channel id, … num_values

When I inspect A-1.npy I get the following shape
d = np.load(f"../data/train/A-1.npy")
d.shape, d.shape[0] * d.shape[1]

Out: ((2880, 25)
However, the corresponding entry for A-1 in the labeled_anomalies.csv is:
A-1, SMAP, [[4690, 4774]], [point], 8640

I was wondering num_values are 8640 which is not what I see in the data which is 2880

What does num_values mean and am In reading it right? please clarify

Duplicate P-2 and non-exist T-10

I found that you have two P-2 and you do not have T-10 in the labeled_anomalies.csv. Can you update the new labeled_anomalies.csv

Error during operation

Hello Kyle,

I have a question for you, an error pops up when I run python example.py -l labeled_anomalies.csv. The error occurred after a certain epoch. I executed the code twice and got the error after the 16th and 20th epochs. The error content is as follows:

Epoch 19/35
2096/2096 [==============================] - 22s 11ms/step - loss: 0.0106 - val_loss: 1.6608e-04
Epoch 20/35
2096/2096 [==============================] - 21s 10ms/step - loss: 0.0101 - val_loss: 2.0771e-04
normalized prediction error: 0.01
Traceback (most recent call last):
  File "example.py", line 10, in <module>
    detector.run()
  File "E:\Zotero\Paper\storage\37LDVKWR\telemanom\telemanom\detector.py", line 210, in run
    errors.process_batches(channel)
  File "E:\Zotero\Paper\storage\37LDVKWR\telemanom\telemanom\errors.py", line 141, in process_batches
    window.prune_anoms()
  File "E:\Zotero\Paper\storage\37LDVKWR\telemanom\telemanom\errors.py", line 418, in prune_anoms
    E_seq = np.delete(E_seq, i_to_remove, axis=0)
  File "<__array_function__ internals>", line 6, in delete
  File "D:\Anaconda3\envs\tf2\lib\site-packages\numpy\lib\function_base.py", line 4406, in delete
    keep[obj,] = False
IndexError: arrays used as indices must be of integer (or boolean) type

Thank you very much and hope to hear from you soon.

Question about selection of the telemetry variable

First of all, thank you for your excellent paper and source code!

I notice that in your algorithm design and implement, only a feature was selected as the telemetry variable. In paper, such a choice was explained as

Since our aim is to predict telemetry values for a single channel we consider the situation where d = 1.

In source code, it seemed that the first feature was selected as the telemetry variable for all channels.

y = data[:,-config.n_predictions:,0] #telemetry value is at position 0

So I wonder whether the same feature was selected as the telemetry one for all channels or the orders of features were different for different channels and the telemetry variable was set as the first feature?

What's more, could you give some general advice on how to choose a good telemetry variable?
I would appreciate if you could reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.