target / matrixprofile-ts Goto Github PK
View Code? Open in Web Editor NEWA Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile
Home Page: https://opensource.target.com
License: Apache License 2.0
A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile
Home Page: https://opensource.target.com
License: Apache License 2.0
In order to provide a better understanding of the algorithm runtime, it would be beneficial to add a plot showing the runtimes with varying m and n. This is similar to how many of the papers present runtime comparisons using a perfplot.
Right now a bug in SCRIMP++ is blocking this from completion - #55
Any plans to support multi-dimensionality?
Great library. Thanks!
The logic for PR #41 causes a bug in stomp when trying to perform a sub-query search. When tsA and tsB are not the same length, numpy automatically returns a boolean and the ".all()" method fails. Using np.array_equal(tsA, tsB) resolves the issue.
In mass()
and massStomp()
in utils.py, the quotient in the calculation of the squared distance can go slightly above 1, leading to a negative difference (see line 177 and line 200).
Is there any objection to just wrapping the calculation of the squared distance in np.clip(., 0.0, None)
and then taking the square root of that?
This issue is addressed in distanceProfile.py by allowing complex values: line 66, line 118, line 126. scrimp.py seems to implement its own version of MASS, and takes the absolute value before taking the square root (see #63): line 71, line 162, line 206, line 257.
SCRIMP++ aside, it seems like it might be cleaner to just clip negative values to 0 directly in mass()
and massStomp()
.
In some occasion, a time series subsequence can be flat/constant for a while, which may lead to a zero standard deviation for that subsequence. However, the the current codes does not seem to handle this case yet. Is there a way to fix this?
Thank you for creating matrixprofile-ts!
I'm trying to cluster a set of time series while using a set of "shared motifs" as "features." Specifically, I have a set of same length time series, where each time series represents a different observation of the same phenomenon. I don't know in advance what type of motifs or behavior can be seen as correct.
Therefore, I'd like to i) extract a set of k-top motifs that are "shared" across the different samples ii) use these motifs to pull out "features" from the underlying time series and iii) perform clustering (i was thinking k-means) on the set.
I was wondering if there was an intuitive way of thinking about this problem. I can find the motifs for a given time series independently by first computing the matrixProfile for each time series, and then finding the motifs (stomp & motif.motifs
). However, I'm not sure how to get the common set of motifs to then do clustering. I think having a common set is important b/c it mitigates the effects of having a "bad observation."
Clustering over the matrix profile isn't particularly useful and I'm not sure if clustering over the individual "motifs" is as well.
I'd appreciate any insight into the matter! I'm very new to time series analysis and the matrixProfile work. Thank you!
(Aside: I believe the discord invite link has expired.)
Can any one provide definitions of the used parameters. The questions are:
(1) How does one determine the samples to be excluded?
Computes the top k motifs from a matrix profile
Parameters
----------
ts: time series to used to calculate mp
mp: tuple, (matrix profile numpy array, matrix profile indices)
max_motifs: the maximum number of motifs to discover
ex_zone: the number of samples to exclude and set to Inf on either side of a found motifs
defaults to m/2
Returns tuple (motifs, distances)
motifs: a list of lists of indexes representing the motif starting locations.
distances: list of minimum distances for each motif
"""
Hello,
how can someone use your code to find top k motifs? Please enlighten. Thanks in advance!
Currently, the implementation of SCRIMP++ only supports self-similarity search. It would be ideal to support sub-query searching.
In some cases SCRIMP++ will error out when computing the dotproduct. It appears to be an inconsistent as it will complete fine in some cases but not others. The window size makes the difference. I believe it has to do with the logic that computes the "idx_nn" or "end_idx". See the attached screenshots for more information.
Here is the code to reproduce the issue:
from matrixprofile import *
import numpy as np
matrixProfile.scrimp_plus_plus(np.random.uniform(size=2**10), 2**5)
It does not error out every time, just in some cases.
In your blog post you mention:
Astonishingly, we can process 20 years’ worth of data, sampled every five minutes, in less than 20 seconds. This speed is critical, as we expect that our data streams will expand in size by several orders of magnitude.
I would expect 20 years of data points to around 2M (12 * 24 * 365 * 20) or around 400K if you only measured workdays/hours. However, on my laptop it already takes 20 seconds to process 16K data points.
Did you use special libraries or hardware to achieve your timings?
Add implementation for MPDist. It should be fairly easy with all of the SCRIMP++ code that was implemented.
In the random sampling algorithms we do not allow users to set a seed. This makes the results different every time the algorithm runs. In some cases that may be great, however it does not allow for reproducible results.
Per @peterdhansen :
According to Slide 81 an annotation vector is applied as:
CMP_i = MP_i + (1-AV_i)*max(MP)
The current implementation is CMP_i = MP_i * AV_i
(note that the annotation vector still works in the current code, just not as effectively).
2nd Cell of [ matrixprofile-ts/docs/examples/Matrix_Profile_Tutorial.ipynb ]
rawdata.csv is missing
Overview
This notebook demonstrates how to calculate and update the Matrix Profile for a sample dataset
(this is the same example signal used in https://github.com/aouyang1/go-matrixprofile)
Would you guide more detail steps for generate ? ot Add the rawdata.csv ?
It's nonsense that I have to spent time in order to get the rawdata.csv file by clone/build/run go-matrixprofile.
I think that this matrixprofile is independent of the go-matrixprofile in code level.
So there is no rule for use same input data as gp-matrixprofile. It's up to users.
Just it is a suggestion of yours.
And because of your suggestion, 2nd Cell of [ matrixprofile-ts/docs/examples/Matrix_Profile_Tutorial.ipynb ] has FileNotFoundError.
Do you agree with this nonsense status ?
I upload the rawdata.csv
rawdata.txt
(Github doesn't support csv. Download it and then read as rawdata.txt not rawdata.csv
The complexity measure for the annotation vector is defined as:
sqrt(sum(diff(subsequence).^2))
CID (Batista 2013)
https://doi.org/10.1007/s10618-013-0312-3
Matrix Profile V (Dau 2017)
http://dx.doi.org/10.1145/3097983.3097993
However, the implementation uses the mean instead of the sum.
A possible fix is to use a moving sum.
The moving sum is already contained in movmeanstd
matrixprofile-ts/matrixprofile/utils.py
Line 47 in 207aa94
See MASS by Mueen:
https://www.cs.unm.edu/~mueen/findNN.html
Fun Fact:
mean(x) = sum(x) / len(x)
In the sliding window setting, len(x) is a constant integer.
The min-max normalization of the Annotation Vector removes the constant factor.
Debugging Implementation (probably contains off-by-one error):
def CE(x: np.ndarray) -> float:
assert x.ndim == 1, x.shape
_ce = np.sqrt(np.sum(np.ediff1d(x)**2))
assert _ce >= 0, (_ce, x)
return _ce
Hi,
I'm currently experimenting with MatrixProfiles, however i have some numerical issues i.e. using STOMP. matrixProfile.stomp
calls _matrixProfile_stomp
, which then calls moveanstd in my case.
In this method however there might occur negative values which are then passed to np.sqrt
, which results in NaN values in the resulting standard deviation.
A similar issue was also observed here: ensozos/Matrix-Profile#17
What would you suggest as a fix? - My first thought was also passing absolute values or change the NaNs afterwards to zero but that might be too pragmatic.
Thank you,
codax
Per Zhu et.al we should change the exclusion zone length from m/2 to m/4.
The initial implementation should be rather basic. This should go through and pick the highest value in the matrix profile and log that index. An exclusion zone will be applied around this index on the matrix profile, such that subsequent peaks can be found.
Are you affiliated with https://github.com/TDAmeritrade/stumpy? Has anyone compared the performance of stumpy with matrixprofile-ts?
The runtime comparison for short timeseries is useless and missleading.
For short timeseries, the startup-cost and initialization dominates the runtime-complexity.
STOMP reduces the runtime complexity over STAMP, not the startup speed.
Replace table with plot, made by https://github.com/nschloe/perfplot
Vary the time-series length from 10^1 to 10^5. The larger the better.
Include brute-force euclidean distance computation as baseline.
scipy.spatial.distance.euclidean()
Circular self-reference.
SCRIMP++ merges the concepts of STAMP and SCRIMP++
I wanted to reopen 78, as it is left open ended. See comments there.
Hi, thank you for sharing the Matrix Profile code.It is very helpfull
I have a time series sample data with a frequency of one point per hour and it work's perfectly with a period of 24, as you can see in image below.
the code used was this:
df_Sample = pd.read_csv('Sample.csv', sep=',')
df_Sample= df_Sample.drop(columns=['Index'])
new_df= df_Sample[:1156]
a = new_df.values.squeeze()
m=24
profile = matrixProfile.stomp(a,m)
new_df['profile'] = np.append(profile[0],np.zeros(m-1)+np.nan)
new_df['profile_index'] = np.append(profile[1], np.zeros(m - 1) + np.nan)
fig, (ax1, ax2) = plt.subplots(2,1,sharex=True,figsize=(15,10))
new_df['Value'].plot(ax=ax1, style='.', title='Sample')
new_df['profile'].plot(ax=ax2, c='r', title='Matrix Profile')
plt.show()
But when I add additional time series data (that is constant) this happens:
The code is the same but instead of "new_df= df_Sample[:1156]" I defined "new_df= df_Sample".
I was expecting a different result. It seems that the constant values are destroying all the previous analyses. Is this a bug?
Thanks in advance,
David
There seems to be a lot of inconsistency in coding style. For example, the time series variable is as tsA instead of snake cased as ts_a. Python conventions are not being met within the code. This is fairly low priority, but something to consider; especially if many people contribute. I am not saying that this code base should follow Python standards - just a standard.
The following code below gives nan and inf values; am I using this incorrectly?
seconds = np.arange(30)
traffic_light = np.array([0]*15 + [1]*5 + [2]*10)
brake_0 = np.array([0]*15 + [1, 2, 3, 4, 5] + [8, 10, 10, 10, 8, 6, 4, 2, 0, 0])
matrixProfile.stomp(traffic_light, 3)
matrixProfile.stomp(brake_0, 3)
The current implementation of SCRIMP++ incorrectly computes the matrix profile indices:
import numpy as np
from matrixprofile import matrixProfile as mp
def artificial_time_series(size):
piece_length = size // 3
piece1 = np.sin(np.arange(0, piece_length))
piece2 = 3 + np.sin(np.arange(0, piece_length))
piece3 = np.sin(np.arange(0, piece_length + size % 3))
timeseries = np.hstack((piece1, piece2, piece3))[:size] + np.random.randn(size) * 0.2
return timeseries
np.random.seed(0)
timeseries = artificial_time_series(1000)
subseq_size = 10
mp_indices = mp.scrimp_plus_plus(timeseries, subseq_size)[1]
assert np.all(mp_indices < len(timeseries) - subseq_size + 1)
This raises an AssertionError.
If we replace SCRIMP++ by STOMP, we see that the AssertionError vanishes, as it should.
mp_indices = mp.stomp(timeseries, subseq_size)[1]
assert np.all(mp_indices < len(timeseries) - subseq_size + 1)
Adding the AV (Matrix Profile V) will significantly increase usability. This'll probably be the next feature we add.
The SCRIMP++ module is a little difficult to follow. It should be refactored once we establish some code style guidelines. #26
I have a timeseries (pattern) of length 44,640 (1 month) and a segment (m) length of 60
Exception thrown on this array but not on subset with 10080 points (1 week)
mp = matrixProfile.stomp(pattern,m)
I am getting the following error:
OverflowError Traceback (most recent call last)
in
52 ax.legend()
53
---> 54 mtfs ,motif_d = motifs.motifs(pattern, mp, max_motifs=10)
55 print('top motifs: \n',mtfs)
56 print('top distances: \n',motif_d)
~/.local/lib/python3.6/site-packages/matrixprofile/motifs.py in motifs(ts, mp, max_motifs, radius, n_neighbors, ex_zone)
57 motif_set = set()
58 initial_motif = [min_idx]
---> 59 pair_idx = int(mp[1][min_idx])
60 if mp_current[pair_idx] != np.inf:
61 initial_motif += [pair_idx]
OverflowError: cannot convert float infinity to integer
When I run the same on 10080 points it runs fine and is very fast.
The same data runs fine with MASS2, MASS3 and Stumped over much larger time series > 1.6M rows
Attached in a ZIP file with sample data
MP Debug 1.zip
Hi,
Very interesting concept thanks for coding it and sharing it.
In the readme, you mention:
We can introduce an anomaly to the end of the time series and use STAMPI to detect it
And then you conclude
The Matrix Profile has spiked in value, highlighting the (potential) presence of a new behavior
I am a bit puzzled by this.. Yes the matrix profile has spiked but so did the data. I do not see in this example the additional value of STAMPI.
Overall I am a bit confused about how to interpret the data. Especially if you'd like to do it in an automatic way for an anomaly detection.
Here is a picture taking the example with some questions..
The first one, the number seems pretty high, so this mean that the z-norm euclidian distance is high (correct me if I am wrong). However, I could argue that there was just a changepoint and this is the normal "new behavior" (ie. there was an earthquake at the beginning of the data and then it went off).
Going into this logic, I would interpret the black square more as outliers than "normal" but in this the MP have values close to 0 which would mean (if I understand correctly) that it should not be seen as outliers.
Finally at the end I am unsure why I see an initial spike or the data going back to 0.
I am sure this come from my lack of understanding of the MP but it might be nice to add a more detailed description of this chart as well as a function or an heuristic in the readme to automatically detect outliers using MP.
Best,
Great to see a python implementation of MP, I've always wanted to play around in Python but never got to it.
I would appreciate to include some minimal function description, maybe in form of docstrings e.g.
def stomp(tsA,m,tsB=None):
"""
:param tsA: ...
:param m: length of...
"""
At this moment it's not really clear to me what each parameter does without re-watching the introduction video or infering this from some paper.
I observed different results when calculating the distance profile using the brute force search algorithm and the function matrixprofile.MASS.distance_profile(). Here is a sample of my code:
query_ = (query - query.mean()) / query.std(ddof=0)
len_m = query.shape[0]
dist = []
for index in range(0, serie.shape[0] - len_m + 1):
sub = serie[index:index + len_m]
sub = (sub - sub.mean()) / sub.std(ddof=0)
dist.append(np.sqrt(np.sum(np.power(sub - query_, 2))))
dist = np.array(dist)
This is less of an issue and more me trying to validate the work of a colleague who has already implemented a version of matrix profile. Given it has already been implemented by this group I thought it would be best to open up dialogue. Thanks!
你好,我下载了你的代码 但是在我本地不能运行,出现好多问题,望解答
The function _stamp_parallel()
computes the distance-profiles via a pool of workers.
Each worker returns a list of distance-profiles.
The reduction of the distance-matrix occurs after all workers return their lists-of-distance-profiles.
The memory requirements of all distance-profiles is O( (sampling_rate*n) * n )
reduce(map(mass_distance_profile_parallel(indices)))
To reduce the memory requirements, each worker must reduce the list-of-distances to an intermediate matrix-profile.
Afterwards the pool-spawner must reduce the intermediate matrix-profiles to the final matrix-profile.
reduce(map(reduce(mass_distance_profile_parallel(indices))))
Hello guys,
It is not an issue as much as it is a suggestion of a new important feature.
I noticed that your R package contains the source code for stomp increment (online version), so I am curious to know when are you going to implement this missing algorithm in this python package?
Regards,
Would you be willing to restructure this module so that all algorithms have their own modules? I feel that many of these algorithms make adjustments to the core algorithm "MASS" for optimization. If I wanted to commit MASS2 by itself for subquery searching, it would cause conflicts with other functions in the distanceProfile module.
I feel that a restructure of having a module for stomp, scrimp and stamp could clear things up. The core functions to call stomp, scrimp or stamp could be accessible through an import in the matrixProfile module. See the current scrimp module for an example.
Using exactly the Tutorial code under Python 2.7 with miniconda environment, when running the following section, got an error. Is this just me ? can anyone help please?
:
Calculate the Matrix Profile
m = 32
mp = matrixProfile.stomp(pattern,m)
TypeError Traceback (most recent call last)
<ipython-input-3-d3196b066bd3> in <module>()
1 m = 32
----> 2 mp = matrixProfile.stomp(pattern,m)
/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/matrixProfile.pyc in stomp(tsA, m, tsB)
270 tsB: Time series to compare the query against. Note that, if no value is provided, tsB = tsA by default.
271 """
--> 272 return _matrixProfile_stomp(tsA,m,order.linearOrder,distanceProfile.STOMPDistanceProfile,tsB)
273
274
/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/matrixProfile.pyc in _matrixProfile_stomp(tsA, m, orderClass, distanceProfileFunction, tsB)
166
167 #Need to pass in the previous sliding dot product for subsequent distance profile calculations
--> 168 (distanceProfile,querySegmentsID),dot_prev = distanceProfileFunction(tsA,idx,m,tsB,dot_first,dp,mean,std)
169
170 if idx == 0:
/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/distanceProfile.pyc in STOMPDistanceProfile(tsA, idx, m, tsB, dot_first, dp, mean, std)
116 #Calculate the first distance profile via MASS
117 if idx == 0:
--> 118 distanceProfile = np.real(np.sqrt(mass(query,tsB).astype(complex)))
119
120 #Currently re-calculating the dot product separately as opposed to updating all of the mass function...
/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/utils.pyc in mass(query, ts)
172 q_std = np.std(query)
173 mean, std = movmeanstd(ts,m)
--> 174 dot = slidingDotProduct(query,ts)
175
176 #res = np.sqrt(2*m*(1-(dot-m*mean*q_mean)/(m*std*q_std)))
/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/utils.pyc in slidingDotProduct(query, ts)
122
123
--> 124 query = np.pad(query,(0,n-m+ts_add-q_add),'constant')
125
126 #Determine trim length for dot product. Note that zero-padding of the query has no effect on array length, which is solely determined by the longest vector
/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/numpy/lib/arraypad.pyc in pad(array, pad_width, mode, **kwargs)
1383 pad_width[iaxis],
1384 iaxis,
-> 1385 kwargs)
1386 return newmat
1387
/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/numpy/lib/shape_base.pyc in apply_along_axis(func1d, axis, arr, *args, **kwargs)
89 outshape = asarray(arr.shape).take(indlist)
90 i.put(indlist, ind)
---> 91 res = func1d(arr[tuple(i.tolist())], *args, **kwargs)
92 # if res is a number, then we have a smaller output array
93 if isscalar(res):
TypeError: 'unicode' object is not callable
We've seen interest in making the library compatible with Python 2. We will explore possible avenues, but if you have any thoughts on the subject please reach out.
Hi,
I was looking for motif discovery on timeseries and find this one. Seems very good, but my data has several variables, it's a multivariate time serie.
This library doesn't support that, right?
Would it be easy to change the code to support it, or it is just non viable?
Thanks,
MArcelo
I tried to use matrix profile to analyse data with missing values, unfortunately I get an empty graph. Is it possible to analyse data with missing values with this implementation? Since in the paper it was stated that matrix profile should result some analysis even with missing data.
As of now, we only have the top K discords algorithm implemented. There is a need to have the top k motifs algorithm implemented as well per our discord channel.
Looks like there are both a LICENSE
as well as a License.md
file in the repo. The LICENSE
file is the authoritative full-length Apache license, while the License.md
is the short-form version, suitable for top-of-file comment header, for example, but not a substitute for the full license.
It would be good to remove License.md
since it's the short-form version, and it's currently keeping the repo from getting the automatic "Apache 2.0" license badge on the overview page courtesy of GitHub's automatic license detector, due to the presence of 2 license files.
Thanks!
the CONTRIBUTING.md file refers to a Github Code of Conduct with a URL which is currently 404.
Can we switch this to the Contributor Covenant?:
Hi,
I was wondering if there are any plans to implement the MPdist (https://sites.google.com/site/mpdistinfo/) algorithm?
Best regards
Ole
In:
mp = matrixProfile.stomp(pattern,m)
mtfs ,motif_d = motifs.motifs(pattern, mp, max_motifs=10)
self._set_item(key, value)
value = self._sanitize_column(key, value)
alue = sanitize_index(value, self.index, copy=False)
Any idea how to solve this?
I wrote a blog post about this library and how to use it on a NAB dataset to detect discords. I can rewrite the notebook and contribute it to the examples section if you want. You can read the post here. http://tylermarrs.com/posts/anomaly-detection-matrix-profile-discords/
Reproduction:
from matrixprofile import *
a = np.random.rand(500)
b = np.random.rand(500)
mp_a_1 = matrixProfile.stomp(a,10,a)[0]
mp_a_2 = matrixProfile.stomp(a,10)[0]
mp_a_b = matrixProfile.stomp(a,10,b)[0]
assert np.max(np.abs(mp_a_b[0])) > 0, 'stomp returns 0-filled vectors when tsB != tsA'
assert (mp_a_1[0] == mp_a_2[0]).all(), 'stomp returns different vectors when tsB = tsA and when tsB = None'
Apparently the annotation_vector.py file isn't present when using pip to install matrixprofile-ts. Is the version on PyPi outdated? Would like to use the annotation vector feature if possible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.