GithubHelp home page GithubHelp logo

target / matrixprofile-ts Goto Github PK

View Code? Open in Web Editor NEW
730.0 26.0 106.0 3.72 MB

A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

Home Page: https://opensource.target.com

License: Apache License 2.0

Python 100.00%
python time-series data-science matrix-profile python3 motif-discovery motif pip pip3 pypi-packages

matrixprofile-ts's Introduction

PyPI version Build Status Downloads Downloads/Week License

matrixprofile-ts

matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keogh and Mueen research groups at UC-Riverside and the University of New Mexico. Current implementations include MASS, STMP, STAMP, STAMPI, STOMP, SCRIMP++, and FLUSS.

Read the Target blog post here.

Further academic description can be found here.

The PyPi page for matrixprofile-ts is here

Contents

Installation

Major releases of matrixprofile-ts are available on the Python Package Index:

pip install matrixprofile-ts

Details about each release can be found here.

Quick start

>>> from matrixprofile import *
>>> import numpy as np
>>> a = np.array([0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0])
>>> matrixProfile.stomp(a,4)
(array([0., 0., 0., 0., 0., 0., 0., 0., 0.]), array([4., 5., 6., 7., 0., 1., 2., 3., 0.]))

Note that SCRIMP++ is highly recommended for calculating the Matrix Profile due to its speed and anytime ability.

Examples

Jupyter notebooks containing various examples of how to use matrixprofile-ts can be found under docs/examples.

As a basic introduction, we can take a synthetic signal and use STOMP to calculate the corresponding Matrix Profile (this is the same synthetic signal as in the Golang Matrix Profile library). Code for this example can be found here

datamp

There are several items of note:

  • The Matrix Profile value jumps at each phase change. High Matrix Profile values are associated with "discords": time series behavior that hasn't been observed before.

  • Repeated patterns in the data (or "motifs") lead to low Matrix Profile values.

We can introduce an anomaly to the end of the time series and use STAMPI to detect it

datampanom

The Matrix Profile has spiked in value, highlighting the (potential) presence of a new behavior. Note that Matrix Profile anomaly detection capabilities will depend on the nature of the data, as well as the selected subquery length parameter. Like all good algorithms, it's important to try out different parameter values.

Algorithm Comparison

This section shows the matrix profile algorithms and the time it takes to compute them. It also discusses use cases on when to use one versus another. The timing comparison is based on the synthetic sample data set to show run time speed.

For a more comprehensive runtime comparison, please review the notebook docs/examples/Algorithm Comparison.ipynb.

All time comparisons were ran on a 4 core 2.8 ghz processor with 16 GB of memory. The operating system used was Ubuntu 18.04LTS 64 bit.

Algorithm Time to Complete Description
STAMP 310 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) STAMP is an anytime algorithm that lets you sample the data set to get an approximate solution. Our implementation provides you with the option to specify the sampling size in percent format.
STOMP 79.8 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) STOMP computes an exact solution in a very efficient manner. When you have a historic time series that you would like to examine, STOMP is typically the quickest at giving an exact solution.
SCRIMP++ 59 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) SCRIMP++ merges the concepts of STAMP and STOMP together to provide an anytime algorithm that enables "interactive analysis speed". Essentially, it provides an exact or approximate solution in a very timely manner. Our implementation allows you to specify the max number of seconds you are willing to wait for a solution to obtain an approximate solution. If you are wanting the exact solution, it is able to provide that as well. The original authors of this algorithm suggest that SCRIMP++ can be used in all use cases.

Matrix Profile in Other Languages

Contact

Citations

  1. Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, Eamonn Keogh (2016). Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets. IEEE ICDM 2016

  2. Matrix Profile II: Exploiting a Novel Algorithm and GPUs to break the one Hundred Million Barrier for Time Series Motifs and Joins. Yan Zhu, Zachary Zimmerman, Nader Shakibay Senobari, Chin-Chia Michael Yeh, Gareth Funning, Abdullah Mueen, Philip Berisk and Eamonn Keogh (2016). EEE ICDM 2016

  3. Matrix Profile V: A Generic Technique to Incorporate Domain Knowledge into Motif Discovery. Hoang Anh Dau and Eamonn Keogh. KDD'17, Halifax, Canada.

  4. Matrix Profile XI: SCRIMP++: Time Series Motif Discovery at Interactive Speed. Yan Zhu, Chin-Chia Michael Yeh, Zachary Zimmerman, Kaveh Kamgar and Eamonn Keogh, ICDM 2018.

  5. Matrix Profile VIII: Domain Agnostic Online Semantic Segmentation at Superhuman Performance Levels. Shaghayegh Gharghabi, Yifei Ding, Chin-Chia Michael Yeh, Kaveh Kamgar, Liudmila Ulanova, and Eamonn Keogh. ICDM 2017.

matrixprofile-ts's People

Contributors

aouyang1 avatar frankiecancino avatar mbarkhau avatar mpieters93 avatar nikita-smyrnov avatar ofer-idan avatar peterdhansen avatar tylerwmarrs avatar vanbenschoten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

matrixprofile-ts's Issues

Handling constant subsequence

In some occasion, a time series subsequence can be flat/constant for a while, which may lead to a zero standard deviation for that subsequence. However, the the current codes does not seem to handle this case yet. Is there a way to fix this?

O(n²) Memory Requirements for _stamp_parallel()

The function _stamp_parallel() computes the distance-profiles via a pool of workers.
Each worker returns a list of distance-profiles.

The reduction of the distance-matrix occurs after all workers return their lists-of-distance-profiles.
The memory requirements of all distance-profiles is O( (sampling_rate*n) * n )

reduce(map(mass_distance_profile_parallel(indices)))

def _stamp_parallel(tsA, m, tsB=None, sampling=0.2, n_threads=-1, random_state=None):

def mass_distance_profile_parallel(indices, tsA=None, tsB=None, m=None):

To reduce the memory requirements, each worker must reduce the list-of-distances to an intermediate matrix-profile.
Afterwards the pool-spawner must reduce the intermediate matrix-profiles to the final matrix-profile.

reduce(map(reduce(mass_distance_profile_parallel(indices))))

Holistic Algorithm Comparison

In order to provide a better understanding of the algorithm runtime, it would be beneficial to add a plot showing the runtimes with varying m and n. This is similar to how many of the papers present runtime comparisons using a perfplot.

Right now a bug in SCRIMP++ is blocking this from completion - #55

SCRIMP++ Invalid Value With np.sqrt

In some cases the distance value is negative. It can be resolved by taking the absolute value and then the square root instead of the square root first.

Generally this gives a warning from numpy. See attached.
image

Python 2 compatibility

We've seen interest in making the library compatible with Python 2. We will explore possible avenues, but if you have any thoughts on the subject please reach out.

Code Style Guidelines

There seems to be a lot of inconsistency in coding style. For example, the time series variable is as tsA instead of snake cased as ts_a. Python conventions are not being met within the code. This is fairly low priority, but something to consider; especially if many people contribute. I am not saying that this code base should follow Python standards - just a standard.

Strange behaviour when testing with constant values

Hi, thank you for sharing the Matrix Profile code.It is very helpfull

I have a time series sample data with a frequency of one point per hour and it work's perfectly with a period of 24, as you can see in image below.

Figure_3

the code used was this:

df_Sample = pd.read_csv('Sample.csv', sep=',')
df_Sample= df_Sample.drop(columns=['Index'])
new_df= df_Sample[:1156]
a = new_df.values.squeeze()
m=24
profile = matrixProfile.stomp(a,m)
new_df['profile'] = np.append(profile[0],np.zeros(m-1)+np.nan)
new_df['profile_index'] = np.append(profile[1], np.zeros(m - 1) + np.nan)
fig, (ax1, ax2) = plt.subplots(2,1,sharex=True,figsize=(15,10))
new_df['Value'].plot(ax=ax1, style='.', title='Sample')
new_df['profile'].plot(ax=ax2, c='r', title='Matrix Profile')
plt.show()

But when I add additional time series data (that is constant) this happens:
Figure_1

The code is the same but instead of "new_df= df_Sample[:1156]" I defined "new_df= df_Sample".

I was expecting a different result. It seems that the constant values are destroying all the previous analyses. Is this a bug?

Thanks in advance,

David

Sample.zip

Question: Finding Shared Motifs for Clustering

Thank you for creating matrixprofile-ts!

I'm trying to cluster a set of time series while using a set of "shared motifs" as "features." Specifically, I have a set of same length time series, where each time series represents a different observation of the same phenomenon. I don't know in advance what type of motifs or behavior can be seen as correct.

Therefore, I'd like to i) extract a set of k-top motifs that are "shared" across the different samples ii) use these motifs to pull out "features" from the underlying time series and iii) perform clustering (i was thinking k-means) on the set.

I was wondering if there was an intuitive way of thinking about this problem. I can find the motifs for a given time series independently by first computing the matrixProfile for each time series, and then finding the motifs (stomp & motif.motifs). However, I'm not sure how to get the common set of motifs to then do clustering. I think having a common set is important b/c it mitigates the effects of having a "bad observation."

Clustering over the matrix profile isn't particularly useful and I'm not sure if clustering over the individual "motifs" is as well.

I'd appreciate any insight into the matter! I'm very new to time series analysis and the matrixProfile work. Thank you!

(Aside: I believe the discord invite link has expired.)

SCRIMP++ Matrix Profile Indices Incorrect

The current implementation of SCRIMP++ incorrectly computes the matrix profile indices:

    import numpy as np
    from matrixprofile import matrixProfile as mp

    def artificial_time_series(size):
         piece_length = size // 3
         piece1 = np.sin(np.arange(0, piece_length))
         piece2 = 3 + np.sin(np.arange(0, piece_length))
         piece3 = np.sin(np.arange(0, piece_length + size % 3))
         timeseries = np.hstack((piece1, piece2, piece3))[:size] + np.random.randn(size) * 0.2
         return timeseries

    np.random.seed(0)
    timeseries = artificial_time_series(1000)
    subseq_size = 10
    mp_indices = mp.scrimp_plus_plus(timeseries, subseq_size)[1]
    assert np.all(mp_indices < len(timeseries) - subseq_size + 1)

This raises an AssertionError.

If we replace SCRIMP++ by STOMP, we see that the AssertionError vanishes, as it should.

    mp_indices = mp.stomp(timeseries, subseq_size)[1]
    assert np.all(mp_indices < len(timeseries) - subseq_size + 1)

Return actual distances from MASS instead of squared distances

In mass() and massStomp() in utils.py, the quotient in the calculation of the squared distance can go slightly above 1, leading to a negative difference (see line 177 and line 200).

Is there any objection to just wrapping the calculation of the squared distance in np.clip(., 0.0, None) and then taking the square root of that?

This issue is addressed in distanceProfile.py by allowing complex values: line 66, line 118, line 126. scrimp.py seems to implement its own version of MASS, and takes the absolute value before taking the square root (see #63): line 71, line 162, line 206, line 257.

SCRIMP++ aside, it seems like it might be cleaner to just clip negative values to 0 directly in mass() and massStomp().

[ rawdata.csv in Matrix_Profile_Tutorial.ipynb ] is missing

2nd Cell of [ matrixprofile-ts/docs/examples/Matrix_Profile_Tutorial.ipynb ]
rawdata.csv is missing

Overview
This notebook demonstrates how to calculate and update the Matrix Profile for a sample dataset 
(this is the same example signal used in https://github.com/aouyang1/go-matrixprofile)

Would you guide more detail steps for generate ? ot Add the rawdata.csv ?

It's nonsense that I have to spent time in order to get the rawdata.csv file by clone/build/run go-matrixprofile.

I think that this matrixprofile is independent of the go-matrixprofile in code level.
So there is no rule for use same input data as gp-matrixprofile. It's up to users.
Just it is a suggestion of yours.
And because of your suggestion, 2nd Cell of [ matrixprofile-ts/docs/examples/Matrix_Profile_Tutorial.ipynb ] has FileNotFoundError.
Do you agree with this nonsense status ?


I upload the rawdata.csv
rawdata.txt
(Github doesn't support csv. Download it and then read as rawdata.txt not rawdata.csv

Dealing with missing values

I tried to use matrix profile to analyse data with missing values, unfortunately I get an empty graph. Is it possible to analyse data with missing values with this implementation? Since in the paper it was stated that matrix profile should result some analysis even with missing data.

Readme is a bit missleading

Hi,
Very interesting concept thanks for coding it and sharing it.
In the readme, you mention:

We can introduce an anomaly to the end of the time series and use STAMPI to detect it
And then you conclude
The Matrix Profile has spiked in value, highlighting the (potential) presence of a new behavior

I am a bit puzzled by this.. Yes the matrix profile has spiked but so did the data. I do not see in this example the additional value of STAMPI.
Overall I am a bit confused about how to interpret the data. Especially if you'd like to do it in an automatic way for an anomaly detection.
Here is a picture taking the example with some questions..
b34f5467c762150b8bf5c0404e4a6551 _image 2019-01-11 at 12 01 40 pm
The first one, the number seems pretty high, so this mean that the z-norm euclidian distance is high (correct me if I am wrong). However, I could argue that there was just a changepoint and this is the normal "new behavior" (ie. there was an earthquake at the beginning of the data and then it went off).
Going into this logic, I would interpret the black square more as outliers than "normal" but in this the MP have values close to 0 which would mean (if I understand correctly) that it should not be seen as outliers.
Finally at the end I am unsure why I see an initial spike or the data going back to 0.
I am sure this come from my lack of understanding of the MP but it might be nice to add a more detailed description of this chart as well as a function or an heuristic in the readme to automatically detect outliers using MP.
Best,

Complexity Measure confuses mean and sum

The complexity measure for the annotation vector is defined as:
sqrt(sum(diff(subsequence).^2))

CID (Batista 2013)
https://doi.org/10.1007/s10618-013-0312-3
Matrix Profile V (Dau 2017)
http://dx.doi.org/10.1145/3097983.3097993

However, the implementation uses the mean instead of the sum.

def make_complexity_AV(ts, m):

A possible fix is to use a moving sum.
The moving sum is already contained in movmeanstd

def movmeanstd(ts,m):

See MASS by Mueen:
https://www.cs.unm.edu/~mueen/findNN.html

Fun Fact:
mean(x) = sum(x) / len(x)
In the sliding window setting, len(x) is a constant integer.
The min-max normalization of the Annotation Vector removes the constant factor.

Debugging Implementation (probably contains off-by-one error):

def CE(x: np.ndarray) -> float:
    assert x.ndim == 1, x.shape
    _ce = np.sqrt(np.sum(np.ediff1d(x)**2))
    assert _ce >= 0, (_ce, x)
    return _ce

@peterdhansen

Duplicate license files: remove License.md?

Looks like there are both a LICENSE as well as a License.md file in the repo. The LICENSE file is the authoritative full-length Apache license, while the License.md is the short-form version, suitable for top-of-file comment header, for example, but not a substitute for the full license.

It would be good to remove License.md since it's the short-form version, and it's currently keeping the repo from getting the automatic "Apache 2.0" license badge on the overview page courtesy of GitHub's automatic license detector, due to the presence of 2 license files.

Thanks!

Implement Annotation Vector

Adding the AV (Matrix Profile V) will significantly increase usability. This'll probably be the next feature we add.

Stomp calculates wrong MP vectors for two time series comparison

Reproduction:

from matrixprofile import *

a = np.random.rand(500)
b = np.random.rand(500)
mp_a_1 = matrixProfile.stomp(a,10,a)[0]
mp_a_2 = matrixProfile.stomp(a,10)[0]
mp_a_b = matrixProfile.stomp(a,10,b)[0]

assert np.max(np.abs(mp_a_b[0])) > 0, 'stomp returns 0-filled vectors when tsB != tsA'
assert (mp_a_1[0] == mp_a_2[0]).all(), 'stomp returns different vectors when tsB = tsA and when tsB = None'

Allow user to set random seed for sampling algorithms

In the random sampling algorithms we do not allow users to set a seed. This makes the results different every time the algorithm runs. In some cases that may be great, however it does not allow for reproducible results.

matrixProfile.stomp() gives nan and inf values

The following code below gives nan and inf values; am I using this incorrectly?

seconds = np.arange(30)
traffic_light = np.array([0]*15 + [1]*5 + [2]*10)
brake_0 = np.array([0]*15 + [1, 2, 3, 4, 5] + [8, 10, 10, 10, 8, 6, 4, 2, 0, 0])

matrixProfile.stomp(traffic_light, 3)
matrixProfile.stomp(brake_0, 3)

Inconsistent Bug with SCRIMP++

In some cases SCRIMP++ will error out when computing the dotproduct. It appears to be an inconsistent as it will complete fine in some cases but not others. The window size makes the difference. I believe it has to do with the logic that computes the "idx_nn" or "end_idx". See the attached screenshots for more information.

Here is the code to reproduce the issue:

from matrixprofile import *
import numpy as np

matrixProfile.scrimp_plus_plus(np.random.uniform(size=2**10), 2**5)

It does not error out every time, just in some cases.

image

image

MASS yields different results than brute force search

I observed different results when calculating the distance profile using the brute force search algorithm and the function matrixprofile.MASS.distance_profile(). Here is a sample of my code:

calculate by brute force

query_ = (query - query.mean()) / query.std(ddof=0)
len_m = query.shape[0]
dist = []
for index in range(0, serie.shape[0] - len_m + 1):
sub = serie[index:index + len_m]
sub = (sub - sub.mean()) / sub.std(ddof=0)
dist.append(np.sqrt(np.sum(np.power(sub - query_, 2))))
dist = np.array(dist)

This is less of an issue and more me trying to validate the work of a colleague who has already implemented a version of matrix profile. Given it has already been implemented by this group I thought it would be best to open up dialogue. Thanks!

运行问题

你好,我下载了你的代码 但是在我本地不能运行,出现好多问题,望解答

Exclusion Zone

I wanted to reopen 78, as it is left open ended. See comments there.

SCRIMP++ Sub-query Support

Currently, the implementation of SCRIMP++ only supports self-similarity search. It would be ideal to support sub-query searching.

Add Discord discovery

The initial implementation should be rather basic. This should go through and pick the highest value in the matrix profile and log that index. An exclusion zone will be applied around this index on the matrix profile, such that subsequent peaks can be found.

Top K Motif

Hello,
how can someone use your code to find top k motifs? Please enlighten. Thanks in advance!

Write minimal documentation

Great to see a python implementation of MP, I've always wanted to play around in Python but never got to it.

I would appreciate to include some minimal function description, maybe in form of docstrings e.g.

def stomp(tsA,m,tsB=None):
"""
:param tsA: ...
:param m: length of...
"""

At this moment it's not really clear to me what each parameter does without re-watching the introduction video or infering this from some paper.

Incorrect Annotation Vector implementation

Per @peterdhansen :

According to Slide 81 an annotation vector is applied as:

CMP_i = MP_i + (1-AV_i)*max(MP)

The current implementation is CMP_i = MP_i * AV_i

(note that the annotation vector still works in the current code, just not as effectively).

Bug from PR #41

The logic for PR #41 causes a bug in stomp when trying to perform a sub-query search. When tsA and tsB are not the same length, numpy automatically returns a boolean and the ".all()" method fails. Using np.array_equal(tsA, tsB) resolves the issue.

image
image
image

Definition and explanation of parameters

Can any one provide definitions of the used parameters. The questions are:

(1) How does one determine the samples to be excluded?

Computes the top k motifs from a matrix profile
Parameters
----------
ts: time series to used to calculate mp
mp: tuple, (matrix profile numpy array, matrix profile indices)
max_motifs: the maximum number of motifs to discover
ex_zone: the number of samples to exclude and set to Inf on either side of a found motifs
defaults to m/2
Returns tuple (motifs, distances)
motifs: a list of lists of indexes representing the motif starting locations.
distances: list of minimum distances for each motif
"""

Stomp increment is missing

Hello guys,
It is not an issue as much as it is a suggestion of a new important feature.

I noticed that your R package contains the source code for stomp increment (online version), so I am curious to know when are you going to implement this missing algorithm in this python package?

Regards,

Support for multivariate time series

Hi,
I was looking for motif discovery on timeseries and find this one. Seems very good, but my data has several variables, it's a multivariate time serie.
This library doesn't support that, right?
Would it be easy to change the code to support it, or it is just non viable?

Thanks,
MArcelo

Tutorial code running into issue: 'unicode' object is not callable

Using exactly the Tutorial code under Python 2.7 with miniconda environment, when running the following section, got an error. Is this just me ? can anyone help please?
:
Calculate the Matrix Profile

m = 32
mp = matrixProfile.stomp(pattern,m)

TypeError                                 Traceback (most recent call last)
<ipython-input-3-d3196b066bd3> in <module>()
      1 m = 32
----> 2 mp = matrixProfile.stomp(pattern,m)

/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/matrixProfile.pyc in stomp(tsA, m, tsB)
    270     tsB: Time series to compare the query against. Note that, if no value is provided, tsB = tsA by default.
    271     """
--> 272     return _matrixProfile_stomp(tsA,m,order.linearOrder,distanceProfile.STOMPDistanceProfile,tsB)
    273 
    274 

/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/matrixProfile.pyc in _matrixProfile_stomp(tsA, m, orderClass, distanceProfileFunction, tsB)
    166 
    167         #Need to pass in the previous sliding dot product for subsequent distance profile calculations
--> 168         (distanceProfile,querySegmentsID),dot_prev = distanceProfileFunction(tsA,idx,m,tsB,dot_first,dp,mean,std)
    169 
    170         if idx == 0:

/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/distanceProfile.pyc in STOMPDistanceProfile(tsA, idx, m, tsB, dot_first, dp, mean, std)
    116     #Calculate the first distance profile via MASS
    117     if idx == 0:
--> 118         distanceProfile = np.real(np.sqrt(mass(query,tsB).astype(complex)))
    119 
    120         #Currently re-calculating the dot product separately as opposed to updating all of the mass function...

/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/utils.pyc in mass(query, ts)
    172     q_std = np.std(query)
    173     mean, std = movmeanstd(ts,m)
--> 174     dot = slidingDotProduct(query,ts)
    175 
    176     #res = np.sqrt(2*m*(1-(dot-m*mean*q_mean)/(m*std*q_std)))

/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/matrixprofile/utils.pyc in slidingDotProduct(query, ts)
    122 
    123 
--> 124     query = np.pad(query,(0,n-m+ts_add-q_add),'constant')
    125 
    126     #Determine trim length for dot product. Note that zero-padding of the query has no effect on array length, which is solely determined by the longest vector

/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/numpy/lib/arraypad.pyc in pad(array, pad_width, mode, **kwargs)
   1383                                 pad_width[iaxis],
   1384                                 iaxis,
-> 1385                                 kwargs)
   1386         return newmat
   1387 

/Users/dev/miniconda2/envs/dsf/lib/python2.7/site-packages/numpy/lib/shape_base.pyc in apply_along_axis(func1d, axis, arr, *args, **kwargs)
     89     outshape = asarray(arr.shape).take(indlist)
     90     i.put(indlist, ind)
---> 91     res = func1d(arr[tuple(i.tolist())], *args, **kwargs)
     92     #  if res is a number, then we have a smaller output array
     93     if isscalar(res):

TypeError: 'unicode' object is not callable

Feature - Top K Motifs

As of now, we only have the top K discords algorithm implemented. There is a need to have the top k motifs algorithm implemented as well per our discord channel.

Refactor SCRIMP++

The SCRIMP++ module is a little difficult to follow. It should be refactored once we establish some code style guidelines. #26

Readme Section 'Algorithm Comparison' is missleading

Problem 1:

The runtime comparison for short timeseries is useless and missleading.
For short timeseries, the startup-cost and initialization dominates the runtime-complexity.
STOMP reduces the runtime complexity over STAMP, not the startup speed.

Solution:

Replace table with plot, made by https://github.com/nschloe/perfplot
Vary the time-series length from 10^1 to 10^5. The larger the better.

Bonus:

Include brute-force euclidean distance computation as baseline.
scipy.spatial.distance.euclidean()

Problem2:

Circular self-reference.

SCRIMP++ merges the concepts of STAMP and SCRIMP++

motifs.motifs throws exceptions on larger arrays

I have a timeseries (pattern) of length 44,640 (1 month) and a segment (m) length of 60

Exception thrown on this array but not on subset with 10080 points (1 week)

mp = matrixProfile.stomp(pattern,m)

I am getting the following error:

OverflowError Traceback (most recent call last)
in
52 ax.legend()
53
---> 54 mtfs ,motif_d = motifs.motifs(pattern, mp, max_motifs=10)
55 print('top motifs: \n',mtfs)
56 print('top distances: \n',motif_d)

~/.local/lib/python3.6/site-packages/matrixprofile/motifs.py in motifs(ts, mp, max_motifs, radius, n_neighbors, ex_zone)
57 motif_set = set()
58 initial_motif = [min_idx]
---> 59 pair_idx = int(mp[1][min_idx])
60 if mp_current[pair_idx] != np.inf:
61 initial_motif += [pair_idx]

OverflowError: cannot convert float infinity to integer

When I run the same on 10080 points it runs fine and is very fast.

The same data runs fine with MASS2, MASS3 and Stumped over much larger time series > 1.6M rows

Attached in a ZIP file with sample data
MP Debug 1.zip

Library restructuring

Would you be willing to restructure this module so that all algorithms have their own modules? I feel that many of these algorithms make adjustments to the core algorithm "MASS" for optimization. If I wanted to commit MASS2 by itself for subquery searching, it would cause conflicts with other functions in the distanceProfile module.

I feel that a restructure of having a module for stomp, scrimp and stamp could clear things up. The core functions to call stomp, scrimp or stamp could be accessible through an import in the matrixProfile module. See the current scrimp module for an example.

Speed claims

In your blog post you mention:

Astonishingly, we can process 20 years’ worth of data, sampled every five minutes, in less than 20 seconds. This speed is critical, as we expect that our data streams will expand in size by several orders of magnitude.

I would expect 20 years of data points to around 2M (12 * 24 * 365 * 20) or around 400K if you only measured workdays/hours. However, on my laptop it already takes 20 seconds to process 16K data points.

Did you use special libraries or hardware to achieve your timings?

How to handle NaN and INF values in utils.py movmeanstd and movstd

Hi,

I'm currently experimenting with MatrixProfiles, however i have some numerical issues i.e. using STOMP. matrixProfile.stomp calls _matrixProfile_stomp, which then calls moveanstd in my case.
In this method however there might occur negative values which are then passed to np.sqrt, which results in NaN values in the resulting standard deviation.

A similar issue was also observed here: ensozos/Matrix-Profile#17

What would you suggest as a fix? - My first thought was also passing absolute values or change the NaNs afterwards to zero but that might be too pragmatic.
Thank you,
codax

ValueError: Length of values does not match length of index

In:

mp = matrixProfile.stomp(pattern,m)
mtfs ,motif_d = motifs.motifs(pattern, mp, max_motifs=10)

self._set_item(key, value)
value = self._sanitize_column(key, value)
alue = sanitize_index(value, self.index, copy=False)

Any idea how to solve this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.