jmotif / jmotif-r Goto Github PK

View Code? Open in Web Editor NEW

54.0 13.0 21.0 4.68 MB

SAX, HOT-SAX, VSM, SAX-VSM, RePair and RRA in R (Rcpp)

Home Page: http://jmotif.github.io/sax-vsm_site/

R 28.89% C++ 70.94% TeX 0.17%

sax-vsm sax r timeseries discretization kdd anomalydiscovery discord

jmotif-r's Introduction

R package "jmotif", provides an implementation of:

z-Normalization of time series data
PAA, i.e., Piecewise Aggregate Approximation
SAX, i.e., Symbolic Aggregate approXimation
HOT-SAX, an algorithm for the exact time series discord discovery
VSM, i.e., Vector Space Model
SAX-VSM, an algorithm for interpretable time series classification (and parameters optimization)
RePair, an algorithm for grammatical inference
Rule Density Curve, an efficient grammatical compression (i.e. Kolmogorov Complexity) -based technique for variable length approximate time series anomaly discovery
RRA (Rare Rule Anomaly), a grammatical compression (i.e. Kolmogorov Complexity) -based algorithm for variable length exact time series anomaly discovery

Most of this functionality is also implemented in Java and some in Python as well...

Citing this work:

While RRA was proposed in [8], the code was ported in R to assist for our newer development in SAX parameters optimization: Grammarviz 3.0, please cite it: Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S., GrammarViz 3.0: Interactive Discovery of Variable-Length Time Series Patterns, ACM Trans. Knowl. Discov. Data, February 2018. [Click here for Citation BibTeX]

Notes:

In order to process sets of timeseries with uneven length, pad shorter with NA within the input data frame (list). Window-based SAX discretization procedure (sliding window left to right) will detect NA within right side of sliding window and abandon any further processing for the current time series continuing to the next.

References:

[1] Dina Goldin and Paris Kanellakis, On similarity queries for time-series data: Constraint specification and implementation, In Principles and Practice of Constraint Programming – CP ’95, pages 137–153. (1995)

[2] Keogh, E., Chakrabarti, K., Pazzani, M., & Mehrotra, S., Dimensionality reduction for fast similarity search in large time series databases, Knowledge and information Systems, 3(3), 263-286. (2001)

[3] Lonardi, S., Lin, J., Keogh, E., & Patel, P., Finding motifs in time series, In Proc. of the 2nd Workshop on Temporal Data Mining (pp. 53-68). (2002)

[4] Salton, G., Wong, A., Yang., C., A vector space model for automatic indexing, Commun. ACM 18, 11, 613–620, 1975.

[5] Senin Pavel and Malinchik Sergey, SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model., Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp.1175,1180, 7-10 Dec. 2013.

[6] Keogh, E., Lin, J., Fu, A., HOT SAX: Efficiently finding the most unusual time series subsequence, In Proc. ICDM (2005)

[7] N.J. Larsson and A. Moffat. Offline dictionary-based compression., In Data Compression Conference, 1999.

[8] Pavel Senin, Jessica Lin , Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, Susan Frankenstein, Time series anomaly discovery with grammar-based compression., In Proc. of The International Conference on Extending Database Technology, EDBT 15.

0.0 Installation from latest sources

install.packages("devtools")
library(devtools)
install_github('jMotif/jmotif-R')

to start using the library, simply load it into R environment:

library(jmotif)

1.0 Time series z-Normalization

z-normalization (znorm(ts, threshold)) is a common to the field of time series patterns mining preprocessing step proposed by Goldin & Kannellakis which helps downstream analyses to focus on the time series structural features.

x = seq(0, pi*4, 0.02)
y = sin(x) * 5 + rnorm(length(x))

plot(x, y, type="l", col="blue", main="A scaled sine wave with a random noise and its z-normalization")

lines(x, znorm(y, 0.01), type="l", col="red")
abline(h=c(1,-1), lty=2, col="gray50")
legend(0, -4, c("scaled sine wave","z-normalized wave"), lty=c(1,1), lwd=c(1,1), 
                                                                col=c("blue","red"), cex=0.8)

2.0 Piecewise Aggregate Approximation (i.e., PAA)

PAA (paa(ts, paa_num)) is designed to reduce the input time series dimensionality by splitting it into equally-sized segments (PAA size) and averaging values of points within each segment. Typically, PAA is applied to z-Normalized time series. In the following example the time series of dimensionality 8 points is reduced to 3 points.

y = c(-1, -2, -1, 0, 2, 1, 1, 0)
plot(y, type="l", col="blue", main="8-points time series and it PAA transform into 3 points")

points(y, pch=16, lwd=5, col="blue")

abline(v=c(1,1+7/3,1+7/3*2,8), lty=3, lwd=2, col="gray50")

y_paa3 = paa(y, 3)

segments(1,y_paa3[1],1+7/3,y_paa3[1],lwd=1,col="red")
points(x=1+7/3/2,y=y_paa3[1],col="red",pch=23,lwd=5)

segments(1+7/3,y_paa3[2],1+7/3*2,y_paa3[2],lwd=1,col="red")
points(x=1+7/3+7/3/2,y=y_paa3[2],col="red",pch=23,lwd=5)

segments(1+7/3*2,y_paa3[3],8,y_paa3[3],lwd=1,col="red")
points(x=1+7/3*2+7/3/2,y=y_paa3[3],col="red",pch=23,lwd=5)

3.0 SAX transform

SAX transform (series_to_string(ts, alphabet_size)) is a discretization algorithm which transforms a sequence of rational values (time series points) into a sequence of discrete values - symbols taken from a finite alphabet. This procedure enables the application of numerous algorithms for discrete data analysis to continuous time series data.

Typically, SAX applied to time series of reduced with PAA dimensionality, which effectively yields a low-dimensional, discrete representation of the input time series which preserves (to some extent) its structural characteristics. By employing this representation it is possible to design efficient algorithms for common time series pattern mining tasks as one can rely on the indexing of data in symbolic space. Note, that before processing with PAA and SAX, time series are z-Normalized.

The figure below illustrates the PAA+SAX procedure: 8 points time series is converted into 3-points PAA representation at the first step, PAA values are converted into letters by using 3 letters alphabet at the second step.

y <- seq(-2,2, length=100)
x <- dnorm(y, mean=0, sd=1)
lines(x,y, type="l", lwd=5, col="magenta")
abline(h = alphabet_to_cuts(3)[2:3], lty=2, lwd=2, col="magenta")
text(0.7,-1,"a",cex=2,col="magenta")
text(0.7, 0,"b",cex=2,col="magenta")
text(0.7, 1,"c",cex=2,col="magenta")

> series_to_string(y_paa3, 3)
[1] "acc"

> series_to_chars(y_paa3, 3)
[1] "a" "c" "c"

4.0 Time series SAX transform via sliding window

Another common way to use SAX is to apply the procedure to sliding window-extracted subseries (sax_via_window(ts, win_size, paa_size, alp_size, nr_strategy, n_threshold)). This technique is used in SAX-VSM, where it enables the conversion of a time series into the word bags. Note, the use of a numerosity reduction strategy.

5.0 SAX-VSM classifier

I use the one of standard UCR time series datasets to illustrate the implemented approach. The Cylinder-Bell-Funnel dataset (Saito, N: Local feature extraction and its application using a library of bases. PhD thesis, Yale University (1994)) consists of three time series classes. The dataset is embedded into the jmotif library:

# load Cylinder-Bell-Funnel data
data("CBF")

where it is wrapped into a list of four elements: train and test sets and their labels:

> str(CBF)
List of 4
$ labels_train: num [1:30] 1 1 1 3 2 2 1 3 2 1 ...
$ data_train  : num [1:30, 1:128] -0.464 -0.897 -0.465 -0.187 -1.136 ...
$ labels_test : num [1:900] 2 2 1 2 2 3 1 3 2 3 ...
$ data_test   : num [1:900, 1:128] -1.517 -0.703 -1.412 -0.955 -1.449 ...

5.1 Pre-processing and bags of words construction

At the first step, each class of the training data needs to be transformed into a bag of words using the manyseries_to_wordbag function, which z-normalizes each of time series and converts it into a set of words which added to the resulting bag:

# set the discretization parameters
#
w <- 60 # the sliding window size
p <- 6  # the PAA size
a <- 6  # the SAX alphabet size

# convert the train classes to wordbags (the dataset has three labels: 1, 2, 3)
#
cylinder <- manyseries_to_wordbag(CBF[["data_train"]][CBF[["labels_train"]] == 1,], w, p, a, "exact", 0.01)
bell <- manyseries_to_wordbag(CBF[["data_train"]][CBF[["labels_train"]] == 2,], w, p, a, "exact", 0.01)
funnel <- manyseries_to_wordbag(CBF[["data_train"]][CBF[["labels_train"]] == 3,], w, p, a, "exact", 0.01)

each of these bags is a two-columns data frame:

> head(cylinder)
   words counts
1 aabeee      2
2 aabeef      1
3 aaceee      7
4 aacfee      1
5 aadeee      7
6 aaedde      1

5.2 `TF*IDF` weighting

TF*IDF weights are computed at the second step with bags_to_tfidf function which accepts a single argument -- a list of named (by class label) word bags:

# compute tf*idf weights for three bags
#
tfidf = bags_to_tfidf( list("cylinder" = cylinder, "bell" = bell, "funnel" = funnel) )

this yields a data frame of four variables: the words which are "important" in TF*IDF terms (i.e. not presented at least in one of the bags) and their class-corresponding weights:

> tail(tfidf)
     words  cylinder     bell funnel
640 ffcbbb 0.6525709 0.445449 0.0000
641 ffdbab 0.0000000 0.000000 0.7615
642 ffdbbb 1.7681483 0.000000 0.0000
643 ffdcaa 0.0000000 0.000000 0.7615
644 ffdcba 0.0000000 0.000000 0.7615
645 ffebbb 1.5230000 0.000000 0.0000

which makes it easy to find which exact pattern contributes the most to the class:

> library(dplyr)
> head(arrange(tfidf, desc(cylinder)))
   words cylinder bell funnel
1 aaeeee 2.413898    0      0
2 aaceee 2.284500    0      0
3 aadeee 2.284500    0      0

> head(arrange(tfidf, desc(funnel)))
   words cylinder bell   funnel
1 fedcba        0    0 2.975097
2 fedbba        0    0 2.284500
3 adfecb        0    0 1.968449

and to visualize those on data:

# make up a sample time-series
#
sample = (CBF[["data_train"]][CBF[["labels_train"]] == 3,])[1,]
sample_bag = sax_via_window(sample, w, p, a, "exact", 0.01)
df = data.frame(index = as.numeric(names(sample_bag)), words = unlist(sample_bag))
               
# weight the found patterns
#
weighted_patterns = merge(df, tfidf)
specificity = rep(0, length(sample))
for(i in 1:length(weighted_patterns$words)){
  pattern = weighted_patterns[i,]
  for(j in 1:w){
    specificity[pattern$index+j] = specificity[pattern$index+j] +
                                        pattern$funnel - pattern$bell - pattern$cylinder
  }
}

# plot the weighted patterns
#
library(ggplot2)
library(scales)
ggplot(data=data.frame(x=c(1:length(sample)), y=sample, col=rescale(specificity)),
 aes(x=x,y=y,color=col)) + geom_line(size=1.2) + theme_bw() +
 ggtitle("The funnel class-characteristic pattern example") +
 scale_colour_gradientn(name = "Class specificity:  ",limits=c(0,1),
    colours=c("red","yellow","green","lightblue","darkblue"),
    breaks=c(0,0.5,1),labels=c("negative","neutral","high"),
    guide = guide_colorbar(title.theme=element_text(size=14, angle=0),title.vjust=1,
    barheight=0.6, barwidth=6, label.theme=element_text(size=10, angle=0))) +
 theme(legend.position="bottom",plot.title=element_text(size=18),
    axis.title.x=element_blank(), axis.title.y=element_blank(),
    axis.text.x=element_text(size=12),axis.text.y=element_blank(),
    panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank(),
    axis.ticks.y = element_blank())

5.3 SAX-VSM classification

Using the weighted patterns obtained at the previous step and the cosine similarity measure it is also easy to classify unlabeled data using the cosine_sim function which accepts a list of two elements: the bag-of-words representation of the input time series (constructed with series_to_wordbag function) and the TF*IDF weights table obtained at the previous step:

# classify the test data
#
labels_predicted = rep(-1, length(CBF[["labels_test"]]))
labels_test = CBF[["labels_test"]]
data_test = CBF[["data_test"]]
for (i in c(1:length(data_test[,1]))) {
    series = data_test[i,]
    bag = series_to_wordbag(series, w, p, a, "exact", 0.01)
    cosines = cosine_sim(list("bag"=bag, "tfidf" = tfidf))
    labels_predicted[i] = which(cosines$cosines == max(cosines$cosines))
}

# compute the classification error
#
error = length(which((labels_test != labels_predicted))) / length(labels_test)
error

# findout which time series were misclassified
#
which((labels_test != labels_predicted))

6.0 SAX-VSM discretization parameters optimization

Here I shall show how the classification task discretization parameters optimization can be done with third-party libraries, specifically nloptr which implements DIRECT and cvTools which facilitates CV process. But not forget the magic of plyr!!! So here is the code:

library(plyr)
library(cvTools)
library(nloptr)

# the cross-validation error function
# uses the following global variables
#   1) nfolds -- specifies folds for the cross-validation
#                if equal to the number of instances, then it is
#                LOOCV
#
cverror <- function(x) {

  # the vector x suppos to contain reational values for the
  # discretization parameters
  #
  w = round(x[1], digits = 0)
  p = round(x[2], digits = 0)
  a = round(x[3], digits = 0)

  # few local vars to simplify the process
  m <- length(train_labels)
  c <- length(unique(train_labels))
  folds <- cvFolds(m, K = nfolds, type = "random")

  # saving the error for each folds in this array
  errors <- list()

  # cross-valiadtion business
  for (i in c(1:nfolds)) {

    # define data sets
    set_test <- which(folds$which == i)
    set_train <- setdiff(1:m, set_test)

    # compute the TF-IDF vectors
    bags <- alply(unique(train_labels),1,function(x){x})
    for (j in 1:c) {
      ll <- which(train_labels[set_train] == unique(train_labels)[j])
      bags[[unique(train_labels)[j]]] <-
        manyseries_to_wordbag( (train_data[set_train, ])[ll,], w, p, a, "exact", 0.01)
    }
    tfidf = bags_to_tfidf(bags)

    # compute the eror
    labels_predicted <- rep(-1, length(set_test))
    labels_test <- train_labels[set_test]
    data_test <- train_data[set_test,]

    for (j in c(1:length(labels_predicted))) {
      bag=NA
      if (length(labels_predicted)>1) {
        bag = series_to_wordbag(data_test[j,], w, p, a, "exact", 0.01)
      } else {
        bag = series_to_wordbag(data_test, w, p, a, "exact", 0.01)
      }
      cosines = cosine_sim(list("bag" = bag, "tfidf" = tfidf))
      if (!any(is.na(cosines$cosines))) {
        labels_predicted[j] = which(cosines$cosines == max(cosines$cosines))
      }
    }

    # the actual error value
    error = length(which((labels_test != labels_predicted))) / length(labels_test)
    errors[i] <- error
  }

  # output the mean cross-validation error as the result
  err = mean(laply(errors,function(x){x}))
  print(paste(w,p,a, " -> ", err))
  err
}

# define the data for CV
train_data <- CBF[["data_train"]]
train_labels <- CBF[["labels_train"]]
nfolds = 15

# perform the parameters optimization
S <- directL(cverror, c(10,2,2), c(120,60,12),
             nl.info = TRUE, control = list(xtol_rel = 1e-8, maxeval = 10))

The optimization process goes as follows:

[1] "65 31 7  ->  1"
[1] "65 31 7  ->  1"
[1] "65 31 7  ->  1"
[1] "28 31 7  ->  1"
[1] "102 31 7  ->  1"
[1] "65 12 7  ->  0.366666666666667"
[1] "65 50 7  ->  1"
[1] "65 31 4  ->  1"
[1] "65 31 10  ->  1"
[1] "28 12 7  ->  0.833333333333333"
[1] "102 12 7  ->  0.666666666666667"
[1] "65 12 4  ->  0"

Call:
nloptr(x0 = x0, eval_f = fn, lb = lower, ub = upper, opts = opts)
Minimization using NLopt version 2.4.2 

NLopt solver status: 5 ( NLOPT_MAXEVAL_REACHED: Optimization stopped because maxeval (above) was 
reached. )

Number of Iterations....: 10 
Termination conditions:  stopval: -Inf xtol_rel: 1e-08 maxeval: 10 ftol_rel: 0 ftol_abs: 0 
Number of inequality constraints:  0 
Number of equality constraints:    0 
Current value of objective function:  0 
Current value of controls: 65 11.66667 3.666667

At this point S contains the best SAX parameters which were found using 10 DIRECT iterations, which we can use for the classification of the test data:

w = round(S$par[1], digits = 0)
p = round(S$par[2], digits = 0)
a = round(S$par[3], digits = 0)

# compute the TF-IDF vectors
#
bags <- alply(unique(train_labels),1,function(x){x})
for (j in 1:length(unique(train_labels))) {
  ll <- which(train_labels == unique(train_labels)[j])
  bags[[unique(train_labels)[j]]] <-
    manyseries_to_wordbag( train_data[ll,], w, p, a, "exact", 0.01)
}
tfidf = bags_to_tfidf(bags)

# classify the test data
#
labels_predicted = rep(-1, length(CBF[["labels_test"]]))
labels_test = CBF[["labels_test"]]
data_test = CBF[["data_test"]]
for (i in c(1:length(data_test[,1]))) {
  print(paste(i))
  series = data_test[i,]
  bag = series_to_wordbag(series, w, p, a, "exact", 0.01)
  cosines = cosine_sim(list("bag"=bag, "tfidf" = tfidf))
  if (!any(is.na(cosines$cosines))) {
    labels_predicted[i] = which(cosines$cosines == max(cosines$cosines))
  }
}

# compute the classification error
#
error = length(which((labels_test != labels_predicted))) / length(labels_test)
error

# findout which time series were misclassified
#
which((labels_test != labels_predicted))
par(mfrow=c(3,1))
plot(data_test[316,], type="l")
plot(data_test[589,], type="l")
plot(data_test[860,], type="l")

which shows us that one instance of each of the classes was misclassified....

7.0 HOT-SAX algorithm for time series discord discovery

Given a time series T, its subsequence C is called discord if it has the largest Euclidean distance to its nearest non-self match. Thus, time series discord is a subsequence within a time series that is maximally different to all the rest of subsequences in the time series, and therefore naturally captures the most unusual subsequence within the time series.

The library embeds the ECG0606 dataset taken from PHYSIONET FTP. The raw data was transformed with their rdsamp utility

rdsamp -r sele0606 -f 120.000 -l 60.000 -p -c | sed -n '701,3000p' >0606.csv

and consists of 15 heartbeats:

We know, that the third heartbeat of this dataset contains the true anomaly as it was discussed in HOTSAX paper by Eamonn Keogh, Jessica Lin, and Ada Fu. Note, that the authors were specifically interested in finding anomalies which are shorter than a regular heartbeat following a suggestion given by the domain expert: ''We conferred with cardiologist, Dr. Helga Van Herle M.D., who informed us that heart irregularities can sometimes manifest themselves at scales significantly shorter than a single heartbeat.'' Figure 13 of the paper further explains the nature of this true anomaly:

Two implementation of discord discovery provided within the code: the brute-force discord discovery and HOT-SAX.

The brute-force takes 14 seconds to discover 5 discords in the data (with early-abandoning distance):

> lineprof( find_discords_brute_force(ecg0606, 100, 5) )
Reducing depth to 2 (from 18)
    time    alloc  release dups                    ref            src
1 13.951 6211.306 6209.214    2                ".Call" .Call         
2  0.001    0.000    0.000   44 c(".Call", "tryCatch") .Call/tryCatch

whereas HOT-SAX finishes in fraction of a second:

> lineprof( find_discords_hotsax(ecg0606, 100, 4, 4, 0.01, 5) )
   time alloc release dups     ref   src
1 0.191 0.245       0   56 ".Call" .Call

The discords returned as a data frame sorted by the position:

> discords = find_discords_hotsax(ecg0606, 100, 4, 4, 0.01, 5)
> discords
  nn_distance position
1   0.4787745       37
2   0.4177020      188
3   1.5045847      411
4   0.4437060      539
5   0.4437060     1566

The best discord is the third one at 411:

discords = find_discords_hotsax(ecg0606, 100, 4, 4, 0.01, 5)
plot(ecg0606, type = "l", col = "cornflowerblue", main = "ECG 0606")
lines(x=c(discords[3,2]:(discords[3,2]+100)),
    y=ecg0606[discords[3,2]:(discords[3,2]+100)], col="red")

It is easy to sort discord by the nearest neighbor distance:

> library(dplyr)
> arrange(discords,desc(nn_distance))
  nn_distance position
1   1.5045847      411
2   0.4787745       37
3   0.4437060      539
4   0.4437060     1566
5   0.4177020      188

7.0 Grammatical inference with RePair

RePair is a dictionary-based compression method proposed in 1999 by Larsson and Moffat. In contrast with Sequitur, Repair is an off-line algorithm that requires the whole input sequence to be accessible before building a grammar. Similar to Sequitur, RePair also can be utilized as a grammar-based compressor able to discover a compact grammar that generates the text. It is a remarkably simple algorithm which is known for its very fast decompression.

In short, RePair performs a recursive pairing step -- finding the most frequent pair of symbols in the input sequence and replacing it with a new symbol -- until every pair appears only once.

As noted by the authors, when compared with online compression algorithms, the disadvantage of Repair having to store a large message in memory for processing is illusory when compared with storing the growing dictionary of an online compressor, as the incremental dictionary-based algorithms maintain an equally large message in memory as a part of the dictionary.

Here is an example of RePair grammar for the input string containing an anomaly (xxx). Note that none of the grammar rules includes the anomalous terminal symbol.

Grammar rule        Expanded grammar rule                        Occurrence in R0
R0 -> R4 xxx R4     abc abc cba cba bac xxx abc abc cba cba bac  
R1 -> abc abc       abc abc                                      2-3, 8-9
R2 -> cba cba       cba cba                                      0-1, 6-7
R3 -> R1 R2         abc abc cba cba                              0-3, 6-9
R4 -> R3 bac        abc abc cba cba bac                          0-4, 6-10

Calling RePair implementation in jmotif-R

grammar <- str_to_repair_grammar("abc abc cba cba bac xxx abc abc cba cba bac")

produces a list of data frames, each of which contains the RePair grammar rule information. For example the first rule of the grammar (second list element):

> str(grammar[[2]])
List of 5
 $ rule_name           : chr "R1"
 $ rule_string         : chr "cba cba"
 $ expanded_rule_string: chr "cba cba"
 $ rule_interval_starts: num [1:2] 2 8
 $ rule_interval_ends  : num [1:2] 3 9

8.0 Rule density curve

As we have discussed in our work, SAX opens door for many high-level string algorithms application to the problem of patterns mining in time series. Specifically in [8], we have shown useful properties of grammatical compression (i.e., algorithmic complexity) when applied to the problem of recurrent and anomalous pattern discovery.

Jmotif-R implements RePair [7] algorithm for grammar induction, which can be used to build the rule density curve enabling highly efficient approximate time series anomaly discovery.

I use the same ECG0606 dataset in this example:

ecg <- ecg0606

require(ggplot2)
df=data.frame(time=c(1:length(ecg)),value=ecg)
p1 <- ggplot(df, aes(time, value)) + geom_line(lwd=1.1,color="blue1") + theme_classic() +
  ggtitle("Dataset ECG qtdb 0606 [701-3000]") +
  theme(plot.title = element_text(size = rel(1.5)), 
    axis.title.x = element_blank(),axis.title.y=element_blank(),
    axis.ticks.y=element_blank(),axis.text.y=element_blank())
p1

and use RePair implementation to build the grammar curve:

# discretization parameters
w=100
p=8
a=8

# discretize the data 
ecg_sax <- sax_via_window(ecg, w, p, a, "none", 0.01)

# get the string representation of time series
ecg_str <- paste(ecg_sax, collapse=" ")

# infer the grammar
ecg_grammar <- str_to_repair_grammar(ecg_str)   

# initialize the density curve
density_curve = rep(0,length(ecg))

# account for all the rule intervals
for(i in 2:length(ecg_grammar)){
    rule = ecg_grammar[[i]]
    for(j in 1:length(rule$rule_interval_starts)){
        xs = rule$rule_interval_starts[j]
        xe = rule$rule_interval_ends[j] + w
        density_curve[xs:xe] <- density_curve[xs:xe] + 1;
    }
}

# see global minimas
which(density_curve==min(density_curve))
# [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24
# [25]  25  26  27  28  29  30 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461
# [49] 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476
min_values = data.frame(x=c(444:476),y=rep(0,(476-443)))

# plot the curve
density_df=data.frame(time=c(1:length(density_curve)),value=density_curve)
shade <- rbind(c(0,0), density_df, c(2229,0))
names(shade)<-c("x","y")
p2 <- ggplot(density_df, aes(x=time,y=value)) +
    geom_line(col="cyan2") + theme_classic() +
    geom_polygon(data = shade, aes(x, y), fill="cyan", alpha=0.5) +
    ggtitle("RePair rules density for (w=100,p=8,a=8)") +
    theme(plot.title = element_text(size = rel(1.5)), axis.title.x = element_blank(),
    axis.title.y=element_blank(), axis.ticks.y=element_blank(),axis.text.y=element_blank())+
    geom_line(data=min_values,aes(x,y),lwd=2,col="red")
p2

grid.arrange(p1, p2, ncol=1)

9.0 Rare Rule Anomaly algorithm.

RRA (i.e., Rare Rule Anomaly) algorithm extends the HOT-SAX algorithm leveraging the grammatical compression properties (i.e. algorithmic, or Kolmogorov complexity properties). In contrast with the original algorithm, whose input is a set of time series subsequences extracted from the input time series via sliding window, RRA operates on the set of time series subsequences which correspond to a grammar' rules. The grammar, whose rules are used in RRA, is built by a grammar inference algorithm run on the set of tokens which are obtained by time series discretization with SAX and a sliding window (i.e. a HOT-SAX input). jmotif-R is using Re-Pair algorithm for grammatical inference.

Since each of the grammar rules consists of terminal and non-terminal tokens, the subsequences corresponding to rules naturally vary in length. Moreover, due to the compression properties of the utilized grammatical inference algorithm, which operates on digrams (i.e. pairs of subsequences extracted via sliding window), the amount of input subsequences for RAA is usually significantly lower than those extracted via sliding window for HOT-SAX, which improves the efficiency of HOT-SAX inner and outer loops by reducing the number of calls to the distance function.

In addition to the above, RRA uses a different heuristics for the ordering in the HOT-SAX outer loop: instead of ordering subsequences by their occurrence frequency, the rule-corresponding subsequences are ordered according to the "rule coverage" values discussed above -- a value which reflects the compressibility of the subsequence. Naturally, we expect that incompressible subsequences correspond to potential anomalies.

jmotif-r's People

Contributors

Stargazers

Watchers

jmotif-r's Issues

paa example

Is it just me or the example provided in the documentation for the paa function does not actually use the paa function at all?

Different result than in example HOT-SAX

Delete

Error still in loading library 'scales'

Hi Senin,

Sorry, I tried also installing "scales" package as shown below and then loaded the libraries within but still it fails....

install.packages("scales")
Installing package into ‘C:/Users/Balaji/Documents/R/win-library/3.2’
(as ‘lib’ is unspecified)
trying URL 'https://cran.uni-muenster.de/bin/windows/contrib/3.2/scales_0.3.0.zip'
Content type 'application/zip' length 605174 bytes (590 KB)
downloaded 590 KB

package ‘scales’ successfully unpacked and MD5 sums checked
Warning: cannot remove prior installation of package ‘scales’

The downloaded binary packages are in
C:\Users\Balaji\AppData\Local\Temp\RtmpKged6V\downloaded_packages

library(scales)
Error in library(scales) : there is no package called ‘scales’

Any clue....

Regards,
Bala

NaN generated from the function cosine_sim

When I was learning the code in the README.md file, I found if I run the code below, cosine_sim generated NaN for that sample in the test data set. Do you know why?

library(jmotif)
data("CBF")

train_data <- CBF[["data_train"]]
train_labels <- CBF[["labels_train"]]
labels_test = CBF[["labels_test"]]
data_test = CBF[["data_test"]]

w = 65
p = 12
a = 4
nfolds = 10
bags <- alply(unique(train_labels),1,function(x){x})
for (j in 1:length(unique(train_labels))) {
  ll <- which(train_labels == unique(train_labels)[j])
  bags[[unique(train_labels)[j]]] <-
    manyseries_to_wordbag( train_data[ll,], w, p, a, "exact", 0.01)
}
tfidf = bags_to_tfidf(bags)

i = 860
series = data_test[i,]
bag = series_to_wordbag(series, w, p, a, "exact", 0.01)
cosines = cosine_sim(list("bag"=bag, "tfidf" = tfidf))

cosines # NaN generated

Problems while installing the package

While installing the package ('jMotif/jmotif-R'), i got the following error

C:/PROGRA~~1/R/R-32~~1.2/etc/i386/Makeconf:189: recipe for target 'RcppExports.o' failed
make: *** [RcppExports.o] Error 1
Warning: running command 'make -f "C:/PROGRA~~1/R/R-32~~1.2/etc/i386/Makeconf" -f "C:/PROGRA~~1/R/R-32~~1.2/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="jmotif.dll" OBJECTS="RcppExports.o jmotif.o"' had status 2
ERROR: compilation failed for package 'jmotif'

removing 'C:/Users/bdgmasu/Documents/R/win-library/3.2/jmotif'

and the installation terminates.

Please Help

How to prepare .rda file (reopen)

Thanks for that piece of information. That was very useful.

I quickly tried to read the source data files for both "Gun_Point" and "CBF" that I downloaded from UCR repo.

To understand converting the above target datasets into .rda data files, I used your reference scripts in /makedata.R. But I receive error while creating the matrix...

data_train = matrix(unlist(dtrain[,-1]), nrow = length(labels_train))
Error in matrix(unlist(dtrain[, -1]), nrow = length(labels_train)) :
'data' must be of a vector type, was 'NULL'
data_test = matrix(unlist(dtest[,-1]), nrow = length(labels_test))
Error in matrix(unlist(dtest[, -1]), nrow = length(labels_test)) :
'data' must be of a vector type, was 'NULL'

Pls. advice...

-Bala

PS: MY friend, may I also request that you don't immediately close the issue thread, by doing that it doesn't allow me to raise reissue on the same thread again.

need to remove all zeros rows from tfidf

Datasets classification using SAX transformation?

i was wondering if someone knows a quick form (better if already implemented) to apply the NN classifier with the SAX transformation, for datasets like the UCR ones, in R?

should all arrays created with "new" be freed with "delete"?

about the sliding window length in testthat

I saw the paper then confused about the sliding window size, it always +1 and I tried to test the testthat about the sax_via_window.R ,then i got some error about testthat.

There is a problem about that +1 is correct?
length(sax1)=54, length(t(dat))=60,slidingwindowsize=6
54=60-6+1

sax1 <- sax_via_window(t(dat), 6, 3, 3, "none", 0.01)
expect_equal(length(sax1), length(t(dat)) - 6 + 1)

Error: Test failed: 'SAX test #1'
Not expected: length(sax1) not equal to length(t(dat)) - 6 + 1
55 - 54 == 1.

Thanks.
Roger

How to prepare .rda file (reopen)

I would like to prepare my dataset in the same format as CBF.rda for quick evaluation. Could you suggest any links on fields or steps etc.?

Thanks,
Bala

Error in executing znorm function

Hi Senin,

I'm getting some errors when running the jMotif-R source codes in steps..
Error happens when I try to run the following line in my R-console on windows.

lines(x, znorm(y, 0.01), type="l", col="red")
Error in xy.coords(x, y) : could not find function "znorm"

I have successfully installed both devtools & jMotif-R package using commands you listed prior to this.

Any idea???

Thanks,
Bala

Help in resolving "SAX-VSM discretization parameters optimization" out of bounds error

Hello my friend Senin,

A quick clarification and help sought.

I run into an error...."Index out of bounds" here..

Error: index out of bounds 6 stop(structure(list(message = "index out of bounds", call = NULL, cppstack = NULL), .Names = c("message", "call", "cppstack" ), class = c("Rcpp::index_out_of_bounds", "C++Error", "error", "condition"))) 5 bags_to_tfidf(bags) 4 fun(x, ...) 3 eval_f(x0, ...) 2 nloptr(x0, eval_f = fn, lb = lower, ub = upper, opts = opts) 1 directL(cverror, c(15, 3, 3), c(60, 10, 8), nl.info = TRUE, control = list(xtol_rel = 1e-08, maxeval = 25))

The same is NOT observed when I run Step 6.0) SAX-VSM discretization parameters optimization with a different .rda source file.

Any clue and quick diagnostics would be deeply appreciated.

Many Thanks.

Balaji

handling NAs in znorm needed

current version produces the following:

znorm(c(1,2,3,-1))
[1] -0.146385 0.439155 1.024695 -1.317465
znorm(c(1,2,NA,-1))
[1] NA NA NA NA

need to add na.rm action

Installation issue - error: 'unordered_map' in namespace 'std' does not name a type

Hi - I am trying to install on Windows 7 under R 3.2.3. I get the following error:

library(devtools)
install_github('jMotif/jmotif-R')
Downloading GitHub repo jMotif/jmotif-R@master
from URL https://api.github.com/repos/jMotif/jmotif-R/zipball/master
Installing jmotif
"C:/PROGRA~~1/R/R-32~~1.3/bin/x64/R" --no-site-file --no-environ --no-save
--no-restore CMD INSTALL
"C:/Users/T897551/AppData/Local/Temp/Rtmpofu5I3/devtools2c0c144c27e0/jMotif-jmotif-R-c948d79"
--library="C:/Program Files/R/R-3.2.3/library" --install-tests

installing source package 'jmotif' ...
** libs

*** arch - i386
g++ -m32 -std=c++0x -I"C:/PROGRA~~1/R/R-32~~1.3/include" -DNDEBUG -I"../inst/include/" -I"C:/Program Files/R/R-3.2.3/library/Rcpp/include" -I"C:/Program Files/R/R-3.2.3/library/RcppArmadillo/include" -I"d:/RCompile/r-compiling/local/local323/include" -O2 -Wall -mtune=core2 -c RcppExports.cpp -o RcppExports.o
In file included from C:/Program Files/R/R-3.2.3/library/RcppArmadillo/include/armadillo:50:0,
from C:/Program Files/R/R-3.2.3/library/RcppArmadillo/include/RcppArmadilloForward.h:46,
from C:/Program Files/R/R-3.2.3/library/RcppArmadillo/include/RcppArmadillo.h:31,
from ../inst/include/jmotif.h:4,
from RcppExports.cpp:4:
C:/Program Files/R/R-3.2.3/library/RcppArmadillo/include/armadillo_bits/compiler_setup.hpp:209:111: note: #pragma message: WARNING: compiler is in C++11 mode, but it has incomplete support for C++11 features;
C:/Program Files/R/R-3.2.3/library/RcppArmadillo/include/armadillo_bits/compiler_setup.hpp:210:87: note: #pragma message: WARNING: if something breaks, you get to keep all the pieces.
C:/Program Files/R/R-3.2.3/library/RcppArmadillo/include/armadillo_bits/compiler_setup.hpp:211:93: note: #pragma message: WARNING: to forcefully prevent Armadillo from using C++11 features,
C:/Program Files/R/R-3.2.3/library/RcppArmadillo/include/armadillo_bits/compiler_setup.hpp:212:90: note: #pragma message: WARNING: #define ARMA_DONT_USE_CXX11 before #include
In file included from RcppExports.cpp:4:0:
../inst/include/jmotif.h:213:1: error: 'unordered_map' in namespace 'std' does not name a type
make: *** [RcppExports.o] Error 1
Warning: running command 'make -f "Makevars" -f "C:/PROGRA~~1/R/R-32~~1.3/etc/i386/Makeconf" -f "C:/PROGRA~~1/R/R-32~~1.3/share/make/winshlib.mk" CXX='$(CXX1X) $(CXX1XSTD)' CXXFLAGS='$(CXX1XFLAGS)' CXXPICFLAGS='$(CXX1XPICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX1XLDFLAGS)' SHLIB_LD='$(SHLIB_CXX1XLD)' SHLIB="jmotif.dll" OBJECTS="RcppExports.o discord.o distance.o hot-sax.o jmotif.o paa.o repair.o rra.o sax-vsm.o sax.o string.o utils.o visit_registry.o znorm.o"' had status 2
ERROR: compilation failed for package 'jmotif'

removing 'C:/Program Files/R/R-3.2.3/library/jmotif'
Error: Command failed (1)

It seems like a namespace error. Any pointers would be appreciated.

Thanks.

Installation Issues

Hi seninp,

I'm going through your ReadMe and I'm getting error when grabbing the Github repository:

install_github('jMotif/jmotif-R')
Downloading GitHub repo jMotif/jmotif-R@master
Installing jmotif
"C:/PROGRA~~1/R/R-32~~1.2/bin/x64/R" --no-site-file --no-environ --no-save --no-restore CMD INSTALL \ "C:/Users/NotUrPC/AppData/Local/Temp/RtmpamNSTe/devtoolsbf46db12781/jMotif-jmotif-R-f2ecb5c" --library="C:/Program \ Files/R/R-3.2.2/library" --install-tests
installing *source package 'jmotif' ...
** libs

*** arch - i386
g++ -m32 -I"C:/PROGRA~~1/R/R-32~~1.2/include" -DNDEBUG -I"C:/Program Files/R/R-3.2.2/library/Rcpp/include" -I"d:/RCompile/r-compiling/local/local320/include" -O2 -Wall -mtune=core2 -c RcppExports.cpp -o RcppExports.o

g++ -m32 -I"C:/PROGRA~~1/R/R-32~~1.2/include" -DNDEBUG -I"C:/Program Files/R/R-3.2.2/library/Rcpp/include" -I"d:/RCompile/r-compiling/local/local320/include" -O2 -Wall -mtune=core2 -c jmotif.cpp -o jmotif.o
jmotif.cpp: In function 'bool is_equal_mindist(Rcpp::CharacterVector, Rcpp::CharacterVector)':
jmotif.cpp:295:30: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp: In function 'std::map<int, Rcpp::Vector<16> > sax_by_chunking(Rcpp::NumericVector, int, int, double)':
jmotif.cpp:396:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp: In function 'Rcpp::DataFrame bags_to_tfidf(Rcpp::List)':
jmotif.cpp:583:38: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:594:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:627:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:632:40: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:646:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:663:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:668:40: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:679:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:699:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:706:34: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp: In function 'Rcpp::DataFrame cosine_sim(Rcpp::List)':
jmotif.cpp:742:33: warning: comparison between signed and unsigned integer expressions [-Wsign-
compare]
jmotif.cpp:766:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp:782:37: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
jmotif.cpp: In function 'discord_record find_best_discord_brute_force(const NumericVector&, int, VisitRegistry_)':
jmotif.cpp:960:26: error: 'isnan' was not declared in this scope
jmotif.cpp:960:26: note: suggested alternative:
c:\rbuildtools\3.2\gcc-4.6.3\bin../lib/gcc/i686-w64-mingw32/4.6.3/../../../../include/c++/4.6.3/cmath:768:5: note: 'std::isnan'
make: *_* [jmotif.o] Error 1
Warnung: Ausführung von Kommando 'make -f "C:/PROGRA~~1/R/R-32~~1.2/etc/i386/Makeconf" -f "C:/PROGRA~~1/R/R-32~~1.2/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="jmotif.dll" OBJECTS="RcppExports.o jmotif.o"' ergab Status 2
ERROR: compilation failed for package 'jmotif'
*removing 'C:/Program Files/R/R-3.2.2/library/jmotif'
Error: Command failed (1)

Do you know where the error could be? I've followed your guide step by step and got no other problems. I'm running R on Windows 7 x64. If you need further information, feel free to ask.

Best regards
BoJanisch

Pls help selecting optimal sliding window (w), PAA size (p) & alphabet size (a) in jmotif-R

Hello my friend,

You have explained in your sax-vsm classic (Java) implementation tutorial page, the implementation of "Running the parameter sampler (Optimizer)". I would like to know if the same has also been implemented in R-equivalent of jMotif package?

If yes, a pointer to demonstrate how to run this...
*sliding window range as [10-150], PAA size as [5-75], and the alphabet [2-18] *
and what to expect as output of the function will be very helpful.

My energy dataset also has multiple signatures, each of whose subsequences are of varied lengths within time series. And goal is to use DiRect sampler that you employed to find the optimal set of parameters prior to classification.

Thanks in anticipation.

Best regards,
Bala

SAX-VSM suboptimal optimization

Hello
In the webpage example 5.0 SAX-VSM classifier, the un-optimized parameters:

w <- 60 # the sliding window size
p <- 6 # the PAA size
a <- 6 # the SAX alphabet size
are optimal, since the error term is 0 (zero miss-classifications rate)

However in the example 6.0 SAX-VSM parameters optimization the optimal parameters are "65 12 4 -> 0":

w <- 65 # the sliding window size
p <- 12 # the PAA size
a <- 4 # the SAX alphabet size
and the miss-classification error is [1] 0.004444444 with 4 misclassified series ([1] 187 589 766 860)

What I am missing?

Also, I tried the optimization with my own data set and I get the following error

Error in bags[[unique(train_labels)[j]]] <- manyseries_to_wordbag((train_data[set_train, :
attempt to select less than one element in OneIndex

Thank you in advance for your help

Error reproducing heatmap i.e. plot the weighted patterns

Hi Senin,

Thanks. You pointed me to the missing link to load jmotif package before.

I continued on with the demo and everything was fine until I reached a point to plot weighted patterns on the time series.

I get the following error when I try to load library(scales) for whatever reasons, though library(ggplot2) loaded well.

library(scales)
Error in library(scales) : there is no package called ‘scales’

Any idea???

Regards,
Bala

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.