jmotif / sax Goto Github PK

View Code? Open in Web Editor NEW

80.0 17.0 23.0 7.93 MB

Java implementation of SAX, HOT-SAX, and EMMA

Home Page: http://jmotif.github.io/sax-vsm_site/

License: GNU General Public License v2.0

Java 94.89% HTML 0.17% R 4.58% Shell 0.21% TeX 0.15%

sax discretization java timeseries bitmap discord anomalydetection anomalydiscovery

sax's People

Contributors

Stargazers

Watchers

sax's Issues

SAXProcessor doesn't process series of length = window size

In SAXProcessor, the following loop prevents from processing the last windowSize elements of a series.
E.g. if the series is the same size of the window, no record is generated.

for (int i = 0; i < ts.length - windowSize; i++)

I guess this should be

for (int i = 0; i <= ts.length - windowSize; i++)

Java API implementing test time series string conversion by individual time series(rows) from CSV

@seninp

Have you got any equivalent implementation (java API) for the function that read test time series (CSV) dataset row-by-row and then converts them into SAX wordbag with each bag corresponding to individual time series?

I'm looking for the equivalence to the following R-code...

for (i in c(1:length(data_test[,1]))) {
print(paste(i))
series = data_test[i,]

bag = series_to_wordbag(series, w, p, a, "exact", 0.01)

Appreciate your quick help !

Thanks,
KMB

HOTSAX fails if each subsequence of time series has a NN with a distance 0

Maven repo availability

Hi Pavel, are there any plans to turn this lib available in any maven repository?

Thanks.

Performance improvement in TSProcessor

Impressive API. Thanks :)

Please let me suggest small performance improvements that I have verified to be effective with JMH.

net.seninp.jmotif.sax.TSProcessor

public double mean(double[] series) {
double res = 0D;
int count = 0;// series.length has it
for (double tp : series) {
res += tp;
count += 1;
}
if (count > 0) {
return res / ((Integer) count).doubleValue();// (double)count is faster as it does not need boxing
}
return Double.NaN;
}

So we get:

public double mean(double[] series) {
double res = 0D;
for (double tp : series) {
res += tp;

}
    if (series.length > 0) {
        return res / (double)series.length;
}
return Double.NaN;

}

and

public double mean(int[] series) {
double res = 0D;
for (int tp : series) {
res += (double) tp;
}
if (series.length > 0) {
return res / (double)series.length;
}
return Double.NaN;
}

and

public double var(double[] series) {
double res = 0D;
double mean = mean(series);
for (double tp : series) {
res += (tp - mean) * (tp - mean);
}
if (series.length > 0) {
return res / (double)series.length;
}
return Double.NaN;
}

and

public double stDev(double[] series) {
double num0 = 0D;
double sum = 0D;
for (double tp : series) {
num0 = num0 + tp * tp;
sum = sum + tp;
}
double len = (double)series.length;
return Math.sqrt((len * num0 - sum * sum) / (len * (len - 1)));
}

Tests shouldn't print much output.

But they do.

any python version？

Generalization of Alphabet implementation

A while ago I read a bit about SAX, and played around with different implementation options. The results have not been published. This was just some recreational thing. I now reviewed this code, and thought about bringing it into a publishable shape, but a quick websearch about java sax time series led me to your implementation - and I'll probably not put any efforts into my implementation, because it most likely won't be able to compete with your library anyhow.

However, I was curious about how you implemented the breakpoint- and distance matrix computations. And ... admittedly, when I saw the giant arrays and switches in https://github.com/jMotif/SAX/blob/master/src/main/java/net/seninp/jmotif/sax/alphabet/NormalAlphabet.java , I twitched a little ;-)

The following is an implementation of your Alphabet class that does the computation of the cuts and the distance matrix generically, on the fly, for arbitrary alphabet sizes.

package net.seninp.jmotif.sax.alphabet;

import net.seninp.jmotif.sax.SAXException;

public class GeneralAlphabet extends Alphabet
{
    @Override
    public Integer getMaxSize()
    {
        return Integer.MAX_VALUE;
    }

    @Override
    public double[] getCuts(Integer size) throws SAXException
    {
        return computeEquiprobableRegions(size);
    }

    @Override
    public double[][] getDistanceMatrix(Integer size) throws SAXException
    {
        return computeDistanceMatrix(getCuts(size));
    }

    /**
     * The square root of two
     */
    private static final double SQRT_TWO = Math.sqrt(2);

    /**
     * Create an array that describes the breakpoints for dividing a
     * Gaussian distribution into equiprobable regions.  
     *  
     * @param n The number of bisection points
     * @return The array
     */
    private static double[] computeEquiprobableRegions(int n)
    {
        double b[] = new double[n - 1];
        for (int s = 1; s <= n - 1; s++)
        {
            double x = ((double) s / n);
            b[s - 1] = probit(x);
        }
        return b;
    }

    /**
     * The probit function.<br> 
     * <br>
     * (See http://en.wikipedia.org/wiki/Probit)
     * 
     * @param x The argument
     * @return The result
     */
    private static double probit(double x)
    {
        return SQRT_TWO * erfInv(2 * x - 1);
    }

    /**
     * The inverse error function.<br>
     * <br> 
     * (See http://en.wikipedia.org/wiki/Error_function)
     * 
     * @param x The argument
     * @return The result
     */
    private static double erfInv(double x)
    {
        // From https://people.maths.ox.ac.uk/gilesm/files/gems_erfinv.pdf
        double w = -Math.log((1.0 - x) * (1.0 + x));
        double p = 0;
        if (w < 5.0)
        {
            w = w - 2.5;
            p = 2.81022636e-08;
            p = 3.43273939e-07 + p * w;
            p = -3.5233877e-06 + p * w;
            p = -4.39150654e-06 + p * w;
            p = 0.00021858087 + p * w;
            p = -0.00125372503 + p * w;
            p = -0.00417768164 + p * w;
            p = 0.246640727 + p * w;
            p = 1.50140941 + p * w;
        }
        else
        {
            w = Math.sqrt(w) - 3.0;
            p = -0.000200214257;
            p = 0.000100950558 + p * w;
            p = 0.00134934322 + p * w;
            p = -0.00367342844 + p * w;
            p = 0.00573950773 + p * w;
            p = -0.0076224613 + p * w;
            p = 0.00943887047 + p * w;
            p = 1.00167406 + p * w;
            p = 2.83297682 + p * w;
        }
        return p * x;
    }

    /**
     * Computes the MINDIST lookup matrix, as described in the SAX paper
     * 
     * @param b The breakpoints, as computed with 
     * {@link #computeEquiprobableRegions(int)}
     * @return The distance matrix
     */
    private static double[][] computeDistanceMatrix(double b[])
    {
        int n = b.length + 1;
        double result[][] = new double[n][n];
        for (int r = 0; r < n; r++)
        {
            for (int c = r; c < n; c++)
            {
                double value = 0;
                if (Math.abs(r - c) > 1)
                {
                    value = b[Math.max(r, c) - 1] - b[Math.min(r, c)];
                }
                result[r][c] = value;
                result[c][r] = value;
            }
        }
        return result;
    }    
}

I'm not so sure abut the usage pattern of instances of this class. If the methods are called repeatedly and are performance critical, one could consider "caching" the results, so that the overhead for the computation will vanish, and the performance will eventually be the same as for the current, switch-based implementation - roughly like that:

package net.seninp.jmotif.sax.alphabet;

import java.util.LinkedHashMap;
import java.util.Map;

import net.seninp.jmotif.sax.SAXException;

public class CachingAlphabet extends Alphabet
{
    private final Alphabet delegate;
    private final Map<Integer, double[]> cuts;
    private final Map<Integer, double[][]> distanceMatrices;

    CachingAlphabet(Alphabet delegate)
    {
        this.delegate = delegate;
        this.cuts = new LinkedHashMap<Integer, double[]>();
        this.distanceMatrices = new LinkedHashMap<Integer, double[][]>();
    }

    @Override
    public Integer getMaxSize()
    {
        return delegate.getMaxSize();
    }

    @Override
    public double[] getCuts(Integer size) throws SAXException
    {
        double result[] = cuts.get(size);
        if (result == null)
        {
            result = delegate.getCuts(size);
            cuts.put(size, result);
        }
        return result;
    }

    @Override
    public double[][] getDistanceMatrix(Integer size) throws SAXException
    {
        double result[][] = distanceMatrices.get(size);
        if (result == null)
        {
            result = delegate.getDistanceMatrix(size);
            distanceMatrices.put(size, result);
        }
        return result;
    }
}

To be used as

Alphabet alphabet = new CachingAlphabet(new GenericAlphabet());

Regardless of the possible caching, here's a quick test showing that the results delivered by this implementation are epsilon-equal to the results from your current implementation.

package net.seninp.jmotif.sax.alphabet;

import java.util.Locale;

import net.seninp.jmotif.sax.SAXException;

public class AlphabetTest
{
    public static void main(String[] args) throws SAXException
    {
        NormalAlphabet a0 = new NormalAlphabet();
        GeneralAlphabet a1 = new GeneralAlphabet();

        int max = Math.min(a0.getMaxSize(), a1.getMaxSize());
        for (int i = 2; i < max; i++)
        {
            double[] c0 = a0.getCuts(i);
            double[] c1 = a1.getCuts(i);

            System.out.println("For " + i);
            //System.out.println("c1: "+Arrays.toString(c0));
            //System.out.println("c0: "+Arrays.toString(c1));

            double d0[][] = a0.getDistanceMatrix(i);
            double d1[][] = a1.getDistanceMatrix(i);

            //System.out.println("d0:\n" + createString(d0));
            //System.out.println("d1:\n" + createString(d1));

            System.out.println("maxAbsoluteDifference for cuts           : " + 
                maxAbsoluteDifference(c0, c1));
            System.out.println("maxAbsoluteDifference for distance matrix: " + 
                maxAbsoluteDifference(d0, d1));
        }
    }

    private static double maxAbsoluteDifference(double a0[], double a1[])
    {
        double maxAbsoluteDifference = Double.NEGATIVE_INFINITY;
        for (int i = 0; i < a0.length; i++)
        {
            double d = Math.abs(a0[i] - a1[i]);
            maxAbsoluteDifference = Math.max(maxAbsoluteDifference, d);
        }
        return maxAbsoluteDifference;
    }

    private static double maxAbsoluteDifference(double a0[][], double a1[][])
    {
        double maxAbsoluteDifference = Double.NEGATIVE_INFINITY;
        for (int i = 0; i < a0.length; i++)
        {
            maxAbsoluteDifference = Math.max(maxAbsoluteDifference, 
                maxAbsoluteDifference(a0[i], a1[i]));
        }
        return maxAbsoluteDifference;
    }

    private static String createString(double a[][])
    {
        StringBuilder sb = new StringBuilder();
        for (int r=0; r<a.length; r++)
        {
            if (r > 0)
            {
                sb.append("\n");
            }
            for (int c=0; c<a[r].length; c++)
            {
                if (c > 0)
                {
                    sb.append("  ");
                }
                sb.append(String.format(Locale.ENGLISH, "%5.5f", a[r][c]));
            }
        }
        return sb.toString();
    }
}

I considered opening this as a pull request, but think that it might be better to post this here, because there are several degrees of freedom for a possible integration. (You could even use this class to generate the lookup tables for your implementation, so that at least the 26 (instead of 20) letters can be covered). But of course, whether you use it or not is up to you.

If I use and publish this code in one of my libraries, it will be under the MIT/X11 license. But, from my perspective, the above code can be considered as "public domain" (one function refers to a paper, the link is in the code), so everybody may do with this whatever he wants.

Approx distance should account for the alphabet.

I think that using the cut intervals central lines may help....

When using this approach, this is how approximation error behavior looks like:

Using sax on gps points

Hi, I have a gps points dataset. It includes timestamp, position, velocity and other attributes.
I want to symbolize the velocity using the sax.
If the time interval between two consecutive point is not fixed, does sax support this case?

Rule thrown away makes it back after the check loop.

do you have any examples of using shingles in time series classification?

Hi Senin,

Recently, I came across approach (cosine similarity) that tries to use k-shingles to help classify text documents based on similarity in subsequent string patterns?
My quick search found that you actually have implemented public classes in Java (in SAXProcessor.java that indeed represents timeseries into Shingles using:

TS2Shingles - for single time series
manySeriesToShingles - for multiple time series datasets

However, my question to you is have you attempted classifying time series based on k-shingles ? If yes, does it fair well than SAX-VSM? Any other findings from such..pls advice.

Thanks,
KMB

Feature Request: String to PAA

My request is motivated by the need to calculate the slope of a pattern.

So as to not let this request be too specific, it might be more generally useful to convert a pattern to a PAA (double[]) which can then be used to do calculations on it.

PAA conversion like SAXProcessor#ts2string() but in reverse.

TSProcessor and SAXProcessor exception handling improvement

Ideally, create and throw SAXRuntimeException.

In the following case, a checked exception is thrown on a condition similar to where typically java.lang.IndexOutOfBoundsException is thrown.

However IndexOutOfBoundsException extends RuntimeException.

There is little chance in the following case to meaningfully catch the checked exception and recover from it. Once it is thrown, it is game over because of a programming error.

SAXProcessor
public char[] ts2string(double[] ts, int paaSize, double[] cuts, double nThreshold)
throws SAXException {

throws it because it calls

TSProcessor
public double[] paa(double[] ts, int paaSize) throws SAXException {
if (len < paaSize) {
throw new SAXException("PAA size can't be greater than the timeseries size.");
}

What one ends up doing as a developer using this library is to ring fence the exception in a way like this:

final char[] ts2string;
try {
ts2string = sp.ts2string(windowSubseries, paaSize, na.getCuts(alphabetSize), normalizationThreshold);
} catch (SAXException ex) {
throw new RuntimeException(ex);
}

which disrupts the work flow when writing software.

PAA doesn't behave as expected for series whose length is not multiple of paa size

I was starting from your code to create a SAX processor that operates reactively.
While coding PAA to operates on enumerables/observable collections I found something that I guess it's an error in TSProcessor PAA implementation
.
Take for example an array [1,2,3,4,5,6,7] and PAA size = 4.

The following R code (taken from this repo) generates what seems to me the correct output

x <- array(1:7, dim=c(7))
l <- length(x)
PAA_number = 4
PAA <- array(0, PAA_number)
for (i in 1:PAA_number) {
  PAA[i] <- mean(x[round((i - 1) * l/PAA_number + 1):round(i * l/PAA_number)])
  print(paste("segment",i,"[",      # a debug output
          round((i - 1) * l/PAA_number + 1),
          ":",round(i * l/PAA_number),"] ->",
          PAA[i]))
}

Output is:

[1] "segment 1 [ 1 : 2 ] -> 1.5"
[1] "segment 2 [ 3 : 4 ] -> 3.5"
[1] "segment 3 [ 4 : 5 ] -> 4.5"
[1] "segment 4 [ 6 : 7 ] -> 6.5"

To the contrary, the TSProcessor paa function generates:

[0] = 1.4285714285714286
[1] = 3.1428571428571428
[2] = 4.8571428571428568
[3] = 6.5714285714285712

Can you check?

Approx distance computation needs to be fixed.

This is how it works right now. Seems to be OK...

When using this approach, this is how approximation error behavior looks like:

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.