GithubHelp home page GithubHelp logo

esnet / pond Goto Github PK

View Code? Open in Web Editor NEW
204.0 32.0 37.0 13.92 MB

Immutable timeseries data structures built with Typescript

Home Page: http://software.es.net/pond

License: Other

JavaScript 12.18% HTML 0.33% CSS 0.65% TypeScript 86.84%
javascript pond timeseries immutable esnet typescript

pond's Introduction

Pond.js

Build Status npm version

Version 1.0 (alpha) of Pond.js is written in Typescript and has a brand new fully typed API**

Version 0.9.x of Pond.js is the current version and the last version with the old API. (Documentation is still available) Note that v0.8.x/v0.9.x is the only version currently aligned with react-timeseries-charts.


Pond.js is a library built on top of immutable.js and Typescript to provide time-based data structures, serialization and processing.

For data structures it unifies the use of times, time ranges, events, collections and time series. For processing it provides a chained pipeline interface to aggregate, collect and process batches or streams of events.

We are still developing Pond.js as it integrates further into our code, so it may change or be incomplete in parts. That said, it has a growing collection of tests and we will strive not to break those without careful consideration.

See the CHANGES.md.

Rationale

ESnet runs a large research network for the US Department of Energy. Our tools consume events and time series data throughout our network visualization applications and data processing chains. As our tool set grew, so did our need to build a Javascript library to work with this type of data that was consistent and dependable. The alternative for us has been to pass ad-hoc data structures between the server and the client, making all elements of the system much more complicated. Not only do we need to deal with different formats at all layers of the system, we also repeat our processing code over and over. Pond.js was built to address these pain points.

The result might be as simple as comparing two time ranges:

const timerange = timerange1.intersection(timerange2);
timerange.asRelativeString();  // "a few seconds ago to a month ago"

Or simply getting the average value in a timeseries:

timeseries.avg("sensor");

Or quickly performing aggregations on a timeseries:

const dailyAvg = timeseries.fixedWindowRollup({
    window: everyDay,
    aggregation: { value: ["value", avg()] }
});

Or much higher level batch or stream processing using the chained API:

const source = stream()
    .groupByWindow({
        window: everyThirtyMinutes,
        trigger: Trigger.onDiscardedWindow
    })
    .aggregate({
        in_avg: ["in", avg()],
        out_avg: ["out", avg()]
    })
    .output(evt => // result );

How to install

Pond can be installed from npm.

The current version of the Typescript rewrite of Pond is pre-release 1.0 alpha, so you need to install it explicitly:

npm install [email protected]

The older Javascript version (v0.8.x), which is the only one currently compatible with the companion visualization library react-timeseries-charts, is still the default version:

npm install pondjs

What does it do?

Pond has three main goals:

  1. Data Structures - Provide a robust set of time-related data structures, built on Immutable.js
  2. Serialization - Provide serialization of these structures for transmission across the wire
  3. Processing - Provide processing operations to work with those structures

Here is the high level overview of the data structures provided:

  • Time - a timestamp

  • TimeRange - a begin and end time, packaged together

  • Index - A time range denoted by a string, for example "5m-1234" is a specific 5 minute time range, or "2014-09" is September 2014

  • Duration - A length of time, with no particular anchor

  • Period - A reoccurring time, for example "hourly"

  • Window - A reoccurring duration of time, such as a one hour window, incrementing forward in time every 5 min

  • Event<K> - A key of type T, which could be Time, TimeRange or Index, and a data object packaged together

  • Collection<K> - A bag of events Event<K>, with a comprehensive set of methods for operating on those events

  • TimeSeries<K> - A sorted Collection<K> of events Event<K> and associated meta data

And then high level processing can be achieved either by chaining together Collection or TimeSeries operations, or with the experimental Stream API:

  • Stream - Stream style processing of events to build more complex processing operations, either on incoming realtime data. Supports remapping, filtering, windowing and aggregation.

Typescript

This library, as of 1.0 alpha, is now written entirely in Typescript. As a result, we recommend that it is used in a Typescript application. However, that is not a requirement.

The documentation website is generated from the Typescript definitions and so will provide type information. While especially useful when building a Typescript application, it is also a guide for Javascript users as it will tell you the expected types, as well as understanding consistency in generics. See these How to read these docs for a quick guide to reading Typescript definitions.

v0.8.9 of Pond ships with basic Typescript declarations that were contributed to the project.

Contributing

Read the contribution guidelines.

The library is written in Typescript and has a large and growing Jest test suite. To run the tests interactively, use:

npm test

Publishing

We are currently publishing Alpha versions as we find bugs or need to make other changes.

lerna publish --npm-tag alpha

License

This code is distributed under a BSD style license, see the LICENSE file for complete information.

Copyright

ESnet Timeseries Library ("Pond.js"), Copyright (c) 2015-2017, The Regents of the University of California, through Lawrence Berkeley National Laboratory (subject to receipt of any required approvals from the U.S. Dept. of Energy). All rights reserved.

If you have questions about your rights to use or distribute this software, please contact Berkeley Lab's Innovation & Partnerships Office at [email protected].

NOTICE. This software is owned by the U.S. Department of Energy. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, prepare derivative works, and perform publicly and display publicly. Beginning five (5) years after the date permission to assert copyright is obtained from the U.S. Department of Energy, and subject to any subsequent five (5) year renewals, the U.S. Government is granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, prepare derivative works, distribute copies to the public, perform publicly and display publicly, and to permit others to do so.

pond's People

Contributors

eric-arellano avatar iabw avatar jdugan1024 avatar miracle2k avatar pjm17971 avatar sartaj10 avatar siavelis avatar timmensch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pond's Issues

Chronological error in TimeSeries constructor

We started to enforce that events passed to the TimeSeries constructor be chronological. This is important as operations on a TimeSeries sometimes depend on the events being in order. However, it seems there are ways for this to happen internally. Those operations should always make sure that their collections are in order.

A possible alternative is to just check if the events are in order and if not, sort them. This would be the least intrusive approach, but might lead to slow user code.

Add merging for a TimeSeries

You should be able to take a TimeSeries with say "in" traffic, and one with "out" traffic and merge them to get a TimeSeries with both "in" and "out" columns.

Degenerate events from `atTime`

I haven't looked carefully at Pond's tests for this, so apologies if I'm mistaken.

Currently, it seems that atTime will only return the first Event returned from bisect, even if there are multiple events at the time time. In the example from #45, if you write two Events with exact matching timestamps, only the first one will be returned even though both are in the index. Would it be more correct to return an [Event] instead. In my use case, I ended up merging them this way...

// Context: Typescript object using `this.index: pond.TimeSeries`.

atTime(time: Date): Event {
    let result: Array<any> = [];
    if (this.hasTime(time)) {
      let done = false;
      // Find the first occurrence of a time, and check to see if there are any
      // sequential elements. Accumulate those, and then return the merged
      // value.
      let pos = this.index.bisect(time);
      while (!done) {
        if (pos >= 0
            && pos < this.index.size()
            && this.index.at(pos).timestamp().getTime() == time.getTime()) {
          result.push(this.index.at(pos));
        } else {
          done = true;
          break;
        }
        ++pos;
      }
      // mergeEvents folds events such that the last non-null value is used for the Event.
      // e.g., {a: 1, b: 3}, {a: null, b: 2} => {a: 1, b: 2}
      // Could alternatively return an `[Event]` here.
      return mergeEvents(result);
    } else {
      return null;
    }
  }

(Thanks for all the help!)

Merge corrupts timestamps

If you merge two timeseries with overlapping timestamps, those timestamps become corrupted in the resulting timeseries.

avro support fails in webpack builds

When you import pond and bundle with webpack, the avro support fails with:

Error in ./~/avsc/lib/files.js
Module not found: Error: Cannot resolve module 'fs' ...

Abstraction for pipeline dependency graph

Integrate morbius graphs into pond, separating the graph from the execution of that graph. This task is just to make that abstraction, but to retain existing functionality. In the future we could add alternative executors, depending on environment or target.

Cannot create time series

Hi,
I'm currently evaluating react-timeseries-charts and struggeling with the creation of the time series. I'm using the example from the Pond page;

const events = [];
events.push(new Event(new Date(2015, 7, 1), {value: 27}));
events.push(new Event(new Date(2015, 8, 1), {value: 29}));
const series = new TimeSeries({
    name: "avg temps",
    events: events
});

I always get an error when I try to run the app:
TypeError: null is not a constructor (evaluating 'new this_type(this._eventList.get(pos))')
This error is thrown when timeseries.timerange() is called.

Going through the code it looks like there is a problem in the _check() function of pipelin-in.js: The _type variable is not set, i.e. none of the following checks evaluates to true:

if (!this._type) {
  if (e instanceof _event2.default) {
    this._type = _event2.default;
  } else if (e instanceof _timerangeevent2.default) {
    this._type = _timerangeevent2.default;
  } else if (e instanceof _indexedevent2.default) {
    this._type = _indexedevent2.default;
  }
}

Even if I set this._type manually to _event2.default, the other if-block throws an error:

if (!(e instanceof this._type)) {
  throw new Error("Homogeneous events expected.");
}

In the console I can see that e is an Event and this._type is something like function ....

Moving library to Typescript

Rewrite the library in Typescript. ๐Ÿ˜บ

The main goal here would be type safety, which I think would be a great benefit in the case of Pond. We can also look into using generics for things like events and collections. As we try to narrow in on what 1.0 would look like, this should be part of that conversation.

We've done some initial investigation into this, and it wouldn't be especially easy to convert, and very likely would cause more serious API changes. But also the benefits seem huge, and productivity downstream of this code base would be greatly enhanced by proper code completion and type checking.

Code structures:

  • Time type to replace 14127847483 or Date
  • Period type to replace "30s" strings
  • Indexes
  • Utility functions
  • Aggregation functions
  • EventKey type, Time, TimeRange and Index base
  • Events
  • Collections<<Event>
  • TImeSeries

Collection processing

  • Batch processing framework
  • Aggregation
  • Alignment and rates
  • Filling
  • Collapsing
  • Selector
  • Take n

Stream processing

  • Should this be supported going forward?
  • Stream processing node graph framework
  • Event-wise processing
  • Grouping
  • Windowing

Project:

  • Transpile setup
  • Generate website API docs from Typescript (@sartaj10)
  • Figure out Jest testing in Typescript
  • Rebuild website

Field path/spec consistency in `TimeSeries`

While untangling the same issue in pypond and comparing against the JS, I noticed that the various aggregation methods in TimeSeries that call the corollary methods in Collection all have field_path in the docstring with the appropriate "this is one column" verbiage, but the methods themselves still use field_spec for the incoming and outgoing args.

A minor consistency issue.

User friendly API for using pipelines on TimeSeries objects

Here is an example of trying to use the current API (as it's exposed on TimeSeries.collapse())

The use case here is to take a TimeSeries, which is to be stacked, and find the appropiate scale. This has three parts to it:

  • For a subset of columns (those which will be charted in the up direction) we element by element collapse it to a single column which is the sum of the columns being collapse. i.e. up = ["oscars", "lhcone", "other"]. The output will be a single column ("value") which is the sum of oscars, lhcone and other, for each time.
  • Then for each of these, find the max down the column. This gives us our chart scale in the up direction. We repeat 1. and 2. for the down direction.
  • Since our chart will be scaled the same in both the up and down direction, find the max of the two max values. This becomes our scale.

To do this in code, within a React component, looks like this:

Firstly, we are going to keep the scale as state on the component. When the TimeSeries changes we recalculate the scale. This is so that we don't have to calculate the scale everytime the thing renders:

componentWillReceiveProps(nextProps) {
    const { series, up, down } = nextProps;
    if (!TimeSeries.is(this.props.series, series)) {
        this.updateChartScale(series, up, down);
    }
}

updateChartScale() does the actual code to get the new state:

updateChartScale(timeseries, up, down) {
    // A function to collapse columns to their sum, then find the max
    // value across the series
    const collapser = (columns, cb) => {
        timeseries.collapse(columns, "value", sum, false, sums =>
            cb(null, sums.max())
        );
    }

    // Use async to collect together the two callback results: the max
    // in the up and down direction, then do a standard React setState
    // to the max of those

    async.series([
        cb => collapser(up, cb),
        cb => collapser(down, cb)
    ], (err, results) => {
         this.setState({max: _.max(results)});
    });
}

Setting the state causes a re-render, which can then just use this.state.max in the YAxis code.

Pretty much a pure reactive approach. Sure.

Too hard? Yeah, probably.

An alternative I've been looking into is to have batch pipelines be optionally sync. Well to be clear they currently aren't async now, strictly speaking, but their API allows them to return results as they are built, and those results, in the future, might not happen in the same tick as the output call (to()).

The proposal is to add two additional functions to Pipeline, to complement to(). toEventList() and toCollectionList(). If you evoke these with a streaming source, it will throw. If you evoke them with a bounded source (e.g. a TimeSeries), it will run all of the source through the pipeline, then send a flush signal. When each Event (or Collection) is received by the Out, and that Out doesn't have an observer set (which will be the case when using the two new output methods), it will instead add the event (or collection) to a results accumulator on the Pipeline itself. When the flush is received by the Out, it will tell the Pipeline that the results are complete.

To the user this would look like this:

const timeseries = new TimeSeries(data);
const events = Pipeline()
    .from(timeseries)
    .emitOn("flush")
    .collapse(["in", "out"], "total", sum)
    .aggregate({total: max})
    .toEventList();
console.log("RESULT:", events);

Or for collections:

const timeseries = new TimeSeries(data);
const collections = Pipeline()
    .from(timeseries)
    .emitOn("flush")
    .groupBy(e => e.value() > 65 ? "high" : "low")
    .take(10)
    .toCollectionList();
console.log(collections);  // 2 collections (high and low)

The tricky thing here is what if this becomes async in the future.

This could happen in two ways:

  • The from() source could be async. For instance it could be a source requested off the network or from a file. If this was the case the toEventList() would return with an empty result set. Sometime later the source would have data and it would start to feed that data into the pipeline.
  • A part of the pipeline is async because it farms out work to workers. The first time this happens the stack will unwind and the pipeline will return, again with no results.

To deal with these I think you'd have to say that if you use the two new methods to return data immediately, the pipeline becomes sync (e.g. a flag is passed down the pipeline, async=false, and everything should honor that). I don't know of anyway to make a network call and actually block waiting for the response (and that would be a bad idea in a browser anyway), so likely such a source, if present, would throw based on the async flag. For file loading on node.js it could block with a *sync method, but again, not such a good idea. For farming out work, it could just not do that. No parallel for you.


On balance the pipeline async API is the right one for a) a reactive approach, b) streaming and c) async sources that aren't a TimeSeries you already have and d) future parallel operations. But, the real point here is if all you have a TimeSeries, and you just want to calculate something on it, it should be simpler.

Here's the first example again, assuming an implementation of collapse that uses the above changes:

updateChartScale(timeseries, up, down) {
    const maxUp = timeseries.collapse(up, "value", sum);
    const maxDown = timeseries.collapse(down, "value", sum);
    this.setState({max: Math.max(maxUp, maxDown)});
}

lastDay and friends are incorrect

begin and end times are reversed in the static methods of Event. TimeRange should throw an exception if it gets a begin time after and end time.

TimeSeries has duplicitous methods

In series.js:

    /**
     * Returns the number of rows in the series.
     */
    size() {
        return this._collection.size();
    }

    /**
     * Returns the number of rows in the series.
     */
    size() {
        return this._collection.size();
    }

Fenceposting with `atTime`?

This is mildly related to #44. I was writing some of my own tests my use of TimeSeries when I noticed that I was missing some events when using atTime and bisect. Here's an example:

      let eventSource = new UnboundedIn();
      let collection = new Collection();
      let timeseries = new TimeSeries({name: "test",
                                       collection: collection});
      let pipeline = new Pipeline()
          .from(eventSource)
          .to(EventOut,
              event =>
              { collection = collection.addEvent(event);
                timeseries = new TimeSeries({name: "test",
                                             collection: collection});});
      assert(collection.size() == 0);
      assert(timeseries.size() == 0);
      eventSource.addEvent(new Event(time1, {value: 2}));
      assert(collection.size() == 1);
      assert(timeseries.size() == 1);
      console.log("timeseries" + timeseries.toString()); 
        // => {"name":"test","utc":true,"columns":["time","value"],"points":[[1465084800000,2]]}
      console.log("time1 index " + timeseries.bisect(time1)); 
        // => 0
      console.log("index 0 " + timeseries.at(0)); 
        // => {"time":1465084800000,"data":{"value":2}}
      console.log("using timeAt " + timeseries.atTime(time1)); 
        // => undefined :(

I think this is because at https://github.com/esnet/pond/blob/master/src/pond/lib/collection.js#L192, instead of if (pos && ....) you actually want if (pos >= 0 && ...).

Align pipeline function

Would take incoming events, with arbitrary timestamps, and align those to time boundaries. For example, to 1:35:00, 1:40:00, 1:45:00 using a duration string such as "5m".

Validation on rollup functions is incorrect

In dailyRollup https://github.com/esnet/pond/blob/rename-column-fix/src/pond/lib/timeseries.js#L1057 there is a check to see if aggregation is a function, but I believe it should be an object. I think this is also a problem in hourlyRollupw, monthlyRollup and yearlyRollup. This would bring things into alignment with how fixedWindowRollup works and how the docs describe the other functions.

Also, fixedWindowRollup refers to a function rather than to an aggregation map:

https://github.com/esnet/pond/blob/rename-column-fix/src/pond/lib/timeseries.js#L1057

Past-Dependent Functions

Hey all,

I was wondering if there existed a mechanism for reductions (past-dependent functions) on pipelines.

Is it the case that:

  • The functionality exists
  • The functionality doesn't exist, but the feature is desired
  • The functionality doesn't exist, and is not desired

Fill not filling in correctly

I'm super excited to have run across this library--I think it could potentially help me with a big issue I've been working on lately.

I have a large time series dataset that has lots of missing values, so I found your "fill" linear function to be particularly interesting. However, i'm having an issue where when it goes to fill in null values, I get results that look like this:

postman 2017-02-16 11-00-03

Have you seen this before? If so, any thoughts on what may be causing this? Or is this the desired behavior, and I'm missing something?

Thanks!

Resampling

The timeseries charts code really doesn't like having 5000 points it has to render over and over (during pan and zoom). Since don't display a 5000 pixel wide chart, we should down sample the data before it is rendered. There are several possibilities here:

A basic resample(n, avg) would divide the series into n buckets, where n might be approximately 1:1 with the resolution of the screen. We'd then apply one of the the functions to that data, like avg or max.

Another approach is to use the processing code to do a rollup("5m", avg) of the data. This would bin samples in the timeseries into time based buckets such as 5 min.

A more sophisticated approach is for each bucket calculate a representative point to use instead, such that the resulting path is visually the most similar to the original path. An algorithm that attempts to do this is called "largest triangle three buckets". Here's the paper:
http://skemman.is/stream/get/1946/15343/37285/3/SS_MSthesis.pdf

tl;dr: You can just read the algorithm on page 21 ("largest triangle three buckets"). It was the one people liked the most.

Too hard to go look up page 21? here's the jist:
You are dividing the samples into n buckets. For each bucket you pick a point to represent that bucket. Buckets to consider in making this decision are A (the previous bucket), B (the next bucket) and C (the current bucket). You'll need the picked point from the previous* bucket a, the average point from the next* bucket b and the samples in the current bucket C. You loop over the samples in the current bucket, each sample c, and calculate the triangle area contained within a, b and c. Largest triangle wins the prize of having its c being selected as the representative point for the current bucket. Move onto the next bucket. Repeat.

  • for the edges, basically use the first and last points. They use an extra bucket at the beginning with just the first point in it, same at the end.

pondjs not recognised after installing

I'm using Webstorm IDE and have created a create-react-app. I installed pondjs using:
npm install pondjs --save ; it got installed and that's why I can see it in node-modules.
But in my App.js when I'm trying to import it using : import { TimeSeries, TimeRange } from "pondjs";
this package is not recognised. Please help. I'm trying to solve this since long.

Last n minutes of a TimeSeries

Assuming I have a TimeSeries already created with the following range for example:

  1. Mon Nov 14 2016 06:30:00 GMT-0500 (Eastern Standard Time)
  2. Mon Nov 14 2016 08:00:00 GMT-0500 (Eastern Standard Time)

Is it possible to get a range of the last 15 minutes using pond? So the new range would be:

  1. Mon Nov 14 2016 07:45:00 GMT-0500 (Eastern Standard Time)
  2. Mon Nov 14 2016 08:00:00 GMT-0500 (Eastern Standard Time)

Fill function

This is the case where you have a raw TimeSeries and you want to heal missing values. I think you want to be able to do a couple of things:

  • Filter out events with missing values in a fieldSpec - this is just a specialized filter operation
  • Replace missing values with a new value

In the case of the second function, probably fill(), there's several ways to replace missing values. We need to decide what the minimum set is to be useful to us, initially. Possibilities are:

  • fill with 0
  • fill with previous non-missing value (pad)
  • interpolate between non-missing values, based on timestamp

For each of these it would be useful to supply a limit. A limit of 2 would fill two missing values only. This is because we often want to patch small holes in TimeSeries, but if there's a substantial hole it makes sense sometimes (e.g. visualization), to keep it as missing data. If we missed an hour of data it should show that.

Note: this is being implemented in PyPond first.

Exposing more aggregate functions for merging timeseries

We already have a way to reduce timeseries on a reducer function like this

 avgSeries = TimeSeries.timeseriesListReduce({
                        name: 'avg',
                        seriesList: allTimeSeries,
                        fieldSpec: ['value'],
                        reducer: Event.avg
                    });

But as i have looked in documentation and source code only two aggregate function are available in Event.js

  1. Sum
  2. Avg

It would be good idea if there was a way to expose more aggregate functions in merging timeSeries. Like in my use case i need min max and other aggregate func.

If this cant be done in current implementation, I am also willing to contribute to this lib.

Clarify Pipeline.windowBy() documentation

The docstrings windowBy() don't match the code RE: the intervals and are slightly vague in that a window OR a duration are passed in as the same arg. Needs some cleanup since this is a pretty crucial bit of functionality.

Real time timeseries?

Hello! Thanks for all your work on this great project. Coming from Python, I've been looking for a columnar timeseries library in Javascript and was super-pleased to find your project. Do you any plans on adding a timeseries implementation specialized on real-time datastreams? I ask because some of the aspects of the realtime example seem like they could be abstracted out and made into an efficient, immutable OnlineTimeSeries extending the existing API. I'm imagining this would also include:

  • Sliding window with a fixed capacity (based on Immutable ... somehow ... instead of a circular buffer), dropping oldest elements.
  • Support out-of-order writes to a TimeSeries. I think the current constructor errors if the events have out-of-order timestamps? This is tangentially related to #41. It would be nice to be able to relax assumptions about Events timestamps for this use case.
  • ....

In my own project (alas, currently private) I originally implemented something based on Pipeline and UnboundedIn (see below), but the implementation I ultimately used just creates a new TimeSeries from an immutable Collection update.

 let eventSource = new UnboundedIn();
 let collection = new Collection();
 let timeseries = new TimeSeries({name: "test", collection: collection});
 let pipeline = new Pipeline()
  .from(eventSource)
  .to(EventOut,
       event =>
               { collection = collection.addEvent(event);
                 timeseries = new TimeSeries({name: "test",
                                              collection: collection});});

Support groupings in processing code

The processing code currently supports aggregation, binning and collections. An event stream is passed between these and are processed irrespective of the event content/source. It's up to the code feeding the event stream to manage a larger collection of incoming events into separate aggregators etc.

A solution to this is to implement a Grouper. The grouper would take a stream of events and classify them based on column name, a list of column names, or a function. Classification would be a string, essentially a key, that would be stored on the event itself. (Since we're talking about immutable objects here, an incoming Event would result in a different Event with the key added (though the internal data reference would be shared)). At any rate that Event could then be fed into an Aggregator and be bucketed with Events that had the same key, and so on down the processing pipeline.

Build quantiles from a Collection

We'd like to at least support getting the quartiles back, since we can use those for visualization, but a general function would look like this:
Collection.quantile(4) // array of 3 values that divide the collection's values into 4 subsets

A Collection.percentile(95) function would also be useful. It could use quantile internally.

We need this for doing the react-timeseries-chart horizontal barcharts properly. We could also build box plot style visualizations with this.

Sliding windows

We have fixed windows, and the pipeline code was refactored several months ago to be able to do sliding window in theory. If we had that we could do rolling averages etc.

There are two types of sliding windows:

  1. Sliding window: stores n points, like in a ring buffer, adding new points in the front and discarding the same number of old points out the back, while emitting an aggregation as it goes. There's no concept of discarding that window here, it just keeps rolling, but you might want to emit just every so many events. Basically something like "I want a n=30 event window, and every 5 events I want to emit a result".

  2. Sliding time window: The other type of window would have a duration like "1h" and an emit rate of every "30s". New events are added to the front, old events are discarded if they are older than 1h. On as each event crosses the 30s boundary (on 1:30, 2:00, 2:30, etc) it would do its aggregation.

Derivative pipeline function

Output new events which are the difference of each pair of events. The output would likely be a TimeRangeEvent.

Bug in Taker processor

The count initialization logic in the Taker is making so that the first event in the TimeSeries is being skipped. To wit: if these were the first 11 values in a set:

0 1 2 3 4 5 6 7 8 9 10

then `.take(10) would yield 1-10 rather than 0-9

Support de-duplicate of data

A collection can have duplicates of events, at the same timerange. It would be nice if we could remove those. As part of this, remove bisect etc from the Collection and move it up to TimeSeries. Then provide a method on Collection to get a list of values at a given time.

See also #44, #46

Bug in index TimeRange generation?

Case in util.js appears to not be toggling UTC vs. local:

            // A year e.g. 2015
            case 1:
                const year = parts[0];
                beginTime = isUTC ? moment.utc([year]) :
                                    moment.utc([year]);
                endTime = isUTC ? moment.utc(beginTime).endOf("year") :
                                  moment(beginTime).endOf("year");
                break;

Column renaming

Add ability to rename a column on a TimeSeries. This would apply to TimeSeries, Collections and would also be a pipeline renamer() processor.

Note: This is being implemented on the PyPond side. The task here will be to translate it over and add appropriate tests.

Reference: esnet/pypond#3

Cant use avg() aggregate functions

When I try to use aggregate fucnction to reduce to timeSeries with TimeSeries.timeseriesListReduce it fails at run time with error in library
TypeError: this._type is not a constructor

Trying to reduce multiple time Series with same time range and different values

const data = { name: label,
                        columns: ['time', 'value'],
                        events: [['timestamp','value'],
                                    [timestamp','value'], ....]};
allTimeSeries.push(new TimeSeries(data));
  const sums = TimeSeries.timeseriesListReduce({
                    seriesList: allTimeSeries,
                    reducer: avg(),
                    fieldSpec: ['value']
                });

Development Environment : @angular 2.4.3
IDE: Visual Studio code

Add crop to TimeSeries

You should be able to directly crop a TimeSeries given a TimeRange. Right now you have to bisect and then slice.

Collection constructor incomplete logic

In the case of an Immutable.List being passed as an argument to a Collection object, _check() is not triggered which can lead to this._type never being defined. In the other cases, _check() is run or _type is copied manually.

        if (!arg1) {
            this._eventList = new Immutable.List();
        } else if (arg1 instanceof Collection) {
            const other = arg1;
            // copyEvents is whether to copy events from other, default is true
            if (_.isUndefined(copyEvents) || copyEvents === true) {
                this._eventList = other._eventList;
                this._type = other._type;
            } else {
                this._eventList = new Immutable.List();
            }
        } else if (_.isArray(arg1)) {
            const events = [];
            arg1.forEach(e => {
                this._check(e);
                events.push(e._d);
            });
            this._eventList = new Immutable.List(events);
        } else if (Immutable.List.isList(arg1)) {
            this._eventList = arg1;
        }
    }

Add accumulator to pipeline

I've stumbled across this project, and I'm doing a bunch of testing with it as I think it could help us solve a couple technical challenges.

Two questions:

  1. Do you have the math referenced somewhere for the linear fill function? Sometimes it behaves like I would expect it to, and other times it does not. Likewise for the align method?
  2. Is there a way to handle running totals? For instance, suppose you had the following timeseries:
var data = {
	name: 'production',
	columns: ['time','units'],
	points: [
	  [1400425947000, 5],
	  [1400425948000, 3],
	  [1400425949000, 4],
	  [1400425950000, 1],	
	  [1400425950000, 9]	
  ]
};

And I wanted to find the running total for each time point. I'm envisioning something like this:

var ts = new TimeSeries(data);
var rt = ts.runningTotal(['units']);

/** console.log(rt) */
rt = [
	{time: 1400425947000, units: 5, total: 5},
	{time: 1400425948000, units: 3, total: 8},
	{time: 1400425949000, units: 4, total: 12},
	{time: 1400425950000, units: 1, total: 13},
	{time: 1400425950000, units: 9, total: 22},
];

It would be simple to write a function and just outside of the library, but it would be ideal to be able to put this into the pipeline of functions.

Thanks, keep up the great work!

Support for sub-millisecond precision?

Are there any plans to support sub-millisecond precision? The time series datasets I work with are stamped at microsecond and nanosecond levels. Unfortunately, millis are too coarse for the rollups and charting I would like to perform.

Thanks!

TimeSeries.timerange() is slow

While not actually super slow, it does show up in some of the charts profiling. This is unnecessarily slow because this._collection doesn't have any guarantees of order, so it will check all the contained events to determine the timerange of the timeseries, whereas at the timeseries level we know this can just be obtained directly from the first and last event.

   timerange() {
        return this._collection.range();
    }

Consistent API for fieldSpecs

This task is to unify the use of fieldSpecs and fieldSpecLists across the API. Some high level objectives:

  1. A fieldSpec is reference to a value within an Event. It is used to get a value out of an Event, using the get(fieldSpec) function.
  2. A field can be a deep value, separated by a period (.) e.g e.get("my.deep.value").
  3. Some API methods, especially on the Pipelines, may expect a list of fieldSpecs. These should be called a fieldSpecList.
  4. Low level access also provides access to deep values with an array. This is because putting string parsing (the dot notation) in the inner loop of getting values out of events is slow. We probably need a name for this, e.g. fieldPathArray or something to separate it from fieldSpecList.
  5. There should be one function that takes either an array or string and returns the array. i.e. sanitizes. This can be used by higher level functions to parse the string once, then pass down the fieldPathArray (which the immutable.js code can use to get the deep value efficiently).
  6. Default for a fieldSpec is "value".
  7. Default for a fieldSpecList is ["value"] unless such a default makes no sense. For example if the function requires two fields, then this default won't work.
  8. If the method is on a TimeSeries, then columns is an alias for fields. It might make sense to call them columnSpecs and columnSpecLists? Internal to the TimeSeries code references to fieldSpecs are fine, and may come from a column name originally.

Notes:

TODO items:

  • Catalog inconsitencies based on the above notes in the Notes section
  • Assemble a list of TODO items to complete this task

Abstract out the running of batch pipelines

The process of creating the execution graph from the pipeline is handled by the runner. The intention here is to extend that so that:

  1. You can control what runner you want, but the current default would remain
  2. The runner will generate the actual execution nodes, not just clone the pipeline dag
  3. The runner can be imported from the user e.g. import runner from "pond-node-runner"

The idea here is to allow different types of execution depending on environment. For instance a node running could take advantage of parallel.js or something similar.

We can use https://github.com/esnet/morbius for this now.

Verify column/point ordering in TimeSeries

There might be a situation where the ordering of the event columns from .columns() might not align with the order of the values in the points produced by .to_point(). Solved in python code by allowing .to_point() to take an optional list of columns to return the point list in a specific order.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.