data-forge / data-forge-ts Goto Github PK

View Code? Open in Web Editor NEW

1.3K 25.0 76.0 3.35 MB

The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.

Home Page: http://www.data-forge-js.com/

License: MIT License

TypeScript 99.58% Batchfile 0.01% JavaScript 0.42%

data-wrangling data-forge data data-analysis javascript nodejs linq pandas visualization data-visualization

data-forge-ts's Introduction

Data-Forge

The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.

Implemented in TypeScript.
Used in JavaScript ES5+ or TypeScript.

To learn more about Data-Forge visit the home page.

Read about Data-Forge for data science in the book JavaScript for Data Science.

Love this? Please star this repo and click here to support my work

Please note that this TypeScript repository replaces the previous JavaScript version of Data-Forge.

BREAKING CHANGES

As of v1.6.9 the dependencies Sugar, Lodash and Moment have been factored out (or replaced with smaller dependencies). This more than halves the bundle size. Hopefully this won't cause any problems - but please log an issue if something changes that you weren't expecting.

As of v1.3.0 file system support has been removed from the Data-Forge core API. This is after repeated issues from users trying to get Data-Forge working in the browser, especially under AngularJS 6.

Functions for reading and writing files have been moved to the separate code library Data-Forge FS.

If you are using the file read and write functions prior to 1.3.0 then your code will no longer work when you upgrade to 1.3.0. The fix is simple though, where usually you would just require in Data-Forge as follows:

const dataForge = require('data-forge');

Now you must also require in the new library as well:

const dataForge = require('data-forge');
require('data-forge-fs');

Data-Forge FS augments Data-Forge core so that you can use the readFile/writeFile functions as in previous versions and as is shown in this readme and the guide.

If you still have problems with AngularJS 6 please see this workaround: #3 (comment)

Install

To install for Node.js and the browser:

npm install --save data-forge

If working in Node.js and you want the functions to read and write data files:

npm install --save data-forge-fs

Quick start

Data-Forge can load CSV, JSON or arbitrary data sets.

Parse the data, filter it, transform it, aggregate it, sort it and much more.

Use the data however you want or export it to CSV or JSON.

Here's an example:

const dataForge = require('data-forge');
require('data-forge-fs'); // For readFile/writeFile.

dataForge.readFileSync('./input-data-file.csv') // Read CSV file (or JSON!)
    .parseCSV()
    .parseDates(["Column B"]) // Parse date columns.
    .parseInts(["Column B", "Column C"]) // Parse integer columns.
    .parseFloats(["Column D", "Column E"]) // Parse float columns.
    .dropSeries(["Column F"]) // Drop certain columns.
    .where(row => predicate(row)) // Filter rows.
    .select(row => transform(row)) // Transform the data.
    .asCSV() 
    .writeFileSync("./output-data-file.csv"); // Write to output CSV file (or JSON!)

From the browser

Data-Forge has been tested with Browserify and Webpack. Please see links to examples below.

If you aren't using Browserify or Webpack, the npm package includes a pre-packed browser distribution that you can install and included in your HTML as follows:

<script language="javascript" type="text/javascript" src="node_modules/data-forge/dist/web/index.js"></script>

This gives you the data-forge package mounted under the global variable dataForge.

Please remember that you can't use data-forge-fs or the file system functions in the browser.

Features

Import and export CSV and JSON data and text files (when using Data-Forge FS).
Or work with arbitrary JavaScript data.
Many options for working with your data:
- Filtering
- Transformation
- Extracting subsets
- Grouping, aggregation and summarization
- Sorting
- And much more
Great for slicing and dicing tabular data:
- Add, remove, transform and generate named columns (series) of data.
Great for working with time series data.
Your data is indexed so you have the ability to merge and aggregate.
Your data is immutable! Transformations and modifications produce a new dataset.
Build data pipeline that are evaluated lazily.
Inspired by Pandas and LINQ, so it might feel familiar!

Contributions

Want a bug fixed or maybe to improve performance?

Don't see your favourite feature?

Need to add your favourite Pandas or LINQ feature?

Please contribute and help improve this library for everyone!

Fork it, make a change, submit a pull request. Want to chat? See my contact details at the end or reach out on Gitter.

Platforms

Node.js (npm install --save data-forge data-forge-fs) (see example here)
Browser
- Via bower (bower install --save data-forge) (see example here)
- Via Browserify (see example here)
- Via Webpack (see example here)

Documentation

Resources

Contact

Please reach and tell me what you are doing with Data-Forge or how you'd like to see it improved.

Twitter: @codecapers
Email: [email protected]
Linkedin: www.linkedin.com/in/ashleydavis75
Web: www.codecapers.com.au

Support the developer

Click here to support the developer.

data-forge-ts's People

Contributors

Stargazers

Watchers

data-forge-ts's Issues

Coerce series data types?

Apologies in advance as I'm pretty inexperienced with JavaScript!

I had an issue reading a CSV file where columns of floats were being read in as strings. This proved frustrating when trying to sum() a Series, as it just returned one big concatenated string.

It might be useful to add a method to the Series class to change the data type, similar to pandas.

Maybe something akin this, which I just wrote for a project, though I'd assume you want to lazily evaluate this as well:

/*
 * Returns a new Series whose values are coerced to the specified data type
*/
function coerce_data_type(series,data_type) {
    var coerced_series;
    switch (data_type) {
        case 'number':
            coerced_series = series.toArray().map(value => Number(value));
            break;
        case 'boolean':
            coerced_series = series.toArray().map(value => Boolean(value));
            break;
        case 'string':
            coerced_series = series.toArray().map(value => String(value));
            break;
        default:
            // TODO - error handling
            coerced_series = series.toArray();
            break;
    }
    return new dataForge.Series(coerced_series);
}

df.summarize is not a function

I'm looking at df.pivot(), which I have working, and as recommended by the docs, I tried looking at df.summarize() but when I run the code I get TypeError df.summarize is not a function

Renaming columns

Is there a better way to rename columns in a DataFrame than the following code? I could not find any information for renaming columns in the Guide or API doc. In this code, I'm trying to rename columns "Like This" to "like_this".

slugify = (s) => s.toLowerCase().replace(' ','_').replace('/','_or_')

rawData = 
    DataForge.readFileSync('data.csv')
    .parseCSV({ dynamicTyping: true })

for(let name of rawData.getColumnNames()) {
    let s = rawData.getSeries(name)
    rawData = rawData.dropSeries(name).withSeries(slugify(name), s)
}

Transform Wide-to-Long and Long-to-Wide

A feature request for consideration ... including transforms for wide-to-long and long-to-wide data structures. For example: https://sejdemyr.github.io/r-tutorials/basics/wide-and-long/

Thanks for considering.

Cumulative sum

Panda has cumsum for its series. What's the equivalent in data forge? It doesn't seem like it exists upon first look, what's the recommended workaround?

Pandas value_counts() equivalent

Can you tell me what's the equivalent of Pandas value_counts() in data-forge?

toCSV after joinOuter running very slow

Ashley, thanks for your awesome work on everything. I'm new to JavaScript and I'm not sure if my issue is related to something I'm doing or if there's an issue.

I'm having an issue with writing to CSV after performing an outer join. I've been able to verify that my data frames are being created. When I display head as pictured below the process is a little slow, but writing to CSV takes minutes to complete. I originally thought it wasn't writing, but it does seem to write after some period of time. Additionally, the script seems to hang and I don't return to the command line.

`const exceptDF = cleanRev.joinOuter(cleanSF,
    cleanRev => cleanRev.sfKey,
    cleanSF => cleanSF.sfKey,
    (cleanRev, cleanSF) => {
        return {
            index: cleanRev ? cleanRev.sfKey : cleanSF.sfkey,
            swanKey: cleanRev ? cleanRev.sfKey : undefined,
            sfKey: cleanSF ? cleanSF.sfKey : undefined
        };
    }
);

console.log(exceptDF.head(3).toString());  //this works, but it's slow

exceptDF.asCSV().writeFile('exceptDF.csv');  //this writes the file, but it takes several minutes and I don't return to the command line in the terminal`

For reference, I'm loading two csv files to different data frames, doing some manipulation, and then performing the outer join. Each file has around 4,000 rows and 20 columns.

Thanks for any input.

Write DataFrame as CSV file without headers.

Readme suggests the following code to write a DataFrame out as a CSV file.

  dataFrame
    .asCSV()
    .writeFileSync("output.csv");

But this method doesn't allow any option to set whether we want to include the headers or not.

Looking at the source code I see there's a DataFrame.toCSV(options?: ICSVOutputOptions) which takes an options argument which can be used to set header: false and exclude the headers.

It's not clear why there are 2 different methods, toCSV() and asCSV() (which is actually in data-forge-fs). My guess is that the first one is used to just get a string representation and the former is optimized to be written out as a file.

Assuming my guess is correct, shouldn't asCSV() also accept an ICSVOutputOptions parameter?

Streaming on rows as they are parsed

Hello,

thanks for this awesome project that hopefully will help me to dump python.

I'm using it to read a csv file, cleaning the data and inserting into postgres.
To deal with huge csv file, I'd like to pipe the rows as they are parsed into postgres without having to keep the full dataframe into memory.
How would you do that ? Using through ?
Thanks

Project setup failed

Just installed DFN, tried to run the intro notebook and a big red dialog announced:

Project setup failed, see log file for full details.
Error: spawn[1]: npm ERR! code E403
npm ERR! 403 Forbidden: @types/request-promise@^4.1.44
npm ERR! A complete log of this run can be found in:
npm ERR! /Users/rwilliams/Library/Application Support/data-forge-notebook/npm-cache/_logs/2019-11-15T05_51_22_276Z-debug.log
Error code: 1

Relevant log entry:
24 verbose stack Error: 403 Forbidden: @types/request-promise@^4.1.44
24 verbose stack at fetch.then.res (/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/node_modules/pacote/lib/fetchers/registry/fetch.js:42:19)
24 verbose stack at tryCatcher (/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/node_modules/bluebird/js/release/util.js:16:23)
24 verbose stack at Promise._settlePromiseFromHandler (/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/node_modules/bluebird/js/release/promise.js:512:31)
24 verbose stack at Promise._settlePromise (/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/node_modules/bluebird/js/release/promise.js:569:18)
24 verbose stack at Promise._settlePromise0 (/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/node_modules/bluebird/js/release/promise.js:614:10)
24 verbose stack at Promise._settlePromises (/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/node_modules/bluebird/js/release/promise.js:693:18)
24 verbose stack at Async._drainQueue (/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/node_modules/bluebird/js/release/async.js:133:16)
24 verbose stack at Async._drainQueues (/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/node_modules/bluebird/js/release/async.js:143:10)
24 verbose stack at Immediate.Async.drainQueues [as _onImmediate] (/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/node_modules/bluebird/js/release/async.js:17:14)
24 verbose stack at runCallback (timers.js:705:18)
24 verbose stack at tryOnImmediate (timers.js:676:5)
24 verbose stack at processImmediate (timers.js:658:5)
25 verbose cwd /Applications/Data-Forge Notebook.app/Contents/examples
26 verbose Darwin 18.7.0
27 verbose argv "/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/bin/node" "/Applications/Data-Forge Notebook.app/Contents/nodejs/v10.15.3/lib/node_modules/npm/bin/npm-cli.js" "install" "--bin-links" "false" "--scripts-prepend-node-path" "true" "--cache" "/Users/rwilliams/Library/Application Support/data-forge-notebook/npm-cache" "--prefer-offline" "--no-audit"
28 verbose node v10.15.3
29 verbose npm v6.4.1
30 error code E403
31 error 403 Forbidden: @types/request-promise@^4.1.44
32 verbose exit [ 1, true ]

DataFrame distinct with multiple column does not work

I am trying to use distinct with multiple column select on DataFrame. But I am not getting distinct rows in the result. Distinct with single column select on DataFrame works fine. Example
df.distinct(row => [row.columnA, row.columnB]).toArray()
even tried this
df.distinct(row => ({ columnA: row.columnA, columnB: row.columnB }).toArray()
Both returns same number of rows as original dataframe, was expecting distinct rows from the original dataframe

Single column select
df.distinct(row => row.columnA).toArray() works perfectly fine

I know I am missing something here on the multi-column selector. Could someone help. Thanks in advance.

Pandas melt() alternative?

I'm trying to find how to do something like the pandas melt() method.

Basically I need to go from this:

|Date|A|B|C|D|E |
|----|-|-|-|-|--|
|2005|1|2|3|4|50|
|2006|6|7|8|9|10|

Doing something like:
df.melt("Date", { varName: "X", valueName: "Val" })

And get this:

|Date|X|Val|
|----|-|---|
|2005|A|1  |
|2005|B|2  |
|2005|C|3  |
etc 
|2006|A|6  |
etc
|2006|E|10 |

JSON5 file / JSON object support

Would it be possible for this package to support .json5 files?
I'm trying to use DF on a JSON5 file but even tho I already have a module that turns that file to a JSON object, there's no way for me to use this package to read either the file or read from the object.
Sticking to JSON isn't an option (as comments are needed).

A workaround would be to save the JSON object to a JSON file and read it again but I'm trying to minimise I/O operations and resources used, so it's not a good option.

Getting data out of a dataframe can be slow (toArray, toPairs, etc)

Hi, this is related to #11. I'm opening a new issue since I don't have permission to reopen the previous one and I'm not sure if the issue is in toPairs or the use of fillGaps and rollingWindow. The issue is very slow performance with from toPairs. Copying from my comment in the other issue.

I just pushed up a change to our test repo. The changes are:

Update package.json to use the most recent version of data-forge
Slightly change the reported timings to make it more clear where the performance issue happens. Specifically, the slow down looks like it's coming out of the call to toPairs().

The tests I'm looking at are method-1.js and method-2.js. The only difference between them is:

$ diff method-1.js method-2.js
76c76
< const mySeries = dfWithoutGaps.getSeries('value');
---
> const mySeries = new dataForge.Series(dfWithoutGaps.getSeries('value').toArray());

Output from running the tests:

cberthiaume@slow-lane:~/data-forge-performance-test-issue-11$ node method-1.js
Time to require: 980.9740000000002
Time to create DataFrame and getSeries: 8.874000000000024
Time for rolling window: 0.08599999999978536
Time for toPairs: 2067.8320000000003
cberthiaume@slow-lane:~/data-forge-performance-test-issue-11$ node method-2.js
Time to require: 975.4
Time to create DataFrame and getSeries: 63.06100000000015
Time for rolling window: 0.10500000000001819
Time for toPairs: 17.16599999999994
cberthiaume@slow-lane:~/data-forge-performance-test-issue-11$

The key difference is the huge difference in time to call toPairs(). Our use case requires us to call toPairs() and the only way to get acceptable performance when doing that is to recreate the series as you see in the diff above. However, our needs have changed that the slow down required to implement this workaround is becoming a bottleneck. Is there a a better way to get good performance from toPairs() without using this workaround? Should I open a separate ticket to track this?

Thanks again for all your help.

TypeError: data_forge__WEBPACK_IMPORTED_MODULE_8__.readFileSync is not a function

When using const df=dataForge.readFile('SalesJan2009.csv').parseCSV(); in the render method of a ReactJS class the browser shows an error

TypeError: data_forge__WEBPACK_IMPORTED_MODULE_8__.readFileSync is not a function

Window functions do not preserve the original index

Unlike Pandas, which would always preserve the original index, data-forge seems to always create a new one, unless a few boilerplate transformations get applied. Based on a couple of years of work with Pandas, I personally find this a lot of unnecessary work. Almost all the time, when creating a new series out of a windowed version of an existing one, I would attach it as a new column to the existing data frame. This makes it easy to inspect the entire data frame down the line, comparing the values in every row.

In my opinion, preserving the original index is essential for the ease of work with data-forge. Of course, this would also require handling of NaNs, their cleaning, etc.

I would be interested to know what you think about this.

[newbie question] Modus-operandi of a simple filtering case of DF needs some education

Hello,

I have a very basic problem. As I am seasoned Pandas (python) programmer, I wanted to try out data-forge DF the moment I saw this one. Thank you so much for bringing this wonderful piece, and it had a lot of potential being part of JS echo system...

But I had a basic starting thing to over come, and do not find that been talked in any of the blog/doc that I went through. Surprising. here is my problem -

I constructed a DF using an array of objects.
var df = new DataFrame( deposit_history );

Then I tried to filter like this:
var df2 = df.where( row => row.status === 'PG_FAILED' );

Then I check what the content of df2 is... it is only null.
DataFrame {configFn: , content: null}

Later (after spending lot of time), I tried .toArray(), then I check df2 again,
DataFrame {configFn: , content: Object}

This behavior looks so odd to me, and maybe I am thinking like the 'where()' alone does not construct the resultant DF, and it looks like there is a method (probably like 'execute') is to be invoked after where. That I am missing. In the absence of that hypothetical method 'execute', when one calls toArray() that really executes the query-plan. Is that what us really going on. Would greatly appreciate if the right way is explained. Thanks in advance.

PS: here is one of the ways Python works:
df[~df.order_type.str.contains('fb|foc|nfr|other')]

df[ df['std_product_class'].str.contains('TB', case=False) ]

Regards,
Tharma

TransformSeries by condition

Hi Ashley,
I'd like to propose you to implement a transformSeries by condition for DataFrame, to locally change column values in rows respecting the specific condition.

I'd like to contribute, but I still don't know TS, anyway here is an example I wrote for personal use:
transformSeriesByCondition
I hope it will be useful as it's been to me!
😊

Include browser version in distribution

Would be helpful if you could include browser version (for include in script tag) for convenience. Thanks for your consideration. Nice Work!!!

dropSeries is not a function.

` .dropSeries(['id', 'open', 'high', 'low', 'volume', 'startTime']) // Drop certain columns.
^

TypeError: dataForge.readFile(...).parseCSV(...).dropSeries is not a function`

My code looks like:
`const dataForge = require('data-forge');
require('data-forge-fs');
require('data-forge-plot');

const df = dataForge.readFile('binance_btc_usdt_time_5m.csv')
.parseCSV()
.dropSeries(['id', 'open', 'high', 'low', 'volume', 'startTime']) // Drop certain columns.
.then(df => {
console.log(df.toString())
})
.catch(err => {
console.log(err)
});`

Is there a seperator option?

Hey.

I'm currently struggling to aggregate my data right. I got the following example data in my CSV file (abstracted file):

10 | 20190404141501 | 2,display_name:server@coolbox,vcpus:1,ram:8096 |

Now when reading and working with the Data from that file, the third column gets seperated aswell, since data-forge thinks, that , is also a seperator. Is there an option, where I can set the seperator, that's used in the file? I haven't found anything in the Guide or the API Documentation.

Type discrepancy between null and undefined

Hi,

I am loading a CSV file with the fromCSV function. When I run detectTypes() on the dataframe at that time, I see something like:
0 string 8.1 phone
1 object 91.9 phone

DataForge is treating empty values / nulls as an object.

After running a transform such as below to uppercase a value, there are some type differences.
dataFrame.transformSeries({ [sourceColumn]: value => value && value.toLocaleUpperCase() })

Running detectTypes() again generates:
0 string 8.1 phone
1 undefined 91.9 phone

Even explicitly returning a null such as:
dataFrame.transformSeries({ [sourceColumn]: value => value ? value.toLocaleUpperCase() : null })

returns an undefined.

It seems there is an inconsistency between fromCSV dynamic typing parsing and transformSeries handling nulls and/or undefined.

How to get equivalent of pandas.DataFrame.pivot_table

How to get equivalent of pandas.DataFrame.pivot_table like here

table = pivot_table(df, values=['D', 'E'], index=['A', 'C'], aggfunc={'D': np.mean, 'E': [min, max, np.mean]})

Output:

                  D   E
               mean max median min
A   C
bar large  5.500000  16   14.5  13
    small  5.500000  15   14.5  14
foo large  2.000000  10    9.5   9
    small  2.333333  12   11.0   8

Rollup support

Rollup generates a lot of Circular dependency warnings and produces a broken bundle with undefined variables.
Single-file bundle could probably resolve this issue.

How many columns can the parseCSV handle?

Hi,

I'm trying to read in a csv file that has about 4500 columns. On the .parseCSV method I am getting this error (I've parsed other CSV files I have and it works fine - suspecting there may be an issue with the number of columns??):

testDataForge Error Error: "toString()" failed
at stringSlice (buffer.js:558:43)
at Buffer.toString (buffer.js:631:10)
at Object.fs.readFileSync (fs.js:601:41)
at SyncFileReader.parseCSV (D:\Applications\ModernAnalytics\TestDataForge\node_modules\data-forge-fs\build\index.js:324:40)
at __dirname (D:\Applications\ModernAnalytics\TestDataForge\testDataForge.js:7:76)
at Object. (D:\Applications\ModernAnalytics\TestDataForge\testDataForge.js:26:3)
at Module._compile (module.js:635:30)
at Object.Module._extensions..js (module.js:646:10)
at Module.load (module.js:554:32)
at tryModuleLoad (module.js:497:12)

Thanks!

Rob

content

Can we add and multiply in the series？

want the math function of add and multiply and other operations of the data frame like python pandas

Cannot access the implementation examples in http://blogs.geniuscode.net

Hi,
I cannot access the implementation examples referred in the API docs. Is it down?

Data Forge Notebooks "Cannot find module 'data-forge'"

I recently updated notebook versions and now when I try to load the data-forge library using

const dataForge = require('data-forge')

I get the Error:

Cannot find module 'data-forge'
at wrapperFn (1:14)
at <anonymous> (18:10)

I also tried using data-forge-fs.... am I missing something?

Thanks!

How to do bulk transformation?

Hi,

Is there a way to group multiple transformation steps together so that I can share it between multiple places? For example, if I want to get the n-th row of a data frame, I would use the following 3 steps: .skip(n-1).take(1).first() (by the way, is it the best approach?).

To reuse it, I can make a function like: getAt = (df: IDataFrame, n: number) => df.skip(n-1).take(1).first() but need to use a temporary variable, like:

const df = new DataFrame(...)
const row = getAt(df, n)

Would be nice if can apply a transformation function to a data frame directly, like:

const row = new DataFrame(...)
  .apply((df) => getAt(df, n))

The apply method can be implemented as simple as:

apply (transformer) {
  return transformer(this)
}

I'm not familiar with Pandas to know whether it has something similar.

Regards

Incorrect documentation for Group and Aggregate

Thank you for the library

In the documentation: https://github.com/data-forge/data-forge-ts/blob/master/docs/guide.md#group-and-aggregate I found several issues

First thing is that the example that should be looked at is not #13 but #12. Secondly the code in the example is not actually working with data-forge v1.0.10. Running the example gives me the following error

$ > node examples/12.\ Group\ and\ aggregate\ sales\ data/index.js
-- Output dataframe:
/Users/btara.truhandarien/Code/data-forge-examples-and-tests/examples/12. Group and aggregate sales data/index.js:47
        Sales: group.select(row => row.Sales).sum(),
                                              ^

TypeError: group.select(...).sum is not a function
    at SelectIterator.salesData.groupBy.select.group [as selector] (/Users/btara.truhandarien/Code/data-forge-examples-and-tests/examples/12. Group and aggregate sales data/index.js:47:47)
    at SelectIterator.next (/Users/btara.truhandarien/Code/data-forge-examples-and-tests/node_modules/data-forge/build/lib/iterators/select-iterator.js:20:25)
    at ColumnNamesIterator.next (/Users/btara.truhandarien/Code/data-forge-examples-and-tests/node_modules/data-forge/build/lib/iterators/column-names-iterator.js:58:50)
    at Function.from (native)
    at DataFrame.getColumnNames (/Users/btara.truhandarien/Code/data-forge-examples-and-tests/node_modules/data-forge/build/lib/dataframe.js:337:22)
    at DataFrame.toString (/Users/btara.truhandarien/Code/data-forge-examples-and-tests/node_modules/data-forge/build/lib/dataframe.js:1917:32)
    at Object.<anonymous> (/Users/btara.truhandarien/Code/data-forge-examples-and-tests/examples/12. Group and aggregate sales data/index.js:53:24)
    at Module._compile (module.js:652:30)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)

I find that modifying the aggregation to be

var summarized = salesData
    .groupBy(row => row.ClientName)
    .select(group => ({
        ClientName: group.first().ClientName,

        // Sum sales per client.
        Sales: group.select(row => row.Sales).deflate().sum(),
    }))
    .inflate() // Series -> DataFrame.
    ;

Works as expected

Directly editing cell values in data frames

Is there no way to edit cell values directly, similarly to how it's done in Pandas?

Example: df['name1']['name2'] = 10

I haven't found anything relevant in the documentation, so I assume there is no function for this. So then the question is, is there any way to modify data frames without having to rebuild whole frames/series every time a modification is done?

getColummns() is not a function

Here is my code:
var transformedData = await new df.DataFrame(data) //data is an array of JSON objects: [{a:1}, {a:3}...]

console.log(transformedData.getColumnNames()) //print out all column Names

for (const column in transformedData.getColummns()) { //not a function
console.log("Column name: ");
console.log(column.name);

console.log("Data:");
console.log(column.series.toArray());

}

This is what i get:
"errorMessage": "transformedData.getColummns is not a function",
"errorType": "TypeError"

Why getColumnNames() works and getColummns() doesn't?

API Docs down. Trying to figure out groupping and .max

Hello,

Just noticed the API docs are down and I haven't been able to figure out how to do two things... Hoping you can provide some guidance. Happy to contribute some examples and documentation once I understand this better.

First, is it possible to group by more than one column? I am trying this:

  // ...
  .groupBy(row => [row.coin,row.type])
  // ...

And it seems to be working. Is that right?

Then, I want to grab the latest of potentially several values per day so I was trying .max like so...

var latest = indexed
  .orderBy(row => row.timestamp)
  .groupBy(row => [row.coin,row.type])
  .select(group => ({
    coin: group.first().coin, // I'm assuming this returns the first of the series
    type: group.first().type,
    timestamp: group.select(row => row.timestamp).max(), // <-- this
    value: group.first().value,
  }))
  .withIndex(row => moment.unix(row.timestamp/1000).format('YYYY-MM-DD'))
  .inflate();

Sorry to bother with this... I keep trying the docs but :s

Thank you so much!

include a ES Module build

Please include a ES Module version of the package so users using bundlers like webpack can benefit from it and get features like tree-shaking, etc.
Right now the package size is quite high

IE 11 Compatibility?

I was wondering if the data-forge 1.2.4 is compatible with IE 11? I've tried using polyfill io and importing the default, es5 and es5 to cover the Symbol usage but for some reason it hangs in iterable.js or iterator.js. Seems to never hit the iterator.prototype[Symbol.iterator] section. Using the polyfill io, I had no issues with putting the data into Dataframe but trying to run the subset function hangs.

fillGaps followed by rollingWindow can perform poorly

Hi, I recently upgraded from data-forge-js to data-forge-ts and I'm noticing the following issue (not sure if it was the same in data-forge-js). I have a DataFrame that I want to call fillGaps on. However, the performance was surprisingly slow. After much debugging having believed it was due to something else in our application I believe I narrowed it down to data-forge-ts. Here is an example that should demonstrate the problem and my current workaround:

    const data = [
      { date: moment().toDate(), value: 1 },
      { date: moment().add(1, 'days').toDate(), value: 3 },
      { date: moment().add(5, 'days').toDate(), value: 2 },
      { date: moment().add(6, 'days').toDate(), value: 6 },
      { date: moment().add(7, 'days').toDate(), value: 5 },
      { date: moment().add(12, 'days').toDate(), value: 2 },
      { date: moment().add(15, 'days').toDate(), value: 9 },
    ];
    const df = new dataForge.DataFrame(data).setIndex('date');

    const gapExists = (pairA, pairB) => {
      // Return true if there is a gap longer than a day.
      // Log when we enter this function
      console.log('gapExists [pairA, pairB]', [pairA, pairB]);
      const startDate = pairA[1].date;
      const endDate = pairB[1].date;
      const gapSize = moment(endDate).startOf('day').diff(moment(startDate).startOf('day'), 'days');
      return gapSize > 1;
    };

    const gapFiller = (pairA, pairB) => {
      const startDate = pairA[1].date;
      const endDate = pairB[1].date;
      const gapSize = moment(endDate).startOf('day').diff(moment(startDate).startOf('day'), 'days');
      const numEntries = gapSize - 1;

      const newEntries = [];

      for (let entryIndex = 0; entryIndex < numEntries; entryIndex += 1) {
        const newValue = { date: pairA[1].date, value: pairA[1].value };
        newValue.date = moment(pairA[1].date).add(entryIndex + 1, 'days').toDate();

        newEntries.push([
          moment(pairA[0]).add(entryIndex + 1, 'days').toDate(), // New index
          newValue, // New value
        ]);
      }

      return newEntries;
    };

    // In this case, the final value of dfWithoutGaps is
    // what you'd expect and the number of entries into
    // gapExists is about what I'd expect.
    //
    // I say "about" what I'd expect because there are
    // two seemingly identical comparisons made between the
    // first two items of the data set for a total
    // execution count of 7 when I'd expect 6.
    const dfWithoutGaps = df.fillGaps(gapExists, gapFiller);
    console.log('dfWithoutGaps', dfWithoutGaps.toArray());

    // With the following line uncommented the rollingWindow
    // call below will trigger a huge number of calls to
    // gapExists (I'm not sure how many but enough to max out the
    // buffer in my browser's debug console). It appears as if
    // fillGaps is being run for each rolling window.
    //
    // The speed of generating the rolling window in this
    // scenario is very slow.
    // const mySeries = dfWithoutGaps.getSeries('value');


    // If we force the lazy evaluation via the line below and then
    // run rollingWindow we'll get the expected number of calls (6) to
    // gapExists.
    //
    // The speed of generating the rolling window in this scenario
    // is very fast.
    // https://github.com/data-forge/data-forge-ts/blob/master/docs/guide.md#lazy-evaluation-through-iterators
    const mySeries = new dataForge.Series(dfWithoutGaps.getSeries('value').toArray());

    const smaPeriod = 3;
    const smaSeries = mySeries
      .rollingWindow(smaPeriod)
      .select(window => window.sum() / smaPeriod);

    console.log('smaSeries', smaSeries.toArray());

It's late here so I could be off the rails somewhere. Does this look like an issue with data-forge-ts? If so, is there a better workaround than what I have here?

Thanks a lot in advance. This is a very helpful tool and I appreciate you creating it!

Getting Value by date index doesn't work for newly created date.

When a date index is set, whether using .setIndex() or explicitly when setting up a dataframe, the index cannot be used to get a value using that index.
e.g. dataframe.test.ts
df.at(moment("2014-04-05").toDate()) will returnundefined while df.at(testDate) will pass the test correctly

Getting a row by index quite slow

Thanks for this awesome DataFrame library for nodejs, much wanted!

Indexing and using at function is quite slow with large DataFrames . at looks like very useful in terms of improving performance. Any workaround until at is fixed for its real intention? A fix will be much helpful if possible. Sorry that i am not aware of the internal working of a DataFrame for a PR

Cost of assertions

Can the library have a way of optimizing (or disabling) chai assertions throughout dataframe and series?

I did some browser (Chrome) profiling and the assertions seemed to be a major factor, so I did some timings.

I can't give you a standalone example, but for basic context my two cases were

[1] a 5000 row table that is aggregated down to about 200 rows: (30ms to parse, 200ms to aggregate),
and [2] a wider 2000 row table that is aggregated down twice: (200ms to parse, 600 ms to aggregate).

After adding the following code to my start up (only change):

  import _ from "lodash";
  import { assert } from "chai";

  const chaiIsFunction = assert.isFunction;
  const chaiIsArray = assert.isArray;
  const chaiIsNumber = assert.isNumber;
  const chaiIsString = assert.isString;
  const chaiIsObject = assert.isObject;

  assert.isFunction = (o, msg) => { if (!_.isFunction(o)) chaiIsFunction(o, msg); };
  assert.isArray = (o, msg) => { if (!_.isArray(o)) chaiIsArray(o, msg); };
  assert.isNumber = (o, msg) => { if (!_.isNumber(o)) chaiIsNumber(o, msg); };
  assert.isString = (o, msg) => { if (!_.isString(o)) chaiIsString(o, msg); };
  assert.isObject = (o, msg) => { if (!_.isObject(o)) chaiIsObject(o, msg); };

the timings went down to [1] (30ms to parse, 80ms to aggregate) and [2] (200ms to parse, 190ms to aggregate).

(By the way, these second timings didn't really improve if I completely disabled the assertions with

    assert.isFunction = () => undefined;
    assert.isArray = () => undefined;
    assert.isNumber = () => undefined;
    assert.isString = () => undefined;
    assert.isObject = () => undefined;

)

My tests were done in a development build, so maybe this isn't as much of a factor in production, but chai seems like a
library designed for testing and not necessarily for production performance (at least I can't see any mention of a
particular production mode in their docs).

Thanks.

Adding series matched by index

Hi,

Thank you for a great library. I was looking for something like this and read about it in the latest issue of Node Weekly. Started playing with it but haven't been able to get the result I'd like. I hope you don't mind if I ask...

I have the following dataframes:

__index__  A     |    __index__  B
---------  --    |    ---------  --
0          A1    |    2          B1
1          A2    |    3          B2
2          A3    |    4          B3
3          A4    |    5          B4
4          A5    |    6          B5

I need to end up with:

__index__  A   B
---------  --  --
0          A1
1          A2
2          A3  B3
3          A4  B4
4          A5  B5
5              B5
6              B5

The data comes from a bunch of files that contain one 2D array each structured like this:

// A.json     |    // B.json
[             |    [
  [0, A1],    |      [2, B1],
  [1, A2],    |      [3, B2],
  [2, A3],    |      [4, B3],
  [3, A4],    |      [5, B4],
  [4, A5]     |      [6, B5]
]             |    ]

Notice how I need the resulting DataFrame to use the file names as the column titles.

I tried using concat and joins but don't quite get this result. Would you mind pointing me in the right direction?

Thank you,

reindex the dataframe

i want to have a function that can reindex the dataframe
something like the pandas.DataFrame.reindex

RollingWindow on DataFrame

Hello again,

If you have a minute, would you mind helping me figure out why I can't get rollingWindow to do what I expect?

I have a bunch of data formatted like so:

__index__   BTC      ETH
----------  -------  -------
2017-06-15  2261.25  333.499
2017-06-16  2481.74  371.358
2017-06-17  2591.41  373.975
2017-06-18  2568.92  377.562
2017-06-19  2578.25  371.813
2017-06-20  2657.32  376.359
2017-06-21  2709.33  338.292
2017-06-22  2714.58  340.326
2017-06-23  2760.61  340.811
2017-06-24  2688.44  334.167
2017-06-25  2627.95  321.283
2017-06-26  2476.72  262.385
2017-06-27  null     null
2017-06-28  2539.78  301.789
2017-06-29  2536.94  315.543
2017-06-30  2547.95  305.518
2017-07-01  2507.89  282.877
2017-07-02  2445.84  269.758
2017-07-03  2519.06  288.076
2017-07-04  2618.16  284.342

And I'm running the following:

const performance = df.rollingWindow(2).select(window => {
  return ((window.last() - window.first()) / window.first())
})
.withIndex(pair => pair[0])
.select(pair => pair[1])
console.log(performance.toString());

But I keep getting...

__index__  __value__
---------  ---------
0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
5          NaN
6          NaN
7          NaN
8          NaN
9          NaN
10         NaN
// truncated for brevity
353        NaN
354        NaN
355        NaN
356        NaN
357        NaN
358        NaN
359        NaN
360        NaN
361        NaN
362        NaN
363        NaN

Not sure what I'm doing wrong. I've tried several variations based on the examples on the Guide.

Thank you!

data-forge not working nicely with collectionsjs

First and foremost, I do not think this is a fault in data-forge but I feel like I should raise the issue here so others and author(s) are aware of it

I have the following code:

const dataForge = require('data-forge');
const util = require('util');
//const c = require('collections/fast-map');

function test() {
  const timestamps = [ [ '2018-05-21 15:38:04' ],
    [ '2018-05-21 15:38:09' ],
    [ '2018-05-21 15:38:09' ],
    [ '2018-05-21 15:38:08' ] ];
  const tsDataFrame = new dataForge.DataFrame({
    columnNames: ['Timestamp'],
    rows: timestamps
  }).setIndex('Timestamp');
  let groupedDf = tsDataFrame.groupBy(row => row.Timestamp).select(tsGroup => ({
    Timestamp: tsGroup.first().Timestamp,
    QPS: tsGroup.count()
  })).inflate();
  console.log(groupedDf.toString());
}

test();

Running the above code will give me the following result

__index__  Timestamp            QPS
---------  -------------------  ---
0          2018-05-21 15:38:04  1
1          2018-05-21 15:38:09  2
2          2018-05-21 15:38:08  1

which is what I expected

However if //const c = require('collections/fast-map'); is uncommented, running it again I will get

__index__  [object Object]  false
---------  ---------------  -----
0
1
2

Clearly this is a mistake. After hours of debugging I can at least spot one possible reason for the error. In the build version of data-forge within DataFrame.prototype.toString function we have the following (cut down for brevity's sake)

    DataFrame.prototype.toString = function () {
        var columnNames = this.getColumnNames();
        var header = ["__index__"].concat(columnNames);
        // more things down below
    };

Doing console.log(columnNames) with collectionsjs required gives me the following:

[ SelectIterable {
    iterable: [ [Object], [Object], [Object] ],
    selector: [Function] },
  false ]

Without collectionsjs I will get the expected result: [ 'Timestamp', 'QPS' ]

Inspecting further the getColumnNames function tells me that Array.from is used which is overridden by collectionsjs implementation: https://github.com/montagejs/collections/blob/master/shim-array.js#L26

I managed to fix things on data-forge side by doing a seemingly unnecessary function call:

let groupedDf = tsDataFrame.groupBy(row => row.Timestamp).select(tsGroup => ({
  Timestamp: tsGroup.first().Timestamp,
  QPS: tsGroup.count()
})).inflate().resetIndex();

This will give me the correct result regardless if collectionsjs was used or not

There's an issue raised already in collectionsjs regarding Array.from montagejs/collections#169 and there is also a PR montagejs/collections#173. I'm not sure about the progress of either

initColumnNames function problem

I have an array of the same column names. (* actually different)
Example : city and CITY

DataFrame.initColumnNames function recognizes these two as the same. (reason toLowerCase() method )
Result : city.1 and CITY.2

When i replace return outputColumnNames; with return inputColumnNames; , it works.

It would be nice, if it was presented as an option .

Do you have any other suggestions for this ?

Thanks in advance.

Unable to build

I am unable to build my typescript web app with dataforge.

I get the following errors to do with the fs module:

ERROR in ./node_modules/data-forge/build/index.js
Module not found: Error: Can't resolve 'fs' in '/home/vagrant/kesm/apps/threescan-data-viewer/node_modules/data-forge/build'
 @ ./node_modules/data-forge/build/index.js 157:17-30 253:17-30 264:17-30
 @ ./node_modules/ts-loader!./node_modules/vue-loader/lib/selector.js?type=script&index=0!./src/ComplexRoiSelector.vue
 @ ./src/ComplexRoiSelector.vue
 @ ./src/index.ts

ERROR in ./node_modules/data-forge/build/lib/dataframe.js
Module not found: Error: Can't resolve 'fs' in '/home/vagrant/kesm/apps/threescan-data-viewer/node_modules/data-forge/build/lib'
 @ ./node_modules/data-forge/build/lib/dataframe.js 4377:21-34 4401:17-30 4433:21-34 4457:17-30
 @ ./node_modules/data-forge/build/index.js
 @ ./node_modules/ts-loader!./node_modules/vue-loader/lib/selector.js?type=script&index=0!./src/ComplexRoiSelector.vue
 @ ./src/ComplexRoiSelector.vue
 @ ./src/index.ts

My code literally just tries to import dataforge and instantiate a dataframe with no data.

Updating package.json dependencies (security)

Since yesterday it is a security incident (https://npmjs.com/advisories/1515). Can you please update the project dependencies of papaparse.

Key Concepts separate file URL broken

https://github.com/data-forge/data-forge-ts/blob/master/docs/docs/concepts.md

Transforming JS object to CSV

Hi.

With the split to data-forge and data-forge-fs, I'm not sure how to turn an array of JS Objects I have in memory into a CSV file.

fromObject() is defined on data-forge but asCSV() and writeFileSync() are defined in data-forge-fs.

Example data structure

const arr = [
  { first: 'micky', last: 'mouse' },
  { first: 'minnie', last: 'mouse' }
];
// How do I turn `arr` into a CSV file?
// Do i import `data-forge` or `data-forge-fs` or both?

Thank you.

using dataforge in angular

hi Ashley,
I'm building an angular 6 application recently and was looking for some data processing lib. So happy to find data-forge as I used pandas and it's a great lib on python.
Would you mind put some instruction on how to import data-forge into angular to be used there?

data-forge / data-forge-ts Goto Github PK

data-forge-ts's Introduction

Data-Forge

BREAKING CHANGES

Install

Quick start

From the browser

Features

Contributions

Platforms

Documentation

Resources

Contact

Support the developer

data-forge-ts's People

Contributors

Stargazers

Watchers

Forkers

data-forge-ts's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs