javascriptdata / scikit.js Goto Github PK

View Code? Open in Web Editor NEW

125.0 125.0 13.0 9.46 MB

JavaScript package for predictive data analysis and machine learning

License: MIT License

TypeScript 99.33% Shell 0.02% JavaScript 0.30% HTML 0.35%

scikit.js's People

Contributors

Stargazers

Watchers

Forkers

dcrescim stjordanis dirktoewe lewuathe luansilveirasouza wenheli steveoni codeheart09 sfundomhlungu fahminlb33 share-more-stories neuralworks-io vitaly-z

scikit.js's Issues

train_test_split

Implement the model_selection train_test_split from sklearn.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split

DecisionTreeClassifier

Build a DecisionTreeClassifier which matches the scikit-learn API.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

To start we don't have to mimic the entire API, just start somewhere, and we can help add features to the class.

Contribution guide

Contribution guide link is returning 404 error

Why is there no SVC?

Hello!

Thanks for the excellent library. I noticed some SVC files but they are commented out. Is there any reason why SVC is not implemented?

Thank You!

How to set DecisionTreeClassifier reproducibility?

Hi, first of all, thanks for developing this library :)

I was writing a code that use the DecisionTreeClassifier. When I checked my results, I realised that they are not constants.
I mean, I set a random number generator with Math.seedrandom(my_seed) to get the same results for each execution, but they are not.

This method worked for me on LogisticRegression, where the weights are initialized by TensorflowJS (if I am not wrong, the user can "fixs" this initialization using Math.seedrandom() before building the neural network). I know DecisionTreeClassifier is not a neural network, so the random number generation is not equal to LogisticRegression.

In summary: How can I set a random number generator for getting the same results on a DecisionTreeClassifier?

Stratified sampling for dataframe splitting

Hello everyone,

I'm working on a project that needs stratified sampling of the dataset so it can have a more balanced test set.
More on the subject: https://en.wikipedia.org/wiki/Stratified_sampling

I implemented a solution using Danfo.js for that purpose and, if you think it is a good idea, I can open a PR with that as a splitting tool.
Its parallel in scikit learn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

If you think this would make sense for the project, just let me know. :)

reference Danfojs types as dev dependency

update build process to allow for TensorFlow versions between danfo and scikit to be out of sync

Overarching Plan (to MVP / version 1)

Hey Folks!
I thought it might be a bit easier if we had one issue that had the current "state of the world".
It would have a list of all completed Estimators/Functions and next to each it would have a person's name if someone was working on it or it'd be checked if it was complete and merged in dev.

Ping me in the comments beneath and I'll add you to whichever estimators you want to work on.

I went through the scikit-learn docs yesterday and broke out the Estimators that we would need for an MVP of scikit.js (let's call it version 1).

Version 1

The focus here is on simple models, and all the preprocessing, and metrics that you'd need to perform high quality model generation.

linear_model

cluster

KMeans

neighbors

dummy

DummyClassifier
DummyRegressor

impute

SimpleImputer

preprocessing

pipeline

Pipeline

compose

ColumnTransformer

tree

metrics

So pick whichever ya want, and ping me, and I'll update the issue and put your name next to the Estimator / Function.

Some great resources for contributors

ML from scratch in Python : https://github.com/eriklindernoren/ML-From-Scratch
Nick Leclure's Book: https://github.com/nfmcclure/tensorflow_cookbook
Charlie Gerard's Book: Practical Machine Learning with Tensorflow.js
MachineLearnjs : https://github.com/machinelearnjs/machinelearnjs

Hello folks! Time flies when you're having fun :)
We are rounding the corner the completion of the MVP / Version 1 list above. I thought it would be good to go through scikit-learn and make a list of the next most important things. That list is below as well as some general todos (docs, tutorials). Feel free to ping me or comment below and grab whatever interests in the following list.

Onward and Upward!

linear_model

Exact solution for linear_regression

datasets

Iris
Boston Housing
#44
#50

naive_bayes

svm

LinearSVC
LinearSVR
SVC
SVR

model_selection

GroupKFold
#46
ShuffleSplit
#45

decomposition

hyper_parameter

#187

ensemble

VotingRegressor
VotingClassifier
RandomForestClassifier
RandomForestRegressor

docs

Make Basic Docs site
Push the Basic Docs site to scikit.org. Have scikit.js redirect to scikit.org
Make Basic Docs site show api for all functions / classes that we export
Make it build browser and node versions
Make the tests run against browser and node environments

GridSearchCV

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Import LibSVM into the library

Is your feature request related to a problem? Please describe.
It would be amazing to have a fast LibSVM implementation for SVC, SVR estimators.

Describe the solution you'd like
Compile the LibSVM project to a wasm file, and use that with the familiar SVM sklearn api

Implement KFold

In the model_selection category, implement KFold.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold

Create the RidgeRegression Estimator

Is your feature request related to a problem? Please describe.
Like the scikit-learn library we should have a RidgeRegression Estimator.

Describe the solution you'd like
Similar to the LinearRegression issue, I think the best plan of action is to create a SGD solution that delivers the right answer (even if it is slower) and then eventually switch to a faster linear algebra library once we understand that landscape a bit more.

Any input about WASM?

It seems based on Node.js currently, but there's WASM around the corner:

Any input on it w.r.t. to this project?

Getting an error calling model.score

Hi!

Maybe I did something wrong but I think this is the simplest example to get a score out.

When I call model.score(xTest, yTest) I get the error Labels can't be converted to a 1D Tensor.

const btcusdtData = await ensureCryptoData('BTCUSDT', '1d')
const df = new dnf.DataFrame(btcusdtData)
// Couldn't use a list of column names here unfortunately so I used their index numbers. Would be a lot cooler if I could use 'open', 'high', etc.
const x = df.iloc({ columns: [1, 2, 3, 5, 6, 7, 8, 9, 10, 11] })
const y = df.iloc({ columns: [4] })
const [xTrain, xTest, yTrain, yTest] = sk.trainTestSplit(x.tensor, y.tensor)
const model = new sk.LinearRegression()
await model.fit(xTrain, yTrain)
const score = model.score(xTest, yTest)

BernoulliNB

Implement BernoulliNB with the scikit-learn API.

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB

save and load

Hi,

How do you save and then re-open a trained network?
It doesn't seem to be explained anywhere in the doc.

Thanks.

Build out our Scikit Learn metrics

Is your feature request related to a problem? Please describe.
We should have a robust and well-tested set of metrics that mirror those on sklearn.

Describe the solution you'd like
Many of these will be thin wrappers around the ones that ship with tensorflow.js, but there will be some that we will just have to simply write using basic tensor math.

Create the LinearRegression Estimator

Is your feature request related to a problem? Please describe.
In order to match the Estimators in scikit-learn, we should make a LinearRegression estimator with the same API.

Describe the solution you'd like
There are couple of ways to do this. Scikit-Learn leans on linear algebra libraries from scipy. I'm honestly not sure if JS (either Node / Browser) has adapters for the same low-level BLAS and LAPACK libraries that scipy uses.
In lieu of not having access to those libraries, the plan should be to just create a gradient descent solution using a TF model.

Once we know / can use low level libraries for solving a linear system of equations, we should trade out the SGD solution above in favor of that.

Additional context
There might be cases where we actual want to have both implementations (Linear Algebra solver, and SGD), because we could deploy different ones to different contexts, ie... Lin Alg solver on Node, and SGD solver on web. This would get rid of the need for shipping an entire solver to the client. But that's something that we would need to test later.

Better Github Actions setup

The goal of the better Github actions is to

Every time a PR is pushed to main, there is auto-deploy to npm for both scikitjs and scikitjs-node
A coverage report is created (one is right now, but it might not be as cool as you are thinking @yawetse
Any other goodies that you think we are missing @yawetse

DecisionTreeRegressor

Build a DecisionTreeRegressor which matches some of the API of the scikit-learn DecisionTreeRegressor.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor

No need to support the entire API to start. If you make the first pass, others can come in, and chip in other features as well.

Model saving and loading

Is your feature request related to a problem? Please describe.
How are models plan to be save and loaded

Describe the solution you'd like
I think we can have a fromJson and toJson to save and load model params and weight. if this is needed i can start working on that

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

yarn add scikitjs fail

Steps to reproduce the behavior:

Create dir \scikitjs
Init npm dir "npm init"
Add module: yarn add scikitjs
error:
.....
error D:\reactapp\personal\scikitjs\node_modules\scikitjs: Command failed.
Exit code: 1
Command: (cd docs && npm install && cd ..); (npx husky install);
....

Desktop:

OS: Windows 10
Node version: v14.18.1

GaussianNB

Implement GaussianNB with the scikit-learn API.

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB

Update Module Structure

from @dcrescim:
My basic premise (and I think we are aligned here) is that less is more.
If we can build a library with fewer repositories (1 is better than 2), or if we can support the same use cases with fewer npm packages (like supporting node/cpu/wasm with only 1 package), then that is better.

The only "gotcha" which will force my hand into "more repo / more packages" territory is if we can't keep the user experience clean.
So what does the dream scenario look like? Here's some example code:

import { LinearRegression } from 'scikitjs' // uses tfjs library, and whichever (webgl, cpu) backend is better

import { LinearRegression } from 'scikitjs/node' // uses tfjs-node library

import { LinearRegression, tf } from 'scikitjs' // uses tfjs library, with wasm backend
tf.setBackend('wasm')

import { LinearRegression } from 'scikitjs/node-gpu' // uses tfjs-node-gpu library

I don't care too much about that last case in the short term, but it's there just to make sure our code structure could eventually support it one day

Is this related to scikit-learn-ts?

If not, then how do they differ?

sklearn crossValidate function

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Making a great crossValidate function that matches the sklearn api

Configure Semantic Release

When PRs are merged onto the main branch:

the changelog should get updated
scikitjs-node should be auto-published to NPM with the correct semver
github should tag each release
scikitjs-browser should be auto-published to NPM too

###Up for debate:

Should we have a single node package with multiple build targets?
Or have a separate repo for browser that's just there for publishing to NPM (my preference)

Need Global Variable/Name-Space of scikit.js . It's more flexible to use in browser Console interactive testing/ad-hoc

PREFER in HTML (as in d3.js , tensorflow.js, tfjs-vis use global variables [d3/tf/tfvis] respectively in browser console ):

Then in browser console when you type tfvis. , d3. , tf. immediately a drop down list of available modules/functions
appeared right after the dot. This is faster for debugging and testing by just utilizing one script tag line in HTML!

===========================
scikit.js in HTML from your website:

This way sk is not recognizable in browser console. Can you provide a link/file of scikit.js that can utilize a global
variable such as "sk" if you have one.

If scikit.js does not have this feature, will it provides one in the future [time frame please].

Is there a way to work around this? [create global variable "sk" and gain access to all sub
modules/functions in browser console for interactive testing].

Thank you, I'm looking forward to hear from you soon.

Put Common datasets in scikitjs

Add Iris, Boston, Wine, etc... to the repo. Add them to the docs site, and write functions to "go and get them".

Build Coveralls and Coverage Reports into build process

update coveralls config to gate releases by test coverage

If I use bundler, the project will fail to build.

I run the scikitjs (version: 1.24.0) in my project. and package it by parcel-bundler (version: 1.12.5). When I start the project I get the following error:

if I import the bundle by script tag (CDN), works fine.

The error location: /node_modules/scikitjs/dist/es5/index.js (line: 20). And, Comment out this line of code and it works fine.

KNeighborsRegressor

Build a KNeighborsRegressor which matches the sklearn API.

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor

Technically there are 3 different strategies ‘ball_tree’, ‘kd_tree’, ‘brute’. This issue is only to support the "brute" method which checks the distance between the predicted point with every point in the input.

The other two are optimizations which try to use trees (kd_tree), and spheres(ball_tree) to speed up this algo, but to start let's just do brute.

`make_classification` / `make_regression` from scikit-learn

When scikit-learn tests the effectiveness of their model training, they usually construct fake datasets where they know the underlying model (coefficients). They have two functions, they are

make_regression (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html)
make_classification (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification)

We should do the same. This is issue tracks our implementation of makeRegression and makeClassification which will be helpful for testing the speed and efficacy of our models.