GithubHelp home page GithubHelp logo

javascriptdata / scikit.js Goto Github PK

View Code? Open in Web Editor NEW
125.0 125.0 13.0 9.46 MB

JavaScript package for predictive data analysis and machine learning

License: MIT License

TypeScript 99.33% Shell 0.02% JavaScript 0.30% HTML 0.35%

scikit.js's People

Contributors

codeheart09 avatar dcrescim avatar dependabot[bot] avatar dirktoewe avatar lewuathe avatar luansilveirasouza avatar risenw avatar semantic-release-bot avatar steveoni avatar stonet2000 avatar wenheli avatar yawetse avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit.js's Issues

Why is there no SVC?

Hello!

Thanks for the excellent library. I noticed some SVC files but they are commented out. Is there any reason why SVC is not implemented?

Thank You!

How to set DecisionTreeClassifier reproducibility?

Hi, first of all, thanks for developing this library :)

I was writing a code that use the DecisionTreeClassifier. When I checked my results, I realised that they are not constants.
I mean, I set a random number generator with Math.seedrandom(my_seed) to get the same results for each execution, but they are not.

This method worked for me on LogisticRegression, where the weights are initialized by TensorflowJS (if I am not wrong, the user can "fixs" this initialization using Math.seedrandom() before building the neural network). I know DecisionTreeClassifier is not a neural network, so the random number generation is not equal to LogisticRegression.

In summary: How can I set a random number generator for getting the same results on a DecisionTreeClassifier?

Stratified sampling for dataframe splitting

Hello everyone,

I'm working on a project that needs stratified sampling of the dataset so it can have a more balanced test set.
More on the subject: https://en.wikipedia.org/wiki/Stratified_sampling

I implemented a solution using Danfo.js for that purpose and, if you think it is a good idea, I can open a PR with that as a splitting tool.
Its parallel in scikit learn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

If you think this would make sense for the project, just let me know. :)

Overarching Plan (to MVP / version 1)

Hey Folks!
I thought it might be a bit easier if we had one issue that had the current "state of the world".
It would have a list of all completed Estimators/Functions and next to each it would have a person's name if someone was working on it or it'd be checked if it was complete and merged in dev.

Ping me in the comments beneath and I'll add you to whichever estimators you want to work on.

I went through the scikit-learn docs yesterday and broke out the Estimators that we would need for an MVP of scikit.js (let's call it version 1).

Version 1

The focus here is on simple models, and all the preprocessing, and metrics that you'd need to perform high quality model generation.

linear_model

  • LinearRegression
  • LassoRegression
  • RidgeRegression
  • ElasticNet
  • LogisticRegression
  • SGDClassifier
  • SGDRegressor

cluster

  • KMeans

neighbors

dummy

  • DummyClassifier
  • DummyRegressor

impute

  • SimpleImputer

preprocessing

  • StandardScaler
  • MinMaxScaler
  • MaxAbsScaler
  • Normalizer
  • RobustScaler
  • LabelEncoder
  • OneHotEncoder
  • OrdinalEncoder

pipeline

  • Pipeline

compose

  • ColumnTransformer

tree

metrics

  • accuracyScore
  • confusionMatrix
  • hingeLoss
  • logLoss
  • precisionScore
  • recallScore
  • rocAucScore
  • zeroOneLoss
  • meanAbsoluteError
  • meanSquaredError
  • meanSquaredLogError
  • r2Score

So pick whichever ya want, and ping me, and I'll update the issue and put your name next to the Estimator / Function.

Some great resources for contributors

Hello folks! Time flies when you're having fun :)
We are rounding the corner the completion of the MVP / Version 1 list above. I thought it would be good to go through scikit-learn and make a list of the next most important things. That list is below as well as some general todos (docs, tutorials). Feel free to ping me or comment below and grab whatever interests in the following list.

Onward and Upward!

linear_model

  • Exact solution for linear_regression

datasets

naive_bayes

svm

  • LinearSVC
  • LinearSVR
  • SVC
  • SVR

model_selection

  • GroupKFold
  • #46
  • ShuffleSplit
  • #45

decomposition

  • PCA

hyper_parameter

ensemble

  • VotingRegressor
  • VotingClassifier
  • RandomForestClassifier
  • RandomForestRegressor

docs

  • Make Basic Docs site
  • Push the Basic Docs site to scikit.org. Have scikit.js redirect to scikit.org
  • Make Basic Docs site show api for all functions / classes that we export
  • Make it build browser and node versions
  • Make the tests run against browser and node environments

Import LibSVM into the library

Is your feature request related to a problem? Please describe.
It would be amazing to have a fast LibSVM implementation for SVC, SVR estimators.

Describe the solution you'd like
Compile the LibSVM project to a wasm file, and use that with the familiar SVM sklearn api

Create the RidgeRegression Estimator

Is your feature request related to a problem? Please describe.
Like the scikit-learn library we should have a RidgeRegression Estimator.

Describe the solution you'd like
Similar to the LinearRegression issue, I think the best plan of action is to create a SGD solution that delivers the right answer (even if it is slower) and then eventually switch to a faster linear algebra library once we understand that landscape a bit more.

Any input about WASM?

It seems based on Node.js currently, but there's WASM around the corner:

Any input on it w.r.t. to this project?

Getting an error calling model.score

Hi!

Maybe I did something wrong but I think this is the simplest example to get a score out.

When I call model.score(xTest, yTest) I get the error Labels can't be converted to a 1D Tensor.

const btcusdtData = await ensureCryptoData('BTCUSDT', '1d')
const df = new dnf.DataFrame(btcusdtData)
// Couldn't use a list of column names here unfortunately so I used their index numbers. Would be a lot cooler if I could use 'open', 'high', etc.
const x = df.iloc({ columns: [1, 2, 3, 5, 6, 7, 8, 9, 10, 11] })
const y = df.iloc({ columns: [4] })
const [xTrain, xTest, yTrain, yTest] = sk.trainTestSplit(x.tensor, y.tensor)
const model = new sk.LinearRegression()
await model.fit(xTrain, yTrain)
const score = model.score(xTest, yTest)

save and load

Hi,

How do you save and then re-open a trained network?
It doesn't seem to be explained anywhere in the doc.

Thanks.

Build out our Scikit Learn metrics

Is your feature request related to a problem? Please describe.
We should have a robust and well-tested set of metrics that mirror those on sklearn.

Describe the solution you'd like
Many of these will be thin wrappers around the ones that ship with tensorflow.js, but there will be some that we will just have to simply write using basic tensor math.

Create the LinearRegression Estimator

Is your feature request related to a problem? Please describe.
In order to match the Estimators in scikit-learn, we should make a LinearRegression estimator with the same API.

Describe the solution you'd like
There are couple of ways to do this. Scikit-Learn leans on linear algebra libraries from scipy. I'm honestly not sure if JS (either Node / Browser) has adapters for the same low-level BLAS and LAPACK libraries that scipy uses.
In lieu of not having access to those libraries, the plan should be to just create a gradient descent solution using a TF model.

Once we know / can use low level libraries for solving a linear system of equations, we should trade out the SGD solution above in favor of that.

Additional context
There might be cases where we actual want to have both implementations (Linear Algebra solver, and SGD), because we could deploy different ones to different contexts, ie... Lin Alg solver on Node, and SGD solver on web. This would get rid of the need for shipping an entire solver to the client. But that's something that we would need to test later.

Better Github Actions setup

The goal of the better Github actions is to

  1. Every time a PR is pushed to main, there is auto-deploy to npm for both scikitjs and scikitjs-node
  2. A coverage report is created (one is right now, but it might not be as cool as you are thinking @yawetse
  3. Any other goodies that you think we are missing @yawetse

Model saving and loading

Is your feature request related to a problem? Please describe.
How are models plan to be save and loaded

Describe the solution you'd like
I think we can have a fromJson and toJson to save and load model params and weight. if this is needed i can start working on that

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

yarn add scikitjs fail

Steps to reproduce the behavior:

  1. Create dir \scikitjs
  2. Init npm dir "npm init"
  3. Add module: yarn add scikitjs
  4. error:
    .....
    error D:\reactapp\personal\scikitjs\node_modules\scikitjs: Command failed.
    Exit code: 1
    Command: (cd docs && npm install && cd ..); (npx husky install);
    ....

Desktop:

  • OS: Windows 10
  • Node version: v14.18.1

Cattura

Update Module Structure

from @dcrescim:
My basic premise (and I think we are aligned here) is that less is more.
If we can build a library with fewer repositories (1 is better than 2), or if we can support the same use cases with fewer npm packages (like supporting node/cpu/wasm with only 1 package), then that is better.

The only "gotcha" which will force my hand into "more repo / more packages" territory is if we can't keep the user experience clean.
So what does the dream scenario look like? Here's some example code:

import { LinearRegression } from 'scikitjs' // uses tfjs library, and whichever (webgl, cpu) backend is better
import { LinearRegression } from 'scikitjs/node' // uses tfjs-node library
import { LinearRegression, tf } from 'scikitjs' // uses tfjs library, with wasm backend
tf.setBackend('wasm') 
import { LinearRegression } from 'scikitjs/node-gpu' // uses tfjs-node-gpu library

I don't care too much about that last case in the short term, but it's there just to make sure our code structure could eventually support it one day

Configure Semantic Release

When PRs are merged onto the main branch:

  • the changelog should get updated
  • scikitjs-node should be auto-published to NPM with the correct semver
  • github should tag each release
  • scikitjs-browser should be auto-published to NPM too

###Up for debate:

  • Should we have a single node package with multiple build targets?
  • Or have a separate repo for browser that's just there for publishing to NPM (my preference)

Need Global Variable/Name-Space of scikit.js . It's more flexible to use in browser Console interactive testing/ad-hoc

PREFER in HTML (as in d3.js , tensorflow.js, tfjs-vis use global variables [d3/tf/tfvis] respectively in browser console ):

<script src='src/d3.min.js'></script> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/[email protected]/dist/tf.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-vis"></script>

Then in browser console when you type tfvis. , d3. , tf. immediately a drop down list of available modules/functions
appeared right after the dot. This is faster for debugging and testing by just utilizing one script tag line in HTML!

===========================
scikit.js in HTML from your website:

<script type="module"> import * as tf from 'https://cdn.skypack.dev/@tensorflow/tfjs' import * as sk from 'https://cdn.skypack.dev/scikitjs' sk.setBackend(tf) </script>

This way sk is not recognizable in browser console. Can you provide a link/file of scikit.js that can utilize a global
variable such as "sk" if you have one.

If scikit.js does not have this feature, will it provides one in the future [time frame please].

Is there a way to work around this? [create global variable "sk" and gain access to all sub
modules/functions in browser console for interactive testing].

Thank you, I'm looking forward to hear from you soon.

If I use bundler, the project will fail to build.

I run the scikitjs (version: 1.24.0) in my project. and package it by parcel-bundler (version: 1.12.5). When I start the project I get the following error:
image
if I import the bundle by script tag (CDN), works fine.

The error location: /node_modules/scikitjs/dist/es5/index.js (line: 20). And, Comment out this line of code and it works fine.

KNeighborsRegressor

Build a KNeighborsRegressor which matches the sklearn API.

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor

Technically there are 3 different strategies ‘ball_tree’, ‘kd_tree’, ‘brute’. This issue is only to support the "brute" method which checks the distance between the predicted point with every point in the input.

The other two are optimizations which try to use trees (kd_tree), and spheres(ball_tree) to speed up this algo, but to start let's just do brute.

`make_classification` / `make_regression` from scikit-learn

When scikit-learn tests the effectiveness of their model training, they usually construct fake datasets where they know the underlying model (coefficients). They have two functions, they are

make_regression (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html)
make_classification (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification)

We should do the same. This is issue tracks our implementation of makeRegression and makeClassification which will be helpful for testing the speed and efficacy of our models.

KNeighborsClassifier

Build a KNeighborsClassifier which matches the scikit-learn API.

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Technically there are 3 different strategies ‘ball_tree’, ‘kd_tree’, ‘brute’. This issue is only to support the "brute" method which checks the distance between the predicted point with every point in the input.

The other two are optimizations which try to use trees (kd_tree), and spheres(ball_tree) to speed up this algo, but to start let's just do brute.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.