GithubHelp home page GithubHelp logo

Comments (22)

guygastineau avatar guygastineau commented on August 29, 2024 2

Oh, thank you. I thought tree might do the trick, but I wasn't sure if the module structure lined up so well with the algorithms. Thank you for the information. You all have organized this project very well.

BTW, I remembered to run my smartcore projects using release, and the speedup is insane. That significantly reduces how urgently I want/need parallelization, but I will keep on this issue, since I think I should contribute in some way as a user to say thanks to all of you maintaining this excellent library.

from smartcore.

VolodymyrOrlov avatar VolodymyrOrlov commented on August 29, 2024 1

@daraghk absolutely! Let me know if you ran into an issue while implementing this feature.

from smartcore.

morenol avatar morenol commented on August 29, 2024 1

@guygastineau PR was here. I think that it only needs the addition of maybe a rust feature flags or just disable that code for WASM targets. Also probably needs to solve conflicts with latest additions to development branch

from smartcore.

morenol avatar morenol commented on August 29, 2024 1

I don't have strong opinion about either options, maybe other contributors or active users of this crate could have @VolodymyrOrlov @Mec-iS @montanalow

On the other hand, I think that we could change the internal implementation of fit as long as it is always compatible to be used with the SupervisedEstimator trait

fn fit(x: &X, y: &Y, parameters: P) -> Result<Self, Failed>

from smartcore.

guygastineau avatar guygastineau commented on August 29, 2024 1

So, WASM doesn't support parallelization, or does default rayon just not work with WASM?

It is nice that so much of the work is already done, but I expect I'll run into various problems concerning integration and twiddling with feature flags (I primarily program in Haskell, so I am not fluent with all the rust build system features).

I will try set some time aside each week at work for moving this forward. I would spend some evening time, but I am directing the music for a musical through the next few weeks, and all my after work programming time is gone until that ends.

@morenol please feel free to poke me if I seem to fall of the project for too long.

from smartcore.

morenol avatar morenol commented on August 29, 2024 1

A quick solution could be to add #[cfg(not(target_arch = "wasm32"))] in the multithreading fit implementation and put #[cfg(target_arch = "wasm32")] in the wasm32 fit implementation that could be just the current code

from smartcore.

Mec-iS avatar Mec-iS commented on August 29, 2024 1

ok but then rayon should be optional? how big is it as a library? Consider that the size of the binary is already 12Mb.

from smartcore.

Mec-iS avatar Mec-iS commented on August 29, 2024 1

There is a list of groups of algorithms in the documentation main page in the lib.rs file, or you can build the documentation with cargo doc --no-deps --open

//! ### Supported algorithms
//! All machine learning algorithms are grouped into these broad categories:
//! * [Clustering](cluster/index.html), unsupervised clustering of unlabeled data.
//! * [Matrix Decomposition](decomposition/index.html), various methods for matrix decomposition.
//! * [Linear Models](linear/index.html), regression and classification methods where output is assumed to have linear relation to explanatory variables
//! * [Ensemble Models](ensemble/index.html), variety of regression and classification ensemble models
//! * [Tree-based Models](tree/index.html), classification and regression trees
//! * [Nearest Neighbors](neighbors/index.html), K Nearest Neighbors for classification and regression
//! * [Naive Bayes](naive_bayes/index.html), statistical classification technique based on Bayes Theorem
//! * [SVM](svm/index.html), support vector machines

The complete list is more or less the modules names, like in algorithm/neighbour there are:

  • bbd_tree
  • cover_tree
  • fastpair
  • linear_search

Depending on your system you can print the directory structure, in Linux is the tree command. If you print the directory structure in the src directory with all its subdirs you will see the list of all the modules (I removed the ones not containing algorithms):

src/
├── algorithm
│   ├── mod.rs
│   ├── neighbour
│   │   ├── bbd_tree.rs
│   │   ├── cover_tree.rs
│   │   ├── fastpair.rs
│   │   ├── linear_search.rs
│   │   └── mod.rs
│   └── sort
│       ├── heap_select.rs
│       ├── mod.rs
│       └── quick_sort.rs
├── cluster
│   ├── dbscan.rs
│   ├── kmeans.rs
│   └── mod.rs
├── decomposition
│   ├── mod.rs
│   ├── pca.rs
│   └── svd.rs
├── ensemble
│   ├── mod.rs
│   ├── random_forest_classifier.rs
│   └── random_forest_regressor.rs
├── lib.rs
├── linalg
│   └── traits
│       ├── cholesky.rs
│       ├── evd.rs
│       ├── high_order.rs
│       ├── lu.rs
│       ├── mod.rs
│       ├── qr.rs
│       ├── stats.rs
│       └── svd.rs
├── linear
│   ├── bg_solver.rs
│   ├── elastic_net.rs
│   ├── lasso_optimizer.rs
│   ├── lasso.rs
│   ├── linear_regression.rs
│   ├── logistic_regression.rs
│   ├── mod.rs
│   └── ridge_regression.rs
├── metrics
│   ├── accuracy.rs
│   ├── auc.rs
│   ├── cluster_hcv.rs
│   ├── cluster_helpers.rs
│   ├── distance
│   │   ├── euclidian.rs
│   │   ├── hamming.rs
│   │   ├── mahalanobis.rs
│   │   ├── manhattan.rs
│   │   ├── minkowski.rs
│   │   └── mod.rs
│   ├── f1.rs
│   ├── mean_absolute_error.rs
│   ├── mean_squared_error.rs
│   ├── mod.rs
│   ├── precision.rs
│   ├── r2.rs
│   └── recall.rs
├── model_selection
│   ├── kfold.rs
│   └── mod.rs
├── naive_bayes
│   ├── bernoulli.rs
│   ├── categorical.rs
│   ├── gaussian.rs
│   ├── mod.rs
│   └── multinomial.rs
├── neighbors
│   ├── knn_classifier.rs
│   ├── knn_regressor.rs
│   └── mod.rs
├── optimization
│   ├── first_order
│   │   ├── gradient_descent.rs
│   │   ├── lbfgs.rs
│   │   └── mod.rs
│   ├── line_search.rs
│   └── mod.rs
├── preprocessing
│   ├── categorical.rs
│   ├── mod.rs
│   ├── numerical.rs
│   ├── series_encoder.rs
│   └── traits.rs
├── svm
│   ├── mod.rs
│   ├── search
│   │   ├── mod.rs
│   │   ├── svc_params.rs
│   │   └── svr_params.rs
│   ├── svc.rs
│   └── svr.rs
└── tree
    ├── decision_tree_classifier.rs
    ├── decision_tree_regressor.rs
    └── mod.rs

from smartcore.

Mec-iS avatar Mec-iS commented on August 29, 2024 1

@guygastineau we also published smartcore-benches if you want to run or contribute some performance tests.

from smartcore.

Mec-iS avatar Mec-iS commented on August 29, 2024 1

yes you can create a new Discussion at https://github.com/smartcorelib/smartcore/discussions

with your general ideas for what you want to do. Who is interested can join the discussion and try to integrate own ideas. Once the needs for the implementation are clear, we can move on and create issues.

from smartcore.

daraghk avatar daraghk commented on August 29, 2024

Hey @VolodymyrOrlov, is this still required? I'd be happy to look at this

from smartcore.

daraghk avatar daraghk commented on August 29, 2024

awesome! are the 'datasets' actively used? (e.g digits.rs) It doesn't seem to me like they are at the moment. I'm just wondering as I plan to use 'digits' to benchmark the implementing of parallelization here. I've had to make small changes to 'digits' regarding feature and target names.
(Edit: I just found the examples repo where they're used)

from smartcore.

VolodymyrOrlov avatar VolodymyrOrlov commented on August 29, 2024

I don't think it is being actively used. Feel free to adjust it to your needs

from smartcore.

guygastineau avatar guygastineau commented on August 29, 2024

I couldn't find any indication of progress on this. I have started using smartcore for random forest classification. @daraghk do you have a branch anywhere where you have started to work on this? @VolodymyrOrlov do you have a good idea of how much work would be involved? I'm willing to throw some time at it, but I am generally busy. The training feels painfully slow compared to the other stages of my processing pipeline which are all parallelized.

from smartcore.

guygastineau avatar guygastineau commented on August 29, 2024

I did a little spelunking. the method fit,

, looks to have a clear path forward for parallelization in the loop that samples and calls the associated function fit_weak_learner, but parallelizing that loop will likely require trait bounds for Send and Sync to propagate outside of the function definition itself. I hope this won't be too far reaching.

There are some things I would like to know. Are the authors and maintainers of this project interested in the addition of the crate rayon as a dependency for such work, or would you all prefer hand-rolled parallelization using the primitives from std::thread? I can see advantages and disadvantages to each approach. If using rayon I would refactor fit in a "functional"/iterator style to fit its idioms. That might not be desired, and perhaps you all would prefer to avoid adding dependencies anyway. With std::thread I would probably keep the loops, so the function wouldn't look so different.

I should check in the DenseMatrix implements Sync and Send. That might be a requirement here, but I would need to play with the codebase to figure that out.

from smartcore.

morenol avatar morenol commented on August 29, 2024

Regarding WASM and multithreading I think that it depends on the runtime engine. In rayon repo they suggest a way to run that with wasm-bindgen https://github.com/rayon-rs/rayon/blob/d3d46c31a49bd1ccea4582043bc69631d1fe176b/README.md#usage-with-webassembly.

For wasmtime it seems that there is still job to be done to support multithreading bytecodealliance/wasmtime#888

For that reason is that we could make rayon or any other dependency like that as optional through a feature flag.

from smartcore.

guygastineau avatar guygastineau commented on August 29, 2024

@Mec-iS excellent point. I was thinking about that when I read the PR. I will see if it is possible to have a parallelization feature that is set specifically not to work with the wasm target and which must be turned on only then including the rayon dependency. Hopefully the build system makes that possible.

from smartcore.

Mec-iS avatar Mec-iS commented on August 29, 2024

ok. Also it would be nice to have a codebase-wise list of methods that can be made parallel so we can see how worth it is to introduce rayon as a dependency in perspective. Adding a dependency is worth if it can be reused in many parts of the codebase.

from smartcore.

guygastineau avatar guygastineau commented on August 29, 2024

@Mec-iS is there a master list of algorithms in the smartcore library somewhere? I am willing to check which ones are parallelizable.

from smartcore.

Mec-iS avatar Mec-iS commented on August 29, 2024

the new version 0.3 is out today with support for Rust 2021, see improvements in docs: https://docs.rs/smartcore/latest/smartcore/

Check the list of issues and use Github Discussions if you want to propose larger improvements https://github.com/smartcorelib/smartcore/discussions

from smartcore.

guygastineau avatar guygastineau commented on August 29, 2024

@guygastineau we also published smartcore-benches if you want to run or contribute some performance tests.

I will take a look. BTW, should the list of parallelizable algorithms start in this thread/issue, or should we begin that somewhere else or in a new issue and reference this one? I remember seeing a link somewhere in our conversation here the other day (I think, maybe I was somewhere else in smartcore's repo/github), and it suggested improvement suggestions should go somewhere for discussion, but I can't find it now. I am feeling a bit brain dead. We have had rehearsals after work from 7-10/11 every night this week (for the musical I mentioned previously), and I'll have shows and rehearsals straight through next Saturday (so I will continue to be brain dead for at least another week).

from smartcore.

thegwan avatar thegwan commented on August 29, 2024

Has there been any progress on random forest parallelization lately, or a branch that someone has started?

from smartcore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.