rust-ml / linfa Goto Github PK

View Code? Open in Web Editor NEW

3.7K 3.7K 238.0 4.41 MB

A Rust machine learning framework.

License: Apache License 2.0

Rust 99.93% Gnuplot 0.04% Python 0.04%

algorithms machine-learning rust scientific-computing

linfa's People

Contributors

Stargazers

Watchers

Forkers

dunnock gitter-badger amitn mstallmo nathanwhit remram44 xd009642 hhtiger cgmossa bytesnake amethix errorman23 paulkoerbitz artrand rustools fgadaleta vasanthakumarv ai-stuff girishprb ai-and-ml sauro98 relf zzjzzb duckquant onehr martinezjorge samjupe chimochio braman09 gopherj getong awesomemachinelearning age-rs 0x00evil alexkyllo charygao isgasho naheedk-pwc digikata quietlychris yujiimt baajarmeh asad-awadia dragonli loadresource elsuizo yuhanliin russellwmy jkabc123 icodein stokhos sjaustirni maksimsco zarinn3pal jeremyarde lonelygo livstyle daniel-vainsencher cruelu ledbettj hbcbh1999 nthon fork-hon andybalaam kiroyashao harupy jdixoncs mustafa-guler wcheddar xzlinux blingbin lllabmaster stjordanis mauricekraus laplacekorea stillmatic alexander-manley christoff-zulzi forkkit alignof sgrigory raywang80s placrosse chenju2k6 vaijira besaleli ralphsu pabannier aistrcyrill phial3 alexanderst huangweiboy2 fayefw zmilan standardgalactic chaostheory-jin jackson211 vonrosenchild tristancacqueray jk1015

linfa's Issues

Sparse PCA

(Related to #22)

Hey,

I would like to start implement sparse PCA and I have a question and need for feedback, as my ML experience is minute.

Should we implement a separate algorithm (like scikit-learn does) or just use a sparse flag?
By reading the sparse PCA paper linked in the original issue, the R implementation by the author, and this Python implementation, I have arrived at this (hyper)parameter list:
- n_components (P)
- l1_penalty (H)
- l2_penalties(H)
- max_n_iterations (P)
- tolerance (P)
I am not sure which of these should be parameters (P) and hyperparameters (H), feel free to comment on that or anything I missed here.

Update README

There are some broken links and other items that need update in the readme before we reach out to a wider audience:

Links to chat (I think we want this to be the rust-ml zulip channel for now)
Links to build and coverage info (part of fixing up the CI builds)
Current state section: Update this to the current state and make it a bit more timeless

Some things which would also be at least nice to haves:

Collaboration guidelines

Roadmap

In terms of functionality, the mid-term end goal is to achieve an offering of ML algorithms and pre-processing routines comparable to what is currently available in Python's scikit-learn.

These algorithms can either be:

re-implemented in Rust;
re-exported from an existing Rust crate, if available on crates.io with a compatible interface.

In no particular order, focusing on the main gaps:

Clustering:
- DBSCAN
- Spectral clustering;
- Hierarchical clustering;
- OPTICS.
Preprocessing:
- PCA
- ICA
- Normalisation
- CountVectoriser
- TFIDF
- t-SNE
Supervised Learning:
- Linear regression;
- Ridge regression;
- LASSO;
- ElasticNet;
- Support vector machines;
- Nearest Neighbours;
- Gaussian processes; (integrating friedrich - tracking issue nestordemeure/friedrich#1)
- Decision trees;
- Random Forest
- Naive Bayes
- Logistic Regression
- Ensemble Learning
- Least Angle Regression
- PLS

The collection is on purpose loose and non-exhaustive, it will evolve over time - if there is an ML algorithm that you find yourself using often on a day to day, please feel free to contribute it 💯

Nearest neighbors and approximate clustering

Feel free to use the HNSW crate I created which ports HNSW to Rust: https://github.com/rust-photogrammetry/hnsw. If you are interested I can add it as an ANN algorithm. I am not as familiar with clustering, but you should be able to build a very fast approximate clustering algorithm utiliIng HNSW.

Let me know if this would be a good addition and what to add to linfa (and where). HNSW should perform quite well from a benchmark perspective, but I haven't ran it neck-to-neck with existing implementations using the exact same data and bench harness yet, so I am interested to try that if python bindings are added.

Remove top-level kmeans example

At the moment, there's a repeat example of the k-means algorithm example: one in the linfa-clustering package, and one in the top-level crate examples/ directory.

With a limited number of algorithms, I suppose it's possible to have an example of each in the top-level crate, but I think it will be neater if examples are kept in their respective packages.

I'm doing a review of the -clustering module at the moment, so would be happy to take care of this when I'm ready to submit that PR.

Enable features in CI

Right now CI doesn't enable any Cargo features. I suggest that we run a separate cargo check on the algorithm crates with all non-conflicting features enabled, to avoid bugs like #128. At the very least we should check the crates with serde enabled.

Clarify conventions for public interfaces

It would be useful, especially to potential contributors, to have a unified description of how public interfaces should be structured. At first glance, I assumed we would be attempting to stay as close to sklearn's conventions as is reasonable with Rust's conventions and syntax. Looking at the KMeans implementation, however, shows significant departure from sklearn's parameter naming scheme, and introduces several additional types for Hyperparameter management.

I think it's important in the long run for the public interfaces to be both intuitive and consistent. With that in mind, I think we should:

Start a discussion about design choices for the public interfaces. I personally am not sure that the utility of introducing HyperParams structs and accompanying factory functions justifies the additional API complexity. But I could absolutely be wrong about that. I'd just like to see some rationale.
Write up the conclusions of that discussion in a design doc. This doesn't need to be that complex or in-depth, just a basic statement of conventions and design philosophy to make it easier for contributors.

Smartcore

There is a new Rust machine learning librarie with a nice website but, currently, relatively few algorithms : smartcorelib.org.

It might be worth contacting them to unify efforts.

Rust implementation of DecisionTree::fit does not match results from sklearn

Details in PR-39

Predicting in a loop makes loads of allocations

We could add a version of Predict::predict which would write the output to a buffer instead of allocating the output.

Something like PredictInplace::predict_inplace. With a signature like fn predict_inplace(&self, x: &R, y: &mut T).

Implementation

implement buffer reuse with predict_inplace
add error handling for predict_inplace

Update algorithms

K-Means clustering
check other algorithms for feasibility

Why do you call this framework a "machine learning" framework?

Currently, this framework only provides the implementation of a single algorithm (an unsupervised algorithm), but this framework (repository) has been around for 2 years, which makes the goal of this project very questionable. Therefore, I suggest you either archive this repo or change its name or goal. It's misleading to have a repo that suggests that this is an ML framework, while it's just an implementation of a single algorithm.

pure rust / linear algebra libs

I tried to compile linfa without any c lib dependencies without success yet. It seems that ndarray depends on a c lib (blas) but can be deactivated/is an option build dependency.

Is there a way to compile linfa in pure rust?

Awesome-rust-mentors

Hi @bytesnake @quietlychris

I saw your project on awesome-rust-mentors, and I'd like to be a mentee.
I didn't find much information on this, so I don't know how this works.

A brief bio about myself:
I'm PhD candidate in economics, close to get a Master degree in Mathematics. Had some experience ML, in mainly in DL.

Best

Feature: Once OPTICS is done OPTICS-OF

So the OpticsResult is provides metrics about a dataset that can be used to drive other analysis. It's commonly mentioned that you can derive DBSCAN clusters from the output of OPTICS (however OPTICS is slower than DBSCAN on it's own so this would be dumb to do).

One thing you can do with the output of OPTICS is compute the Local Outlier Factor for each element of the dataset and use that to detect anomalies. This issue suggests adding a trait impl on the OpticsResult type in my Optcs PR to implement this functionality.

https://en.wikipedia.org/wiki/Local_outlier_factor

Enable code coverage with tarpaulin

Code coverage was disabled when I re-enabled the CI as I had some problems to get tarpaulin to work on our project. It should be enabled again.

Dataset improvements

Most of the time the dataset is implemented for ArrayBase<D, Ix2> for records and ArrayBase<D, Ix1> for targets. We could simplify a lot of impls by:

renaming Dataset to DatasetBase
implement Dataset and DatasetView
implement Dataframe
support multi-targets and label probabilities

Implement for new types

k-folding
non-copying k-folding iter_fold
bootstrapping
shuffling

Improve ergonomics:

feature naming
target naming

Improve Linear Regression: Use ndarray-linalg's lstsq, improve names and follow conventions

There are some improvements to the linear regression / least squares code which would be nice to implement.

Use least squares from ndarray-linalg (lapack), now that it is exposed
Address comments from original PR
- Fix names fit_intercept, fit_intercept_with_normalisation to be consistent with linfa-logistic
Follow C-GETTER for getter names

Improving PLS

Added with PR #95, some improvements for PLS were identified during code review, they are listed here for reference:

Add an example to the website
Add benchmarks (should depend on #103 )
Use TruncatedSvd in PLS SVD implementations

Ideas for testing

This issue suggests some ideas, which you might use to improve your testing. Normally writing tests can be a very time-consuming task and it is crucial to have a large number of tests for good coverage.

check failure for invalid hyper-parameters
check the construction of a model with a well-known dataset
check that linearly separable data has an optimal accuracy
check special qualities of your algorithm, e.g. can detect/is robust against outliers, construct sparse solutions etc.
use a real-world dataset and compare to performance with similar implementations
look into scikit-learn and how they are performing the testing

If you have any specific test idea for any algorithm in the linfa ecosystem, please add a comment below 🎉

Undefined reference to `LAPACKE_dgetrs' - Linux Mint 20

= note: /usr/bin/ld: /home/m0riarty_d3v/WebstormProjects/linfa/target/debug/deps/libndarray_linalg-ab5e3b1dfd453661.rlib(ndarray_linalg-ab5e3b1dfd453661.ndarray_linalg.1huagtdm-cgu.7.rcgu.o): in functionlapacke::dgetrf':
/home/m0riarty_d3v/.cargo/registry/src/github.com-1ecc6299db9ec823/lapacke-0.2.0/src/lib.rs:6130: undefined reference to LAPACKE_dgetrf' /usr/bin/ld: /home/m0riarty_d3v/WebstormProjects/linfa/target/debug/deps/libndarray_linalg-ab5e3b1dfd453661.rlib(ndarray_linalg-ab5e3b1dfd453661.ndarray_linalg.1huagtdm-cgu.7.rcgu.o): in function lapacke::dgetrs':
/home/m0riarty_d3v/.cargo/registry/src/github.com-1ecc6299db9ec823/lapacke-0.2.0/src/lib.rs:6302: undefined reference to LAPACKE_dgetrs' collect2: error: ld returned 1 exit status
Kernel: x86_64 Linux 5.4.0-26-generic

This error appear when I execute cargo run --features openblas --example diabetes (in linfa-linear folder)

Suggestion: Make serde an optional dependency

One suggestion to prevent bringing in more dependencies than a user needs would be to make serde an optional dependency and adjust the Serialize and Deserialize derives be like #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]. This would be a breaking change so would need a minor version update.

Are we ndarray 0.15 ready yet?

I open this issue to keep track of stuff required to bump to ndarray 0.15. From #110 we get the following list:

Improving and extending benchmarks

One area where we are lacking right now is the benchmarking coverage. I would like to improve that in the coming weeks.

Infrastructure for benchmarking

Benchmarks are an essential part of linfa. They should give feedback for contributors on their implementations and users confidence that we're doing good work. In order to automate the process we have to employ an CI system which creates a benchmark report on (a) PR (b) commits to master branch. This is difficult with wall-clock benchmarks (aka criterio.rs) but possible with valgrind.

use iai for benchmarking
add a workflow executing the benchmark on PR/commits to master and create reports in JSON format
build a script parsing reports and posting it as comments to PR (see here)
add a page to the website which displays reports in a human-readable way
(pro) use polynomial regression to find influence of predictors (e.g. #weights, #features, #samples, etc.) to targets (e.g. L1 cache misses, cycles etc.) and post the algorithmic complexity as well

Plotting in Criterion benchmarks are extremely slow

Many of our Criterion benchmarks (particularly the ones in linfa-clustering) call plot_config, which causes the runner to create plots at the end of the benchmarks. The issue is that the plot creation takes seemingly forever. I'm in favour of removing the plots since I don't really use them, but if someone objects to that then we can look into it.

linfa-clustering won't compile due to type error

I came across an issue compiling linfa-clustering as a dependency for the Rust-ML book today, which per the error message is tracked down to a link the linfa-clustering/src/gaussian_mixture/algorithm.rs in the compute_precisions_cholesky_full() function, which using the Cholesky method imported in from ndarray-linalg. As far as I can tell, when cloning linfa's master and trying to build in the sub-crate, I run into the same compilation error. I've tinkered around with it, but haven't figured out a fix quite yet.

Compiling linfa-clustering v0.3.1 (https://github.com/rust-ml/linfa?branch=master#b7c31c58)
error[E0599]: no method named `cholesky` found for struct `ArrayBase<ViewRepr<&<F as linfa::Float>::Lapack>, Dim<[usize; 2]>>` in the current scope
   --> /home/chrism/.cargo/git/checkouts/linfa-f557574ebcb8312b/b7c31c5/algorithms/linfa-clustering/src/gaussian_mixture/algorithm.rs:262:51
    |
262 |             let decomp = covariance.with_lapack().cholesky(UPLO::Lower)?;
    |                                                   ^^^^^^^^ method not found in `ArrayBase<ViewRepr<&<F as linfa::Float>::Lapack>, Dim<[usize; 2]>>`

error: aborting due to previous error

DBSCAN does not count border points

The DBSCAN implementation in linfa-clustering only includes core points with minPts neighbours. Border points (points with less than minPts neighbours) connected to core points are counted as noise, which is different behaviour from how DBSCAN is typically described. Is this behaviour intentional or a bug?

Opencl/Cuda support

I've been recently getting into rust and I do a lot with machine learning (in python with tensorflow), so I was wondering if there was any opencl or cuda support by any chance? I was looking in the readme and it didn't seem to mention anything, only blas implementations in intel mkl.

Remove duplicated dependencies

Calling cargo tree -d --workspace at the root of the repo reveals all of Linfa's duplicated dependencies. These are dependencies that have the same name but different versions, leading to bloat. We want to eliminate as many duplicated dependencies as possible by ensuring that multiple instances of the same crate in the dependency tree have the same version. We should also prioritize the non-dev dependencies since they affect the end users.

Confused about project layout

Hopefully, just a quick query. So looking in the cargo.toml, most of the workspace members are added as dependencies for linfa, meaning they're pulled into the library at src/lib.rs, however only clustering publicly exposed making the other ones inclusion currently useless.

So from this I guess the plan is to publish each workspace member separately (which is nice). But also have a main crate called linfa that pulls everything in just for convenience?

If that is the case I'd consider making the project root a virtual workspace, adding a library project just called linfa in that and then pulling in the others as dependencies there. I've done that before in some personal projects and it worked well.

Some form of tidy-up and some people making sample projects using it to get a feel for the ergonomics would be good before another release

Documentation maintenance and small fixes

Here is a collection of work that needs to be done regarding documentation and minor issues with the code itself. Picking an item from this list can be a good way to start getting acquainted with the codebase and provide a useful contribution. 😄 👍🏻

In general, I would say that the pages for the individual algorithms in the linfa-clustering and linfa-elasticnet sub-crates can be good references for the structure of a documentation page.

It's suggested to look at the existing documentation in your local build rather than on doc.rs since some pages may have already been updated.

ICA is listed as supervised learning in the README

But it is actually an unsupervised learning algorithm.

Update to ndarray 0.14

Just to mention I've started to look at using ndarray 0.14 (and hopefully I intend to make a PR in the end)

For now, I am blocked by an issue with sprs which is used by linfa-kernel: sparsemat/sprs#278

Add a changelog

A changelog would be useful so users can see when new features are added and if anything was broken and has been fixed. https://keepachangelog.com/en/1.0.0/

Improve K-Means

In a recent PR (#97) multiple issues came up with the current KMeans implementation. We could improve it by:

adding kmeans++ for centroid initialization
adding incremental KMeans mode
using a more stable centroid updating scheme

Errors Building Linfa Clustering and Linfa Reduction

Hey there, interesting project! I was just getting it installed and found that I couldn't compile the docs for linfa-clustering or linfa-reduction without turning off default-features because of this error:

    Updating crates.io index
error: failed to select a version for `linfa-reduction`.
    ... required by package `nasa-gallery v0.1.0 (/home/zicklag/git/zicklag/nasa-gallery)`
versions that meet the requirements `*` are: 0.2.0

the package `nasa-gallery` depends on `linfa-reduction`, with features: `openblas-src` but `linfa-reduction` does not have these features.


failed to select a version for `linfa-reduction` which could resolve this conflict

Handling datasets with multiple targets?

I would like to add the linnerud dataset to linfa-dataset but I've noticed that currently targets are only handled as a one-dimensional array. Should we introduce something like:

pub type MultiTargetDataset<D, T> =
    DatasetBase<ArrayBase<OwnedRepr<D>, Ix2>, ArrayBase<OwnedRepr<T>, Ix2>>;

or is there a better way?

Moving ownership to the ML Working Group

Following my message to the ML WG Zulip chat:

As some of you might have noticed, my activity in the ML corner of the Rust ecosystem has quite diminished over the last few months.

Due to what amounts to a career change (ML -> Software Engineer) I have been focusing more on the web/networking ecosystem and I have found it more and more time-consuming to context-switch at the end of the day to work on ML topics.
Motivation plays a role there too, being myself less engaged in the future evolution of the ecosystem.

I'd thus be more than open to move ownership of linfa over to the WG or to a subset of volunteers with an interest in moving the project forward.

I moved the ownership of linfa to rust-ml, which is going to be a better steward going forward, with surely more helping hands to dedicate to the project.

My sincerest apologies to @paulkoerbitz, @mossbanay and @bytesnake. I left your PRs unanswered for a long time, but I really couldn't manage to juggle the number of projects I have been working on during the last few months.

🚀

Serializing CountVectorizer and Models

Hello,

Is there a way to serialize the preprocessing steps like CountVectorizer?

Any help would be greatly appreciated!

save model

hey, I'm new to linfa and couldn't find methods to save and load trained models.
Any ideas how this could be done? Thanks!

Bug in decision tree example

Line 47: let entropy_pred_y = gini_model.predict(&test);
should be let entropy_pred_y = entropy_model.predict(&test);

linalg svd unwrap issues

Maybe this isn't the correct place to ask or maybe I just need a fresh pair of eyes but I'm having trouble using the ndarray-linalg svd function.

the following code throws and error at the unwrap()

    X : &ArrayBase<impl Data<Elem = f64>, Ix2>,
) -> Array2<f64> {
    let temp:Array2<f64> = X.to_owned().clone();
    let (u, s, v) = temp.svd(true, true).unwrap();

while the following does not

    x : &ArrayBase<impl Data<Elem = f64>, Ix2>,
    n_components : f64,
) -> Self {

    let (_n, _m) = x.dim();

    //calculate the array of columnar means
    let mean = x.mean_axis(Axis(0)).unwrap();

    // subtract means from X
    let h:Array2<f64> = Array2::ones((_n, _m));
    let temp:Array2<f64> = h * &mean;
    let b:Array2<f64> = x - &temp;

    // compute SVD
    let (u, sigma, v) = b.svd(true, true).unwrap();

any help would be appreciated

Nearest Neighbour algorithms

Nearest neighbour algorithms are an item in the roadmap. They're helpful in clustering algorithms, particularly DBSCAN. There are a number of existing implementations of advanced nearest neighbour algorithms in the Rust ecosystem, but they are not compatible with ndarray so they'd need to be modified to be included in Linfa.

Existing algorithms I found:

Linear search: The simplest and least efficient method. O(N) time. Already exists in our DBSCAN implementation.
K-D tree: O(logN) data structure. The kdtree crate looks pretty solid.
BSP tree: O(logN) data structure. This binary_space_partition crate is pretty basic so I'm not sure how useful it is to us. BSP tree is also the nearest neighbour implementation use by SmartCore.
Ball tree: O(logN) as well. This ball_tree crate implements the algorithm.
R* tree: O(logN) algorithm. The rstar crate is the most well-maintained of the nearest neighbour implementations with existing benchmarks.

We should pick at least one of the O(logN) algorithms to integrate into Linfa, along with linear search due to its simplicity. I'm not sure how many of these algorithms we really need since I don't know the detailed tradeoffs between them.

Fix CI build, make CI run automatically

Make the CI build run again.

Discussion: Should we enable Clippy for the project?

README and dark theme

Probably the least important issue ever, but can't see the right side of the ferris in the readme with github dark theme. Maybe consider either white outline, or white inner and black outline - or some zanier colours (orange inside black outline).

Screenshot of said issue

Clarify versioning of linfa and linfa sub-crates

How will versioning of linfa and its sub-crates work? Currently the version of linfa on crates.io is 0.1.2 (although in Cargo.toml its still 0.1.1) and linfa-clustering is released on crates.io as 0.1.0 (consistent with Cargo.toml).

Some questions that come to mind:

Should all sub-crates (linfa-clustering, etc.) have the same version as the main version (my guess is not)?
Do we have to release each version of the main linfa and each of the sub-crates by hand?
Does the main linfa crate get a release every time any of its sub-crates gets a release?

Add option for progress visualization

I was recently working on some documentation of the DBSCAN algorithm, and on several occasions was running a test on datasets of varying lengths. Even with some very small changes in the parameters, the execution time could jump from under one second to several minutes. While checking progress is not possible for all algorithms, there are at least a few where displaying a progress indicator could be helpful to users for estimating remaining execution time or at least understanding how far along their run is. Obviously, not all users might be interested in this, so making this feature optional with a feature flag might make sense as well.

Options for this could potentially be the indicatif or pb libraries.

Bindings for Ruby and newer Roadmap

I happened to land here through a post on Hacker News and this looks very interesting. Unfortunately, the blog post was from Decemeber 2019 and the link was only posted there recently. I see that most of the original roadmap has been already implemented. Kudos for that!

I have two questions though:

Are there any plans to add bindings for Ruby or any other programming language at the moment? I ask this since Ruby has been my primary programming language for 12+ years and I still find it lacking in ML support. This project may just fill the gap.
Is the roadmap going to be extended any time soon to include any additional algorithms or variations? Also, is the scikit-learn compatibility an intermediate or final goal of the project?

Thanks again for all the awesome work!

Improvements for Principal Component Analysis

A plain Principal Component Analysis algorithm was added in 7b6075e. The next steps should improve upon edge-cases and features.

implement Roweis Discriminant Analysis which mixes supervised and unsupervised models
implement sparse PCA. By adding a sparsity constraint (like LASSO) only certain principal components are selected to represent the data
implement robust PCA to improve robustness to outliers by using a L1 norm instead of the normal Frobenius norm
(?) implement non-linear PCA (should be similar to diffusion maps except for scaling)
add tests for edge-cases for very large, sparse or ill-behaving datasets

strange question

why the project U wrote is named linfa?

Life from ashes - Decision Trees

Implement a Decision Tree classifier;
Provide an easy way to load canonical datasets for testing/benchmarking/documentation purposes;
Test the Decision Tree classifier on a canonical dataset.