rust-ml / linfa Goto Github PK
View Code? Open in Web Editor NEWA Rust machine learning framework.
License: Apache License 2.0
A Rust machine learning framework.
License: Apache License 2.0
(Related to #22)
Hey,
I would like to start implement sparse PCA and I have a question and need for feedback, as my ML experience is minute.
sparse
flag?n_components
(P)l1_penalty
(H)l2_penalties
(H)max_n_iterations
(P)tolerance
(P)There are some broken links and other items that need update in the readme before we reach out to a wider audience:
Some things which would also be at least nice to haves:
In terms of functionality, the mid-term end goal is to achieve an offering of ML algorithms and pre-processing routines comparable to what is currently available in Python's scikit-learn
.
These algorithms can either be:
In no particular order, focusing on the main gaps:
Clustering:
Preprocessing:
Supervised Learning:
friedrich
- tracking issue nestordemeure/friedrich#1)The collection is on purpose loose and non-exhaustive, it will evolve over time - if there is an ML algorithm that you find yourself using often on a day to day, please feel free to contribute it ๐ฏ
Feel free to use the HNSW crate I created which ports HNSW to Rust: https://github.com/rust-photogrammetry/hnsw. If you are interested I can add it as an ANN algorithm. I am not as familiar with clustering, but you should be able to build a very fast approximate clustering algorithm utiliIng HNSW.
Let me know if this would be a good addition and what to add to linfa (and where). HNSW should perform quite well from a benchmark perspective, but I haven't ran it neck-to-neck with existing implementations using the exact same data and bench harness yet, so I am interested to try that if python bindings are added.
At the moment, there's a repeat example of the k-means algorithm example: one in the linfa-clustering
package, and one in the top-level crate examples/
directory.
With a limited number of algorithms, I suppose it's possible to have an example of each in the top-level crate, but I think it will be neater if examples are kept in their respective packages.
I'm doing a review of the -clustering
module at the moment, so would be happy to take care of this when I'm ready to submit that PR.
Right now CI doesn't enable any Cargo features. I suggest that we run a separate cargo check
on the algorithm crates with all non-conflicting features enabled, to avoid bugs like #128. At the very least we should check the crates with serde
enabled.
It would be useful, especially to potential contributors, to have a unified description of how public interfaces should be structured. At first glance, I assumed we would be attempting to stay as close to sklearn's conventions as is reasonable with Rust's conventions and syntax. Looking at the KMeans implementation, however, shows significant departure from sklearn's parameter naming scheme, and introduces several additional types for Hyperparameter management.
I think it's important in the long run for the public interfaces to be both intuitive and consistent. With that in mind, I think we should:
Start a discussion about design choices for the public interfaces. I personally am not sure that the utility of introducing HyperParams structs and accompanying factory functions justifies the additional API complexity. But I could absolutely be wrong about that. I'd just like to see some rationale.
Write up the conclusions of that discussion in a design doc. This doesn't need to be that complex or in-depth, just a basic statement of conventions and design philosophy to make it easier for contributors.
There is a new Rust machine learning librarie with a nice website but, currently, relatively few algorithms : smartcorelib.org.
It might be worth contacting them to unify efforts.
Details in PR-39
We could add a version of Predict::predict
which would write the output to a buffer instead of allocating the output.
Something like PredictInplace::predict_inplace
. With a signature like fn predict_inplace(&self, x: &R, y: &mut T)
.
predict_inplace
predict_inplace
Currently, this framework only provides the implementation of a single algorithm (an unsupervised algorithm), but this framework (repository) has been around for 2 years, which makes the goal of this project very questionable. Therefore, I suggest you either archive this repo or change its name or goal. It's misleading to have a repo that suggests that this is an ML framework, while it's just an implementation of a single algorithm.
I tried to compile linfa without any c lib dependencies without success yet. It seems that ndarray depends on a c lib (blas) but can be deactivated/is an option build dependency.
Is there a way to compile linfa in pure rust?
I saw your project on awesome-rust-mentors, and I'd like to be a mentee.
I didn't find much information on this, so I don't know how this works.
A brief bio about myself:
I'm PhD candidate in economics, close to get a Master degree in Mathematics. Had some experience ML, in mainly in DL.
Best
So the OpticsResult is provides metrics about a dataset that can be used to drive other analysis. It's commonly mentioned that you can derive DBSCAN clusters from the output of OPTICS (however OPTICS is slower than DBSCAN on it's own so this would be dumb to do).
One thing you can do with the output of OPTICS is compute the Local Outlier Factor for each element of the dataset and use that to detect anomalies. This issue suggests adding a trait impl on the OpticsResult type in my Optcs PR to implement this functionality.
Code coverage was disabled when I re-enabled the CI as I had some problems to get tarpaulin to work on our project. It should be enabled again.
Most of the time the dataset is implemented for ArrayBase<D, Ix2>
for records and ArrayBase<D, Ix1>
for targets. We could simplify a lot of impl
s by:
Dataset
to DatasetBase
Dataset
and DatasetView
Dataframe
Implement for new types
iter_fold
Improve ergonomics:
There are some improvements to the linear regression / least squares code which would be nice to implement.
fit_intercept
, fit_intercept_with_normalisation
to be consistent with linfa-logisticThis issue suggests some ideas, which you might use to improve your testing. Normally writing tests can be a very time-consuming task and it is crucial to have a large number of tests for good coverage.
If you have any specific test idea for any algorithm in the linfa ecosystem, please add a comment below ๐
= note: /usr/bin/ld: /home/m0riarty_d3v/WebstormProjects/linfa/target/debug/deps/libndarray_linalg-ab5e3b1dfd453661.rlib(ndarray_linalg-ab5e3b1dfd453661.ndarray_linalg.1huagtdm-cgu.7.rcgu.o): in function
lapacke::dgetrf':
/home/m0riarty_d3v/.cargo/registry/src/github.com-1ecc6299db9ec823/lapacke-0.2.0/src/lib.rs:6130: undefined reference to LAPACKE_dgetrf' /usr/bin/ld: /home/m0riarty_d3v/WebstormProjects/linfa/target/debug/deps/libndarray_linalg-ab5e3b1dfd453661.rlib(ndarray_linalg-ab5e3b1dfd453661.ndarray_linalg.1huagtdm-cgu.7.rcgu.o): in function
lapacke::dgetrs':
/home/m0riarty_d3v/.cargo/registry/src/github.com-1ecc6299db9ec823/lapacke-0.2.0/src/lib.rs:6302: undefined reference to LAPACKE_dgetrs' collect2: error: ld returned 1 exit status
Kernel: x86_64 Linux 5.4.0-26-generic
This error appear when I execute cargo run --features openblas --example diabetes (in linfa-linear folder)
One suggestion to prevent bringing in more dependencies than a user needs would be to make serde an optional dependency and adjust the Serialize
and Deserialize
derives be like #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
. This would be a breaking change so would need a minor version update.
I open this issue to keep track of stuff required to bump to ndarray 0.15
. From #110 we get the following list:
One area where we are lacking right now is the benchmarking coverage. I would like to improve that in the coming weeks.
Benchmarks are an essential part of linfa. They should give feedback for contributors on their implementations and users confidence that we're doing good work. In order to automate the process we have to employ an CI system which creates a benchmark report on (a) PR (b) commits to master branch. This is difficult with wall-clock benchmarks (aka criterio.rs) but possible with valgrind.
Many of our Criterion benchmarks (particularly the ones in linfa-clustering
) call plot_config
, which causes the runner to create plots at the end of the benchmarks. The issue is that the plot creation takes seemingly forever. I'm in favour of removing the plots since I don't really use them, but if someone objects to that then we can look into it.
I came across an issue compiling linfa-clustering
as a dependency for the Rust-ML book today, which per the error message is tracked down to a link the linfa-clustering/src/gaussian_mixture/algorithm.rs
in the compute_precisions_cholesky_full()
function, which using the Cholesky method imported in from ndarray-linalg
. As far as I can tell, when cloning linfa's master
and trying to build in the sub-crate, I run into the same compilation error. I've tinkered around with it, but haven't figured out a fix quite yet.
Compiling linfa-clustering v0.3.1 (https://github.com/rust-ml/linfa?branch=master#b7c31c58)
error[E0599]: no method named `cholesky` found for struct `ArrayBase<ViewRepr<&<F as linfa::Float>::Lapack>, Dim<[usize; 2]>>` in the current scope
--> /home/chrism/.cargo/git/checkouts/linfa-f557574ebcb8312b/b7c31c5/algorithms/linfa-clustering/src/gaussian_mixture/algorithm.rs:262:51
|
262 | let decomp = covariance.with_lapack().cholesky(UPLO::Lower)?;
| ^^^^^^^^ method not found in `ArrayBase<ViewRepr<&<F as linfa::Float>::Lapack>, Dim<[usize; 2]>>`
error: aborting due to previous error
The DBSCAN implementation in linfa-clustering only includes core points with minPts neighbours. Border points (points with less than minPts neighbours) connected to core points are counted as noise, which is different behaviour from how DBSCAN is typically described. Is this behaviour intentional or a bug?
I've been recently getting into rust and I do a lot with machine learning (in python with tensorflow), so I was wondering if there was any opencl or cuda support by any chance? I was looking in the readme and it didn't seem to mention anything, only blas implementations in intel mkl.
Calling cargo tree -d --workspace
at the root of the repo reveals all of Linfa's duplicated dependencies. These are dependencies that have the same name but different versions, leading to bloat. We want to eliminate as many duplicated dependencies as possible by ensuring that multiple instances of the same crate in the dependency tree have the same version. We should also prioritize the non-dev dependencies since they affect the end users.
Hopefully, just a quick query. So looking in the cargo.toml, most of the workspace members are added as dependencies for linfa, meaning they're pulled into the library at src/lib.rs
, however only clustering publicly exposed making the other ones inclusion currently useless.
So from this I guess the plan is to publish each workspace member separately (which is nice). But also have a main crate called linfa that pulls everything in just for convenience?
If that is the case I'd consider making the project root a virtual workspace, adding a library project just called linfa in that and then pulling in the others as dependencies there. I've done that before in some personal projects and it worked well.
Some form of tidy-up and some people making sample projects using it to get a feel for the ergonomics would be good before another release
Here is a collection of work that needs to be done regarding documentation and minor issues with the code itself. Picking an item from this list can be a good way to start getting acquainted with the codebase and provide a useful contribution. ๐ ๐๐ป
In general, I would say that the pages for the individual algorithms in the linfa-clustering
and linfa-elasticnet
sub-crates can be good references for the structure of a documentation page.
It's suggested to look at the existing documentation in your local build rather than on doc.rs since some pages may have already been updated.
linfa-bayes
's sub-crate documentation to include a brief description of the algorithm used and add an example of the usage of the provided model along with maybe some hints regarding when to choose a naive Bayes predictor. There are already some examples in the dedicated page for the params structure but maybe it's better to add at least one to the main page for the sub-cratelinfa-hierarchical
: The documentation for this sub-crate needs improvements like the ones listed for linfa-bayes
linfa-ica
: Also needs updates similar to linfa-bayes
linfa-kernel
: Add a description of what kernel methods are useful for, similarly to what's written in the linfa-svm
crate, and add descriptions to the Kernel
and KernelView
subtypes. There are some examples in the KernelBase
struct Documentation but maybe it would be a good idea to have an example of kernel transformation directly on the crate's main page.linfa-linear
: The crate's documentation is inside the ols
module instead of being in the crate's root and it is outdated since it says it only provides an implementation of the least squares algorithmlinfa-linear
: glm::TweedieRegressor
provides its own fit
method instead of implementing the Fit
trait, and the same for predict
. Moving the methods inside the trait would be a good way to align it with the rest of the algorithms provided and ensure compatibilitylinfa-logistic
could use some more documentation to explain the algorithm and some examples in its main pagelinfa-logistic
implements its own fit
and predict
methods instead of implementing the Fit
and Predict
traits. Like in the case of linear regression it would be a good thing to align the interface with the other sub-crateslinfa-reduction
completely lacks any documentation. An explanation of why dimensionality reduction is useful in ML and descriptions of the single algorithms would really help with understanding the usefulness of the cratelinfa-svm
: this is another reference for what other crates' documentation should look like. Right now the predict
method returns an Array1
for classification and a Vec
for regression. It would be less confusing if the regression case was modified to return an Array1
too.linfa-trees
: it needs changes similar to linfa-bayes
, with the addition of documentation regarding the methods for the params structurelinfa
: The main page of the rustdocs still says that linfa only provides the K-Means algorithm which may be very confusing to someone that only sees the crate in doc.rs for the first time. Here a completely revised page compete with project goals and links to the various sub-crates would be useful, so that one can find the algorithm they need without manually searching for the sub-crates.linfa
: The dataset module page could use a bit of explanation about the main differences between the four dataset types and a brief recollection of what utility methods a dataset provideslinfa
: The metrics module page could use a brief list of the provided metrics so that one does not have to go looking for them in the sup-pages if they only want to know whether a metric is provided or notBut it is actually an unsupervised learning algorithm.
Just to mention I've started to look at using ndarray 0.14 (and hopefully I intend to make a PR in the end)
For now, I am blocked by an issue with sprs
which is used by linfa-kernel
: sparsemat/sprs#278
A changelog would be useful so users can see when new features are added and if anything was broken and has been fixed. https://keepachangelog.com/en/1.0.0/
In a recent PR (#97) multiple issues came up with the current KMeans implementation. We could improve it by:
Hey there, interesting project! I was just getting it installed and found that I couldn't compile the docs for linfa-clustering
or linfa-reduction
without turning off default-features
because of this error:
Updating crates.io index
error: failed to select a version for `linfa-reduction`.
... required by package `nasa-gallery v0.1.0 (/home/zicklag/git/zicklag/nasa-gallery)`
versions that meet the requirements `*` are: 0.2.0
the package `nasa-gallery` depends on `linfa-reduction`, with features: `openblas-src` but `linfa-reduction` does not have these features.
failed to select a version for `linfa-reduction` which could resolve this conflict
I would like to add the linnerud dataset to linfa-dataset but I've noticed that currently targets are only handled as a one-dimensional array. Should we introduce something like:
pub type MultiTargetDataset<D, T> =
DatasetBase<ArrayBase<OwnedRepr<D>, Ix2>, ArrayBase<OwnedRepr<T>, Ix2>>;
or is there a better way?
Following my message to the ML WG Zulip chat:
As some of you might have noticed, my activity in the ML corner of the Rust ecosystem has quite diminished over the last few months.
Due to what amounts to a career change (ML -> Software Engineer) I have been focusing more on the web/networking ecosystem and I have found it more and more time-consuming to context-switch at the end of the day to work on ML topics.
Motivation plays a role there too, being myself less engaged in the future evolution of the ecosystem.I'd thus be more than open to move ownership of linfa over to the WG or to a subset of volunteers with an interest in moving the project forward.
I moved the ownership of linfa
to rust-ml
, which is going to be a better steward going forward, with surely more helping hands to dedicate to the project.
My sincerest apologies to @paulkoerbitz, @mossbanay and @bytesnake. I left your PRs unanswered for a long time, but I really couldn't manage to juggle the number of projects I have been working on during the last few months.
๐
Hello,
Is there a way to serialize the preprocessing steps like CountVectorizer?
Any help would be greatly appreciated!
hey, I'm new to linfa and couldn't find methods to save and load trained models.
Any ideas how this could be done? Thanks!
Line 47: let entropy_pred_y = gini_model.predict(&test);
should be let entropy_pred_y = entropy_model.predict(&test);
Maybe this isn't the correct place to ask or maybe I just need a fresh pair of eyes but I'm having trouble using the ndarray-linalg svd function.
the following code throws and error at the unwrap()
X : &ArrayBase<impl Data<Elem = f64>, Ix2>,
) -> Array2<f64> {
let temp:Array2<f64> = X.to_owned().clone();
let (u, s, v) = temp.svd(true, true).unwrap();
while the following does not
x : &ArrayBase<impl Data<Elem = f64>, Ix2>,
n_components : f64,
) -> Self {
let (_n, _m) = x.dim();
//calculate the array of columnar means
let mean = x.mean_axis(Axis(0)).unwrap();
// subtract means from X
let h:Array2<f64> = Array2::ones((_n, _m));
let temp:Array2<f64> = h * &mean;
let b:Array2<f64> = x - &temp;
// compute SVD
let (u, sigma, v) = b.svd(true, true).unwrap();
any help would be appreciated
Nearest neighbour algorithms are an item in the roadmap. They're helpful in clustering algorithms, particularly DBSCAN. There are a number of existing implementations of advanced nearest neighbour algorithms in the Rust ecosystem, but they are not compatible with ndarray
so they'd need to be modified to be included in Linfa.
Existing algorithms I found:
We should pick at least one of the O(logN) algorithms to integrate into Linfa, along with linear search due to its simplicity. I'm not sure how many of these algorithms we really need since I don't know the detailed tradeoffs between them.
Make the CI build run again.
Discussion: Should we enable Clippy for the project?
How will versioning of linfa and its sub-crates work? Currently the version of linfa on crates.io is 0.1.2 (although in Cargo.toml its still 0.1.1) and linfa-clustering is released on crates.io as 0.1.0 (consistent with Cargo.toml).
Some questions that come to mind:
I was recently working on some documentation of the DBSCAN algorithm, and on several occasions was running a test on datasets of varying lengths. Even with some very small changes in the parameters, the execution time could jump from under one second to several minutes. While checking progress is not possible for all algorithms, there are at least a few where displaying a progress indicator could be helpful to users for estimating remaining execution time or at least understanding how far along their run is. Obviously, not all users might be interested in this, so making this feature optional with a feature flag might make sense as well.
Options for this could potentially be the indicatif or pb libraries.
I happened to land here through a post on Hacker News and this looks very interesting. Unfortunately, the blog post was from Decemeber 2019 and the link was only posted there recently. I see that most of the original roadmap has been already implemented. Kudos for that!
I have two questions though:
Thanks again for all the awesome work!
A plain Principal Component Analysis algorithm was added in 7b6075e. The next steps should improve upon edge-cases and features.
why the project U wrote is named linfa?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.