GithubHelp home page GithubHelp logo

frjnn / bhtsne Goto Github PK

View Code? Open in Web Editor NEW
63.0 3.0 8.0 7.1 MB

Parallel Barnes-Hut t-SNE implementation written in Rust.

Home Page: https://docs.rs/bhtsne

License: MIT License

Rust 100.00%
data-science data-visualization machine-learning rust dimensionality-reduction similarity-measures bhtsne barnes-hut

bhtsne's Introduction

bhtsne

License: MIT Gethseman codecov

Parallel Barnes-Hut and exact implementations of the t-SNE algorithm written in Rust. The tree-accelerated version of the algorithm is described with fine detail in this paper by Laurens van der Maaten. The exact, original, version of the algorithm is described in this other paper by G. Hinton and Laurens van der Maaten. Additional implementations of the algorithm, including this one, are listed at this page.

Installation

Add this line to your Cargo.toml:

[dependencies]
bhtsne = "0.5.2"

Documentation

The API documentation is available here.

Example

The implementation supports custom data types and custom defined metrics. For instance, general vector data can be handled in the following way.

 use bhtsne;

 const N: usize = 150;         // Number of vectors to embed.
 const D: usize = 4;           // The dimensionality of the
                               // original space.
 const THETA: f32 = 0.5;       // Parameter used by the Barnes-Hut algorithm.
                               // Small values improve accuracy but increase complexity.

 const PERPLEXITY: f32 = 10.0; // Perplexity of the conditional distribution.
 const EPOCHS: usize = 2000;   // Number of fitting iterations.
 const NO_DIMS: u8 = 2;        // Dimensionality of the embedded space.
 
 // Loads the data from a csv file skipping the first row,
 // treating it as headers and skipping the 5th column,
 // treating it as a class label.
 // Do note that you can also switch to f64s for higher precision.
 let data: Vec<f32> = bhtsne::load_csv("iris.csv", true, Some(&[4]), |float| {
         float.parse().unwrap()
 })?;
 let samples: Vec<&[f32]> = data.chunks(D).collect();
 // Executes the Barnes-Hut approximation of the algorithm and writes the embedding to the
 // specified csv file.
 bhtsne::tSNE::new(&samples)
     .embedding_dim(NO_DIMS)
     .perplexity(PERPLEXITY)
     .epochs(EPOCHS)
     .barnes_hut(THETA, |sample_a, sample_b| {
             sample_a
             .iter()
             .zip(sample_b.iter())
             .map(|(a, b)| (a - b).powi(2))
             .sum::<f32>()
             .sqrt()
     })
     .write_csv("iris_embedding.csv")?;

In the example euclidean distance is used, but any other distance metric on data types of choice, such as strings, can be defined and plugged in.

Parallelism

Being built on rayon, the algorithm uses the same number of threads as the number of CPUs available. Do note that on systems with hyperthreading enabled this equals the number of logical cores and not the physical ones. See rayon's FAQs for additional informations.

MNIST embedding

The following embedding has been obtained by preprocessing the MNIST train set using PCA to reduce its dimensionality to 50. It took approximately 3 minutes and 6 seconds on a 2.0GHz quad-core 10th-generation i5 MacBook Pro. mnist

bhtsne's People

Contributors

emilhernvall avatar frjnn avatar yuhanliin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

bhtsne's Issues

Integration into linfa

I just saw your post on Reddit, awesome work! I'm the maintainer of linfa and thought about implementing t-SNE as a transformative dimensionality reduction technique in the past, but never came to it. This crate can take off a lot of work for us. We would implement a wrapper which adepts your algorithm by:

  • implementing builder style pattern for configuration
  • using datasets for input/output
  • implementing transform trait for the algorithm

Sounds good? I just quickly glanced at the source code and three things stood out which could be improved:

  • make csv dependency optional, sometimes it's not necessary to pull that in
  • make algorithm generic for num_traits::Float
  • what is about error handling? Can any part of your algorithm fail, especially what happens if there are NaNs in your data or parameters are mis-configured (e.g. perplexity negative)

c++ benchmarks

Dear Francesco,

Thanks for this implementation. I assume it is faster than the C++ original version right considering the rayon parallel efficiency. I will have some tests later on but just want to know whether you have some benchmarks.

Thanks,

Jianshu

Missing file errors when running tests

When I run cargo test for the first time I get the following:

running 10 tests
test test::set_embedding_dim ... ok
test test::set_epochs ... ok
test test::exact_tsne ... FAILED
test test::set_final_momentum ... ok
test test::set_learning_rate ... ok
test test::set_momentum ... ok
test test::set_momentum_switch_epoch ... ok
test test::set_perplexity ... ok
test test::set_stop_lying_epoch ... ok
test test::barnes_hut_tsne ... FAILED

failures:

---- test::exact_tsne stdout ----
thread 'test::exact_tsne' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/test.rs:77:87

---- test::barnes_hut_tsne stdout ----
thread 'test::barnes_hut_tsne' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/test.rs:107:87
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    test::barnes_hut_tsne
    test::exact_tsne

It looks like the tests expect several CSV data files, but they're not there.

Typo in function name

First, thanks so much for providing this crate! I've thought about implementing t-SNE in Rust before, but never took on the challenge. Kudos!

I think you have a typo in one of your function names; bhtsne::wite_csv should probably be write_csv. Just wanted to let you know.

Thanks, again!

rayon integration

In the near future we'll use rayon's parallel routines to perform the embarrassingly parallel parts of the computation.

Custom metrics

It would be nice to add the possibility to run the algorithm with custom user-defined metrics.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.