GithubHelp home page GithubHelp logo

mdarm / neural-partitioner Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 2.68 MB

Project on extending a Neural Partitioner for the M149 - Database Systems course, NKUA, Spring 2023.

Python 22.16% TeX 75.12% Jupyter Notebook 2.72%
approximate-nearest-neighbor-search convolutional-neural-networks ensembling-methods knn mahalanobis-distance optimization-algorithms pca pytorch

neural-partitioner's Introduction

Wrangling with a Neural Partitioner

This repository contains the code for a series of attempted enhancements made to the original Neural Partitioner from the paper Unsupervised Space Partitioning for Nearest Neighbor Search, authored by Abrar Fahim, Mohammed Eunus Ali, and Muhammad Aamir Cheema.

For more detailed documentation and insights into how the various algorithms and models have been implemented, please refer to the implementation report.

Included is also an incomplete implementation of the Hierarchical Navigable Small Worlds (HNSW) algorithm, originally intended as a potential enhancement for the ANN algorithm. The class developed as part of this attempt has been included in the repository for reference and future development. You can find this class in the attempted-improvements folder.


๐Ÿ“š Contents

Code Structure

Getting Started

Begin by taking a look at main.py, which serves as the entry point of the code. It relies on the paths.txt file to determine the locations of the datasets needed for training. Before proceeding, ensure that all the required dependencies mentioned in requirements.txt are installed.

You could of course skip all that, by running a ready-to-go setup on Kaggle. More on that in a bit.

Setting Everything Up

Ensure that the file paths.txt specifies the absolute directory paths necessary for the code to function. It should include:

  • paths_to_mnist: Directory containing the MNIST dataset in hdf5 format.
  • path_to_sift: Similar to paths_to_mnist, but for the SIFT dataset.
  • path_to_knn_matrix: Directory where the generated k-NN matrix will be stored.
  • path_to_models: Directory for saving trained models.
  • path_to_tensors: Directory for storing processed tensors for faster re-runs.

Prerequisites

  • Python 3.5+
  • Compatible with Windows, Linux, macOS
  • (Optional, although ideal) GPU support for faster computation

Installation

For starters, clone the repository:

git clone [email protected]:mdarm/neural-partitioner.git
cd ./neural-partitioner/src

Before running the code, install the required dependencies by running the following command:

pip install -r src/requirements.txt

Running the Code

Two versions of the running workflow are available; locally and on Kaggle.

Running Locally

Before running the code, fill in paths.txt with the appropriate directories. Next, download the SIFT and/or MNIST datasets from ANN Benchmarks and place them in the respective folders specified in paths.txt.

To execute the code with the default configuration, merely type:

python main.py

For a custom configuration, here's an example command:

python main.py --n_bins 256 --dataset_name mnist --n_trees 3 --metric_distance mahalanobis --model_combine cnn neural linear

Running on Kaggle

We have automated the process to rerun all the experiments performed using an accelerated Kaggle notebook. This allows you to easily replicate and explore the results without the hassle of manual setup; for more details see here.

Parameters for running

Default parameter values are set in utils.py.

Data Partitioning:

  • dataset_name: Choose the dataset to partition (mnist or sift).
  • n_bins: Define the number of bins for dataset partitioning.
  • k_train: Set the number of neighbors to build the k-NN matrix.
  • k_test: Set the number of neighbors for testing the model.
  • n_bins_to_search: Choose how many bins to search for nearest neighbors.

Model Training:

  • n_epochs: Specify the number of epochs for training.
  • batch_size: Set the batch size.
  • lr: Define the learning rate.
  • n_trees: Choose the number of trees for ensemble.
  • n_levels: Define the number of levels in each tree.
  • tree_branching: Set the number of children per node.
  • model_type: Select the model type (neural, linear, or cnn).
  • eta_value: Balance parameter for the loss function.
  • distance_metric: Choose distance metric (euclidean or mahalanobis).
  • model_combination: Create an ensemble by combining models in the order provided.
  • pl: Run vector quantisation pipeline (executes after the model ensembling).

Storing Options:

  • load_knn: Load the k-NN matrix from file (if available).
  • continue_train: Continue training from the last checkpoint by loading models from file.

Output Example

Loss and Accuracy (RNN-Recall)

The program outputs the loss and accuracy (RNN-Recall) metrics and these metrics are highlighted in the plots. Additionally, a summary of the plots is displayed in the command prompt.

Plots

Light ย  ย  ย  ย  Dark

Plots show loss per epoch and accuracy (RNN recall) per bin (partition) size for each neural ensemble, for a 16-bin space partition. The two most likely partitions to which the queries belong are indicated by the points on each line.

Run Summary

The program outputs information about test accuracy, mean candidate set size, average query time, and its standard deviation for various combinations of the number of models and bins.

Example:

-- READING PATHS -- 
preparing knn with k =  10
BUILDING TREE
...
----- CALCULATING K-NN RECALL FOR EACH POINT ------- 
1 models, 1 bins 
mean accuracy  tensor(0.9272)
mean candidate set size tensor(8918.8584)
average query time: 0.21, standard deviation: 0.03 (miliseconds)
...

All the run experiments are in the outputs and files are named using the convention method-bin_number-dataset. For example, if the method used was cnn, bin number was 16, and dataset was sift, the output file would be named cnn-16-sift.txt.

neural-partitioner's People

Contributors

mdarm avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.