GithubHelp home page GithubHelp logo

matakshay / nn-classifier-using-vptree Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 2.0 11.42 MB

An efficient Nearest Neighbor Classifier for the MINST dataset. It uses a VP Tree data structure for preprocessing, thus improving query time complexity

Java 100.00%
classifier nearest-neighbor-search nearest-neighbours-classifier nearest-neighbour-algorithm knn-classifier knn-algorithm machine-learning-algorithms machine-learning mnist-classification mnist-dataset

nn-classifier-using-vptree's Introduction

Nearest Neighbour Classifier using VP Tree

This is an efficient Nearest Neighbour Classifier for classifying images of handwritten digit from the MNIST dataset. It uses a VP Tree to pre-process the images, thus reducing query time complexity. This project was done as part of the Data Structures & Algorithms course at IIT Delhi.

TABLE OF CONTENTS

  1. Introduction
  2. k-NN Algorithm
  3. VP Tree
  4. Dataset
  5. Usage
  6. Acknowledgement

Introduction

MNIST Classification

Classification is a fundamental task in Machine Learning. Given a labelled dataset of points and their classes, classification essentially involved using this dataset to identify the class for each query point. Classification tasks, in general can involve more than two classes as well (multi-class classification). This project, for example involves 10 classes (digits 0-9).

k-Nearest Neighbours Algorithm

The k-NN Algorithm is one of the classic base-line algorithms in Machine Learning and Pattern Recognition. Due to its simple structure and implementation, it is often the first approach adopted in most classification problems, before more sophisticated techniques are considered.
The traditional algorithm has a test time complexity which is linear in the size of the train dataset (assuming k is much smaller than the size of training set) for each query. For large dataset, this approach can become computationally much expensive. Here we consider an efficient k-NN algorithm which uses a VP Tree data structure to pre-process and store the train data in such a manner, so that during test time, the nearest neighbour for a query can be obtained much more quickly. With this optimised approach, test time complexity per query can be improved to become logarithmic in the size of train dataset.

VP Tree

Vantage-Point tree (or VP Tree) is an example of a metric tree. Metric trees are useful for storing data points defined in a metric space. At each level, a VP Tree divides the data points into two sub-parts, according to their similarity (or distance) from a chosen vantage point. Points where are closer to the vantage point than a threshold are store in the left subtree, while the remaining data points are stored in the right subtree. In this way, the entire dataset is stored in the tree by successively diving it into two halves at each node. The leaf nodes essentially contain a single data point.
While searching for the nearest neighbour of a query point, the recursive process starts from the root node. At each level, based of the values of threshold distance (for that node), distance of query point from vantage point (of that node) and the distances of the points encountered till that instant, the algorithm decides in which of the sub-trees to enter. This recursive techniques greatly reduces the number of distance comparisons needed and in turn improved the query time complexity.
A key hyperparameter of choice here, is the distance metric to be used for comparing the similarity between two data points. Here I experimented with three different metric and obtained the following results-

  • Using Manhattan distance as the metric gave an accuracy of 96.31%
  • Euclidean distance gave the highest accuracy of 96.91%
  • Using Chebyshev distance metric gave an accuracy of 79.62%

Dataset

The MNIST dataset (http://yann.lecun.com/exdb/mnist/) has been a landmark dataset in Machine Learning and Pattern Recognition. It consists of more than 70,000 grayscale images of handwritten numeric digit, divided into a train set (of 60,000 images) and a test dataset (10,000 images). Each image has a fixed dimension of 28x28 pixels. Each image comes labelled with one of the 10 possible classes (0-9). Over the years this dataset has been used for testing many Convolutional Neural Networks and algorithms Machine Learning and Computer Vision.

Usage

  1. Clone the repository to your system and head over to it
    git clone https://github.com/matakshay/NN-Classifier-using-VPTree
    cd NN-Classifier-using-VPTree
  2. Before moving to the next step, ensure that JDK version 11.0.5 has been installed on the system
  3. Compile the project
    javac -Xlint:unchecked NNClassifier/Main.java
  4. Execute the code with the following command. This will read the dataset, build the classifier & construct a VP Tree (using the 60,000 images from train set), use the test set images to obtain the predictions and lastly will compute the accuracy of the classifier over the test set.
    java NNClassifier/Main
    By default it uses the l2 metric (Euclidean distance) for computing the similarity between two images. One can pass "l1", "linf" as a command-line argument (while executing the code) to set the metric to l1 (Manhattan distance) or linf (Chebyshev distance) respectively.

Acknowledgement

I referred the following research papers, articles and course lectures while working on this project-

nn-classifier-using-vptree's People

Contributors

matakshay avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.