Glaucoma Detection

Welcome to the Glaucoma detection repository.

Prerequisites
Usage
data
CSV
TODO

Prerequisites

Having Conda installed is the recommended approach. If not, you can install the requirements directly, although it's not the preferred method.

Usage

clone the directory.

git clone https://github.com/Sudonuma/MLCodingChallenge.git

cd to MLCodingChallenge.

cd MLCodingChallenge

Create a conda environment and activate it. (if you have conda installed otherwise you can directly install the requirements)

conda create --name glaucomaenv python=3.9

conda activate glaucomaenv

Install the dependencies.

pip install -r requirements.txt

run the main script for training, validation and inference (see file for available arguments, you can also run with default args).

python main.py

Note : If you would like to only validate and infer a model you can run this command python main.py --validate_only True.

Data

The glaucodma dataset contains images categorized into folders labeled from 0 to 5. These folders include photographs of patients, some of whom have glaucoma, while others do not. Notably, the number of images of individuals without glaucoma significantly surpasses those with the condition.

The dataset is accompanied by a train_labels.csv file, which maps each image's name to its corresponding label. To simplify the labels, we encoded them as follows: "rg" represents class 1, and "ngr" corresponds to class 0. The resulting dataset with these encoded labels is saved in a file called encoded_dataset.csv.

For our experimentation, we set aside a portion (10%) of the data as a test dataset, and the information for these test samples is stored in a CSV file called test_data.csv.

In a separate experiment, we aimed to mitigate class imbalance by reducing the dataset size. We generated two distinct files: reduced_encoded_train_data.csv for training and reduced_encoded_test_data.csv for testing.

Additionally, to facilitate code testing for other users, we included two CSV files, namely, dummy_train_data.csv and dummy_test_data.csv. These files enable users to test the code without the need to download the entire 50+GB dataset.

Should you wish to train on the full dataset, you can do so by copying the image folders (0 to 5) into the data/dataset directory. Make sure to adjust the --data_csv_path argument to point to ./data/dataset/encoded_train_dataset.csv and the --test_data_csv_path to ./data/dataset/encoded_test_dataset.csv.

If you want to train on the balanced data, you can utilize the reduced_encoded_train_data.csv and reduced_encoded_test_data.csv files.

The data is split 70% for training, 20% validation and 10% for model evaluation (testing).

CSV

To train the model on all the dataset use: train_data.csv and test_data.csv
To train the model on balanced data (exactly the sample number of samples for each class): reduced_encoded_train_data.csv and reduced_encoded_test_data.csv
To train the model on Downsampled but not well balanced data use: 13ktrain_data.csv and 13ktest_data.csv
dummy_train_data.csv and dummy_test_data.csv are just for the purpose to run the code.

TODO

csv files should be tracked with DVC.
Improve the EDA and the pre-processing step.
Test the output of the model.
Add more tests.
Optimise the stratfied sampling.
Use Siamese Neural Networks as it is robust against unbalanced.

sudonuma / glaucoma-detection Goto Github PK