Welcome to the Glaucoma detection repository.
Having Conda installed is the recommended approach. If not, you can install the requirements directly, although it's not the preferred method.
- clone the directory.
git clone https://github.com/Sudonuma/MLCodingChallenge.git
- cd to MLCodingChallenge.
cd MLCodingChallenge
- Create a conda environment and activate it. (if you have conda installed otherwise you can directly install the requirements)
conda create --name glaucomaenv python=3.9
conda activate glaucomaenv
- Install the dependencies.
pip install -r requirements.txt
- run the main script for training, validation and inference (see file for available arguments, you can also run with default args).
python main.py
Note : If you would like to only validate and infer a model you can run this command
python main.py --validate_only True
.
The glaucodma dataset contains images categorized into folders labeled from 0 to 5. These folders include photographs of patients, some of whom have glaucoma, while others do not. Notably, the number of images of individuals without glaucoma significantly surpasses those with the condition.
The dataset is accompanied by a train_labels.csv file, which maps each image's name to its corresponding label. To simplify the labels, we encoded them as follows: "rg" represents class 1, and "ngr" corresponds to class 0. The resulting dataset with these encoded labels is saved in a file called encoded_dataset.csv.
For our experimentation, we set aside a portion (10%) of the data as a test dataset, and the information for these test samples is stored in a CSV file called test_data.csv
.
In a separate experiment, we aimed to mitigate class imbalance by reducing the dataset size. We generated two distinct files: reduced_encoded_train_data.csv
for training and reduced_encoded_test_data.csv
for testing.
Additionally, to facilitate code testing for other users, we included two CSV files, namely, dummy_train_data.csv
and dummy_test_data.csv
. These files enable users to test the code without the need to download the entire 50+GB dataset.
Should you wish to train on the full dataset, you can do so by copying the image folders (0 to 5) into the data/dataset
directory. Make sure to adjust the --data_csv_path
argument to point to ./data/dataset/encoded_train_dataset.csv
and the --test_data_csv_path
to ./data/dataset/encoded_test_dataset.csv
.
If you want to train on the balanced data, you can utilize the reduced_encoded_train_data.csv
and reduced_encoded_test_data.csv
files.
The data is split 70% for training, 20% validation and 10% for model evaluation (testing).
- To train the model on all the dataset use:
train_data.csv
andtest_data.csv
- To train the model on balanced data (exactly the sample number of samples for each class):
reduced_encoded_train_data.csv
andreduced_encoded_test_data.csv
- To train the model on Downsampled but not well balanced data use:
13ktrain_data.csv
and13ktest_data.csv
dummy_train_data.csv
anddummy_test_data.csv
are just for the purpose to run the code.
- csv files should be tracked with DVC.
- Improve the EDA and the pre-processing step.
- Test the output of the model.
- Add more tests.
- Optimise the stratfied sampling.
- Use Siamese Neural Networks as it is robust against unbalanced.