The goal of this project is to develop a system that can accurately identify, classify and recognise fossil images like dinosaurs or trace fossils.
Identifying fossils can be a time-consuming process that relies on expert knowledge of fossil morphology and can be challenging to identify due to their fragmented and degraded nature.
The main problem to address in this project is the development of a machine learning model capable of accurately recognising and classifying fossils based on their images.
-
Sources and context of the dataset: The dataset is a collection of fossil images obtained by using a web crawler to download fossil images from the Internet and automatically export the data into a structured dataset.
- reduced-FID dataset: I will use the reduced-FID with 60 thousand images and 50 category of fossils publish by zenodo.org . Links to download the reduced-FID dataset
- FID dataset: This dataset is used to fill the gaps of the reduced-FID. Links to download the FID dataset.
- fossil-vs-non-fossil dataset, I used to remove irrelevant images. fossil-vs-non-fossil.zip
-
Samples of the entries, features, values: The dataset is a reduced version of the Fossil Image Dataset that contains 415 thousand images.
-
Number of features and samples: The dataset contains 60 thousand RGB images 1200~ image for each 50 category of fossils.
-
Encoding of the features: The images are stored in subfolder with each subfolder named according to the commune ancestor.
-
Quality of the data: the data is of high quality, with no missing images. However some images are not relevant or have some obstruction like text or humans.
-
Images format: the images have the following format BMP, GIF, JPEG, PNG, TIFF.
-
Below I will display a small sample of some images contained in the dataset.
imagehash is a package that needs to be installed on the environment:
- using conda:
conda install -c conda-forge imagehash
- using pip:
pip install ImageHash
scikit-image 0.21.0 is a package that needs to be installed on the environment:
- using pip:
pip install scikit-image
split-folders is a package that needs to be installed on the environment:
- using pip:
pip install split-folders
yellowbrick 1.5 is a package that needs to be installed on the environment:
- using pip:
pip install yellowbrick
Data exploration and cleaning.ipynb
Remove-irrelevant-images.ipynb
Fossil-classifier.ipynb