ubc-mds / abalone_age_prediction Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 8.0 6.94 MB

License: MIT License

Python 1.58% Jupyter Notebook 97.86% Makefile 0.37% Dockerfile 0.20%

abalone_age_prediction's Introduction

Abalone Age Prediction

Author

Huanhuan Li, student from UBC MDS Program (2020 - 2021)
Chuang Wang, student from UBC MDS Program (2020 - 2021)
Charles Suresh, student from UBC MDS Program (2020 - 2021)

Project Information

Background
In this project, we are going to estimate abalone's age from physical measurements. Abalone is a kind of shellfish that lives in cold water. It has great health benefits based on the fact of low fat and high protein. The nutritional value in different ages is different, as well as economic value. Therefore, telling the age of abalone is an important question for scientists, fish farmers, and customers. The traditional way to determine the age of abalone is from the number of rings. Counting the rings is a time-consuming task since it requires a tedious process involving cutting the shell, staining it, and counting the rings under the microscope. Thus, we consider using other easily obtained measures to predict the age.
Analysis
To predict abalone's age from physical measurements, we build a regression model using a popular type of regularized linear regression model Ridge. The model can use the physical measurements (Sex, Length, Diameter, Height, Whole weight, etc.) to predict the age of abalone. Our final Ridge model can predict age in a decent accuracy on an unseen test data set, with a R-squared score of 0.52 and a mean absolute percentage error (MAPE) of 13.71. However, considering the potential economic losses to the stakeholders, we recommend further improvement before it is put into the industry.
Dataset
The dataset used in this project comes from an original study "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", created by Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994). It was sourced from the UCI Machine Learning Repository and can be found here. Each row in the data set represents an abalone, including the physical measurements(Sex, Length, Diameter, Height, Whole weight, etc.) and the number of rings, which gives the age in years by adding 1.5. The missing values in the original study have been removed and the range of continuous values has been scaled. Please find the detailed information here.

Report

The final report can be found here.

Usage

To replicate the analysis,

Clone this GitHub repository.
Navigate to the root directory of this repository.
Make sure Docker is installed in your device. (You can install Docker here.)
Pull the docker image which contains the software and libraries/packages needed to run abalone age prediction Machine Learning (ML) pipeline.

docker pull chuangw/abalone_age_prediction:latest
To run this analysis using Docker, type the following (filling in PATH_ON_YOUR_COMPUTER with the absolute path to the root of this project on your computer).

docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/rstudio/Abalone_Age_Prediction chuangw/abalone_age_predictor make -C /home/rstudio/Abalone_Age_Prediction all
To clean up the analysis, type:

docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/rstudio/Abalone_Age_Prediction chuangw/abalone_age_predictor make -C /home/rstudio/Abalone_Age_Prediction clean

Flow Chart and Project Organization

Flow Chart

The whole analysis process including running all scripts and rendering R markdown is automated in the pipeline written in the Makefile. And this can be run using Docker (see above).

Project Organization

.
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── data
│   ├── processed
│   │   ├── test.csv
│   │   └── training.csv
│   └── raw
│       └── abalone.csv
├── doc
│   ├── abalone_age_predict_report.Rmd
│   ├── abalone_age_predict_report.html
│   ├── abalone_age_predict_report.md
│   └── abalone_age_refs.bib
├── env-abalone.yml
├── img
│   ├── out.png
│   ├── output.dot
│   └── project_flow_chart.png
├── results
│   ├── eda
│   │   ├── all_vs_age_dist.png
│   │   ├── corr_plot.png
│   │   └── sex_vs_age_violin.png
│   └── ml_model
│       ├── best_model_quality.sav
│       ├── best_predict_model.sav
│       └── hyperparam_tuning.png
└── src
    ├── data_wrangling
    │   ├── download_data.py
    │   └── pre_process_abalone_data.py
    ├── eda
    │   ├── abalone_eda.ipynb
    │   └── abalone_eda.py
    └── ml
        ├── abalone_fit_predict_model.py
        └── abalone_test_result.py

Dependencies

Please refer to env-abalone.yml under the root directory of this project. Run the following command from the root of this repository to replicate the enviroment for this project.

conda env create --file env-abalone.yml

License

References

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Data comes from an original (non-machine-learning) study: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)

abalone_age_prediction's People

Contributors

Watchers

Forkers

huan-ds chuangw6 charlessuresh chuckho777 mf523 elgohr-update

abalone_age_prediction's Issues

Date&Time:

Nov 19th 2020, 9pm - 11pm (PST)

Chair:

@chuangw46

Attendees:

@chuangw46 @huan-ds @charlessuresh

Agenda

ID	Item	Outcome
1	Identify the data set and research topic	Abalone Data Set Given a number of features for an abalone, predict its age
2	Draft a team work contract	All the items discussed in the group were documented in Google Doc
3	other 3 files that are important for collaboration	Code of Conduct file Contributing file License file

Actionable Items

ID	Assignee	Tasks	Delivery Time
1	@chuangw46	To setup github repository To write script for downloading data and saving it in csv format To refine the team work contract	5am, Nov 19th 2020 (PST)
2	@huan-ds	To write the project background documentation in `README.md` file To write `CONTRIBUTING.md` file	5am, Nov 19th 2020 (PST)
3	@charlessuresh	To write `CODE_OF_CONDUCT.md` file To start working on the EDA process	5am, Nov 19th 2020 (PST)

Peer Review of Version 0.2.0

Hi Group32,

Well done on the project! I had fun reading it.

The documentation is great. Plots and interpretations are clear, and easy to follow along.
Codes are readable and concise. Functions are written well with great docstrings.
Analysis and reasoning is correct and makes sense.
Communication is clear and concise with a smooth flow.
Suggestion: I only have a few minor suggestions. First suggestion is to maybe label the first "Results & Discussion" section as "EDA" and keep the second one as it is, instead of having two sections with the same title on your report. Also, the interpretation of Figure 3 describes four metrics: mean, median, 25% quartile and 75% quartile, but I am only seeing three metrics on the plot. Although the table of these metric can be found in the src folder, I would still suggest rephrase this part to avoid unnecessary confusion. Lastly, regarding the outliers excluded from training data, I am curious how they are defined so maybe you could briefly describe the rule used in defining an outlier (sorry if it is already defined in the project but I failed to find it).

Overall a great project! Hopefully my suggestions would help improve the project!

Figure(s)/table(s) that appear in your report should be generated programmatically for maximum reproducibility

Date&Time:

Nov 20th 2020, 11pm - 12pm (PST)

Attendees:

@chuangw46 @huan-ds

Actionable Items

ID	Assignee	Tasks	Delivery Time
1	@chuangw46	To add interpretation for EDA notebook To reformat EDA code in the notebook To rewrite the CODE_OF_CONDUCT.md file	20:00, Nov 21 2020 (PST)
2	@huan-ds	To refine the project proposal in the README.md file To add interpretation for EDA notebook	20:00, Nov 21 2020 (PST)

Increase size for figure 2 by lowering your column dimensions.

Spell out your y axis labels for figure 3

EDA Elements

For the EDA process, please add the items that you find necessary to include in EDA.

Milestone #4 - your repo/analysis review

Hi Team 32,

Very nice repo and analysis. First, some praise:

-Nice touch with listing your LinkedIn and that it's part of UBC MDS.
-Nice Background explanation about why your prediction question would be useful and preferable to how it is currently done (counting rings is expensive).
-great flow chart. I doubt everyone went this far (I know my team didn't!)
-good thing you used Ridge; your X variables have crazy collineraity (to be expected when these are similar body measurements).
-your narration is very clear and orderly.

My list of refinements is long, but most are really quick to do, and a number are purely to ponder over and implement if you think they make sense:

In the readme, I'd suggest dropping the bullet points used for "Background" "Analysis" and "Dataset". A purely asthetic critique.
There are a couple of typos in your Readme and report that I think you will catch if you paste your text in a word processing document.
Your web link for this reference doesn't work:
Sam Waugh (1995) "Extending and benchmarking Cascade-Correlation", PhD thesis, Computer Science Department, University of Tasmania. [Web Link]
You wrote in the Analysis that age is a categorical variable. I think you meant to refer to Sex.
It looks like you haven't quite optimized for alpha because you haven't reached a maximum? Maybe you should expand the search space to include much smaller values of alpha? In our labs we've used alpha as low as 0.01.
After you show Table 1, you start talking about further refinements you could make. You should put a header for that section.

Food-for-thought suggestions. Your choice whether to ignore or address somehow:
6. It looks like there's really no difference between conditioning Age on Females or Males. Maybe you could transform this to a variable of Adult or Infant.
7. future refinements: find out how much you could simplify the model in removing features without losing much prediction accuracy. This seems relevant because a fishermen would prefer to only have to measure a couple body parts, rather than 8, right?
8. future refinements: find out whether your model is improved if you allow gender to interact with your other explanatory variable measurements.

EDA Plots Script

Amendments that can be made to the abalone_eda.py script to fine-tune the plots saved:

dpi : Adjust dpi as per requirement when incorporating the plots into the report
subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None): I've set the parameter values for now to make them look reasonable. Feel free to adjust each of the parameters listed to tune the plot spacings
font_scale: set the font scale as per requirement

Let me know if there are any issues with the abalone_eda.py script. Also, feel free to suggest/make any changes

Peer Review of 0.2.0 version

Hi Group 32,

I enjoyed going through your work!

The flowchart in README.md is fantastic.
Awesome plots.
Code is well written and is easy to read.
Result communication was crisp and clear.

Suggestions:

Cal had already provided the suggestions that I wanted to give. I don't see any point violating the DRY principle :P

Just a small thing, snake cased names could be used for data frame's column names. Accessing columns gets easier (like -> df.column_name)

Proposal

Avoid using abbreviations or function names like MSE and RandomForestRegressor in the proposal. Otherwise, you should define them.

Other than that, everything else looks fine and you have done a great job in this assignment in all of the tasks.

README.md file

Hi, I've been complete the most part of README.md file. Here are two issues I need some help with:

1. License: How to check the license for a public dataset on uci website.
1. Dependencies：How to check the minimal dependencies for this project?

EDA code quality

Try to avoid repitions in the code and make use of functions that avoid that. An example of that this chunk that you used:

sns.histplot(train_df1, x="Length (mm)", kde=True, bins=15, ax=axs[0, 0])
sns.histplot(train_df1, x="Diameter (mm)", kde=True, bins=15, ax=axs[0, 1])
sns.histplot(train_df1, x="Height (mm)", kde=True, bins=20, ax=axs[0, 2])
sns.histplot(train_df1, x="Whole Weight (g)", kde=True, bins=15, ax=axs[0, 3])
sns.histplot(train_df1, x="Shucked Weight (g)", kde=True, bins=15, ax=axs[1, 0])
sns.histplot(train_df1, x="Viscera Weight (g)", kde=True, bins=15, ax=axs[1, 1])
sns.histplot(train_df1, x="Shell Weight (g)", kde=True, bins=15, ax=axs[1, 2])

Check the link below for a better alternative to doing that.
https://seaborn.pydata.org/examples/faceted_histogram.html

Group Meeting 3

Date&Time:

Nov 27th 2020, 5:00pm - 8:30pm (Beijing)

Chair:

@chuangw46

Attendees:

@chuangw46 @huan-ds @charlessuresh

Actionable Items

Delivery Time: Nov 28th 2020, 5:00pm (Beijing)

Item	Task (To create)	Input	Output	Assignee
Download data	`src/data_wrangling/download_data.py`	UCI data source link	`data/raw/abalone.csv`	Finished
Preprocessing + Partition	`src/data_wrangling/pre_process_abalone_data.py`	`data/raw/abalone.csv`	`data/processed/training.csv` `data/processed/test.csv`	@charlessuresh
EDA	`src/eda/abalone_eda.py`	`data/processed/ training.csv`	`/results/eda/sex_vs_age_violin.png` `/results/eda/all_vs_age_dist.png` `/results/eda/corr_plot.png`	@charlessuresh
ML: fit model	`src/ml/abalone_fit_predict_model.py`	`data/processed/training.csv`	`results/ml_model/best_predict_model.sav` `results/ml_model/hyperparam_tuning.png`	@chuangw46
ML: test model accuracy	`src/ml/abalone_test_result.py`	`data/processed/test.csv` `results/best_predict_model.sav`	`results/ml_model/best_model_quality.sav`	@chuangw46
Report	`doc/abalone_age_predict_report.Rmd`	All images and results generated from above step	`.html` and `.pdf`	@huan-ds
Shell script	`/run_all.sh`	NA	final report	@huan-ds

ubc-mds / abalone_age_prediction Goto Github PK

abalone_age_prediction's Introduction

Abalone Age Prediction

Author

Project Information

Report

Usage

Flow Chart and Project Organization

Dependencies

License

References

abalone_age_prediction's People

Contributors

Watchers

Forkers

abalone_age_prediction's Issues

Date&Time:

Chair:

Attendees:

Agenda

Actionable Items

Date&Time:

Attendees:

Actionable Items

Date&Time:

Chair:

Attendees:

Actionable Items

Recommend Projects

Recommend Topics

Recommend Org

Jobs