GithubHelp home page GithubHelp logo

abalone_age_prediction's Introduction

Abalone Age Prediction

Author


Project Information


  • Background
    In this project, we are going to estimate abalone's age from physical measurements. Abalone is a kind of shellfish that lives in cold water. It has great health benefits based on the fact of low fat and high protein. The nutritional value in different ages is different, as well as economic value. Therefore, telling the age of abalone is an important question for scientists, fish farmers, and customers. The traditional way to determine the age of abalone is from the number of rings. Counting the rings is a time-consuming task since it requires a tedious process involving cutting the shell, staining it, and counting the rings under the microscope. Thus, we consider using other easily obtained measures to predict the age.

  • Analysis
    To predict abalone's age from physical measurements, we build a regression model using a popular type of regularized linear regression model Ridge. The model can use the physical measurements (Sex, Length, Diameter, Height, Whole weight, etc.) to predict the age of abalone. Our final Ridge model can predict age in a decent accuracy on an unseen test data set, with a R-squared score of 0.52 and a mean absolute percentage error (MAPE) of 13.71. However, considering the potential economic losses to the stakeholders, we recommend further improvement before it is put into the industry.

  • Dataset
    The dataset used in this project comes from an original study "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", created by Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994). It was sourced from the UCI Machine Learning Repository and can be found here. Each row in the data set represents an abalone, including the physical measurements(Sex, Length, Diameter, Height, Whole weight, etc.) and the number of rings, which gives the age in years by adding 1.5. The missing values in the original study have been removed and the range of continuous values has been scaled. Please find the detailed information here.

Report


The final report can be found here.

Usage


To replicate the analysis,

  1. Clone this GitHub repository.
  2. Navigate to the root directory of this repository.
  3. Make sure Docker is installed in your device. (You can install Docker here.)
  4. Pull the docker image which contains the software and libraries/packages needed to run abalone age prediction Machine Learning (ML) pipeline.

    docker pull chuangw/abalone_age_prediction:latest

  5. To run this analysis using Docker, type the following (filling in PATH_ON_YOUR_COMPUTER with the absolute path to the root of this project on your computer).

    docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/rstudio/Abalone_Age_Prediction chuangw/abalone_age_predictor make -C /home/rstudio/Abalone_Age_Prediction all

  6. To clean up the analysis, type:

    docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/rstudio/Abalone_Age_Prediction chuangw/abalone_age_predictor make -C /home/rstudio/Abalone_Age_Prediction clean

Flow Chart and Project Organization


  • Flow Chart

The whole analysis process including running all scripts and rendering R markdown is automated in the pipeline written in the Makefile. And this can be run using Docker (see above).

  • Project Organization
.
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── data
│   ├── processed
│   │   ├── test.csv
│   │   └── training.csv
│   └── raw
│       └── abalone.csv
├── doc
│   ├── abalone_age_predict_report.Rmd
│   ├── abalone_age_predict_report.html
│   ├── abalone_age_predict_report.md
│   └── abalone_age_refs.bib
├── env-abalone.yml
├── img
│   ├── out.png
│   ├── output.dot
│   └── project_flow_chart.png
├── results
│   ├── eda
│   │   ├── all_vs_age_dist.png
│   │   ├── corr_plot.png
│   │   └── sex_vs_age_violin.png
│   └── ml_model
│       ├── best_model_quality.sav
│       ├── best_predict_model.sav
│       └── hyperparam_tuning.png
└── src
    ├── data_wrangling
    │   ├── download_data.py
    │   └── pre_process_abalone_data.py
    ├── eda
    │   ├── abalone_eda.ipynb
    │   └── abalone_eda.py
    └── ml
        ├── abalone_fit_predict_model.py
        └── abalone_test_result.py

Dependencies


Please refer to env-abalone.yml under the root directory of this project. Run the following command from the root of this repository to replicate the enviroment for this project.

conda env create --file env-abalone.yml

License


MIT license

References


  • Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

  • Data comes from an original (non-machine-learning) study: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)

abalone_age_prediction's People

Contributors

charlessuresh avatar chuangw6 avatar huan-ds avatar

Watchers

 avatar

abalone_age_prediction's Issues

Group Meeting 1

Date&Time:

Nov 19th 2020, 9pm - 11pm (PST)

Chair:

@chuangw46

Attendees:

@chuangw46 @huan-ds @charlessuresh

Agenda

ID Item Outcome
1 Identify the data set and research topic
  • Abalone Data Set
  • Given a number of features for an abalone, predict its age
  • 2 Draft a team work contract All the items discussed in the group were documented in Google Doc
    3 other 3 files that are important for collaboration
  • Code of Conduct file
  • Contributing file
  • License file
  • Actionable Items

    ID Assignee Tasks Delivery Time
    1 @chuangw46
  • To setup github repository
  • To write script for downloading data and saving it in csv format
  • To refine the team work contract
  • 5am, Nov 19th 2020 (PST)
    2 @huan-ds
  • To write the project background documentation in README.md file
  • To write CONTRIBUTING.md file
  • 5am, Nov 19th 2020 (PST)
    3 @charlessuresh
  • To write CODE_OF_CONDUCT.md file
  • To start working on the EDA process
  • 5am, Nov 19th 2020 (PST)

    Peer Review of Version 0.2.0

    Hi Group32,

    Well done on the project! I had fun reading it.

    • The documentation is great. Plots and interpretations are clear, and easy to follow along.
    • Codes are readable and concise. Functions are written well with great docstrings.
    • Analysis and reasoning is correct and makes sense.
    • Communication is clear and concise with a smooth flow.
    • Suggestion: I only have a few minor suggestions. First suggestion is to maybe label the first "Results & Discussion" section as "EDA" and keep the second one as it is, instead of having two sections with the same title on your report. Also, the interpretation of Figure 3 describes four metrics: mean, median, 25% quartile and 75% quartile, but I am only seeing three metrics on the plot. Although the table of these metric can be found in the src folder, I would still suggest rephrase this part to avoid unnecessary confusion. Lastly, regarding the outliers excluded from training data, I am curious how they are defined so maybe you could briefly describe the rule used in defining an outlier (sorry if it is already defined in the project but I failed to find it).

    Overall a great project! Hopefully my suggestions would help improve the project!

    Group Meeting 2

    Date&Time:

    Nov 20th 2020, 11pm - 12pm (PST)

    Attendees:

    @chuangw46 @huan-ds

    Actionable Items

    ID Assignee Tasks Delivery Time
    1 @chuangw46
  • To add interpretation for EDA notebook
  • To reformat EDA code in the notebook
  • To rewrite the CODE_OF_CONDUCT.md file
  • 20:00, Nov 21 2020 (PST)
    2 @huan-ds
  • To refine the project proposal in the README.md file
  • To add interpretation for EDA notebook
  • 20:00, Nov 21 2020 (PST)

    EDA Elements

    For the EDA process, please add the items that you find necessary to include in EDA.

    Milestone #4 - your repo/analysis review

    Hi Team 32,

    Very nice repo and analysis. First, some praise:

    -Nice touch with listing your LinkedIn and that it's part of UBC MDS.
    -Nice Background explanation about why your prediction question would be useful and preferable to how it is currently done (counting rings is expensive).
    -great flow chart. I doubt everyone went this far (I know my team didn't!)
    -good thing you used Ridge; your X variables have crazy collineraity (to be expected when these are similar body measurements).
    -your narration is very clear and orderly.

    My list of refinements is long, but most are really quick to do, and a number are purely to ponder over and implement if you think they make sense:

    1. In the readme, I'd suggest dropping the bullet points used for "Background" "Analysis" and "Dataset". A purely asthetic critique.
    2. There are a couple of typos in your Readme and report that I think you will catch if you paste your text in a word processing document.
    3. Your web link for this reference doesn't work:
      Sam Waugh (1995) "Extending and benchmarking Cascade-Correlation", PhD thesis, Computer Science Department, University of Tasmania. [Web Link]
    4. You wrote in the Analysis that age is a categorical variable. I think you meant to refer to Sex.
    5. It looks like you haven't quite optimized for alpha because you haven't reached a maximum? Maybe you should expand the search space to include much smaller values of alpha? In our labs we've used alpha as low as 0.01.
    6. After you show Table 1, you start talking about further refinements you could make. You should put a header for that section.

    Food-for-thought suggestions. Your choice whether to ignore or address somehow:
    6. It looks like there's really no difference between conditioning Age on Females or Males. Maybe you could transform this to a variable of Adult or Infant.
    7. future refinements: find out how much you could simplify the model in removing features without losing much prediction accuracy. This seems relevant because a fishermen would prefer to only have to measure a couple body parts, rather than 8, right?
    8. future refinements: find out whether your model is improved if you allow gender to interact with your other explanatory variable measurements.

    EDA Plots Script

    Amendments that can be made to the abalone_eda.py script to fine-tune the plots saved:

    • dpi : Adjust dpi as per requirement when incorporating the plots into the report
    • subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None): I've set the parameter values for now to make them look reasonable. Feel free to adjust each of the parameters listed to tune the plot spacings
    • font_scale: set the font scale as per requirement

    Let me know if there are any issues with the abalone_eda.py script. Also, feel free to suggest/make any changes

    Peer Review of 0.2.0 version

    Hi Group 32,

    I enjoyed going through your work!

    • The flowchart in README.md is fantastic.
    • Awesome plots.
    • Code is well written and is easy to read.
    • Result communication was crisp and clear.

    Suggestions:

    Cal had already provided the suggestions that I wanted to give. I don't see any point violating the DRY principle :P

    Just a small thing, snake cased names could be used for data frame's column names. Accessing columns gets easier (like -> df.column_name)

    Proposal

    • Avoid using abbreviations or function names like MSE and RandomForestRegressor in the proposal. Otherwise, you should define them.

    Other than that, everything else looks fine and you have done a great job in this assignment in all of the tasks.

    README.md file

    Hi, I've been complete the most part of README.md file. Here are two issues I need some help with:

      1. License: How to check the license for a public dataset on uci website.
      1. Dependencies:How to check the minimal dependencies for this project?

    EDA code quality

    • Try to avoid repitions in the code and make use of functions that avoid that. An example of that this chunk that you used:
    sns.histplot(train_df1, x="Length (mm)", kde=True, bins=15, ax=axs[0, 0])
    sns.histplot(train_df1, x="Diameter (mm)", kde=True, bins=15, ax=axs[0, 1])
    sns.histplot(train_df1, x="Height (mm)", kde=True, bins=20, ax=axs[0, 2])
    sns.histplot(train_df1, x="Whole Weight (g)", kde=True, bins=15, ax=axs[0, 3])
    sns.histplot(train_df1, x="Shucked Weight (g)", kde=True, bins=15, ax=axs[1, 0])
    sns.histplot(train_df1, x="Viscera Weight (g)", kde=True, bins=15, ax=axs[1, 1])
    sns.histplot(train_df1, x="Shell Weight (g)", kde=True, bins=15, ax=axs[1, 2])
    

    Check the link below for a better alternative to doing that.
    https://seaborn.pydata.org/examples/faceted_histogram.html

    Group Meeting 3

    Date&Time:

    Nov 27th 2020, 5:00pm - 8:30pm (Beijing)

    Chair:

    @chuangw46

    Attendees:

    @chuangw46 @huan-ds @charlessuresh

    Actionable Items

    Delivery Time: Nov 28th 2020, 5:00pm (Beijing)

    Item Task (To create) Input Output Assignee
    Download data src/data_wrangling/download_data.py UCI data source link data/raw/abalone.csv Finished
    Preprocessing + Partition src/data_wrangling/pre_process_abalone_data.py data/raw/abalone.csv
  • data/processed/training.csv
  • data/processed/test.csv
  • @charlessuresh
    EDA src/eda/abalone_eda.py data/processed/ training.csv
  • /results/eda/sex_vs_age_violin.png
  • /results/eda/all_vs_age_dist.png
  • /results/eda/corr_plot.png
  • @charlessuresh
    ML: fit model src/ml/abalone_fit_predict_model.py data/processed/training.csv
  • results/ml_model/best_predict_model.sav
  • results/ml_model/hyperparam_tuning.png
  • @chuangw46
    ML: test model accuracy src/ml/abalone_test_result.py
  • data/processed/test.csv
  • results/best_predict_model.sav
  • results/ml_model/best_model_quality.sav @chuangw46
    Report doc/abalone_age_predict_report.Rmd All images and results generated from above step .html and .pdf @huan-ds
    Shell script /run_all.sh NA final report @huan-ds

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.