COMPREHENSIVE ANALYSIS AND PREDICTION OF OBESITY RISK LEVELS USING MACHINE LEARNING TECHNIQUES WITH - (90.99)% ACCURACY

Author: Anamika Kumari

Introduction:

Obesity is a pressing global health concern, with millions affected worldwide and significant implications for morbidity, mortality, and healthcare costs. The prevalence of obesity has tripled since 1975, now affecting approximately 30% of the global population. This escalating trend underscores the urgent need to address the multifaceted risks associated with excess weight. Obesity is a leading cause of various health complications, including diabetes, heart disease, osteoarthritis, sleep apnea, strokes, and high blood pressure, significantly reducing life expectancy and increasing mortality rates. Effective prediction of obesity risk is crucial for implementing targeted interventions and promoting public health.

Approach:

Data Collection and Preprocessing:
- We will gather comprehensive datasets containing information on demographics, lifestyle habits, dietary patterns, physical activity levels, and medical history.
- We will preprocess the data to handle missing values, normalize features, and encode categorical variables.
Exploratory Data Analysis (EDA):
- We will perform exploratory data analysis to gain insights into the distribution of variables, identify patterns, and explore correlations between features and obesity risk levels.
- Visualization techniques will be employed to present key findings effectively.
Feature Engineering:
- We will engineer new features and transformations to enhance the predictive power of our models.
- This may involve creating interaction terms, deriving new variables, or transforming existing features to improve model performance.
Model Development:
- We will employ advanced machine learning techniques, including ensemble methods such as Random Forest, Gradient Boosting (XGBoost, LightGBM), and possibly deep learning approaches, to develop predictive models for obesity risk classification.
- We will train and fine-tune these models using appropriate evaluation metrics and cross-validation techniques to ensure robustness and generalization.
Model Evaluation:
- We will evaluate the performance of our models using various metrics such as accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC).
- We will also conduct sensitivity analysis and interpretability assessments to understand the factors driving predictions and identify areas for improvement.

About Obesity Risk Level Prediction-Project:

Understanding Obesity and Risk Prediction:

Understanding Obesity:
- Obesity stems from excessive body fat accumulation, influenced by genetic, environmental, and behavioral factors.
- Risk prediction involves analyzing demographics, lifestyle habits, and physical activity to classify individuals into obesity risk categories.
Global Impact:
- Worldwide obesity rates have tripled since 1975, affecting 30% of the global population.
- Urgent action is needed to develop effective risk prediction and management strategies.
Factors Influencing Risk:
- Obesity risk is shaped by demographics, lifestyle habits, diet, physical activity, and medical history.
- Analyzing these factors reveals insights into obesity's mechanisms and identifies high-risk populations.
Data-Driven Approach:
- Advanced machine learning and large datasets enable the development of predictive models for stratifying obesity risk.
- These models empower healthcare professionals and policymakers to implement tailored interventions for improved public health outcomes.
Proactive Health Initiatives:
- Our proactive approach aims to combat obesity by leveraging data and technology for personalized prevention and management.
- By predicting obesity risk, we aspire to create a future where interventions are precise, impactful, and tailored to individual needs.

Source: World Health Organization. (2022). Obesity and overweight.

Dataset Overview:

The dataset contains comprehensive information encompassing eating habits, physical activity, and demographic variables, comprising a total of 17.

Key Attributes Related to Eating Habits:

Frequent Consumption of High-Caloric Food (FAVC): Indicates the frequency of consuming high-caloric food items.
Frequency of Consumption of Vegetables (FCVC): Measures the frequency of consuming vegetables.
Number of Main Meals (NCP): Represents the count of main meals consumed per day.
Consumption of Food Between Meals (CAEC): Describes the pattern of food consumption between main meals.
Consumption of Water Daily (CH20): Quantifies the daily water intake.
Consumption of Alcohol (CALC): Indicates the frequency of alcohol consumption.

Attributes Related to Physical Condition:

Calories Consumption Monitoring (SCC): Reflects the extent to which individuals monitor their calorie intake.
Physical Activity Frequency (FAF): Measures the frequency of engaging in physical activities.
Time Using Technology Devices (TUE): Indicates the duration spent using technology devices.
Transportation Used (MTRANS): Describes the mode of transportation typically used.

Additionally, the dataset includes essential demographic variables such as gender, age, height, and weight, providing a comprehensive overview of individuals' characteristics.

Target Variable:

The target variable, NObesity, represents different obesity risk levels, categorized as:

Underweight (BMI < 18.5): 0
Normal (18.5 <= BMI < 20): 1
Overweight I (20 <= BMI < 25): 2
Overweight II (25 <= BMI < 30): 3
Obesity I (30 <= BMI < 35): 4
Obesity II (35 <= BMI < 40): 5
Obesity III (BMI >= 40): 6

No.	Topic
1.	What is Obesity?
2.	Understanding Obesity and Risk Prediction
3.	Dataset Overview

No.	Topic
1.	Importing Relevant Libraries
2.	Loading Datasets

No.	Topic
1.	Summary Statistic of dataframe
2.	The unique values present in dataset
3.	The count of unique value in the NObeyesdad column
4.	Categorical and numerical Variables Analysis
	- a. Extracting column names for categorical, numerical, and categorical but cardinal variables
	- b. Summary Of All Categorical Variables
	- c. Summary Of All Numerical Variables

Section: 4. Data Preprocessing

No.	Topic
1.	Typeconversion of dataframe
2.	Renaming the Columns
3.	Detecting Columns with Large or Infinite Values

Section: 5. Exploratory Data Analysis and Visualization-EDAV

1. Univariate Analysis

No.	Topic
a.	Countplots for all Variables
b.	Analyzing Individual Variables Using Histogram
c.	KDE Plots of Numerical Columns
d.	Pie Chart and Barplot for categorical variables
e.	Violin Plot and Box Plot for Numerical variables

2. Bivariate Analysis

No.	Topic
a.	Scatter plot: AGE V/s Weight with Obesity Level
b.	Scatter plot: AGE V/s Height with Obesity Level
c.	Scatter plot: Height V/s Weight with Obesity Level
d.	Scatter plot: AGE V/s Weight with Overweighted Family History
e.	Scatter plot: AGE V/s height with Overweighted Family History
f.	Scatter plot: Height V/s Weight with Overweighted Family History
g.	Scatter plot: AGE V/s Weight with Transport use
h.	Scatter plot: AGE V/s Height with Transport use
i.	Scatter plot: Height V/s Weight with Transport use

3. Multivariate Analysis

No.	Topic
a.	Pair Plot of Variables against Obesity Levels
b.	Correlation heatmap for Pearson's correlation coefficient
c.	Correlation heatmap for Kendall's tau correlation coefficient
d.	3D Scatter Plot of Numerical Columns against Obesity Level

e. Cluster Analysis

No.	Topic
I.	K-Means Clustering on Obesity level
II.	PCA Plot of numerical variables against obesity level

4. Outlier Analysis

a. Univariate Outlier Analysis

No.	Topic
I.	Boxplot Outlier Analysis
II.	Detecting outliers using Z-Score
III.	Detecting outliers using Interquartile Range (IQR)

b. Multivariate Outlier Analysis

No.	Topic
I.	Detecting Multivariate Outliers Using Mahalanobis Distance
II.	Detecting Multivariate Outliers Using Principal Component Analysis (PCA)
III.	Detecting Cluster-Based Outliers Using KMeans Clustering

5. Feature Engineering:

No.	Topic
a.	Encoding Categorical to numerical variables
b.	BMI(Body Mass Index) Calculation
c.	Total Meal Consumed:
d.	Total Activity Frequency Calculation
e.	Ageing process analysis

Section: 6. Analysis & Prediction Using Machine Learning(ML) Model

No.	Topic
1.	Feature Importance Analysis and Visualization
	a. Feature Importance Analysis using Random Forest Classifier
	b. Feature Importance Analysis using XGBoost(XGB) Model
	c. Feature Importance Analysis Using (LightGBM) Classifier Model
2.	Data visualization after Feature Engineering
	a. Bar plot of numerical variables
	b. PairPlot of Numerical Variables
	c. Correlation Heatmap of Numerical Variables

Section: 7. Prediction of Obesity Risk Level Using Machine learning(ML) Models

No.	Topic
1.	Machine Learning Model Creation: XGBoost and LightGBM - Powering The Predictions! 🚀
2.	Cutting-edge Machine Learning Model Evaluation: XGBoosting and LightGBM 🤖
3.	Test Data Preprocessing for Prediction
4.	Showcase Predicted Encdd_Obesity_Level Values on Test Dataset 📊

Section: 8. Conclusion: 📝

No.	Topic
1.	Conclusion: 📝
2.	It's time to make Submission:

Links to access this project's ipynb file, if you are cannot able to see it in github reposetory are here

🎯 Project Objectives:

Machine Learning Model Development: Develop a robust machine learning model leveraging advanced techniques to accurately predict obesity risk levels.
Data Analysis and Feature Engineering: Conduct thorough analysis of demographics, lifestyle habits, and physical activity data to identify key factors influencing obesity risk. Implement effective feature engineering strategies to enhance model performance.
Achieve 100% Accuracy: Strive to achieve a high level of accuracy, aiming for 100% precision in predicting obesity risk levels. Employ rigorous model evaluation techniques and optimize model parameters accordingly.
Actionable Insights: Provide actionable insights derived from the predictive model to facilitate targeted interventions and public health strategies. Enable healthcare professionals and policymakers to make informed decisions for obesity prevention and management.
Documentation and Presentation: Ensure comprehensive documentation of the model development process and findings. Prepare clear and concise presentations to communicate results effectively to stakeholders.

🚀 Prerequisites:

Machine Learning Basics: Understanding of supervised learning, model evaluation, and feature engineering.
Python Proficiency: Proficiency in Python, including libraries like NumPy, Pandas, and Scikit-learn.
Data Analysis Skills: Ability to perform EDA, preprocess datasets, and visualize data.
Jupyter Notebooks: Familiarity with Jupyter Notebooks for interactive coding and documentation.
Health Data Understanding: Basic knowledge of obesity, BMI calculation, and health-related datasets.
Computational Resources: Access to a computer with sufficient processing power and memory.
Environment Setup: Python environment setup with necessary libraries installed.
Version Control: Familiarity with Git and GitHub for collaboration and project management.
Documentation Skills: Ability to document methodologies and results effectively using markdown.
Passion for Data Science: Genuine interest in data science and public health projects.

Industry Relevance:

This project is highly relevant to the industry across several critical areas:

Healthcare Analytics: Leveraging advanced machine learning techniques, this project facilitates predictive analysis in healthcare, enabling personalized interventions and preventive strategies.
Precision Medicine: Accurately predicting obesity risk levels contributes to the advancement of precision medicine, allowing for tailored treatments and interventions based on individual health profiles.
Public Health Initiatives: By providing actionable insights derived from data analysis, this project assists in formulating targeted public health initiatives to reduce obesity rates and improve population health outcomes.
Data-driven Decision Making: Empowering healthcare professionals and policymakers with data-driven insights facilitates informed decision-making processes, optimizing resource allocation and intervention strategies.
Technology Integration: Integrating machine learning models into healthcare systems enhances diagnostic capabilities, risk assessment, and patient management, driving efficiency and improving healthcare delivery.
Preventive Healthcare: Emphasizing predictive analytics for obesity risk levels supports preventive healthcare initiatives, focusing on early detection and intervention to mitigate health risks and improve overall well-being.

Libraries and Packages Requirement

To execute this project, ensure the following libraries and packages are installed:

Python Standard Libraries:
- os: Operating system functionality
- pickle: Serialization protocol for Python objects
- warnings: Control over warning messages
- collections: Container datatypes
- csv: CSV file reading and writing
- sys: System-specific parameters and functions
Data Processing and Analysis:
- numpy: Numerical computing library
- pandas: Data manipulation and analysis library
Data Visualization:
- matplotlib.pyplot: Data visualization library
- seaborn: Statistical data visualization library
- altair: Declarative statistical visualization library
- mpl_toolkits.mplot3d: 3D plotting toolkit
- tabulate: Pretty-print tabular data
- colorama: Terminal text styling library
Machine Learning and Model Evaluation:
- scipy.stats: Statistical functions
- sklearn.cluster: Clustering algorithms
- sklearn.preprocessing: Data preprocessing techniques
- sklearn.decomposition: Dimensionality reduction techniques
- sklearn.ensemble: Ensemble learning algorithms
- xgboost: Extreme Gradient Boosting library
- lightgbm: Light Gradient Boosting Machine library
Miscellaneous:
- IPython.display.Image: Displaying images in IPython
- sklearn.metrics: Metrics for model evaluation
- sklearn.model_selection: Model selection and evaluation tools
- sklearn.preprocessing.LabelEncoder: Encode labels with a value between 0 and n_classes-1
- scipy.stats.pearsonr: Pearson correlation coefficient and p-value for testing non-correlation
- scipy.stats.chi2: Chi-square distribution

Make sure to have these libraries installed in your Python environment before running the code.

Tech Stack Used:

Programming Languages

Python: Used for data processing, analysis, machine learning model development, and scripting tasks.

Libraries and Frameworks

NumPy: For numerical computing and array operations.
Pandas: For data manipulation and analysis.
Matplotlib: For static, interactive, and animated visualizations.
Seaborn: For statistical data visualization.
Scikit-learn: For machine learning algorithms and model evaluation.
XGBoost: For gradient boosting algorithms.
LightGBM: For gradient boosting algorithms with faster training speed and higher efficiency.
Altair: For declarative statistical visualization.
IPython.display: For displaying images in IPython.
Tabulate: For pretty-printing tabular data.
Colorama: For terminal text styling.
SciPy: For scientific computing and statistical functions.

Tools and Utilities

Jupyter Notebook: For interactive computing and data exploration.
Git: For version control and collaboration.
GitHub: For hosting project repositories and collaboration.
Travis CI: For continuous integration and automated testing.
CircleCI: For continuous integration and automated testing.
GitHub Actions: For continuous integration and automated workflows directly within GitHub.

Data Storage and Processing

CSV Files: For storing structured data.
Pickle: For serializing and deserializing Python objects.

Development Environment

Operating System: Platform-independent (Windows, macOS, Linux).
Integrated Development Environment (IDE): Any Python-compatible IDE like PyCharm, VS Code, or Jupyter Lab.

Documentation and Collaboration

Markdown: For documenting project details, README files, and collaboration.
GitHub Wiki: For project documentation and knowledge sharing.
Google Docs: For collaborative documentation and note-taking.

Version Control Requirements

To manage code changes and collaboration effectively, the following version control tools and practices are recommended for this project:

Git Installation:
- Download and install Git from the official Git website.
- Ensure Git is properly configured on your system, including setting up your username and email address.
GitHub Repository:
- Create a GitHub account if you don't have one.
- Set up a new repository for the project on GitHub.
- Initialize the local project directory as a Git repository using the following commands:
```
git init
```
Collaboration Workflow:
- Follow a standard Git workflow, such as the feature branch workflow or Gitflow, for managing branches and code changes.
- Utilize pull requests for code review and collaboration between team members.
- Ensure consistent and descriptive commit messages to track changes effectively.
Continuous Integration (CI):
- Integrate a CI/CD pipeline with GitHub using platforms like Travis CI, CircleCI, or GitHub Actions.
- Configure automated tests to run on each push or pull request to ensure code quality and reliability.
Code Review:
- Conduct thorough code reviews for all pull requests to maintain code quality and ensure adherence to coding standards.
- Provide constructive feedback and suggestions for improvement during code reviews.

By following these version control practices, you can streamline collaboration, track changes effectively, and ensure the stability and reliability of the project codebase.

Installation Requirements:

To set up the environment for this project, follow these steps:

Python Installation: Ensure Python is installed on your system. You can download it from the official Python website.
Virtual Environment (Optional but Recommended):
- Install virtualenv: pip install virtualenv
- Create a virtual environment: virtualenv env
- Activate the virtual environment:
  - On Windows: .\env\Scripts\activate
  - On macOS and Linux: source env/bin/activate
Required Libraries:
- Install necessary libraries using pip:
```
pip install numpy pandas scikit-learn matplotlib seaborn jupyter xgboost lightgbm
```
- These libraries are essential for data analysis, visualization, and machine learning tasks. Additional libraries like XGBoost and LightGBM are included for specific machine learning models. As listed above in the Libraries Requirements
Jupyter Notebook Installation (Optional but Recommended):
- Install Jupyter Notebook: pip install notebook
- Launch Jupyter Notebook: jupyter notebook
Git Installation (Optional but Recommended):
- Download and install Git from the official Git website.
Project Repository:
- Clone the project repository from GitHub:
```
git clone https://github.com/yourname/Obesity-Risk-Level-Prediction--Project-using-ML
```
- Alternatively, download the project files directly from the repository.
Data Source:
- Ensure you have access to the dataset required for the project.(as provided in this repository).
- Or you can visit this link to get dataset for this project : See here
Environment Setup:
- Set up the project environment by installing all required dependencies listed in the project's requirements.txt file:
```
pip install -r requirements.txt
```
Run Jupyter Notebook:
- Navigate to the project directory containing the Jupyter Notebook file and launch Jupyter Notebook:
```
jupyter notebook
```
Project Configuration:
- Customize any project configurations or settings as necessary, such as file paths, model parameters, or data preprocessing steps.
Documentation and Notes:
- Keep documentation and notes handy for reference during the project, including datasets, code snippets, and research papers related to obesity prediction and machine learning techniques.

Outcome and Analysis:

Model Evaluation Matrix:

Best Model Performanace for Obesity Risk-Level Prediction:

Result:

Based on the evaluation metrics, the models performed quite similarly, with minor differences in accuracy, precision, recall, and F1-score. The XGBoost model achieved an accuracy of approximately 90.87%, followed closely by LightGBM with an accuracy of approximately 90.99%. CatBoost achieved an accuracy of approximately 90.56%. The ensemble model, which combines predictions from XGBoost and LightGBM, achieved an accuracy of approximately 90.80%.

Considering the performance metrics and confusion matrices, LightGBM appears to have a slight edge over the other models in terms of accuracy and F1-score, with similar performance in precision and recall. However, the differences in performance among the models are relatively small, indicating that they are all capable of producing reliable predictions.

Therefore, based on the evaluation results, LightGBM seems to be the best model for making predictions on Obesity Risk Level Prediction.

Through our comprehensive analysis and predictive modeling efforts, we aim to achieve accurate classification of individuals into different obesity risk categories. This outcome will enable healthcare professionals to identify high-risk individuals, tailor interventions, and allocate resources effectively. Furthermore, our insights into the factors influencing obesity risk will inform public health policies and initiatives aimed at prevention and management. By leveraging data-driven approaches and advanced machine learning techniques, we aspire to make significant strides towards combating the global obesity epidemic and promoting healthier communities.

Enjoy Project!

anamicca23 / muli-class-obesity-risk-level-prediction-project-using-ml Goto Github PK

muli-class-obesity-risk-level-prediction-project-using-ml's Introduction

Introduction:

Approach:

About Obesity Risk Level Prediction-Project:

Key Attributes Related to Eating Habits:

Attributes Related to Physical Condition:

Target Variable:

Table of Contents:

🎯 Project Objectives:

🚀 Prerequisites:

Industry Relevance:

Installation Requirements:

Outcome and Analysis:

muli-class-obesity-risk-level-prediction-project-using-ml's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Jobs