XGBoost will be used for this analysis.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now.
!pip install xgboost
Collecting xgboost
Downloading xgboost-1.2.1-py3-none-macosx_10_13_x86_64.macosx_10_14_x86_64.macosx_10_15_x86_64.whl (1.2 MB)
�[K |████████████████████████████████| 1.2 MB 3.9 MB/s eta 0:00:01
�[?25hRequirement already satisfied: scipy in /opt/miniconda3/lib/python3.7/site-packages (from xgboost) (1.4.1)
Requirement already satisfied: numpy in /opt/miniconda3/lib/python3.7/site-packages (from xgboost) (1.18.1)
Installing collected packages: xgboost
Successfully installed xgboost-1.2.1
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
https://www.kaggle.com/uciml/pima-indians-diabetes-database
import pandas as pd
dataset = pd.read_csv('diabetes.csv')
dataset.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
- Split train/set dataset
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
features = dataset.iloc[:, 0:8].values
labels = dataset.iloc[:,8].values
features
array([[ 6. , 148. , 72. , ..., 33.6 , 0.627, 50. ],
[ 1. , 85. , 66. , ..., 26.6 , 0.351, 31. ],
[ 8. , 183. , 64. , ..., 23.3 , 0.672, 32. ],
...,
[ 5. , 121. , 72. , ..., 26.2 , 0.245, 30. ],
[ 1. , 126. , 60. , ..., 30.1 , 0.349, 47. ],
[ 1. , 93. , 70. , ..., 30.4 , 0.315, 23. ]])
labels[0:5]
array([1, 0, 1, 0, 1])
Plotting percentage of Outcome 0 (no disease) vs 1 (disease)
dataset.Outcome.value_counts().plot(kind='pie')
<matplotlib.axes._subplots.AxesSubplot at 0x1a26c3ed50>
print(f'Percentage of No Disease: {100 * labels[labels==0].shape[0] / labels.shape[0]:0.2f}')
print(f'Percentage of Disease: {100 * labels[labels==1].shape[0] / labels.shape[0]:0.2f}')
Percentage of No Disease: 65.10
Percentage of Disease: 34.90
from sklearn.preprocessing import MinMaxScaler
# Scalling to range (-1, 1)
scaler=MinMaxScaler( (-1, 1) )
X = scaler.fit_transform(features)
#X = features
Y = labels
X
array([[-0.29411765, 0.48743719, 0.18032787, ..., 0.00149031,
-0.53116994, -0.03333333],
[-0.88235294, -0.14572864, 0.08196721, ..., -0.2071535 ,
-0.76686593, -0.66666667],
[-0.05882353, 0.83919598, 0.04918033, ..., -0.30551416,
-0.49274125, -0.63333333],
...,
[-0.41176471, 0.2160804 , 0.18032787, ..., -0.21907601,
-0.85738685, -0.7 ],
[-0.88235294, 0.26633166, -0.01639344, ..., -0.10283159,
-0.76857387, -0.13333333],
[-0.88235294, -0.06532663, 0.14754098, ..., -0.09388972,
-0.79760888, -0.93333333]])
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size = 0.33, random_state=0)
x_train.shape
(514, 8)
x_test.shape
(254, 8)
model = XGBClassifier()
model.fit(x_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
y_pred = model.predict(x_test)
y_pred
array([1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1,
1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
# accuracy
acc = accuracy_score(y_test, y_pred)
print(f'Accuracy score: {acc * 100:0.2f}')
Accuracy score: 75.98
What we have learned:
- Quick load and visualize dataset
- Using MinMaxScaler
- Using XGBoost