Now that we've had some practice building models in Apache Spark, it's time to put your skills to the test!
Using the datasets/Heart.csv
dataset (credit: ISLR), you will build a binary classification model to predict whether or not a patient has heart disease (AHD
) given the following features:
Age
Sex
ChestPain
RestBP
Chol
Fbs
RestECG
MaxHR
ExAng
Oldpeak
Slope
Ca
Thal
- Your target column (
AHD
) needs to be run through aStringIndexer
. ChestPain
andThal
need to be run through aStringIndexer
and aOneHotEncoder
NOTE: You only need one instance ofOneHotEncoder
for both columns
- Split the data into an 80/20 train/test split. Use 42 as your seed for consistency.
- Paste your best accuracy score from the test set below
REPLACE THIS WITH YOUR ACCURACY SCORE
You're going to publish your notebook rather than submitting it in this repo. In DataBricks, select File > Publish and paste your URL below:
REPLACE THIS WITH YOUR NOTEBOOK URL