Question 1
code was done on Jupyter Notebook
This code is designed to perform a K-Nearest Neighbors (KNN) machine learning algorithm on a set of email data to classify emails as spam or not spam. It splits the process into several steps -data preparation -model training -prediction -evaluation.
Initial Setup and Data Loading
-
Importing Libraries: The code starts by importing necessary libraries.
pandas
is used for data manipulation,numpy
for numerical computations,Counter
fromcollections
is used for counting occurrences of elements, helpful in voting for the most common class in KNN predictions. -
Loading Data: Data is loaded from
spam_train.csv
andspam_test.csv
usingpandas
, This dataset contains features extracted from emails along with a class/label indicating whether an email is spam (1
) or not spam (0
). -
###Data Preparation
- The features (X) and labels (y) are separated for both training and test sets.
- Training features are extracted using
spam_train.iloc[:, 0:56]
(all rows and the first 56 columns assuming the last column is the label) - labels are extracted from the
class
column.
The core of this code revolves around a manual implementation of the KNN algorithm, which involves:
-
Defining a training function: simulates model training, but in the context of KNN
-
predict_knn
Function: A crucial part of this code that seems to compute the distance between a test sample and all training samples to find the nearest neighbors but lacks the final step to return the most common label among the k-nearest neighbors. Q1
![Screenshot 2024-03-03 at 2 53 53 AM](https://private-user-images.githubusercontent.com/84993132/309595737-09bf9f90-8715-4ffd-be3c-3e71f5fdf6cf.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMzODYxNjgsIm5iZiI6MTcyMzM4NTg2OCwicGF0aCI6Ii84NDk5MzEzMi8zMDk1OTU3MzctMDliZjlmOTAtODcxNS00ZmZkLWJlM2MtM2U3MWY1ZmRmNmNmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODExVDE0MTc0OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJmMmFlYjNkODA2YTYwNGI4NmQ0OTU0ZWQwZGIyNzJjZGYxMzAxZWFmOTMzMmY1MmFkNmZmMjVmNzEwZTY4OTImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.X2b-cJScKut2_0wFGcPz4pvNN6P3Mee8CJ0YJIn9kPE)
Q3
![Screenshot 2024-03-03 at 4 02 59 PM](https://private-user-images.githubusercontent.com/84993132/309633136-ca52d943-ebd2-4ab7-8a29-eb90cdd38cf8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjMzODYxNjgsIm5iZiI6MTcyMzM4NTg2OCwicGF0aCI6Ii84NDk5MzEzMi8zMDk2MzMxMzYtY2E1MmQ5NDMtZWJkMi00YWI3LThhMjktZWI5MGNkZDM4Y2Y4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODExVDE0MTc0OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWRiYmMxMTFhM2QxMGQ5M2VjNDMzZWY2MmQ0YzVhYmI4YmVkMTcyNjg3MmUyMzBkMTdmYWViYTNjZjEwYWY3ZDYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Zuun7poiq_G7u0kqMKBDiaLe6iqpwAnRFJ8LtP20Lpg)