The Higgs boson is an elementary particle which explains why other particles have a mass. Measurements during high-speed collisions of protons at CERN were made public with the aim of predicting whether the collision by-products are an actual boson or background noise.
The work was mainly done in 2 ways: data pre-processing and then applying logistic regression.
Preprocessing can include different combinations of the following methods: (1) replacing undefined datapoints by the median/mean, (2) performing a polynomial expansion, (3) standardizing.
Logistic Regressions are subsequently implemented and legitimized by means of a 7-fold cross validation.
The entire project only uses python libraries Numpy and Matplolib (for visualisation).
Please add the files train.csv and test.csv directly in same folder as run.py and all the other files.
This file produces the predictions same file used to obtain the team's ("no_CS_members") the best score on the aicrowd platform. It is self-contained and only requires access to the data and files described below.
This file contains the required functions as stated in the project outline pdf file.
- mean_squared_error_gd, mean_squared_error_sgd, least_squares, ridge_regression
- logistic_regression, reg_logistic_regression
As well as auxiliary functions supporting the ones cited above.
- compute_mse_loss, compute_mse_gradient, batch_iter, compute_stoch_mse_gradient, sigmoid, calculate_logistic_loss, calculate_logistic_gradient
- calculate_stoch_logistic_gradient, stoch_reg_logistic_regression
This file contains functions used to pre-process data.
- data_removed, data_replaced, split_data, add_w0
- normalize_log_gaussian, normalize_angles, normalize_gaussian, normalize_min_max, normalize
This file contains functions to optimize the hyperparameter lambda.
- build_k_indices, cross_validation, cross_validation_demo
And to calculate the best degree for the polynomial expansion of each feature, and build the corresponding polynom.
- build_poly, best_degree_selection, phi_optimized
This file contains functions used to classify the data, aswell as some for computation of evaluating metrics.
- simple_class, get_accuracy, get_only_accuracy, get_auc, roc_visualization
- get_Kneigbors, getKpredictions
A notebook outlining the step-by-step progress of the model (each stage adds something on top of the previous version):
- logistic regression
- logistic regression + normalized
- logistic regression + normalized + w0
- logistic regression + normalized smart + w0
- logistic regression + normalized smart + w0 + high correlation features removed
This file allowed to run mutiple repetitions of each method described in "our progress", in order to compare their mean and standard deviation.
This file allows to calculate the accuracy for seven methods of regression and classification coded for this project.
- A. Gradient Descent with MSE
- B. Stochastic Gradient Descent with MSE
- C. Least Squares
- D. Ridge Regression with cross validation to find best lambda
- E. Logistic Regression with cross validation to find best lambda
- F. Regularized Logistic Regression
- G. K-nearest neighbors classification
This file allows to calculate the accuracy for each method on random train sets, in order to build their box plot.
Helper functions used to load the data and create the csv submission
The exact file that was used for the group's best submission (#204331)
- Mathilde Morelli
- Iris Toye
- Alexei Ermochkine