This is a minor study(play around) for the Home Depot Product Relevance Model Kaggle Task. Our final score is not optimized to hit the top leaderboard, but our performance is considerable for only 6 features.
Prerequisites
Python 3.6
Jupyter notebook
Running the tests
Just simply download and run the script at your choice (python script or jupyter notebook)
Features
Only 6 features extracted from the dataset to achieve the study goal. They are:
Levenshtein distance between search term and product title
Levenshtein distance between search term and product description
Cosine similarity between search term and product title (TF-IDF)
Cosine similarity between search term and product description (TF-IDF)
Cosine similarity between search term and product title (Word2Vec)
Cosine similarity between search term and product description (Word2Vec)
Pipeline
Fill up all missing data with empty string value or default value
Merge the training data set and testing data set with description data set
Extract and parse different features
.. * Apply the following steps on each description: to_lowercase -> split -> apply Snowball stemmer -> aggregate