Material (jupyter notebook) for a talk about Pipelines and Gridsearch with scikit-learn.
This talk was given on May 16, 2018 for a PyData Munich Meetup hosted at Jetbrains Event Space.
Authors: Florent Martin and Koen van Woerden
Building a data science model usually involves lots of steps: cleaning, preprocessing, vectorizing, predicting, etc. Especially with an interactive notebook, one easily loses track of the various intermediate data outputs. Changing the intermediate processing steps also gets very cumbersome. On top of that trying to optimize the hyperparameters takes a lot of work. We will show a solution to these problems using Pipelines and Gridsearch with scikit-learn. These techniques will be demonstrated on an NLP classification problem. This talk will also serve as an introduction to scikit-learn.
The jupyter notebook that has to be to run is
./notebooks/tutorial.ipynb
.
It should be run from the root directory of the git repository.
To run the notebook, you need to have the two files data.csv
and val.csv
in the directory ./data/talk/
.
There are two ways to do so:
- The first way. If the kaggle api is installed on your computer (and if you have generated a token API), and if you can use make, then simply run
make
in the root directory. - The second way. Otherwise, you will need to download and prepare the data by hand. This means:
- Download the data from the kaggle competition
spooky author classification in the directory
./data/raw/
. (At that point if you can runmake
, then runmake
and you don't need to run any other step). - You should unzip the file
train.zip
located in./data/raw/
into./data/raw/train.csv
. Concretely, from the root directory run
This should create a fileunzip ./data/raw/train.zip -d ./data/talk/
train.csv
inside the directory./data/talk/
. 3. Finally from the root directory of the repo, runpython3 ./src/trainvalsplit.py
which will create a training set./data/talk/data.csv
and a validation set./data/talk/val.csv
. - Download the data from the kaggle competition
spooky author classification in the directory
This notebook was designed to be displayed during a presentation with a beamer. For that we use RISE.
- Florent Martin
- Koen van Woerden
- Many thanks to Nick Del Grosso for helpful suggestions.