The utilizing-the-kaggle-python-docker-container-image from danielschulz

Getting started with data science and applying machine learning has never been as simple as it is now. There are many free and paid online tutorials and courses out there to help you to get started. I’ve recently started to learn, play, and work on Data Science & Machine Learning on Kaggle.com. In this brief post, I’d like to share my experience with the Kaggle Python Docker image, which simplifies the Data Scientist’s life.

Outline:

Why did I start to use containers for Data Science & Machine Learning with Python?
How did I start?
Experiment with Kaggle notebooks in your local sandbox
Alternatives
References

Why did I start to use containers for Data Science & Machine Learning with Python?

The short answer is simple: I had to push long-running and compute intense machine learning jobs from my laptop computer to an old, but powerful desktop workstation.

Setting up a second Python development and test machine isn’t that hard, but it still can be a hassle. pip install is not my best friend because it fails far too often—especially behind enterprise HTTP proxies. Therefore, I prefer to use Anaconda anyway. The Kaggle Python Docker image looked interesting for further reasons:

The image contains the same software libraries as the Kaggle online runtime environment. This would allow me to play and develop in a local, private environment and upload only significant changes to the online version of my Kaggle notebooks and code.
As a Newbie Kaggler, you might want to learn from, modify, and try out the solutions of advanced users without forking too many scripts and notebooks. Downloading scripts and notebooks and running them in your own environment offers you a lot of freedom and flexibility.
Using the Kaggle Python Docker image made it very simple for me to create a minimal, shared Jupyter Notebook environment for a side project with a co-worker.

How did I start?

My desktop workstation with Ubuntu 16.04 already had a reasonable Docker version (17.05.0-ce) installed and from there I just had to pull the Kaggle Python Docker image. For any details for the Docker installation see this blog post or https://docs.docker.com.

$ docker pull kaggle/python

The image is large and therefore it can take some time to download all layers. Once it’s downloaded, you can list the image including the size info:

$ docker image ls kaggle/python
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
kaggle/python       latest              09a349977ca7        2 weeks ago         12.5GB

Let’s do a quick check before starting a Docker Jupyter Notebook environment:

As already described on Kaggle.com, it is helpful to create a shell function and add the function also to your .bash_profile:

$ kpython(){ docker run -v $PWD:/tmp/working \
    -w=/tmp/working --rm -it kaggle/python python "$@" ; }

Now we can use kpython instead of python. For example, to print the Python and Keras version of the Docker image:

$ kpython -c 'import sys; print("Python version ", sys.version); \
          import keras; print("Keras version ", keras.__version__)'

Python version  3.6.3 |Anaconda custom (64-bit)| (default, Nov 20 2017, 20:41:42)
[GCC 7.2.0]
Using TensorFlow backend.
Keras version  2.1.2

So, a Jupyter Notebook environment can be started with the following docker run command:

First, create a working directory on the workstation for your notebooks and files. Then, start the container.

$ mkdir kaggle && cd kaggle
$ docker run -v $PWD:/tmp/working -w=/tmp/working -p 8888:8888 \
     --rm -it kaggle/python jupyter notebook --no-browser \
     --ip="0.0.0.0" --notebook-dir=/tmp/working --allow-root

Watch for the output with the token:

[C 16:36:23.778 NotebookApp]
    Copy/paste this URL into your browser when you connect for the first time, 
    to login with a token:
        http://0.0.0.0:8888/?token=2c997056b24406afdfd7e0e1d10861999656e1ef5e22e812

Since I want to run a Jupyter Notebook service, I usually start the container as detached (-d) instead of in foreground mode (-it):


$ docker run -v $PWD:/tmp/working -w=/tmp/working -p 8888:8888 
     --rm -d kaggle/python jupyter notebook --no-browser \
     --ip="0.0.0.0" --notebook-dir=/tmp/working --allow-root

You can display the logs of the running container to see the token:

$ docker ps
CONTAINER ID        IMAGE                       COMMAND                   
158fec0f6eaa        kaggle/python               "/usr/bin/tini -- ..."   

$ docker logs 158fec0f6eaa
[I 21:04:14.464 NotebookApp] The Jupyter Notebook is running at:
http://0.0.0.0:8888/?token=2c997056b24406afdfd7e0e1d10861999656e1ef5e22e812

I am connecting from the browser on my laptop computer to the desktop workstation where the Kaggle container is running. Therefore, I will replace 0.0.0.0 with the IP address or FQDN of the workstation.

http://mylinuxbox:8888/?token=2c997056b24406afdfd7e0e1d10861999656e1ef5e22e812

Here we go, a fresh Jupyter Notebook ...

Experiment with Kaggle notebooks in your local sandbox

In this section I’d like to show how you can download Kaggle notebooks and play with a note in your own local environment.

Note, please join the Kaggle community and don’t go private only. Share your work, questions, and experiences with other users.

Overview

Create a directory structure.
Download a notebook.
Download the input data.
Run the notebook.

Create a directory structure

Kaggle notebooks usually read input data file form the directory ../input/.

E.g. ../input/train.csv and ../input/test.csv.

A similar directory structure is required on your local system so that notebooks run without any modification:

working dir:
 - code
   - notebook.ipynb
 - input
   - train.csv
   - test.csv

You can either create the directories on the Linux system manually or via the Jupyter web UI.

The structure should look like this:

Download a notebook

Next, download the notebook that you would like to try out. I am illustrating this with my one of my public Kaggle notebooks.

On Kaggle.com, navigate to the notebook and code tab, then download the .ipynb file on your local computer.

After that, upload the notebook into the code directory of your local environment. Navivate to the code directory and Upload the notebook (.ipynb) file. Your result should look like this:

Download the input data

Well, the notebook needs some input data. On Kaggle.com, navigate to the notebook and data tab, then download the train.csv and test.csv files on your local computer.

Next, upload the input data into the input directory of your local environment. The result should look like this:

Run the notebook

Now it's time to run the notebook in your local Kaggle environment. Navigate to the code directory and open the notebook.

The notebook opens in a new browser tab and from here you can run, modify, and try out anything.

Any alternatives?

There are plenty of alternatives, including:

Kaggle.com makes it easy to fork kernels (notebooks, scripts) online. In case you don't see any need to develop in your private environment, Kaggle.com is a good place for practicing data science.
Amazon Web Services offers AWS Deep Learning AMIs with a lot of pre-installed software.
Running and securing a solid notebook server.
Nothing beats your personalized development and test environment that you constructed over years.

danielschulz / utilizing-the-kaggle-python-docker-container-image Goto Github PK

utilizing-the-kaggle-python-docker-container-image's Introduction

Why did I start to use containers for Data Science & Machine Learning with Python?

How did I start?

Experiment with Kaggle notebooks in your local sandbox

Create a directory structure

Download a notebook

Download the input data

Run the notebook

Any alternatives?

References

utilizing-the-kaggle-python-docker-container-image's People

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs