GithubHelp home page GithubHelp logo

kube-datascience's Introduction

Kubernetes Data Science for EKS/GKE

This project is an interactive datascience environment that is running on Kubernetes. Containers such as Jupyter and Apache Livy will be deployed as long as an spark cluster.

At the moment kubernetes only supports spark submit in a cluster mode, which is good for production jobs but not so much for interactive analysis. This is the reason why a spark cluster is running in the environment.

This project will deploy the containers in a defined namespace, you can change the namespace name as you want (see the section below)

Run the services in Kubernetes

The images are pused to my personal repository and are publicly available. Make sure your kubectl is properly configured in order to point to your kubernetes cluster.

The Makefile will generate the deployment files for kubernetes, do not use the kubenetes/* files directly, use them via the Makefile.

Deploy the environment (Variables are not needed, look into the Makefile to see the default values):

#By default the jupyter password is admin
$ make deploy NAMESPACE=my-original-namespace-name STACK_NAME=toto JUPYTER_PASSWORD=admin

How to build the docker images

In the root folder just run make build. If you need to customise the images, you need to retag them and push them to your dockerhub account. If you are using minikube, you don't need to push the images to a custom repository, the internal docker images are shared with minikube.

Access to the services

There is at the moment two services deployed with a loadbalancer (Externally accessible):

  • Jupyter
  • Spark webui to get the url of those services just run make output

Good to know

Makefiles are used to build and deploy everything, it can be easily integrated with a Jenkins server. Jinja is used to generate the deployment file. The execution of Jinja don't need to have Jinja installed but needs docker. Actually, a Jinja docker image is used to generate the files.

The generated files will be located in .tmp/

Improvments to be done

  • Store the home directory of jupyter to an S3 bucket (or compatible)
  • Find a way to not expose directly the spark webui on the internet but make it available for the datascientist
  • Pass AWS environment variables to the containers dynamically
  • Add tensorflow capabilities to this environment

kube-datascience's People

Stargazers

Bruno Gomes avatar jiandong avatar

Watchers

James Cloos avatar jiandong avatar Mark Thebault avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.