GithubHelp home page GithubHelp logo

spark-kubernetes's Introduction

Deploying Spark on Kubernetes

Want to learn how to build this?

Check out the post.

Want to use this project?

Minikube Setup

Install and run Minikube:

  1. Install a Hypervisor (like VirtualBox or HyperKit) to manage virtual machines
  2. Install and Set Up kubectl to deploy and manage apps on Kubernetes
  3. Install Minikube

Start the cluster:

$ minikube start --memory 8192 --cpus 4
$ minikube dashboard

Build the Docker image:

$ eval $(minikube docker-env)
$ docker build -t spark-hadoop:2.2.1 -f ./docker/Dockerfile ./docker

Create the deployments and services:

$ kubectl create -f ./kubernetes/spark-master-deployment.yaml
$ kubectl create -f ./kubernetes/spark-master-service.yaml
$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml
$ minikube addons enable ingress
$ kubectl apply -f ./kubernetes/minikube-ingress.yaml

Add an entry to /etc/hosts:

$ echo "$(minikube ip) spark-kubernetes" | sudo tee -a /etc/hosts

Test it out in the browser at http://spark-kubernetes/.

spark-kubernetes's People

Contributors

mjhea0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-kubernetes's Issues

1 node(s) had taints that the pod didn't tolerate

Fails to create pods in minikube version: v1.1.0 on Ubuntu 16.04

$ kubectl describe po spark-master-7fcc5d88cc-24t57
...
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  84s (x20 over 28m)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

$ kubectl get po
NAME                            READY   STATUS    RESTARTS   AGE
spark-master-7fcc5d88cc-24t57   0/1     Pending   0          30m
spark-worker-77fc45f84f-g9lkl   0/1     Pending   0          30m
spark-worker-77fc45f84f-l8h8l   0/1     Pending   0          30m

I keep getting ImagePullBackOff

When I create the deployments and the services, the master is Running but I get "ImagePullBackOff" error for the workers

image

The Docker image used is: spark-hadoop:3.0.0

When I inspect the worker logs, I get this: "Error from server (BadRequest): container "spark-worker" in pod "spark-worker-655b68b77c-8cvrz" is waiting to start: trying and failing to pull image"

Solution For Connectivity Problem When Submitting Work From Master Node To Worker(s)

I was inspired by this repository, and continue to build on it.

However, I also got the issue faced here: #1

I was getting:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I bashed my head around this for 2 nights, not being an expert in K8S I first thought something was wrong with how I started it up.

Either way, this is how I reproduced the problem:

1)

I checked my resources, and I made the following config:

spark-defaults.conf

spark.master spark://sparkmaster:7077
spark.driver.host sparkmaster
spark.driver.bindAddress sparkmaster
spark.executor.cores 1
spark.executor.memory 512m
spark.driver.extraLibraryPath /opt/hadoop/lib/native
spark.app.id KubernetesSpark

2)

And I ran minikube with:

minikube start --memory 8192 --cpus 4 --vm=true

3)

These were my spark-master and spark-worker scripts:

spark-worker.sh

#!/bin/bash

. /common.sh

getent hosts sparkmaster

if ! getent hosts sparkmaster; then
  sleep 5
  exit 0
fi

/usr/local/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://sparkmaster:7077 --webui-port 8081 --memory 2g

### Note I put 2g here just to be 100% confident I was not using to much resources.

spark-worker.sh

#!/bin/bash

. /common.sh

echo "$(hostname -i) sparkmaster" >> /etc/hosts

/usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host sparkmaster --port 7077 --webui-port 8080

4)

I then ran:

kubectl exec <master-pod-name> -it -- pyspark
>>>
sc.parallelize([1,2,3,4]).collect()
>>>
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

And error occurred!

I made sure to get access to 8081 and 4040 to investigate logs further:

kubectl port-forward <spark-worker-pod> 8081:8081
kubectl port-forward <spark-master-pod> 4040:4040

I then went in and:

http://localhost:8081/ --> Find my executor --> stderr (`http://localhost:8081/logPage/?appId=<APP-ID>&executorId=<EXECUTOR-ID>&logType=stderr`) ->

5)

I scratched my head, and I knew! I have enough resources, why does this not work!

And I could see:

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: sparkmaster/10.101.97.213:41607
Caused by: java.net.ConnectException: Connection timed out

I then thought well, I done this right:

spark.driver.host sparkmaster
spark.driver.bindAddress sparkmaster

The docs mention that it can be either HOST or IP, I am good I thought. I saw the possible solution of:

sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy

Well this was not a problem for me, actually I had no iptables to resolve at all.

So I then verified the master IP with:

kubectl get pods -o wide

I then took the MASTER-IP and added it directly:

pyspark --conf spark.driver.bindAddress=<MASTER-POD-IP> --conf spark.driver.host=<MASTER-POD-IP>
>>> ....

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/

Using Python version 3.7.9 (default, Sep 10 2020 17:42:58)
SparkSession available as 'spark'.
>>> sc.parallelize([1,2,3,4,5,6]).collect()
[1, 2, 3, 4, 5, 6] <---- BOOOOOOM!!!!!!!!!!!!!!!

6)

SOLUTION:

spark-defaults.conf

spark.master spark://sparkmaster:7077
spark.executor.cores 1
spark.executor.memory 512m
spark.driver.extraLibraryPath /opt/hadoop/lib/native
spark.app.id KubernetesSpark

And add the IPs correctly:

spark-worker.sh

#!/bin/bash

. /common.sh

echo "$(hostname -i) sparkmaster" >> /etc/hosts

# We must set the IP address to the executors of the master pod, othewerwise we will get the error
# inside the worker trying to connect to master:
#
# 20/09/12 15:56:55 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
# cluster UI to ensure that workers are registered and have sufficient resources
#
# When investigating the worker we can see:
# Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: s
#   parkmaster/10.101.97.213:34881
# Caused by: java.net.ConnectException: Connection timed out
#
# This means that when the spark-class ran, it was able to create the connection at init stage, but
# when pushing the spark-submit, it failed.
echo "spark.driver.host $(hostname -i)" >> /usr/local/spark/conf/spark-defaults.conf
echo "spark.driver.bindAddress $(hostname -i)" >> /usr/local/spark/conf/spark-defaults.conf

/usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host sparkmaster --port 7077 --webui-port 8080

In this case my SPARK_HOME is /usr/local/spark

My Dockerfile

FROM python:3.7-slim-stretch

# PATH
ENV PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# Spark
ENV SPARK_VERSION 3.0.0
ENV SPARK_HOME /usr/local/spark
ENV SPARK_LOG_DIR /var/log/spark
ENV SPARK_PID_DIR /var/run/spark
ENV PYSPARK_PYTHON /usr/local/bin/python
ENV PYSPARK_DRIVER_PYTHON /usr/local/bin/python
ENV PYTHONUNBUFFERED 1
ENV HADOOP_COMMON org.apache.hadoop:hadoop-common:2.7.7
ENV HADOOP_AWS org.apache.hadoop:hadoop-aws:2.7.7
ENV SPARK_MASTER_HOST sparkmaster

# Java
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/

# Install curl
RUN apt-get update && apt-get install -y curl

# Install procps
RUN apt-get install -y procps

# Install coreutils
RUN apt-get install -y coreutils

# https://github.com/geerlingguy/ansible-role-java/issues/64
RUN apt-get update && mkdir -p /usr/share/man/man1 && apt-get install -y openjdk-8-jdk && \
    apt-get install -y ant && apt-get clean && rm -rf /var/lib/apt/lists/ && \
    rm -rf /var/cache/oracle-jdk8-installer;

# Download Spark, enables full functionality for spark-submit against docker container
RUN curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \
        tar -zx -C /usr/local/ && \
        ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 ${SPARK_HOME}

# add scripts and update spark default config
ADD tools/docker/spark/common.sh tools/docker/spark/spark-master.sh tools/docker/spark/spark-worker.sh /
ADD tools/docker/spark/example_spark.py /

RUN chmod +x /common.sh /spark-master.sh /spark-worker.sh

ADD tools/docker/spark/spark-defaults.conf ${SPARK_HOME}/conf/spark-defaults.conf
ENV PATH $PATH:${SPARK_HOME}/bin

Currently bulding a streaming platform in this repo:

https://github.com/Thelin90/deiteo

Initial job has not accepted

I'm trying to create my own image using your Dockerfile, and its is build without problems and I create de master deployment and worker deployment, and the pods are running, but when I execute the example with pyspark I get this error:

$ kubectl exec spark-master-2-7dd86dc9d7-tftnr -it pyspark

Python 2.7.9 (default, Jun 29 2016, 13:08:31)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 2.7.9 (default, Jun 29 2016 13:08:31)
SparkSession available as 'spark'.
>>> words = 'the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog'
>>> seq = words.split()
>>> data = sc.parallelize(seq)
>>> counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect()
**[Stage 0:>                                                          (0 + 0) / 2]2019-04-25 15:53:10 WARN  TaskSchedulerImpl:66 - **Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources**

Any idea what is the problem?

[Question] Connecting to spark in Client Mode on k8s

Hello,
Thank you for the excellent work.
Is there a way to use this architecture for executing pyspark scripts in Client Mode. Such that one can import pyspark in a Jupyter Notebook and connect to the spark cluster running on kubernetes.

pyspark-notebook
λ docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$(PWD):/home/jovyan/work" jupyter/pyspark-notebook

Something like this?

# my-notebook.ipynb
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

# Build Spark session
sparkConf = SparkConf()
sparkConf.setMaster("k8s://https://localhost:6445")
sparkConf.set("spark.kubernetes.container.image", "spark-hadoop:2.2.1")
sparkConf.set("spark.kubernetes.namespace", "default")
sparkConf.set('spark.submit.deployMode', 'client') # Only client mode is possible 
sparkConf.set('spark.executor.instances', '2') # Set the number of executer pods
sparkConf.setAppName('pyspark-shell')
os.environ['PYSPARK_PYTHON'] = 'python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3'

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

# Test
filePath = os.path.join('../Test1.csv')
df = spark.read.format('csv').options(
    header='true', inferSchema=True).load(filePath)
df.show()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.