testdrivenio / spark-kubernetes Goto Github PK

View Code? Open in Web Editor NEW

105.0 7.0 77.0 236 KB

spark on kubernetes

Shell 52.76% Dockerfile 47.24%

spark kubernetes docker

spark-kubernetes's Introduction

Deploying Spark on Kubernetes

Want to learn how to build this?

Check out the post.

Want to use this project?

Minikube Setup

Install and run Minikube:

Install a Hypervisor (like VirtualBox or HyperKit) to manage virtual machines
Install and Set Up kubectl to deploy and manage apps on Kubernetes
Install Minikube

Start the cluster:

$ minikube start --memory 8192 --cpus 4
$ minikube dashboard

Build the Docker image:

$ eval $(minikube docker-env)
$ docker build -t spark-hadoop:2.2.1 -f ./docker/Dockerfile ./docker

Create the deployments and services:

$ kubectl create -f ./kubernetes/spark-master-deployment.yaml
$ kubectl create -f ./kubernetes/spark-master-service.yaml
$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml
$ minikube addons enable ingress
$ kubectl apply -f ./kubernetes/minikube-ingress.yaml

Add an entry to /etc/hosts:

$ echo "$(minikube ip) spark-kubernetes" | sudo tee -a /etc/hosts

Test it out in the browser at http://spark-kubernetes/.

spark-kubernetes's People

Contributors

Stargazers

Watchers

spark-kubernetes's Issues

1 node(s) had taints that the pod didn't tolerate

Fails to create pods in minikube version: v1.1.0 on Ubuntu 16.04

$ kubectl describe po spark-master-7fcc5d88cc-24t57
...
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  84s (x20 over 28m)  default-scheduler  0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.

$ kubectl get po
NAME                            READY   STATUS    RESTARTS   AGE
spark-master-7fcc5d88cc-24t57   0/1     Pending   0          30m
spark-worker-77fc45f84f-g9lkl   0/1     Pending   0          30m
spark-worker-77fc45f84f-l8h8l   0/1     Pending   0          30m

try to access on browser using http://spark-kubernetes/ but not accessile

hi, i conffigured according to repo testdrivenio/spark-kubernetes also set etc/hosts

but still got an error on browser on terminal it works fine and perfectly.

could you help me out this : /

Thanks in advance.....

How to build docker images and deploy spark deployments and service yaml files

Hi @mjhea0 ,

I was able to create a Kubernetes cluster by following the documentation links. However, I am not able to build the docker custom images and not sure where to run the other commands for deployments and services. Can you guide me over here?

I keep getting ImagePullBackOff

When I create the deployments and the services, the master is Running but I get "ImagePullBackOff" error for the workers

The Docker image used is: spark-hadoop:3.0.0

When I inspect the worker logs, I get this: "Error from server (BadRequest): container "spark-worker" in pod "spark-worker-655b68b77c-8cvrz" is waiting to start: trying and failing to pull image"

http://spark-kubernetes/ is not working

Hi,

I am following the steps that you have here: https://testdriven.io/blog/deploying-spark-on-kubernetes/

Thank you so much for sharing this with us.

I am having some problems. The test part is working, but this http://spark-kubernetes/ doesn't work for me.

when I execute minikube ip I get this: 192.168.99.103

Do you know another way to reach the same page (spark master)?

Kind regards,
Anaid

Solution For Connectivity Problem When Submitting Work From Master Node To Worker(s)

I was inspired by this repository, and continue to build on it.

However, I also got the issue faced here: #1

I was getting:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I bashed my head around this for 2 nights, not being an expert in K8S I first thought something was wrong with how I started it up.

Either way, this is how I reproduced the problem:

I checked my resources, and I made the following config:

spark-defaults.conf

spark.master spark://sparkmaster:7077
spark.driver.host sparkmaster
spark.driver.bindAddress sparkmaster
spark.executor.cores 1
spark.executor.memory 512m
spark.driver.extraLibraryPath /opt/hadoop/lib/native
spark.app.id KubernetesSpark

And I ran minikube with:

minikube start --memory 8192 --cpus 4 --vm=true

These were my spark-master and spark-worker scripts:

spark-worker.sh

#!/bin/bash

. /common.sh

getent hosts sparkmaster

if ! getent hosts sparkmaster; then
  sleep 5
  exit 0
fi

/usr/local/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://sparkmaster:7077 --webui-port 8081 --memory 2g

### Note I put 2g here just to be 100% confident I was not using to much resources.

spark-worker.sh

#!/bin/bash

. /common.sh

echo "$(hostname -i) sparkmaster" >> /etc/hosts

/usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host sparkmaster --port 7077 --webui-port 8080

I then ran:

kubectl exec <master-pod-name> -it -- pyspark
>>>
sc.parallelize([1,2,3,4]).collect()
>>>
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

And error occurred!

I made sure to get access to 8081 and 4040 to investigate logs further:

kubectl port-forward <spark-worker-pod> 8081:8081

kubectl port-forward <spark-master-pod> 4040:4040

I then went in and:

http://localhost:8081/ --> Find my executor --> stderr (`http://localhost:8081/logPage/?appId=<APP-ID>&executorId=<EXECUTOR-ID>&logType=stderr`) ->

I scratched my head, and I knew! I have enough resources, why does this not work!

And I could see:

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: sparkmaster/10.101.97.213:41607
Caused by: java.net.ConnectException: Connection timed out

I then thought well, I done this right:

spark.driver.host sparkmaster
spark.driver.bindAddress sparkmaster

The docs mention that it can be either HOST or IP, I am good I thought. I saw the possible solution of:

sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy

Well this was not a problem for me, actually I had no iptables to resolve at all.

So I then verified the master IP with:

kubectl get pods -o wide

I then took the MASTER-IP and added it directly:

pyspark --conf spark.driver.bindAddress=<MASTER-POD-IP> --conf spark.driver.host=<MASTER-POD-IP>
>>> ....

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/

Using Python version 3.7.9 (default, Sep 10 2020 17:42:58)
SparkSession available as 'spark'.
>>> sc.parallelize([1,2,3,4,5,6]).collect()
[1, 2, 3, 4, 5, 6] <---- BOOOOOOM!!!!!!!!!!!!!!!

SOLUTION:

spark-defaults.conf

spark.master spark://sparkmaster:7077
spark.executor.cores 1
spark.executor.memory 512m
spark.driver.extraLibraryPath /opt/hadoop/lib/native
spark.app.id KubernetesSpark

And add the IPs correctly:

spark-worker.sh

#!/bin/bash

. /common.sh

echo "$(hostname -i) sparkmaster" >> /etc/hosts

# We must set the IP address to the executors of the master pod, othewerwise we will get the error
# inside the worker trying to connect to master:
#
# 20/09/12 15:56:55 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your
# cluster UI to ensure that workers are registered and have sufficient resources
#
# When investigating the worker we can see:
# Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: s
#   parkmaster/10.101.97.213:34881
# Caused by: java.net.ConnectException: Connection timed out
#
# This means that when the spark-class ran, it was able to create the connection at init stage, but
# when pushing the spark-submit, it failed.
echo "spark.driver.host $(hostname -i)" >> /usr/local/spark/conf/spark-defaults.conf
echo "spark.driver.bindAddress $(hostname -i)" >> /usr/local/spark/conf/spark-defaults.conf

/usr/local/spark/bin/spark-class org.apache.spark.deploy.master.Master --host sparkmaster --port 7077 --webui-port 8080

In this case my SPARK_HOME is /usr/local/spark

My Dockerfile

FROM python:3.7-slim-stretch

# PATH
ENV PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# Spark
ENV SPARK_VERSION 3.0.0
ENV SPARK_HOME /usr/local/spark
ENV SPARK_LOG_DIR /var/log/spark
ENV SPARK_PID_DIR /var/run/spark
ENV PYSPARK_PYTHON /usr/local/bin/python
ENV PYSPARK_DRIVER_PYTHON /usr/local/bin/python
ENV PYTHONUNBUFFERED 1
ENV HADOOP_COMMON org.apache.hadoop:hadoop-common:2.7.7
ENV HADOOP_AWS org.apache.hadoop:hadoop-aws:2.7.7
ENV SPARK_MASTER_HOST sparkmaster

# Java
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/

# Install curl
RUN apt-get update && apt-get install -y curl

# Install procps
RUN apt-get install -y procps

# Install coreutils
RUN apt-get install -y coreutils

# https://github.com/geerlingguy/ansible-role-java/issues/64
RUN apt-get update && mkdir -p /usr/share/man/man1 && apt-get install -y openjdk-8-jdk && \
    apt-get install -y ant && apt-get clean && rm -rf /var/lib/apt/lists/ && \
    rm -rf /var/cache/oracle-jdk8-installer;

# Download Spark, enables full functionality for spark-submit against docker container
RUN curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \
        tar -zx -C /usr/local/ && \
        ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 ${SPARK_HOME}

# add scripts and update spark default config
ADD tools/docker/spark/common.sh tools/docker/spark/spark-master.sh tools/docker/spark/spark-worker.sh /
ADD tools/docker/spark/example_spark.py /

RUN chmod +x /common.sh /spark-master.sh /spark-worker.sh

ADD tools/docker/spark/spark-defaults.conf ${SPARK_HOME}/conf/spark-defaults.conf
ENV PATH $PATH:${SPARK_HOME}/bin

Currently bulding a streaming platform in this repo:

https://github.com/Thelin90/deiteo

Initial job has not accepted

I'm trying to create my own image using your Dockerfile, and its is build without problems and I create de master deployment and worker deployment, and the pods are running, but when I execute the example with pyspark I get this error:

$ kubectl exec spark-master-2-7dd86dc9d7-tftnr -it pyspark

Python 2.7.9 (default, Jun 29 2016, 13:08:31)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 2.7.9 (default, Jun 29 2016 13:08:31)
SparkSession available as 'spark'.
>>> words = 'the quick brown fox jumps over the lazy dog the quick brown fox jumps over the lazy dog'
>>> seq = words.split()
>>> data = sc.parallelize(seq)
>>> counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect()
**[Stage 0:>                                                          (0 + 0) / 2]2019-04-25 15:53:10 WARN  TaskSchedulerImpl:66 - **Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources**

Any idea what is the problem?

[Question] Connecting to spark in Client Mode on k8s

Hello,
Thank you for the excellent work.
Is there a way to use this architecture for executing pyspark scripts in Client Mode. Such that one can import pyspark in a Jupyter Notebook and connect to the spark cluster running on kubernetes.

pyspark-notebook
λ docker run --rm -p 10000:8888 -e JUPYTER_ENABLE_LAB=yes -v "$(PWD):/home/jovyan/work" jupyter/pyspark-notebook

Something like this?

# my-notebook.ipynb
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

# Build Spark session
sparkConf = SparkConf()
sparkConf.setMaster("k8s://https://localhost:6445")
sparkConf.set("spark.kubernetes.container.image", "spark-hadoop:2.2.1")
sparkConf.set("spark.kubernetes.namespace", "default")
sparkConf.set('spark.submit.deployMode', 'client') # Only client mode is possible 
sparkConf.set('spark.executor.instances', '2') # Set the number of executer pods
sparkConf.setAppName('pyspark-shell')
os.environ['PYSPARK_PYTHON'] = 'python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3'

spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
sc = spark.sparkContext

# Test
filePath = os.path.join('../Test1.csv')
df = spark.read.format('csv').options(
    header='true', inferSchema=True).load(filePath)
df.show()

create a directory on HDFS

Hi,

Do you know how can I create a directory on HDFS?

Kind regards,
Anaid

testdrivenio / spark-kubernetes Goto Github PK

spark-kubernetes's Introduction

Deploying Spark on Kubernetes

Want to learn how to build this?

Want to use this project?

Minikube Setup

spark-kubernetes's People

Contributors

Stargazers

Watchers

Forkers

spark-kubernetes's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs