singularitiescr / spark-docker Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 57.0 37 KB

Apache Spark Docker Image

Home Page: https://hub.docker.com/r/singularities/spark/

License: MIT License

Shell 100.00%

spark-docker's People

Contributors

Stargazers

Watchers

spark-docker's Issues

Different avro library version

I got below exception when running a spark app using this docker image.

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
...
...
Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$serializeDatum$1.apply(GenericAvroSerializer.scala:123)
at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$serializeDatum$1.apply(GenericAvroSerializer.scala:123)
...
...

Same app works fine on my local cluster. That cluster was using spark image that contains hadoop(spark-2.0.2-bin-hadoop2.7). On checking, I noticed that org.apache.avro.io.DatumWriter was being pulled from avro 1.7.7. /opt/spark-2.0.2-bin-hadoop2.7/jars/avro-1.7.7.jar!/org/apache/avro/io/DatumWriter.class

This docker image was using version 1.7.4. /usr/hadoop-2.7.3/share/hadoop/common/lib/avro-1.7.4.jar!/org/apache/avro/io/DatumWriter.class

Not sure how these are different given that both are using hadoop 2.7. Is there any difference in the way this docker image is built as compared to spark distribution? From the stack trace it does appear that spark needs avro 1.7.7.

starts-spark does not start Spark

Just running start-spark master does not start Spark:

> docker run -ti --rm singularities/spark bash
root@742c0b1c2bf3:/# start-spark master
Adding user `spark' ...
Adding new group `spark' (1000) ...
Adding new user `spark' (1000) with group `spark' ...
Not creating home directory `/home/spark'.
spark@742c0b1c2bf3:/$

But if I exit the shell, HDFS and Spark start:

spark@742c0b1c2bf3:/$ exit
16/12/20 22:18:56 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 742c0b1c2bf3/172.17.0.3
STARTUP_MSG:   args = [-format, -force]
STARTUP_MSG:   version = 2.7.3
...
16/12/20 22:19:17 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@23a44d8a{/metrics/applications/json,null,AVAILABLE}
16/12/20 22:19:17 INFO master.Master: I have been elected leader! New state: ALIVE

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

Hi,
Thank you for your contribution. When I tried to test my cluster with run-example SparkPi 10, it said

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
    at org.apache.spark.deploy.SparkSubmitArguments.handleUnknown(SparkSubmitArguments.scala:451)
    at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:178)
    at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:97)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 5 more

Looks like Spark couldn't find Hadoop classpath. How can I fix this? Many thanks.

pyspark and spark-submit not work inside containers

Hey,

I try to run pyspark and spark-submit, but there is something wrong with python.

So i got this :

root@sparkmaster:/tmp/test_spark# spark-submit count.py
Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:82)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 11 more

When i try to run python only i got this error :

root@sparkmaster:/tmp/test_spark# python
bash: python: command not found


root@sparkmaster:/tmp/test_spark# pyspark
env: python: No such file or directory

Do you have any idea ?

Thank you

worker webui not accessable

Hi,
First,this is a great code to use for spark and hadoop.
Congratulations and thanks for a good dependable spark docker package.
I was trying to access worker ui,as we cannot add a port to expose worker api to be able to scale,
how can we access webui..for example the ipaddress of the container like workerIP:8081 is also not working.
Please help

Exception when using --files

Hi,

I'm trying a sample app where a config file has to be shared using --files option. I see an exception when I use this

./bin/spark-submit --files some-properties-file.conf --class org.apache.spark.examples.SparkPi --master local[2] ./examples/jars/spark-examples_2.11-2.0.1.jar 10

Full exception stack below. Not sure what is causing this.

Exception in thread "main" java.lang.IllegalArgumentException: Malformed IPv6 address at index 8: hdfs://[NAMENODE_HOST]:8020
at java.net.URI.create(URI.java:852)
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:180)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:361)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1428)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1401)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:458)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2275)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Malformed IPv6 address at index 8: hdfs://[NAMENODE_HOST]:8020
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.parseIPv6Reference(URI.java:3469)
at java.net.URI$Parser.parseServer(URI.java:3219)
at java.net.URI$Parser.parseAuthority(URI.java:3155)
at java.net.URI$Parser.parseHierarchical(URI.java:3097)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.(URI.java:588)
at java.net.URI.create(URI.java:850)

Don't show jobs in Spark UI

I'm launching a spark cluster using the provided docker file and it's working perfect (thanks a lot for that!) but I can't see any job running/completed in the Spark UI.

I'd like to see them because you can know what Spark is doing under the hood in a more legible way.

Can anyone help me with this?

CPU Allocation

Hi, how does spark-docker handles CPU allocation? When we create 3 workers, does the three workers work on different CPUs, or they use 3 CPUs in a shared way?

I've noticed that the more workers I use the slower the computation is. I'm using Docker 17 with 5 CPU and 5 Gb of RAM.

singularitiescr / spark-docker Goto Github PK

spark-docker's People

Contributors

Stargazers

Watchers

Forkers

spark-docker's Issues

Different avro library version

starts-spark does not start Spark

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

pyspark and spark-submit not work inside containers

worker webui not accessable

Exception when using --files

Don't show jobs in Spark UI

CPU Allocation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs