singularitiescr / spark-docker Goto Github PK
View Code? Open in Web Editor NEWApache Spark Docker Image
Home Page: https://hub.docker.com/r/singularities/spark/
License: MIT License
Apache Spark Docker Image
Home Page: https://hub.docker.com/r/singularities/spark/
License: MIT License
I got below exception when running a spark app using this docker image.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
...
...
Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$serializeDatum$1.apply(GenericAvroSerializer.scala:123)
at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$serializeDatum$1.apply(GenericAvroSerializer.scala:123)
...
...
Same app works fine on my local cluster. That cluster was using spark image that contains hadoop(spark-2.0.2-bin-hadoop2.7). On checking, I noticed that org.apache.avro.io.DatumWriter was being pulled from avro 1.7.7. /opt/spark-2.0.2-bin-hadoop2.7/jars/avro-1.7.7.jar!/org/apache/avro/io/DatumWriter.class
This docker image was using version 1.7.4. /usr/hadoop-2.7.3/share/hadoop/common/lib/avro-1.7.4.jar!/org/apache/avro/io/DatumWriter.class
Not sure how these are different given that both are using hadoop 2.7. Is there any difference in the way this docker image is built as compared to spark distribution? From the stack trace it does appear that spark needs avro 1.7.7.
Just running start-spark master
does not start Spark:
> docker run -ti --rm singularities/spark bash
root@742c0b1c2bf3:/# start-spark master
Adding user `spark' ...
Adding new group `spark' (1000) ...
Adding new user `spark' (1000) with group `spark' ...
Not creating home directory `/home/spark'.
spark@742c0b1c2bf3:/$
But if I exit the shell, HDFS and Spark start:
spark@742c0b1c2bf3:/$ exit
16/12/20 22:18:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = 742c0b1c2bf3/172.17.0.3
STARTUP_MSG: args = [-format, -force]
STARTUP_MSG: version = 2.7.3
...
16/12/20 22:19:17 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@23a44d8a{/metrics/applications/json,null,AVAILABLE}
16/12/20 22:19:17 INFO master.Master: I have been elected leader! New state: ALIVE
Hi,
Thank you for your contribution. When I tried to test my cluster with run-example SparkPi 10
, it said
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
at org.apache.spark.deploy.SparkSubmitArguments.handleUnknown(SparkSubmitArguments.scala:451)
at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:178)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:97)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
Looks like Spark couldn't find Hadoop classpath. How can I fix this? Many thanks.
Hey,
I try to run pyspark and spark-submit, but there is something wrong with python.
So i got this :
root@sparkmaster:/tmp/test_spark# spark-submit count.py
Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:82)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 11 more
When i try to run python only i got this error :
root@sparkmaster:/tmp/test_spark# python
bash: python: command not found
root@sparkmaster:/tmp/test_spark# pyspark
env: python: No such file or directory
Do you have any idea ?
Thank you
Hi,
First,this is a great code to use for spark and hadoop.
Congratulations and thanks for a good dependable spark docker package.
I was trying to access worker ui,as we cannot add a port to expose worker api to be able to scale,
how can we access webui..for example the ipaddress of the container like workerIP:8081 is also not working.
Please help
Hi,
I'm trying a sample app where a config file has to be shared using --files option. I see an exception when I use this
./bin/spark-submit --files some-properties-file.conf --class org.apache.spark.examples.SparkPi --master local[2] ./examples/jars/spark-examples_2.11-2.0.1.jar 10
Full exception stack below. Not sure what is causing this.
Exception in thread "main" java.lang.IllegalArgumentException: Malformed IPv6 address at index 8: hdfs://[NAMENODE_HOST]:8020
at java.net.URI.create(URI.java:852)
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:180)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:361)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1428)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1401)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:458)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2275)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Malformed IPv6 address at index 8: hdfs://[NAMENODE_HOST]:8020
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.parseIPv6Reference(URI.java:3469)
at java.net.URI$Parser.parseServer(URI.java:3219)
at java.net.URI$Parser.parseAuthority(URI.java:3155)
at java.net.URI$Parser.parseHierarchical(URI.java:3097)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.(URI.java:588)
at java.net.URI.create(URI.java:850)
I'm launching a spark cluster using the provided docker file and it's working perfect (thanks a lot for that!) but I can't see any job running/completed in the Spark UI.
I'd like to see them because you can know what Spark is doing under the hood in a more legible way.
Can anyone help me with this?
Hi, how does spark-docker handles CPU allocation? When we create 3 workers, does the three workers work on different CPUs, or they use 3 CPUs in a shared way?
I've noticed that the more workers I use the slower the computation is. I'm using Docker 17 with 5 CPU and 5 Gb of RAM.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.