GithubHelp home page GithubHelp logo

singularitiescr / spark-docker Goto Github PK

View Code? Open in Web Editor NEW
68.0 68.0 57.0 37 KB

Apache Spark Docker Image

Home Page: https://hub.docker.com/r/singularities/spark/

License: MIT License

Shell 100.00%

spark-docker's People

Contributors

fjhoelsg avatar gregavrbancic avatar jomarinb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-docker's Issues

Different avro library version

I got below exception when running a spark app using this docker image.

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
...
...
Caused by: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$serializeDatum$1.apply(GenericAvroSerializer.scala:123)
at org.apache.spark.serializer.GenericAvroSerializer$$anonfun$serializeDatum$1.apply(GenericAvroSerializer.scala:123)
...
...

Same app works fine on my local cluster. That cluster was using spark image that contains hadoop(spark-2.0.2-bin-hadoop2.7). On checking, I noticed that org.apache.avro.io.DatumWriter was being pulled from avro 1.7.7. /opt/spark-2.0.2-bin-hadoop2.7/jars/avro-1.7.7.jar!/org/apache/avro/io/DatumWriter.class

This docker image was using version 1.7.4. /usr/hadoop-2.7.3/share/hadoop/common/lib/avro-1.7.4.jar!/org/apache/avro/io/DatumWriter.class

Not sure how these are different given that both are using hadoop 2.7. Is there any difference in the way this docker image is built as compared to spark distribution? From the stack trace it does appear that spark needs avro 1.7.7.

starts-spark does not start Spark

Just running start-spark master does not start Spark:

> docker run -ti --rm singularities/spark bash
root@742c0b1c2bf3:/# start-spark master
Adding user `spark' ...
Adding new group `spark' (1000) ...
Adding new user `spark' (1000) with group `spark' ...
Not creating home directory `/home/spark'.
spark@742c0b1c2bf3:/$ 

But if I exit the shell, HDFS and Spark start:

spark@742c0b1c2bf3:/$ exit
16/12/20 22:18:56 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 742c0b1c2bf3/172.17.0.3
STARTUP_MSG:   args = [-format, -force]
STARTUP_MSG:   version = 2.7.3
...
16/12/20 22:19:17 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@23a44d8a{/metrics/applications/json,null,AVAILABLE}
16/12/20 22:19:17 INFO master.Master: I have been elected leader! New state: ALIVE

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

Hi,
Thank you for your contribution. When I tried to test my cluster with run-example SparkPi 10, it said

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
    at org.apache.spark.deploy.SparkSubmitArguments.handleUnknown(SparkSubmitArguments.scala:451)
    at org.apache.spark.launcher.SparkSubmitOptionParser.parse(SparkSubmitOptionParser.java:178)
    at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:97)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:114)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 5 more

Looks like Spark couldn't find Hadoop classpath. How can I fix this? Many thanks.

pyspark and spark-submit not work inside containers

Hey,

I try to run pyspark and spark-submit, but there is something wrong with python.

So i got this :

root@sparkmaster:/tmp/test_spark# spark-submit count.py
Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:82)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 11 more

When i try to run python only i got this error :

root@sparkmaster:/tmp/test_spark# python
bash: python: command not found


root@sparkmaster:/tmp/test_spark# pyspark
env: python: No such file or directory

Do you have any idea ?

Thank you

worker webui not accessable

Hi,
First,this is a great code to use for spark and hadoop.
Congratulations and thanks for a good dependable spark docker package.
I was trying to access worker ui,as we cannot add a port to expose worker api to be able to scale,
how can we access webui..for example the ipaddress of the container like workerIP:8081 is also not working.
Please help

Exception when using --files

Hi,

I'm trying a sample app where a config file has to be shared using --files option. I see an exception when I use this

./bin/spark-submit --files some-properties-file.conf --class org.apache.spark.examples.SparkPi --master local[2] ./examples/jars/spark-examples_2.11-2.0.1.jar 10

Full exception stack below. Not sure what is causing this.

Exception in thread "main" java.lang.IllegalArgumentException: Malformed IPv6 address at index 8: hdfs://[NAMENODE_HOST]:8020
at java.net.URI.create(URI.java:852)
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:180)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:361)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1428)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:1401)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:458)
at org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.SparkContext.(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2275)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Malformed IPv6 address at index 8: hdfs://[NAMENODE_HOST]:8020
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.parseIPv6Reference(URI.java:3469)
at java.net.URI$Parser.parseServer(URI.java:3219)
at java.net.URI$Parser.parseAuthority(URI.java:3155)
at java.net.URI$Parser.parseHierarchical(URI.java:3097)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.(URI.java:588)
at java.net.URI.create(URI.java:850)

Don't show jobs in Spark UI

I'm launching a spark cluster using the provided docker file and it's working perfect (thanks a lot for that!) but I can't see any job running/completed in the Spark UI.

I'd like to see them because you can know what Spark is doing under the hood in a more legible way.

Can anyone help me with this?

CPU Allocation

Hi, how does spark-docker handles CPU allocation? When we create 3 workers, does the three workers work on different CPUs, or they use 3 CPUs in a shared way?

I've noticed that the more workers I use the slower the computation is. I'm using Docker 17 with 5 CPU and 5 Gb of RAM.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.