piercingdan / spark-jupyter-aws Goto Github PK
View Code? Open in Web Editor NEWA guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Hey @PiercingDan! Thanks for writing up this guide and featuring Flintrock in it prominently. I have a couple of suggestions that may help simplify the guide.
From Using your own AMI:
Note that it is important that since flintrock is designed to install and configure Spark every time, it is important that you delete the Spark folder and other files before saving your AMI or you will encounter errors with flintrock.
Actually, you can tell Flintrock not to install Spark as follows:
flintrock launch my-cluster --no-install-spark
You can do the same for HDFS, though Flintrock by default does not install HDFS so you'd only do that to override a configuration in Flintrock's config.yaml
.
Most, if not all, of the setup code can be captured in a script and deployed automatically using a combination of Flintrock's run-command
and copy-file
commands. Have you considered using them?
For example, you can capture your setup code in a script called piercingdan-quickstart.sh
and then deploy it to the cluster as follows:
flintrock copy-file my-cluster ./piercingdan-quickstart.sh /tmp/
flintrock run-command my-cluster 'chmod u+x /tmp/piercingdan-quickstart.sh'
flintrock run-command my-cluster '/tmp/piercingdan-quickstart.sh'
If you host the script on GitHub, you can even do away with copy-file
and download the script directly from GitHub onto the cluster with run-command
. You can also use the --master-only
option if what you're doing doesn't need to hit the whole cluster.
Another alternative is to use --ec2-user-data
, but I recommend using that only if you're comfortable with EC2.
By capturing this work in a script that can easily be deployed, you can save your readers from having to create and maintain their own AMIs which, in my view, is a big pain, especially if people are changing things from time to time or working in multiple regions.
Finally, you can also just flintrock stop
your cluster if the cost of the EBS root drives is acceptable. That will eliminate the cost of the running instances and leave behind a cluster that's ready to use with a quick flintrock start
.
It looks like you meant to put a code block around this section:
[ec2-user@privateipaddress]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 10G 0 disk └─xvda1 202:1 0 10G 0 part / xvdf 202:80 0 30G 0 disk └─xvdf1 202:81 0 30G 0 part
then close the code block for hte rest of the list:
sudo mkdir /oldvol
, then mount the attached volume sudo mount /dev/xvdf1 /oldvol
.or it didn't close properly. I don't have a nice envt where I can pull and submit a PR, so hopefully you can give it a quick fix.
When I input this code in my jupyter notebook
iris_raw_RDD = sc.textFile('s3n://BucketName/iris_data.csv')
iris_raw_RDD.take(5)
the error will occur
An error occurred while calling o71.partitions.
: java.io.IOException: No FileSystem for scheme: s3n
Do you know how can I fix this problem
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.