The spark-jupyter-aws from piercingdan

Potential simplifications to the guide

Hey @PiercingDan! Thanks for writing up this guide and featuring Flintrock in it prominently. I have a couple of suggestions that may help simplify the guide.

From Using your own AMI:

Note that it is important that since flintrock is designed to install and configure Spark every time, it is important that you delete the Spark folder and other files before saving your AMI or you will encounter errors with flintrock.

Actually, you can tell Flintrock not to install Spark as follows:
```
flintrock launch my-cluster --no-install-spark
```
You can do the same for HDFS, though Flintrock by default does not install HDFS so you'd only do that to override a configuration in Flintrock's config.yaml.
Most, if not all, of the setup code can be captured in a script and deployed automatically using a combination of Flintrock's run-command and copy-file commands. Have you considered using them?

For example, you can capture your setup code in a script called piercingdan-quickstart.sh and then deploy it to the cluster as follows:
```
flintrock copy-file my-cluster ./piercingdan-quickstart.sh /tmp/
flintrock run-command my-cluster 'chmod u+x /tmp/piercingdan-quickstart.sh'
flintrock run-command my-cluster '/tmp/piercingdan-quickstart.sh'
```
If you host the script on GitHub, you can even do away with copy-file and download the script directly from GitHub onto the cluster with run-command. You can also use the --master-only option if what you're doing doesn't need to hit the whole cluster.

Another alternative is to use --ec2-user-data, but I recommend using that only if you're comfortable with EC2.

By capturing this work in a script that can easily be deployed, you can save your readers from having to create and maintain their own AMIs which, in my view, is a big pain, especially if people are changing things from time to time or working in multiple regions.

Finally, you can also just flintrock stop your cluster if the cost of the EBS root drives is acceptable. That will eliminate the cost of the running instances and leave behind a cluster that's ready to use with a quick flintrock start.

Misformatted text

It looks like you meant to put a code block around this section:

[ec2-user@privateipaddress]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 10G 0 disk └─xvda1 202:1 0 10G 0 part / xvdf 202:80 0 30G 0 disk └─xvdf1 202:81 0 30G 0 part

then close the code block for hte rest of the list:

Make a mount point sudo mkdir /oldvol, then mount the attached volume sudo mount /dev/xvdf1 /oldvol.

or it didn't close properly. I don't have a nice envt where I can pull and submit a PR, so hopefully you can give it a quick fix.

s3 access problem

When I input this code in my jupyter notebook
iris_raw_RDD = sc.textFile('s3n://BucketName/iris_data.csv')
iris_raw_RDD.take(5)

the error will occur
An error occurred while calling o71.partitions.
: java.io.IOException: No FileSystem for scheme: s3n

Do you know how can I fix this problem

piercingdan / spark-jupyter-aws Goto Github PK

spark-jupyter-aws's People

Contributors

Stargazers

Watchers

Forkers

spark-jupyter-aws's Issues

Potential simplifications to the guide

Misformatted text

s3 access problem

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs