geodocker / geodocker-jupyter-geopyspark Goto Github PK

Jupyter+GeoNotebook + GeoPySpark Docker Container

License: Apache License 2.0

Makefile 11.73% Shell 12.54% Jupyter Notebook 69.59% HCL 5.30% Dockerfile 0.83%

geodocker-jupyter-geopyspark's Introduction

GeoDocker Cluster

GeoDocker is a collection of Docker images encapsulating a distributed geo-processing platform based on GeoTrellis, GeoMesa, and GeoWave. The emphasis is on providing integration between these projects and exposing geo-processing functionality in Hadoop ecosystem.

Project Status

This project is in active development. The layout and composition of GeoDocker may change as we explore this our use-case further. Despite that we're committed to maintaining sanity by providing publicly published, versioned, and tested images. Your feedback and contributions are always welcome.

Goals

Integrate GeoTrellis, GeoWave, and GeoMesa as a unified platform
Provide realistic and convenient distributed integration testing environment
Support deployment of GeoDocker to Amazon EMR
Explore and support other deployment options like DC/OS and ECS

Environment

Images

Build and Publish

It is not necessary to build and publish these containers to use them as-is. Pre-built images are available on quay.io. Building is only necessary in-order to customize and develop GeoDocker.

All images contain a Makefile which provide following targets:

build: Builds the container with latest tag
test: Runs the container tests
publish: Publishes the container with latest tag and tag provided by a $TAG environment variable (ex: make publish TAG=ABC123)

These targets are also used by Travis Ci as specified in .travis.yml

Docker Compose: Local Cluster

Those images which contain multiple container roles or depend on instance of other containers to function also provide a docker-compose.yml file which allows to easily bring up an local cluster. This cluster can be used for exploration, integration testing, and debugging.

# Build the latest container
~/proj/geodocker-accumulo
❯ make build
docker build -t quay.io/geodocker/accumulo:latest	.
Sending build context to Docker daemon 117.2 kB
Step 1 : FROM quay.io/geodocker/hdfs:latest
...

# Start a local multi-container cluster, use -d option to start in background mode
~/proj/geodocker-accumulo
❯ docker-compose up
Creating geodockeraccumulo_zookeeper_1
Creating geodockeraccumulo_hdfs-name_1
Creating geodockeraccumulo_hdfs-data_1
Creating geodockeraccumulo_accumulo-master_1
Creating geodockeraccumulo_accumulo-tserver_1
Creating geodockeraccumulo_accumulo-monitor_1
Attaching to geodockeraccumulo_hdfs-name_1, geodockeraccumulo_zookeeper_1, geodockeraccumulo_hdfs-data_1, geodockeraccumulo_accumulo-master_1, geodockeraccumulo_accumulo-monitor_1, geodockeraccumulo_accumulo-tserver_1
...

# Inspect running containers
~/proj/geodocker-accumulo
❯ docker-compose ps
                Name                              Command               State                     Ports
--------------------------------------------------------------------------------------------------------------------------
geodockeraccumulo_accumulo-master_1    /sbin/entrypoint.sh master ...   Up
geodockeraccumulo_accumulo-monitor_1   /sbin/entrypoint.sh monitor      Up      0.0.0.0:50095->50095/tcp
geodockeraccumulo_accumulo-tserver_1   /sbin/entrypoint.sh tserver      Up
geodockeraccumulo_hdfs-data_1          /sbin/entrypoint.sh data         Up
geodockeraccumulo_hdfs-name_1          /sbin/entrypoint.sh name         Up      0.0.0.0:50070->50070/tcp
geodockeraccumulo_zookeeper_1          /sbin/entrypoint.sh zkServ ...   Up      0.0.0.0:2181->2181/tcp, 2888/tcp, 3888/tcp

# Inspect logs from running container
~/proj/geodocker-accumulo
❯ docker-compose logs hdfs-name
hdfs-name_1         | Formatting namenode root fs in /data/hdfs/name...
hdfs-name_1         | 16/07/14 02:30:16 INFO namenode.NameNode: STARTUP_MSG:
hdfs-name_1         | /************************************************************
hdfs-name_1         | STARTUP_MSG: Starting NameNode
hdfs-name_1         | STARTUP_MSG:   host = 46c38f89156b/172.19.0.3
hdfs-name_1         | STARTUP_MSG:   args = [-format]
...

# Run a command inside the cluster container
~/proj/geodocker-accumulo
❯ docker-compose run --rm accumulo-master bash -c "set -e \
		&& source /sbin/hdfs-lib.sh \
		&& wait_until_hdfs_is_available \
		&& with_backoff hdfs dfs -test -d /accumulo \
		&& accumulo shell -p GisPwd -e 'createtable test_table'"
Safe mode is OFF
2016-07-14 02:49:25,809 [trace.DistributedTrace] INFO : SpanReceiver org.apache.accumulo.tracer.ZooTraceClient was loaded successfully.
2016-07-14 02:49:25,973 [shell.Shell] ERROR: org.apache.accumulo.core.client.TableExistsException: Table test_table exists
make: *** [test] Error 1

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

geodocker-jupyter-geopyspark's People

Contributors

Stargazers

Watchers

Forkers

echeipesh lossyrob jpolchlo jbouffard simonkassel wsf1990 asmith26 donrv johnniehard harryprince nubiofs joyoyoyoyoyo apurvaanand gisdevelope jungfrau70 y-lei99 ishaansharma

geodocker-jupyter-geopyspark's Issues

Latest Version of docker image breaks existing notebooks

I've had some existing code for a while and when I updated to the latest image, i received the error below. The last known image i was able to get my code working on was quay.io/geodocker/jupyter-geopyspark:e900b5f

The code was simply doing this:

querried_spatial_layer = gps.query(uri=catalog_uri,
layer_name=layer_name,
layer_zoom=0,
query_geom=county,
num_partitions=100)

`

Py4JJavaError Traceback (most recent call last)
/home/hadoop/.local/lib/python3.4/site-packages/geopyspark/geotrellis/catalog.py in init(self, uri)
299 try:
--> 300 self.wrapper = pysc._gateway.jvm.geopyspark.geotrellis.io.AttributeStoreWrapper(uri)
301 except Py4JJavaError as err:

/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in call(self, *args)
1400 return_value = get_return_value(
-> 1401 answer, self._gateway_client, None, self._fqn)
1402

/usr/local/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:

Py4JJavaError: An error occurred while calling None.geopyspark.geotrellis.io.AttributeStoreWrapper.
: java.lang.AbstractMethodError: geotrellis.spark.io.s3.S3AttributeStore.geotrellis$spark$io$AttributeCaching$setter$geotrellis$spark$io$AttributeCaching$$x$1_$eq(Lscala/Tuple2;)V
at geotrellis.spark.io.AttributeCaching$class.$init$(AttributeCaching.scala:29)
at geotrellis.spark.io.s3.S3AttributeStore.(S3AttributeStore.scala:38)
at geotrellis.spark.io.s3.S3LayerProvider.attributeStore(S3LayerProvider.scala:41)
at geotrellis.spark.io.AttributeStore$.apply(AttributeStore.scala:70)
at geotrellis.spark.io.AttributeStore$.apply(AttributeStore.scala:73)
at geopyspark.geotrellis.io.AttributeStoreWrapper.(AttributeStoreWrapper.scala:29)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in ()
3 layer_zoom=0,
4 query_geom=county_wm,
----> 5 num_partitions=100)

/home/hadoop/.local/lib/python3.4/site-packages/geopyspark/geotrellis/catalog.py in query(uri, layer_name, layer_zoom, query_geom, time_intervals, query_proj, num_partitions, store)
185 store = AttributeStore.build(store)
186 else:
--> 187 store = AttributeStore.cached(uri)
188
189 pysc = get_spark_context()

/home/hadoop/.local/lib/python3.4/site-packages/geopyspark/geotrellis/catalog.py in cached(cls, uri)
326 return _cached_stores[uri]
327 else:
--> 328 store = cls(uri)
329 _cached_stores[uri] = store
330 return store

/home/hadoop/.local/lib/python3.4/site-packages/geopyspark/geotrellis/catalog.py in init(self, uri)
300 self.wrapper = pysc._gateway.jvm.geopyspark.geotrellis.io.AttributeStoreWrapper(uri)
301 except Py4JJavaError as err:
--> 302 raise ValueError(err.java_exception.getMessage())
303
304 @classmethod

ValueError: geotrellis.spark.io.s3.S3AttributeStore.geotrellis$spark$io$AttributeCaching$setter$geotrellis$spark$io$AttributeCaching$$x$1_$eq(Lscala/Tuple2;)V`

Rasterio Not Found On Worker Nodes in RPM-Based Deployment

Evidently, rasterio is not available on the worker nodes in the RPM-based setup. It seems that it is being installed by the bootstrap script), but it it is not available to (Geo)PySpark.

Issue with visualizing GPS map outputs through default port structure.

After running a sample demo, example, NLCD, I was unable to get the xyz tile server to be exposed thru the VM running GPS through to my local machine.

@echeipesh came up with a nice work around:

Expose a specific port link during the docker run initialization command, as per:
docker run -it --rm --name geopyspark \\n -p 8000:8000 -p 4040:4040 -p 7070:7070 \\n -v $HOME/.aws:/home/hadoop/.aws:ro \\n quay.io/geodocker/jupyter-geopyspark

And then include the 7070 port ID in the calls as per:

Not sure if there is permanent solution for Macs. (This is apparently not an issue for linux machines) But this works for now I suppose.

Username Password?

Hi, i know this is not really an issue but if i do

docker run -it --rm --name geopyspark \ -p 8000:8000 -p 4040:4040 \ quay.io/geodocker/jupyter-geopyspark

why would you not explain whats the credentials? or how to get there?

Decouple geopyspark and geopyspark-netcdf versions

The two do not iterate at the same rate and while we have not imposed the semver on these projects yet it would be much easier to iterate and test rather than bumping the geopyspark-netcdf version manually.

The deeper issue is that the way geopyspark-netcdf jars are being discovered during runtime are coupled to the geopyspark version.

Cannot install pip packages - permision denied

Hi there

I'm using the docker image quay.io/geodocker/jupyter-geopyspark:blog (also contains the fantastic geonotebook), and I'm trying to install with !pip install geopandas. Unfortunately I'm getting an error:

PermissionError: [Errno 13] Permission denied: '/usr/lib/python3.4/site-packages/descartes-1.1.0.dist-info'

Attempting to sudo gives /usr/bin/sh: sudo: command not found

Is there anyway I can install additional pip packages (preferably through a notebook so I get the correct python environment)?

Many thanks for these fantastic docker containers!

Consider Dropping Support For Docker on EMR

The original plan was to use the Docker image on EMR, but now that we have RPMs, it might be worthwhile to consider dropping support for using the Docker on EMR. Doing that would simplify the image and reduce its size.

In particular, the gdal-and-friends.tar.gz binary blob could be dropped from the base image, and the two Python tarballs could be dropped from the main image.

Recapture GDAL capabilities we lost with move away from native build

This set of code:

with rio.open('s3://mrgeo-source/srtm-v3-30/N00E006.hgt') as ds:
    bounds = ds.bounds
    height = ds.height
    width = ds.width
    crs = ds.get_crs()
    srs = osr.SpatialReference()
    srs.ImportFromWkt(crs.wkt)
    proj4 = srs.ExportToProj4()
    tile_cols = math.floor((width - 1) / 512) * 512
    tile_rows = math.floor((height - 1) / 512) * 512
    ws = [((x, x + 512), (y, y + 512)) for x in range(0,tile_cols, 512) \
                                          for y in range(0, tile_rows, 512)]
    print(bounds)
    print(height)
    print(width)
    print(crs)
    print(tile_cols)
    print(tile_rows)
    print(ws)

fails in the current container. In the rde/workshop-prep container, it succeeds.

I suspect this is due to the move away from a native GDAL build and towards a pip installed GDAL.

We should either figure out how to gain back those capabilities with the pip installed version, or move back to a native GDAL build.

Support populating docker image with sample notebooks

I followed this command:

docker run -it --rm --name geopyspark \
   -p 8000:8000 \
   quay.io/geodocker/jupyter-geopyspark

And the notebooks do not appear in the home directory in jupyter.

It would be extremely helpful for a first time user of GPS to be able to do a docker run and get some sample notebooks to play with.

@jpolchlo mentioned it looks like its close to enabling this, but something must be missing in terms of populating the notebooks in /home/hadoop - or wherever they need to be

Break-up Single Security Group in Terraform Setup

There is discussion in #42 and #45 about splititng the single security group used for the ECS instance, the EMR master, and the EMR worker into three separate security groups. I made two brief attempts to do that, but neither was successful.

Attempt 1

With direct dependencies between security groups shown with solid arrows and dependencies mediated by security group rules shown with dotted arrows, the following diagram shows one attempt.
The change is acceptable to Terraform and AWS at creaton-time. However, the lack of direct access between the ECS instance and the EMR workers causes spark context creation to fail.

The actual change is here:

diff --git a/terraform/emr.tf b/terraform/emr.tf
index 95ed648..693eddd 100644
--- a/terraform/emr.tf
+++ b/terraform/emr.tf
@@ -10,8 +10,8 @@ resource "aws_emr_cluster" "emr-spark-cluster" {
     key_name         = "${var.key_name}"
     subnet_id        = "${var.subnet}"
 
-    emr_managed_master_security_group = "${aws_security_group.security-group.id}"
-    emr_managed_slave_security_group  = "${aws_security_group.security-group.id}"
+    emr_managed_master_security_group = "${aws_security_group.emr-master.id}"
+    emr_managed_slave_security_group  = "${aws_security_group.emr-worker.id}"
   }
 
   instance_group {
diff --git a/terraform/jupyterhub.tf b/terraform/jupyterhub.tf
index e46c0e6..9147ccf 100644
--- a/terraform/jupyterhub.tf
+++ b/terraform/jupyterhub.tf
@@ -3,7 +3,7 @@ resource "aws_spot_instance_request" "jupyterhub" {
   iam_instance_profile = "${var.ecs_instance_profile}"
   instance_type        = "m3.xlarge"
   key_name             = "${var.key_name}"
-  security_groups      = ["${aws_security_group.security-group.name}"]
+  security_groups      = ["${aws_security_group.ecs-instance.name}"]
   spot_price           = "0.05"
   wait_for_fulfillment = true
 
diff --git a/terraform/security-group.tf b/terraform/security-group.tf
index 727ffc3..5f993fc 100644
--- a/terraform/security-group.tf
+++ b/terraform/security-group.tf
@@ -1,9 +1,10 @@
-resource "aws_security_group" "security-group" {
+# ECS Instance
+resource "aws_security_group" "ecs-instance" {
   ingress {
     from_port = 0
     to_port   = 0
     protocol  = "-1"
-    self      = true
+    security_groups = ["${aws_security_group.emr-master.id}"]
   }
 
   ingress {
@@ -31,3 +32,78 @@ resource "aws_security_group" "security-group" {
     create_before_destroy = true
   }
 }
+
+# EMR Master
+resource "aws_security_group" "emr-master" {
+  lifecycle {
+    create_before_destroy = true
+  }
+}
+
+resource "aws_security_group_rule" "from-jupyterhub" {
+  type                     = "ingress"
+  from_port                = 0
+  to_port                  = 0
+  protocol                 = "-1"
+  source_security_group_id = "${aws_security_group.ecs-instance.id}"
+
+  security_group_id = "${aws_security_group.emr-master.id}"
+}
+
+resource "aws_security_group_rule" "from-workers" {
+  type                     = "ingress"
+  from_port                = 0
+  to_port                  = 0
+  protocol                 = "-1"
+  source_security_group_id = "${aws_security_group.emr-worker.id}"
+
+  security_group_id = "${aws_security_group.emr-master.id}"
+}
+
+resource "aws_security_group_rule" "ssh-all" {
+  type                     = "ingress"
+  from_port                = 22
+  to_port                  = 22
+  protocol                 = "tcp"
+  cidr_blocks              = ["0.0.0.0/0"]
+
+  security_group_id = "${aws_security_group.emr-master.id}"
+}
+
+resource "aws_security_group_rule" "outgoing-all" {
+  type                     = "egress"
+  from_port                = 0
+  to_port                  = 0
+  protocol                 = "-1"
+  cidr_blocks              = ["0.0.0.0/0"]
+
+  security_group_id = "${aws_security_group.emr-master.id}"
+}
+
+# EMR Worker
+resource "aws_security_group" "emr-worker" {
+  ingress {
+    from_port = 0
+    to_port   = 0
+    protocol  = "-1"
+    security_groups = ["${aws_security_group.emr-master.id}"]
+  }
+
+  ingress {
+    from_port = 0
+    to_port   = 0
+    protocol  = "-1"
+    self      = true
+  }
+
+  egress {
+    from_port   = 0
+    to_port     = 0
+    protocol    = "-1"
+    cidr_blocks = ["0.0.0.0/0"]
+  }
+
+  lifecycle {
+    create_before_destroy = true
+  }
+}

Attempt 2

Allowing the ECS instance and EMR workers to communicate produces a cyclic dependency.
Although mediated by security group rules and therefore "grammatical" from Terraform's perspective, this fails when "terraform apply" is run. It gives a message which I recall being to the effect of "you have encountered a bug that used to exist in Terraform" (which is weird on a number of levels). Using this strategy produces mutually-interdependent security groups which Terraform cannot automatically remove and which must be removed by hand (that is why I do not have the error message pasted into this issue verbatim -- I do not want to do the manual cleanup again).

diff --git a/terraform/emr.tf b/terraform/emr.tf
index 95ed648..693eddd 100644
--- a/terraform/emr.tf
+++ b/terraform/emr.tf
@@ -10,8 +10,8 @@ resource "aws_emr_cluster" "emr-spark-cluster" {
     key_name         = "${var.key_name}"
     subnet_id        = "${var.subnet}"
 
-    emr_managed_master_security_group = "${aws_security_group.security-group.id}"
-    emr_managed_slave_security_group  = "${aws_security_group.security-group.id}"
+    emr_managed_master_security_group = "${aws_security_group.emr-master.id}"
+    emr_managed_slave_security_group  = "${aws_security_group.emr-worker.id}"
   }
 
   instance_group {
diff --git a/terraform/jupyterhub.tf b/terraform/jupyterhub.tf
index e46c0e6..4bb5b28 100644
--- a/terraform/jupyterhub.tf
+++ b/terraform/jupyterhub.tf
@@ -3,7 +3,7 @@ resource "aws_spot_instance_request" "jupyterhub" {
   iam_instance_profile = "${var.ecs_instance_profile}"
   instance_type        = "m3.xlarge"
   key_name             = "${var.key_name}"
-  security_groups      = ["${aws_security_group.security-group.name}"]
+  security_groups      = ["${aws_security_group.jupyterhub.name}"]
   spot_price           = "0.05"
   wait_for_fulfillment = true
 
diff --git a/terraform/security-group.tf b/terraform/security-group.tf
index 727ffc3..010bb4f 100644
--- a/terraform/security-group.tf
+++ b/terraform/security-group.tf
@@ -1,9 +1,17 @@
-resource "aws_security_group" "security-group" {
+# ECS Instance
+resource "aws_security_group" "jupyterhub" {
   ingress {
     from_port = 0
     to_port   = 0
     protocol  = "-1"
-    self      = true
+    security_groups = ["${aws_security_group.emr-master.id}"]
+  }
+
+  ingress {
+    from_port = 0
+    to_port   = 0
+    protocol  = "-1"
+    security_groups = ["${aws_security_group.emr-worker.id}"]
   }
 
   ingress {
@@ -31,3 +39,125 @@ resource "aws_security_group" "security-group" {
     create_before_destroy = true
   }
 }
+
+# EMR Master
+resource "aws_security_group" "emr-master" {
+  lifecycle {
+    create_before_destroy = true
+  }
+}
+
+resource "aws_security_group_rule" "master-jupyterhub" {
+  type                     = "ingress"
+  from_port                = 0
+  to_port                  = 0
+  protocol                 = "-1"
+  source_security_group_id = "${aws_security_group.jupyterhub.id}"
+
+  security_group_id = "${aws_security_group.emr-master.id}"
+}
+
+resource "aws_security_group_rule" "master-workers" {
+  type                     = "ingress"
+  from_port                = 0
+  to_port                  = 0
+  protocol                 = "-1"
+  source_security_group_id = "${aws_security_group.emr-worker.id}"
+
+  security_group_id = "${aws_security_group.emr-master.id}"
+}
+
+resource "aws_security_group_rule" "master-ssh" {
+  type                     = "ingress"
+  from_port                = 22
+  to_port                  = 22
+  protocol                 = "tcp"
+  cidr_blocks              = ["0.0.0.0/0"]
+
+  security_group_id = "${aws_security_group.emr-master.id}"
+}
+
+resource "aws_security_group_rule" "master-outgoing" {
+  type                     = "egress"
+  from_port                = 0
+  to_port                  = 0
+  protocol                 = "-1"
+  cidr_blocks              = ["0.0.0.0/0"]
+
+  security_group_id = "${aws_security_group.emr-master.id}"
+}
+
+# EMR Worker
+resource "aws_security_group" "emr-worker" {
+  # ingress {
+  #   from_port = 0
+  #   to_port   = 0
+  #   protocol  = "-1"
+  #   security_groups = ["${aws_security_group.jupyterhub.id}"]
+  # }
+
+  # ingress {
+  #   from_port = 0
+  #   to_port   = 0
+  #   protocol  = "-1"
+  #   security_groups = ["${aws_security_group.emr-master.id}"]
+  # }
+
+  # ingress {
+  #   from_port = 0
+  #   to_port   = 0
+  #   protocol  = "-1"
+  #   self      = true
+  # }
+
+  # egress {
+  #   from_port   = 0
+  #   to_port     = 0
+  #   protocol    = "-1"
+  #   cidr_blocks = ["0.0.0.0/0"]
+  # }
+
+  lifecycle {
+    create_before_destroy = true
+  }
+}
+
+resource "aws_security_group_rule" "worker-jupyterhub" {
+  type                     = "ingress"
+  from_port                = 0
+  to_port                  = 0
+  protocol                 = "-1"
+  source_security_group_id = "${aws_security_group.jupyterhub.id}"
+
+  security_group_id = "${aws_security_group.emr-worker.id}"
+}
+
+resource "aws_security_group_rule" "worker-master" {
+  type                     = "ingress"
+  from_port                = 0
+  to_port                  = 0
+  protocol                 = "-1"
+  source_security_group_id = "${aws_security_group.emr-master.id}"
+
+  security_group_id = "${aws_security_group.emr-worker.id}"
+}
+
+resource "aws_security_group_rule" "workers-worker" {
+  type      = "ingress"
+  from_port = 0
+  to_port   = 0
+  protocol  = "-1"
+  self      = true
+
+  security_group_id = "${aws_security_group.emr-worker.id}"
+}
+
+resource "aws_security_group_rule" "worker-outgoing" {
+  type                     = "egress"
+  from_port                = 0
+  to_port                  = 0
+  protocol                 = "-1"
+  cidr_blocks              = ["0.0.0.0/0"]
+
+  security_group_id = "${aws_security_group.emr-worker.id}"
+}

Rename stage0 stage1 and others

Dockerfile.stage0 -> Dockerfile.build
Dockerfile.stage2 -> Dockerfile

blobs -> artifacts

scripts/blob-* -> scripts/artifact-<a name>

There should also be a writeup that explains the build process and motivations for stage0 stage2.

MIA: stage1, have you seen it?

libcurl Not Compiled Into GDAL

Evidently, libcurl is not compiled into GDAL. This prevents names beginning with /vsicurl/ and /vsis3/ from being used.

Be Able to Run Script in the EMR Terminal

It would be nice if we could run scripts by sshing into EMR, uploading the desired script, and then running into. Because right now, you need to go into JupyterHub and create a terminal in there. Then once created, you'll need to export these variables before you can run the script.