Currently, we can run experiments in parallel by spinning up some instances and then m

With UserData <div class="highlight highlight-source-yaml notranslate position-rel

Run experiments in parallel on AWS about raster-vision HOT 6 CLOSED

lewfish commented on August 20, 2024

Run experiments in parallel on AWS

from raster-vision.

Comments (6)

lewfish commented on August 20, 2024

We also need a way to define the set of experiments we want to run. I'm not convinced that we need to do anything fancy for this, but we might want to look at https://github.com/keplr-io/picard

from raster-vision.

lossyrob commented on August 20, 2024

I believe I got regular docker running with the GPU, which gets around the challenge you mentioned for running on AWS Batch.

For AWS Batch, because we'd need a custom AMI, we'd need to in an unmanaged compute environment.

Steps I took to get the GPU running in an ECS-optimized AMI instance (p2.xlarge):

Launch latest ecs agent AMI [amzn-ami-2016.09.g-amazon-ecs-optimized]
SSH in, run the following:

> sudo yum groupinstall -y "Development Tools"
> version=364.19
> arch=`uname -m`
> sudo yum install -y wget
> wget http://us.download.nvidia.com/XFree86/Linux-${arch}/${version}/NVIDIA-Linux-${arch}-${version}.run
> srcs=`ls /usr/src/kernels`
> sudo bash ./NVIDIA-Linux-${arch}-${version}.run -silent --kernel-source-path /usr/src/kernels/${srcs}
> sudo reboot

Log back in, clone this repository.

Run the following patch on this repo:

diff --git a/scripts/run b/scripts/run
index 08914c1..b4a8869 100755
--- a/scripts/run
+++ b/scripts/run
@@ -27,10 +27,10 @@ then
             keras-semantic-segmentation-cpu "${@:2}"
     elif [ "${1:-}" = "--gpu" ]
     then
-        sudo nvidia-docker run --rm -it \
+        sudo docker run --rm -it \
             -v ~/keras-semantic-segmentation/src:/opt/src \
             -v ~/data:/opt/data \
-            002496907356.dkr.ecr.us-east-1.amazonaws.com/keras-semantic-segmentation-gpu "${@:2}"
+            --privileged -v /usr:/hostusr -v /lib:/hostlib keras-semantic-segmentation-gpu "${@:2}"
     else
         usage
     fi
diff --git a/src/Dockerfile-gpu b/src/Dockerfile-gpu
index d0d4ca2..c022276 100644
--- a/src/Dockerfile-gpu
+++ b/src/Dockerfile-gpu
@@ -20,4 +20,7 @@ USER root
 RUN mkdir /opt/data
 RUN chown -R keras:root /opt/data
 
-CMD ["bash"]
+COPY startup.sh /usr/local/bin/
+
+CMD ["/usr/local/bin/startup.sh"]
+

startup.sh looks like:

#!/bin/bash

set -e

echo "Copy the NVidia drivers from the parent (because nvidia-docker-plugin doesn't work with ECS agent)"
find /hostusr -name "*nvidia*" -o -name "*cuda*" -o -name "*GL*" | while read path
do
  newpath="/usr${path#/hostusr}"
  mkdir -p `dirname $newpath` && \
    cp -a $path $newpath
done

cp -ar /hostlib/modules /lib

echo "/usr/lib64" > /etc/ld.so.conf.d/nvidia.conf
ldconfig

echo "Starting your essential task"
exec /bin/bash

remember to chmod a+x src/startup.sh before building the container.

From inside the container, run a python shell with the following:

>>> from tensorflow.python.client import device_lib as _device_lib
>>> _device_lib.list_local_devices()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 9470809447589491728
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11386087015
incarnation: 568561566696502057
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
]

from raster-vision.

lossyrob commented on August 20, 2024

Removed the operations tag because this issue isn't work on the Azavea Ops team's plate, but tagging @azavea/operations because in case there is interest in tracking.

from raster-vision.

lossyrob commented on August 20, 2024

I've updated the comment above to reflect the current process; I was able to run this successfully with the reboot as the last step. The next step is to pull the setup into a cloud-config user data for the spot request, and attempt to run a container that sees the GPU successfully directly from instance startup.

from raster-vision.

lossyrob commented on August 20, 2024

With UserData

#cloud-config

runcmd:
  - sudo yum groupinstall -y "Development Tools"
  - sudo yum install -y wget
  - curl -o driver-install.run http://us.download.nvidia.com/XFree86/Linux-`uname -m`/364.19/NVIDIA-Linux-`uname -m`-364.19.run
  - sudo bash ./driver-install.run -silent --kernel-source-path /usr/src/kernels/`ls /usr/src/kernels`
  - sudo reboot

I can pull a container, run python and get

root@f9b54b3edfc1:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.25GiB
Free memory: 11.16GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 10438605519399576444
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11387098727
incarnation: 11180310221239512191
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
]
>>>

Just to check, a p2.xlarge instance without the cloud-init file gives:

root@2d2f8ad8e131:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: 2d2f8ad8e131
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1077] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1078] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:140] kernel driver does not appear to be running on this host (2d2f8ad8e131): /proc/driver/nvidia/version does not exist
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 4842678244661302863
]
>>>

with the same container.

from raster-vision.

whatnick commented on August 20, 2024

I have been testing Neptune experimenting platform. This seems like a good fit for batching multiple experiments.

from raster-vision.

Run experiments in parallel on AWS about raster-vision HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs