GithubHelp home page GithubHelp logo

Comments (6)

lewfish avatar lewfish commented on August 20, 2024

We also need a way to define the set of experiments we want to run. I'm not convinced that we need to do anything fancy for this, but we might want to look at https://github.com/keplr-io/picard

from raster-vision.

lossyrob avatar lossyrob commented on August 20, 2024

I believe I got regular docker running with the GPU, which gets around the challenge you mentioned for running on AWS Batch.

For AWS Batch, because we'd need a custom AMI, we'd need to in an unmanaged compute environment.

Steps I took to get the GPU running in an ECS-optimized AMI instance (p2.xlarge):

  • Launch latest ecs agent AMI [amzn-ami-2016.09.g-amazon-ecs-optimized]
  • SSH in, run the following:
> sudo yum groupinstall -y "Development Tools"
> version=364.19
> arch=`uname -m`
> sudo yum install -y wget
> wget http://us.download.nvidia.com/XFree86/Linux-${arch}/${version}/NVIDIA-Linux-${arch}-${version}.run
> srcs=`ls /usr/src/kernels`
> sudo bash ./NVIDIA-Linux-${arch}-${version}.run -silent --kernel-source-path /usr/src/kernels/${srcs}
> sudo reboot

Log back in, clone this repository.

  • Run the following patch on this repo:
diff --git a/scripts/run b/scripts/run
index 08914c1..b4a8869 100755
--- a/scripts/run
+++ b/scripts/run
@@ -27,10 +27,10 @@ then
             keras-semantic-segmentation-cpu "${@:2}"
     elif [ "${1:-}" = "--gpu" ]
     then
-        sudo nvidia-docker run --rm -it \
+        sudo docker run --rm -it \
             -v ~/keras-semantic-segmentation/src:/opt/src \
             -v ~/data:/opt/data \
-            002496907356.dkr.ecr.us-east-1.amazonaws.com/keras-semantic-segmentation-gpu "${@:2}"
+            --privileged -v /usr:/hostusr -v /lib:/hostlib keras-semantic-segmentation-gpu "${@:2}"
     else
         usage
     fi
diff --git a/src/Dockerfile-gpu b/src/Dockerfile-gpu
index d0d4ca2..c022276 100644
--- a/src/Dockerfile-gpu
+++ b/src/Dockerfile-gpu
@@ -20,4 +20,7 @@ USER root
 RUN mkdir /opt/data
 RUN chown -R keras:root /opt/data
 
-CMD ["bash"]
+COPY startup.sh /usr/local/bin/
+
+CMD ["/usr/local/bin/startup.sh"]
+

startup.sh looks like:

#!/bin/bash

set -e

echo "Copy the NVidia drivers from the parent (because nvidia-docker-plugin doesn't work with ECS agent)"
find /hostusr -name "*nvidia*" -o -name "*cuda*" -o -name "*GL*" | while read path
do
  newpath="/usr${path#/hostusr}"
  mkdir -p `dirname $newpath` && \
    cp -a $path $newpath
done

cp -ar /hostlib/modules /lib

echo "/usr/lib64" > /etc/ld.so.conf.d/nvidia.conf
ldconfig

echo "Starting your essential task"
exec /bin/bash

remember to chmod a+x src/startup.sh before building the container.

  • From inside the container, run a python shell with the following:
>>> from tensorflow.python.client import device_lib as _device_lib
>>> _device_lib.list_local_devices()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 9470809447589491728
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11386087015
incarnation: 568561566696502057
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
]

from raster-vision.

lossyrob avatar lossyrob commented on August 20, 2024

Removed the operations tag because this issue isn't work on the Azavea Ops team's plate, but tagging @azavea/operations because in case there is interest in tracking.

from raster-vision.

lossyrob avatar lossyrob commented on August 20, 2024

I've updated the comment above to reflect the current process; I was able to run this successfully with the reboot as the last step. The next step is to pull the setup into a cloud-config user data for the spot request, and attempt to run a container that sees the GPU successfully directly from instance startup.

from raster-vision.

lossyrob avatar lossyrob commented on August 20, 2024

With UserData

#cloud-config

runcmd:
  - sudo yum groupinstall -y "Development Tools"
  - sudo yum install -y wget
  - curl -o driver-install.run http://us.download.nvidia.com/XFree86/Linux-`uname -m`/364.19/NVIDIA-Linux-`uname -m`-364.19.run
  - sudo bash ./driver-install.run -silent --kernel-source-path /usr/src/kernels/`ls /usr/src/kernels`
  - sudo reboot

I can pull a container, run python and get

root@f9b54b3edfc1:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.25GiB
Free memory: 11.16GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 10438605519399576444
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11387098727
incarnation: 11180310221239512191
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
]
>>> 

Just to check, a p2.xlarge instance without the cloud-init file gives:

root@2d2f8ad8e131:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: 2d2f8ad8e131
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1077] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1078] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:140] kernel driver does not appear to be running on this host (2d2f8ad8e131): /proc/driver/nvidia/version does not exist
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 4842678244661302863
]
>>> 

with the same container.

from raster-vision.

whatnick avatar whatnick commented on August 20, 2024

I have been testing Neptune experimenting platform. This seems like a good fit for batching multiple experiments.

from raster-vision.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.