Comments (5)
-
I'm not sure introducing all of the overhead around baking an AMI is going to be worth it just to bake in the
nvidia-docker
dependency. That archive is ~2MB and decompresses to a single binary. Downloading and decompressing that on an EC2 instance should be taking a few seconds end-to-end. -
I think it would be good to make the
aws ec2 wait
thing use tags vs. keypair names. That looks like an option supported by--filter
. -
Terminating the instances from within an EC2 instance could be tricky because I think we'd actually have to terminate the Spot Fleet request (doing that also indirectly messes with Terraform's state of the word).
-
Not entirely following the last solutions to processing work in parallel, but as a related note,
instanceID
is already available via instance metadata.
from raster-vision.
I'm not sure introducing all of the overhead around baking an AMI is going to be worth it just to bake in the nvidia-docker dependency. That archive is ~2MB and decompresses to a single binary. Downloading and decompressing that on an EC2 instance should be taking a few seconds end-to-end.
I don't think we need to bake our own AMI with nvidia-docker on it. We can just use an existing AMI that has it. But if it only takes a few seconds to install it, then it doesn't matter.
I think it would be good to make the aws ec2 wait thing use tags vs. keypair names. That looks like an option supported by --filter.
Tags make more sense, but Terraform doesn't let us associate tags with instances created using a spot fleet request. See hashicorp/terraform#3263
Not entirely following the last solutions to processing work in parallel, but as a related note, instanceID is already available via instance metadata.
True, but each worker needs to know its index (ie. a number between 0 and n-1 if n workers) to figure out which batch jobs to run. I don't think we can turn the instanceid into a worker index.
from raster-vision.
Good point about #3263.
Regarding an index for workers, there is also ami-launch-index
via instance metadata. Not entirely sure what value that returns when multiple instances get launched from a Spot Fleet though.
I guess my high level concern is that we try to make use of what's already there (if it applies) via instance metadata vs. supplying and managing our own identifiers.
from raster-vision.
I'm thinking about using AWS Batch to run lots of experiments in parallel. Does that sound ok? One issue is that Batch uses ECS, and ECS doesn't know about nvidia-docker. There's a workaround to be able to use the GPU even when running using regular docker in ECS: https://blog.cloudsight.ai/deep-learning-image-recognition-using-gpus-in-amazon-ecs-docker-containers-5bdb1956f30e#.mau60bfvo
from raster-vision.
I'm moving the conversation about parallelizing experiments to #10
from raster-vision.
Related Issues (20)
- Unable to install RasterVision HOT 3
- Issues with using model bundle for prediction HOT 15
- Cannot import ClassConfig on Kaggle HOT 16
- Cannot save prediction using colors from ClassConfig HOT 4
- Improve unit test coverage of CLI and `Runner`s
- Cannot plot batch with ObjectDetectionVisualizer HOT 4
- Multi-temporal raster source visualizer fails when batch size is 1 HOT 2
- Make it possible to exclude "null" class labels from the computation of metrics HOT 3
- RuntimeError: expected scalar type Long but found Int HOT 10
- Allow user to specify AOI box filtering behavior in sliding window datasets HOT 1
- self._hds cannot be converted to a Python object for pickling HOT 2
- Semantic Segmentation Labels not initializing properly from predictions when extent provided HOT 2
- use my trained modle to prediction ,has wrong happened HOT 2
- RuntimeError: The size of tensor a (82) must match the size of tensor b (64) at non-singleton dimension 3 HOT 4
- Migrate to `pydantic` v2
- MPL notice for use of everett library and LGPL for triangle
- v0.30 release checklist
- `ModuleNotFoundError: No module named 'rastervision.examples'` when running command from examples doc HOT 1
- Add ability to use different Objectdetection models than FasterRCNN HOT 1
- BATCH_CPU_JOB_QUEUE requires a value parseable by str HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from raster-vision.