Comments (7)
Will keep this issue open to make sure we write up a full example though!
from tune-sklearn.
Great question :) It is entirely possible to deploy tune-sklearn on spot instances as you would do in standard Ray code.
We can put together a full example later (or simply add it to the tune documentation). I think failure handling should work out of the box.
In your script:
# sklearn.py
ray.init(address="auto")
clf = TuneSearchCV(...)
clf.fit()
and
ray up <yaml>
ray submit <yaml> sklearn.py
from tune-sklearn.
Ah amazing!
from tune-sklearn.
Reopened the issue for the full example. I'm still not completely sure how checkpointing and fault tolerance would be implemented using the tune-sklearn API
from tune-sklearn.
ah, I think tune-sklearn actually automatically should handle that for you.
from tune-sklearn.
I just looked into this, and I think there are some missing RayTune parameters to be supported (i.e., tune.run(max_retries=X)), so I'll submit a PR to do this.
Hypothetically, you can do something like this:
- Start with a YAML
# Save as `ray_cluster.yaml`
cluster_name: ray-tune-cluster
provider: {type: aws, region: us-west-2}
auth: {ssh_user: ubuntu}
min_workers: 3
max_workers: 3
# Deep Learning AMI (Ubuntu) Version 21.0
head_node:
- InstanceType: c5.xlarge,
- ImageId: ami-0b294f219d14e6a82,
- InstanceMarketOptions:
- MarketType: spot
worker_nodes:
- InstanceType: c5.xlarge,
- ImageId: ami-0b294f219d14e6a82,
- InstanceMarketOptions:
- MarketType: spot
- On any Tune script, you should be able to add 2 lines:
+ray.init(address="auto")
-search = TuneSearchCV(...)
+search = TuneSearchCV(..., max_retries=3)
- Run this on a Ray cluster. You'll be able to leverage multiple nodes for your search:
$ ray submit ray_cluster.yaml <your_script.py>
The above submit
command is equivalent to:
$ ray attach ray_cluster.yaml # ssh
ubuntu@ip-172-31-24-53:~$ python <your_script.py>
It will also 1. retry failed trials and 2. restart failed nodes. @rohan-gt is this what you're thinking about?
from tune-sklearn.
@richardliaw yes this is fine
from tune-sklearn.
Related Issues (20)
- TuneSearchCV not correctly handling error_score parameter HOT 5
- Save TuneSearchCV object with tensorflow and keras models HOT 7
- Can't suppress warning messages through standard python methods HOT 4
- n_jobs doesn't seem to be taken into account by TuneSearchCV HOT 3
- Resuming from checkpoint?
- Fail to run the conda installed tune_sklearn package HOT 2
- sk_n_jobs bug
- "training_iteration" from TuneSearchCV is always 1, and accuracy does not improve over time
- For TuneGridSearchCV: Where should I put reuse_actors=True?
- AttributeError: 'str' object has no attribute 'setup'
- TuneSearchCV doesn't seem to search for modules in alternative locations included in the PATH environment variable HOT 2
- during pickling there is an error HOT 1
- False Error log complains failed to read the result of trails
- How to tune Skorch model using GPU
- Since Ray-2.7.0, fetch_trial_dataframes is deprecated and raise an DeprecationWarning exception HOT 2
- No experiment checkpoint file of form 'experiment_state-*.json' was found HOT 1
- context is not passed with `set_config`
- Label management problem for Multilable classification
- Is it possible to save all models when doing TuneSearchCV or equivalent?
- Installation fails on Python 3.11/Windows
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tune-sklearn.