Comments (4)
Hi @CCranney, thanks for your interest!
Fortunately, we added a Submitit tutorial for submitting jobs to slurm just the other day: https://github.com/facebook/Ax/blob/main/tutorials/submitit.ipynb
However, I cannot see where those kinds of error messages are ever created, and therefore have difficulty debugging the process. How would I see what errors caused the trials to fail?
In the PyTorch MOO NAS tutorial, the jobs are running locally in via TorchX in their own process, which makes debugging quite hard. I think there should be a way to somehow set things up in TorchX to pipe the logs back to the TorchXRunner
or at least save them to disk, but I'm not sure how to do that (if you figure it out please let us know, I think this is a question for the TorchX folks).
What changes would need to occur in a Runner object to be submitted as a batch job?
I haven't tried the TorchXRunner with remote job submission yet (we use a different kind of remote backend), so unfortunately I don't know the answer here. I would recommend starting from TorchX and see if you can just use pure TorchX code to run a slurm job on the cluster - that should tell you what kind of settings would need to be piped through to the respective TorchX logic from the runner (this might require some minor updates to the runner, we'd be happy to help with those).
But if you're fine with using Submitit, I recommend you give the tutorial I mentioned above a try.
from ax.
Thank you @Balandat! This definitely set me off on the right direction. I'll be digging into the TorchXRunner question further, if I get those answers I'll definitely reach out. Thank you for the tutorial and for the tips! I'll go ahead and close this issue.
from ax.
Great. Please do report back and share your solution :)
from ax.
Hi all,
While I have not resolved the best method for deploying jobs to the scheduler, I have landed on a debugging solution for investigating specific trial runs. My code is split into two files, one that utilizes Ax classes to compile experiments, trials, and runners and schedulers. This script relies on a second file that uses pytorch-lightning to generate, train and evaluate models specified by the search space. All of this is outlined in the tutorial link in my first post.
Because it is errors in the second script file that are not printed to the screen/terminal, I have found a method for logging all output of the second script to a text file. It's a workaround, but it works at identifying how or why specific trials failed.
I pasted the following code at the top of the second script. Note that I manually created a log
directory for it to save files to prior to running the code.
import logging
import sys
import io
from datetime import datetime
class StreamToLogger(io.TextIOBase):
def __init__(self, logger, level=logging.INFO):
self.logger = logger
self.level = level
self.linebuf = ''
def write(self, buf):
for line in buf.rstrip().splitlines():
self.logger.log(self.level, line.rstrip())
# Configure the logging module
logging.basicConfig(filename=f'logs/output_{datetime.now().strftime("%Y%m%d-%H%M%S-%f")}.log', level=logging.INFO)
# Redirect stdout to the logger
stdout_logger = logging.getLogger('STDOUT')
sys.stdout = StreamToLogger(stdout_logger, logging.INFO)
# Redirect stderr to the logger
stderr_logger = logging.getLogger('STDERR')
sys.stderr = StreamToLogger(stderr_logger, logging.ERROR)
The saved files are marked by the date and time down to microseconds (to ensure that even rapidly-generated jobs will not accidentally write to the same file). This may be overkill, but better too much info than not enough in my opinion.
from ax.
Related Issues (20)
- Issue when starting an AxClient with out-of-design points HOT 2
- cannot import name 'TrainingData' HOT 2
- applying complex constrains HOT 2
- Ax is not not starting as many workers as I'd like to; sometimes, get_next_trials returns 0 new trials HOT 4
- Evaluating custom candidates HOT 2
- Input Feature Selection - Does the relevant code exist? HOT 6
- [Feature Request] support constraints on `ChoiceParameters` HOT 4
- Extending Models.THOMPSON with an extra parameter HOT 1
- There are some questions when i use the Ax HOT 7
- Space characters in the objective name AND specifying a threshold leads to an error message: "AssertionError: Outcome constraint should be of form `metric_name >= x" HOT 1
- Pandas deprecation warning when deserializing AxClient JSON HOT 2
- AX seems to get stuck with Ray
- `StandardizeY` transform requires non-empty data." when using SAASBO
- Plotting outside of a notebook HOT 1
- Setting search space step size in Ax Service API HOT 10
- Problem when Sobol falls back to HitAndRunPolytopeSampler HOT 1
- Arms from previous batch keep appearing in new batches HOT 5
- EHVI & NEHVI break with more than 7 objectives HOT 4
- Multi-objective experiments generate duplicated data HOT 5
- Question: Transforming objective when passing `best_f` to `ProbabilityOfImprovement`, etc. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ax.