Hi, I'm writing a based on the <a href="https://pytorch.org/t

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Tutorial Request: Deploying Runners on Clusters, Debugging Runners/Schedulers about ax HOT 4 CLOSED

CCranney commented on April 28, 2024

Tutorial Request: Deploying Runners on Clusters, Debugging Runners/Schedulers

from ax.

Comments (4)

Balandat commented on April 28, 2024

Hi @CCranney, thanks for your interest!

Fortunately, we added a Submitit tutorial for submitting jobs to slurm just the other day: https://github.com/facebook/Ax/blob/main/tutorials/submitit.ipynb

However, I cannot see where those kinds of error messages are ever created, and therefore have difficulty debugging the process. How would I see what errors caused the trials to fail?

In the PyTorch MOO NAS tutorial, the jobs are running locally in via TorchX in their own process, which makes debugging quite hard. I think there should be a way to somehow set things up in TorchX to pipe the logs back to the TorchXRunner or at least save them to disk, but I'm not sure how to do that (if you figure it out please let us know, I think this is a question for the TorchX folks).

What changes would need to occur in a Runner object to be submitted as a batch job?

I haven't tried the TorchXRunner with remote job submission yet (we use a different kind of remote backend), so unfortunately I don't know the answer here. I would recommend starting from TorchX and see if you can just use pure TorchX code to run a slurm job on the cluster - that should tell you what kind of settings would need to be piped through to the respective TorchX logic from the runner (this might require some minor updates to the runner, we'd be happy to help with those).

But if you're fine with using Submitit, I recommend you give the tutorial I mentioned above a try.

from ax.

CCranney commented on April 28, 2024

Thank you @Balandat! This definitely set me off on the right direction. I'll be digging into the TorchXRunner question further, if I get those answers I'll definitely reach out. Thank you for the tutorial and for the tips! I'll go ahead and close this issue.

from ax.

Balandat commented on April 28, 2024

Great. Please do report back and share your solution :)

from ax.

CCranney commented on April 28, 2024

Hi all,

While I have not resolved the best method for deploying jobs to the scheduler, I have landed on a debugging solution for investigating specific trial runs. My code is split into two files, one that utilizes Ax classes to compile experiments, trials, and runners and schedulers. This script relies on a second file that uses pytorch-lightning to generate, train and evaluate models specified by the search space. All of this is outlined in the tutorial link in my first post.

Because it is errors in the second script file that are not printed to the screen/terminal, I have found a method for logging all output of the second script to a text file. It's a workaround, but it works at identifying how or why specific trials failed.

I pasted the following code at the top of the second script. Note that I manually created a log directory for it to save files to prior to running the code.

import logging
import sys
import io
from datetime import datetime

class StreamToLogger(io.TextIOBase):
    def __init__(self, logger, level=logging.INFO):
        self.logger = logger
        self.level = level
        self.linebuf = ''

    def write(self, buf):
        for line in buf.rstrip().splitlines():
            self.logger.log(self.level, line.rstrip())

# Configure the logging module
logging.basicConfig(filename=f'logs/output_{datetime.now().strftime("%Y%m%d-%H%M%S-%f")}.log', level=logging.INFO)

# Redirect stdout to the logger
stdout_logger = logging.getLogger('STDOUT')
sys.stdout = StreamToLogger(stdout_logger, logging.INFO)

# Redirect stderr to the logger
stderr_logger = logging.getLogger('STDERR')
sys.stderr = StreamToLogger(stderr_logger, logging.ERROR)

The saved files are marked by the date and time down to microseconds (to ensure that even rapidly-generated jobs will not accidentally write to the same file). This may be overkill, but better too much info than not enough in my opinion.

from ax.

Tutorial Request: Deploying Runners on Clusters, Debugging Runners/Schedulers about ax HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs