GithubHelp home page GithubHelp logo

Comments (4)

Balandat avatar Balandat commented on April 28, 2024

Hi @CCranney, thanks for your interest!

Fortunately, we added a Submitit tutorial for submitting jobs to slurm just the other day: https://github.com/facebook/Ax/blob/main/tutorials/submitit.ipynb

However, I cannot see where those kinds of error messages are ever created, and therefore have difficulty debugging the process. How would I see what errors caused the trials to fail?

In the PyTorch MOO NAS tutorial, the jobs are running locally in via TorchX in their own process, which makes debugging quite hard. I think there should be a way to somehow set things up in TorchX to pipe the logs back to the TorchXRunner or at least save them to disk, but I'm not sure how to do that (if you figure it out please let us know, I think this is a question for the TorchX folks).

What changes would need to occur in a Runner object to be submitted as a batch job?

I haven't tried the TorchXRunner with remote job submission yet (we use a different kind of remote backend), so unfortunately I don't know the answer here. I would recommend starting from TorchX and see if you can just use pure TorchX code to run a slurm job on the cluster - that should tell you what kind of settings would need to be piped through to the respective TorchX logic from the runner (this might require some minor updates to the runner, we'd be happy to help with those).

But if you're fine with using Submitit, I recommend you give the tutorial I mentioned above a try.

from ax.

CCranney avatar CCranney commented on April 28, 2024

Thank you @Balandat! This definitely set me off on the right direction. I'll be digging into the TorchXRunner question further, if I get those answers I'll definitely reach out. Thank you for the tutorial and for the tips! I'll go ahead and close this issue.

from ax.

Balandat avatar Balandat commented on April 28, 2024

Great. Please do report back and share your solution :)

from ax.

CCranney avatar CCranney commented on April 28, 2024

Hi all,

While I have not resolved the best method for deploying jobs to the scheduler, I have landed on a debugging solution for investigating specific trial runs. My code is split into two files, one that utilizes Ax classes to compile experiments, trials, and runners and schedulers. This script relies on a second file that uses pytorch-lightning to generate, train and evaluate models specified by the search space. All of this is outlined in the tutorial link in my first post.

Because it is errors in the second script file that are not printed to the screen/terminal, I have found a method for logging all output of the second script to a text file. It's a workaround, but it works at identifying how or why specific trials failed.

I pasted the following code at the top of the second script. Note that I manually created a log directory for it to save files to prior to running the code.

import logging
import sys
import io
from datetime import datetime

class StreamToLogger(io.TextIOBase):
    def __init__(self, logger, level=logging.INFO):
        self.logger = logger
        self.level = level
        self.linebuf = ''

    def write(self, buf):
        for line in buf.rstrip().splitlines():
            self.logger.log(self.level, line.rstrip())

# Configure the logging module
logging.basicConfig(filename=f'logs/output_{datetime.now().strftime("%Y%m%d-%H%M%S-%f")}.log', level=logging.INFO)

# Redirect stdout to the logger
stdout_logger = logging.getLogger('STDOUT')
sys.stdout = StreamToLogger(stdout_logger, logging.INFO)

# Redirect stderr to the logger
stderr_logger = logging.getLogger('STDERR')
sys.stderr = StreamToLogger(stderr_logger, logging.ERROR)

The saved files are marked by the date and time down to microseconds (to ensure that even rapidly-generated jobs will not accidentally write to the same file). This may be overkill, but better too much info than not enough in my opinion.

from ax.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.