GithubHelp home page GithubHelp logo

fostiropoulos / ablator Goto Github PK

View Code? Open in Web Editor NEW
34.0 34.0 34.0 66.4 MB

Model Ablation Tool-Kit for Deep Learning Model

Home Page: https://docs.ablator.org

License: GNU General Public License v3.0

Python 99.31% Makefile 0.21% Shell 0.25% Dockerfile 0.09% Jupyter Notebook 0.13%
ablation deep-learning machine-learning torch

ablator's People

Contributors

apneetha avatar fostiropoulos avatar hieuchi911 avatar hjzccc avatar mrlyg avatar seanxiaoby avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ablator's Issues

Renaming "trial_id" to "trial_num" in /main/state/state.py

In state.py, there are some statements like:

  • trial_id = int(trial.trial_num)
  • stmt = select(Trial).where(Trial.trial_num == trial_id)

The variable naming is misleading, since trial_id is the entry id in the DB and trial_num is the number associated with the sampler.
Therefore, trial_id in the code should be renamed to trial_num.

PermissionError when passing path to init_chkpt config param

What happened:

  • Run parallel experiment that initialize models from a checkpoint via init_chkpt config param
  • Error message:
2023-07-14 14:11:16: Initializing model weights ONLY from checkpoint. \tmp\experiments_\experiment_7046_9991\0ee8_9991\best_checkpoints
Traceback (most recent call last):
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\test_init_chkpt.py", line 79, in <module>
    metrics = ablator.launch(working_directory = os.getcwd(), ray_head_address="auto")
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\mp.py", line 542, in launch
    self._init_state(
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\mp.py", line 410, in _init_state
    super()._init_state()
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\proto.py", line 62, in _init_state
    mock_wrapper._init_state(run_config=copy.deepcopy(self.run_config), debug=True)
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\model\main.py", line 548, in _init_state
    self._init_model_state(resume, smoke_test)
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\model\main.py", line 498, in _init_model_state
    self._load_model(self.current_checkpoint, model_only=True)
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\model\main.py", line 623, in _load_model
    save_dict = torch.load(checkpoint_path, map_location="cpu")
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork\venv\lib\site-packages\torch\serialization.py", line 771, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork\venv\lib\site-packages\torch\serialization.py", line 270, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork\venv\lib\site-packages\torch\serialization.py", line 251, in __init__
    super(_open_file, self).__init__(open(name, mode))
PermissionError: [Errno 13] Permission denied: '\\tmp\\experiments_\\experiment_7046_9991\\0ee8_9991\\best_checkpoints'

Version/ dependencies:

  • python 3.10.6
  • ray 2.2.0
  • local OS: windows
  • ray cluster also on local machine

Reproduction script

test_init_chkpt.py

from ablator import ModelConfig, OptimizerConfig, TrainConfig, RunConfig, ParallelConfig
from ablator import ModelWrapper, ParallelTrainer
from ablator.main.configs import SearchSpace

import torch
import torch.nn as nn

import shutil
import os

optimizer_config = OptimizerConfig(name="sgd", arguments={"lr": 0.1})

train_config = TrainConfig(
    dataset="test",
    batch_size=128,
    epochs=2,
    optimizer_config=optimizer_config,
    scheduler_config=None,
)

search_space = {
    "train_config.optimizer_config.arguments.lr": SearchSpace(
        value_range=[0, 10], value_type="int"
    ),
}

run_config = ParallelConfig(
    train_config=train_config,
    model_config=ModelConfig(),
    metrics_n_batches = 800,
    experiment_dir = "/tmp/experiments/",
    device="cuda",
    amp=True,
    random_seed = 42,
    total_trials = 5,
    concurrent_trials = 3,
    search_space = search_space,
    optim_metrics = {"val_loss": "min"},
    gpu_mb_per_experiment = 1024,
    cpus_per_experiment = 1,
    init_chkpt="<path-to-checkpoint>"
)

class MyCustomModel(nn.Module):
    def __init__(self, config: ModelConfig) -> None:
        super().__init__()
        self.lr = 0.01
        self.param = nn.Parameter(torch.ones(100))
        self.itr = 0

    def forward(self, x: torch.Tensor):
        x = self.param + torch.rand_like(self.param) * 0.01
        self.itr += 1
        if self.itr > 10 and self.lr > 5:
            raise Exception("large lr.")
        return {"preds": x}, x.sum().abs()

class TestWrapper(ModelWrapper):
    def make_dataloader_train(self, run_config: RunConfig):
        dl = [torch.rand(100) for i in range(100)]
        return dl

    def make_dataloader_val(self, run_config: RunConfig):
        dl = [torch.rand(100) for i in range(100)]
        return dl

if not os.path.exists(run_config.experiment_dir):
    shutil.os.mkdir(run_config.experiment_dir)

shutil.rmtree(run_config.experiment_dir)

wrapper = TestWrapper(MyCustomModel)

ablator = ParallelTrainer(
    wrapper=wrapper,
    run_config=run_config,
)

metrics = ablator.launch(working_directory = os.getcwd(), ray_head_address="auto")

To reproduce the error:

  1. ray start --head
  2. python test_init_chkpt.py

Default rand_weights_init

  1. In class TrainConfig, rand_weights_init is set to True by default. It'll reset the weight of all modules in create_model() in ModelWrapper. Problematic for using a pre-trained model.
  2. In model.apply(butils.init_weights) in create_model(), layers are re-init with std=1, causing very large output absolute values.

Misleading naming of optim_direction in parse_metrics() in main.mp

The naming of optim_direction parameter in parse_metrics() function is misleading, since when it's called: metrics =parse_metrics(list(self.run_config.optim_metrics.keys()), metrics), self.run_config.optim_metrics.keys() returns metrics names, not optimization direction.

Unused parameter in init_scheduler() of scheduler config classes

  1. The method init_scheduler() in all scheduler config classes are declaring model: nn.Module as a parameter, but this is not used when initializing the schedulers. Changing it to **kwargs or *args might be a reasonable solution
  2. optimizer parameter should be of type Optimizer

[Feature Request] Support for black-box functions

There is nothing that could prevent ablator from being used with any black-box function.

The two key components that would need to be addressed are:

  • Experiment Persistence: Involves a mechanism for check-pointing an experiment and restoring the state in case of a failure.
  • Fault Tolerance: Identifying common fail-cases of a black-box function, such as CPU over-utilization, out-of-memory errors and gracefully terminating such experiments.

We need to discuss methods to address the issues mentioned and any other to be able to support black-box functions and evaluate the feasibility.

The goal is to provide at least partial support in the next release of ablator.

Bug with wrapper accessing SchedulerConfig

In wrapper._train_evaluation_step()

if (
            self.scheduler is not None
            and hasattr(self.train_config.scheduler_config, "step_when")
            and self.train_config.scheduler_config.step_when == "val"
        ):

is always False, should be

if (
            self.scheduler is not None
            and hasattr(self.train_config.scheduler_config.arguments, "step_when")
            and self.train_config.scheduler_config.arguments.step_when == "val"
        ):

Need to fix: Append metrics to results.json and keep the right format forever.

The issue is that we append to results.json so each line is supposed to be a json. As the file grows dynamically it is difficult to efficiently encode as a json, because we would need to re-encode all results.json at each step. An alternative I just thought would be that we can wrap results into an array i.e. [ results ] and when appending to results.json we over-write the last two rows to be ,results+1] .

There are several considerations for this that I found in practice to be problematic, such as mid-stream interuption of writing to results.json (i.e not writing the entire row). We would need to perform some sort synced write (not sure on the exact name of the top-of my head). Can you please try to implement this and write tests for the above scenario and any additional? Please use a dedicated branch for this.

Incorrect functional call run_config.gcp_config.rsync_down_node

In ablator.main.mp.py, _rsync_nodes() method makes a call on the GCP config object: self.run_config.gcp_config.rsync_down_node(hostname, self.experiment_dir, self.logger), but I can only find in GcpConfig class a method named rsync_down_nodes in ablator.modules.storage.cloud.py. Is this a bug?

Typo in modules/scheduler.py

In modules/scheduler.py, config class for ReduceLROnPlateau scheduler has a typo. It's PlateuaConfig, while it's supposed to be 'PlateauConfig'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.