fostiropoulos / ablator Goto Github PK

View Code? Open in Web Editor NEW

34.0 34.0 34.0 66.4 MB

Model Ablation Tool-Kit for Deep Learning Model

Home Page: https://docs.ablator.org

License: GNU General Public License v3.0

Python 99.31% Makefile 0.21% Shell 0.25% Dockerfile 0.09% Jupyter Notebook 0.13%

ablation deep-learning machine-learning torch

ablator's People

Contributors

Stargazers

Watchers

ablator's Issues

Renaming "trial_id" to "trial_num" in /main/state/state.py

In state.py, there are some statements like:

trial_id = int(trial.trial_num)
stmt = select(Trial).where(Trial.trial_num == trial_id)

The variable naming is misleading, since trial_id is the entry id in the DB and trial_num is the number associated with the sampler.
Therefore, trial_id in the code should be renamed to trial_num.

PermissionError when passing path to init_chkpt config param

What happened:

Run parallel experiment that initialize models from a checkpoint via init_chkpt config param
Error message:

2023-07-14 14:11:16: Initializing model weights ONLY from checkpoint. \tmp\experiments_\experiment_7046_9991\0ee8_9991\best_checkpoints
Traceback (most recent call last):
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\test_init_chkpt.py", line 79, in <module>
    metrics = ablator.launch(working_directory = os.getcwd(), ray_head_address="auto")
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\mp.py", line 542, in launch
    self._init_state(
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\mp.py", line 410, in _init_state
    super()._init_state()
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\proto.py", line 62, in _init_state
    mock_wrapper._init_state(run_config=copy.deepcopy(self.run_config), debug=True)
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\model\main.py", line 548, in _init_state
    self._init_model_state(resume, smoke_test)
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\model\main.py", line 498, in _init_model_state
    self._load_model(self.current_checkpoint, model_only=True)
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\model\main.py", line 623, in _load_model
    save_dict = torch.load(checkpoint_path, map_location="cpu")
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork\venv\lib\site-packages\torch\serialization.py", line 771, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork\venv\lib\site-packages\torch\serialization.py", line 270, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork\venv\lib\site-packages\torch\serialization.py", line 251, in __init__
    super(_open_file, self).__init__(open(name, mode))
PermissionError: [Errno 13] Permission denied: '\\tmp\\experiments_\\experiment_7046_9991\\0ee8_9991\\best_checkpoints'

Version/ dependencies:

python 3.10.6
ray 2.2.0
local OS: windows
ray cluster also on local machine

Reproduction script

test_init_chkpt.py

from ablator import ModelConfig, OptimizerConfig, TrainConfig, RunConfig, ParallelConfig
from ablator import ModelWrapper, ParallelTrainer
from ablator.main.configs import SearchSpace

import torch
import torch.nn as nn

import shutil
import os

optimizer_config = OptimizerConfig(name="sgd", arguments={"lr": 0.1})

train_config = TrainConfig(
    dataset="test",
    batch_size=128,
    epochs=2,
    optimizer_config=optimizer_config,
    scheduler_config=None,
)

search_space = {
    "train_config.optimizer_config.arguments.lr": SearchSpace(
        value_range=[0, 10], value_type="int"
    ),
}

run_config = ParallelConfig(
    train_config=train_config,
    model_config=ModelConfig(),
    metrics_n_batches = 800,
    experiment_dir = "/tmp/experiments/",
    device="cuda",
    amp=True,
    random_seed = 42,
    total_trials = 5,
    concurrent_trials = 3,
    search_space = search_space,
    optim_metrics = {"val_loss": "min"},
    gpu_mb_per_experiment = 1024,
    cpus_per_experiment = 1,
    init_chkpt="<path-to-checkpoint>"
)

class MyCustomModel(nn.Module):
    def __init__(self, config: ModelConfig) -> None:
        super().__init__()
        self.lr = 0.01
        self.param = nn.Parameter(torch.ones(100))
        self.itr = 0

    def forward(self, x: torch.Tensor):
        x = self.param + torch.rand_like(self.param) * 0.01
        self.itr += 1
        if self.itr > 10 and self.lr > 5:
            raise Exception("large lr.")
        return {"preds": x}, x.sum().abs()

class TestWrapper(ModelWrapper):
    def make_dataloader_train(self, run_config: RunConfig):
        dl = [torch.rand(100) for i in range(100)]
        return dl

    def make_dataloader_val(self, run_config: RunConfig):
        dl = [torch.rand(100) for i in range(100)]
        return dl

if not os.path.exists(run_config.experiment_dir):
    shutil.os.mkdir(run_config.experiment_dir)

shutil.rmtree(run_config.experiment_dir)

wrapper = TestWrapper(MyCustomModel)

ablator = ParallelTrainer(
    wrapper=wrapper,
    run_config=run_config,
)

metrics = ablator.launch(working_directory = os.getcwd(), ray_head_address="auto")

To reproduce the error:

ray start --head
python test_init_chkpt.py

Default rand_weights_init

In class TrainConfig, rand_weights_init is set to True by default. It'll reset the weight of all modules in create_model() in ModelWrapper. Problematic for using a pre-trained model.
In model.apply(butils.init_weights) in create_model(), layers are re-init with std=1, causing very large output absolute values.

Misleading naming of optim_direction in parse_metrics() in main.mp

The naming of optim_direction parameter in parse_metrics() function is misleading, since when it's called: metrics =parse_metrics(list(self.run_config.optim_metrics.keys()), metrics), self.run_config.optim_metrics.keys() returns metrics names, not optimization direction.

Unused parameter in init_scheduler() of scheduler config classes

The method init_scheduler() in all scheduler config classes are declaring model: nn.Module as a parameter, but this is not used when initializing the schedulers. Changing it to **kwargs or *args might be a reasonable solution
optimizer parameter should be of type Optimizer

in mp trainer init function gpu memory check gives an error if gpu does not exist in that machine

In __init__ function, the code tries to check the gpu memory, but if the machine is running this without a gpu it yields an error

- There's an if condition for the case where device is an int, but it's not included in the param type definition.

There's an if condition for the case where device is an int, but it's not included in the param type
Originally posted by @hieuchi911 in #14 (comment)

[Feature Request] Support for black-box functions

There is nothing that could prevent ablator from being used with any black-box function.

The two key components that would need to be addressed are:

Experiment Persistence: Involves a mechanism for check-pointing an experiment and restoring the state in case of a failure.
Fault Tolerance: Identifying common fail-cases of a black-box function, such as CPU over-utilization, out-of-memory errors and gracefully terminating such experiments.

We need to discuss methods to address the issues mentioned and any other to be able to support black-box functions and evaluate the feasibility.

The goal is to provide at least partial support in the next release of ablator.

Bug with wrapper accessing SchedulerConfig

In wrapper._train_evaluation_step()

if (
            self.scheduler is not None
            and hasattr(self.train_config.scheduler_config, "step_when")
            and self.train_config.scheduler_config.step_when == "val"
        ):

is always False, should be

if (
            self.scheduler is not None
            and hasattr(self.train_config.scheduler_config.arguments, "step_when")
            and self.train_config.scheduler_config.arguments.step_when == "val"
        ):

Need to fix: Append metrics to results.json and keep the right format forever.

The issue is that we append to results.json so each line is supposed to be a json. As the file grows dynamically it is difficult to efficiently encode as a json, because we would need to re-encode all results.json at each step. An alternative I just thought would be that we can wrap results into an array i.e. [ results ] and when appending to results.json we over-write the last two rows to be ,results+1] .

There are several considerations for this that I found in practice to be problematic, such as mid-stream interuption of writing to results.json (i.e not writing the entire row). We would need to perform some sort synced write (not sure on the exact name of the top-of my head). Can you please try to implement this and write tests for the above scenario and any additional? Please use a dedicated branch for this.

Incorrect functional call run_config.gcp_config.rsync_down_node

In ablator.main.mp.py, _rsync_nodes() method makes a call on the GCP config object: self.run_config.gcp_config.rsync_down_node(hostname, self.experiment_dir, self.logger), but I can only find in GcpConfig class a method named rsync_down_nodes in ablator.modules.storage.cloud.py. Is this a bug?

Typo in modules/scheduler.py

In modules/scheduler.py, config class for ReduceLROnPlateau scheduler has a typo. It's PlateuaConfig, while it's supposed to be 'PlateauConfig'

fostiropoulos / ablator Goto Github PK

ablator's People

Contributors

Stargazers

Watchers

Forkers

ablator's Issues

What happened:

Version/ dependencies:

Reproduction script

Recommend Projects

Recommend Topics

Recommend Org

Jobs