fostiropoulos / ablator Goto Github PK
View Code? Open in Web Editor NEWModel Ablation Tool-Kit for Deep Learning Model
Home Page: https://docs.ablator.org
License: GNU General Public License v3.0
Model Ablation Tool-Kit for Deep Learning Model
Home Page: https://docs.ablator.org
License: GNU General Public License v3.0
In state.py, there are some statements like:
trial_id = int(trial.trial_num)
stmt = select(Trial).where(Trial.trial_num == trial_id)
The variable naming is misleading, since trial_id
is the entry id in the DB and trial_num
is the number associated with the sampler.
Therefore, trial_id
in the code should be renamed to trial_num
.
init_chkpt
config param2023-07-14 14:11:16: Initializing model weights ONLY from checkpoint. \tmp\experiments_\experiment_7046_9991\0ee8_9991\best_checkpoints
Traceback (most recent call last):
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\test_init_chkpt.py", line 79, in <module>
metrics = ablator.launch(working_directory = os.getcwd(), ray_head_address="auto")
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\mp.py", line 542, in launch
self._init_state(
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\mp.py", line 410, in _init_state
super()._init_state()
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\proto.py", line 62, in _init_state
mock_wrapper._init_state(run_config=copy.deepcopy(self.run_config), debug=True)
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\model\main.py", line 548, in _init_state
self._init_model_state(resume, smoke_test)
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\model\main.py", line 498, in _init_model_state
self._load_model(self.current_checkpoint, model_only=True)
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork-hieu\ablator\main\model\main.py", line 623, in _load_model
save_dict = torch.load(checkpoint_path, map_location="cpu")
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork\venv\lib\site-packages\torch\serialization.py", line 771, in load
with _open_file_like(f, 'rb') as opened_file:
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork\venv\lib\site-packages\torch\serialization.py", line 270, in _open_file_like
return _open_file(name_or_buffer, mode)
File "C:\Users\hieuc\Documents\USC\Spring-2023\Research-Intern\ablator-fork\venv\lib\site-packages\torch\serialization.py", line 251, in __init__
super(_open_file, self).__init__(open(name, mode))
PermissionError: [Errno 13] Permission denied: '\\tmp\\experiments_\\experiment_7046_9991\\0ee8_9991\\best_checkpoints'
test_init_chkpt.py
from ablator import ModelConfig, OptimizerConfig, TrainConfig, RunConfig, ParallelConfig
from ablator import ModelWrapper, ParallelTrainer
from ablator.main.configs import SearchSpace
import torch
import torch.nn as nn
import shutil
import os
optimizer_config = OptimizerConfig(name="sgd", arguments={"lr": 0.1})
train_config = TrainConfig(
dataset="test",
batch_size=128,
epochs=2,
optimizer_config=optimizer_config,
scheduler_config=None,
)
search_space = {
"train_config.optimizer_config.arguments.lr": SearchSpace(
value_range=[0, 10], value_type="int"
),
}
run_config = ParallelConfig(
train_config=train_config,
model_config=ModelConfig(),
metrics_n_batches = 800,
experiment_dir = "/tmp/experiments/",
device="cuda",
amp=True,
random_seed = 42,
total_trials = 5,
concurrent_trials = 3,
search_space = search_space,
optim_metrics = {"val_loss": "min"},
gpu_mb_per_experiment = 1024,
cpus_per_experiment = 1,
init_chkpt="<path-to-checkpoint>"
)
class MyCustomModel(nn.Module):
def __init__(self, config: ModelConfig) -> None:
super().__init__()
self.lr = 0.01
self.param = nn.Parameter(torch.ones(100))
self.itr = 0
def forward(self, x: torch.Tensor):
x = self.param + torch.rand_like(self.param) * 0.01
self.itr += 1
if self.itr > 10 and self.lr > 5:
raise Exception("large lr.")
return {"preds": x}, x.sum().abs()
class TestWrapper(ModelWrapper):
def make_dataloader_train(self, run_config: RunConfig):
dl = [torch.rand(100) for i in range(100)]
return dl
def make_dataloader_val(self, run_config: RunConfig):
dl = [torch.rand(100) for i in range(100)]
return dl
if not os.path.exists(run_config.experiment_dir):
shutil.os.mkdir(run_config.experiment_dir)
shutil.rmtree(run_config.experiment_dir)
wrapper = TestWrapper(MyCustomModel)
ablator = ParallelTrainer(
wrapper=wrapper,
run_config=run_config,
)
metrics = ablator.launch(working_directory = os.getcwd(), ray_head_address="auto")
To reproduce the error:
ray start --head
python test_init_chkpt.py
TrainConfig
, rand_weights_init
is set to True by default. It'll reset the weight of all modules in create_model()
in ModelWrapper
. Problematic for using a pre-trained model.model.apply(butils.init_weights)
in create_model()
, layers are re-init with std=1, causing very large output absolute values.The naming of optim_direction
parameter in parse_metrics()
function is misleading, since when it's called: metrics =parse_metrics(list(self.run_config.optim_metrics.keys()), metrics)
, self.run_config.optim_metrics.keys()
returns metrics names, not optimization direction.
init_scheduler()
in all scheduler config classes are declaring model: nn.Module
as a parameter, but this is not used when initializing the schedulers. Changing it to **kwargs or *args might be a reasonable solutionoptimizer
parameter should be of type Optimizer
In __init__
function, the code tries to check the gpu memory, but if the machine is running this without a gpu it yields an error
if
condition for the case where device
is an int
, but it's not included in the param typeThere is nothing that could prevent ablator from being used with any black-box function.
The two key components that would need to be addressed are:
We need to discuss methods to address the issues mentioned and any other to be able to support black-box functions and evaluate the feasibility.
The goal is to provide at least partial support in the next release of ablator.
In wrapper._train_evaluation_step()
if (
self.scheduler is not None
and hasattr(self.train_config.scheduler_config, "step_when")
and self.train_config.scheduler_config.step_when == "val"
):
is always False, should be
if (
self.scheduler is not None
and hasattr(self.train_config.scheduler_config.arguments, "step_when")
and self.train_config.scheduler_config.arguments.step_when == "val"
):
The issue is that we append to results.json
so each line is supposed to be a json. As the file grows dynamically it is difficult to efficiently encode as a json, because we would need to re-encode all results.json at each step. An alternative I just thought would be that we can wrap results into an array i.e. [ results ] and when appending to results.json we over-write the last two rows to be ,results+1] .
There are several considerations for this that I found in practice to be problematic, such as mid-stream interuption of writing to results.json (i.e not writing the entire row). We would need to perform some sort synced write (not sure on the exact name of the top-of my head). Can you please try to implement this and write tests for the above scenario and any additional? Please use a dedicated branch for this.
In ablator.main.mp.py
, _rsync_nodes()
method makes a call on the GCP config object: self.run_config.gcp_config.rsync_down_node(hostname, self.experiment_dir, self.logger)
, but I can only find in GcpConfig class a method named rsync_down_nodes
in ablator.modules.storage.cloud.py
. Is this a bug?
In modules/scheduler.py
, config class for ReduceLROnPlateau
scheduler has a typo. It's PlateuaConfig
, while it's supposed to be 'PlateauConfig'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.