eora-ai / torchok Goto Github PK

Production-oriented Computer Vision models training pipeline for common tasks: classification, segmentation, detection and representation🥤

Home Page: https://torchok.readthedocs.org

License: Apache License 2.0

Dockerfile 0.57% Jupyter Notebook 0.77% Python 98.66%

computer-vision deep-learning image-classification image-retrieval image-segmentation representation-learning

torchok's People

Contributors

Stargazers

Watchers

Forkers

vladvin romakoks alicegaz vladislavpatrushev mmalyutin eora-ai

torchok's Issues

Tests for BEiT

Is your feature request related to a problem? Please describe.
There are no tests for BEiT backbone. The main reason is BEiT can't convert to Jit CPU model, because of BEiT has SyncBatchNorm layer which has no implementation for CPU.

Code refactoring

Is your feature request related to a problem? Please describe.
Need consistency with PEP 8 style guide.

Describe the solution you'd like
Use flake8 to check project on common code style.

Add PML losses

Is your feature request related to a problem? Please describe.
There are few losses in the TorchOk for Retrieval task.

Describe the solution you'd like
Add losses from Pytorch Metric Learning - PML

Describe alternatives you've considered

Additional context

Why not add a new abstraction for different network topologies

Suggestions are to rename the models package to nn_components (or something more appropriate) and add a new package named topologies. Topologies will store classes for constructs of backbone, pooling, neck, and head, with different classes for each case which uses structure different from just stacking backbone, pooling, neck, and head or adds new components. This may be useful for models learned with AdversarialPropogation, Adversarial training or some cases of multiheaded learning, ensemble models or composite constructs.

Add DataModules

Suggest you utilize Lightning Data Module https://pytorch-lightning.readthedocs.io/en/stable/extensions/datamodules.html to abstract all data-related stuff.

Implement forward_onnx

Is your feature request related to a problem? Please describe.
Method need to create model without head (to get embedding)

Describe the solution you'd like

Models must have inputs and outputs set in the config
Modify ModelCheckpointWithONNX to support the forward_onnx method and dictionary entry/exit

Scheduling of hyper parameters

Is your feature request related to a problem? Please describe.
When running multi-stage runs a user can set different hyper parameters of some modules. For example, a user can change margin from 0.1 to 0.5 after some epochs of training. Since it can now only be done manually via different runs it breaks the reproducibility.

Describe the solution you'd like
The best way is to implement a scheduler of hyper parameters which should be changed according to some epochs timeline. For example, you can have a dict like this:

hparams_schedule:
  5:
    'losses['triplet_loss'].m': 0.1
  15:
    'losses['triplet_loss'].m': 0.5

The keys here are accessors that can be used relative to Task object.

Describe alternatives you've considered

Additional context

Add testing

Support two types of testing:

Running with the separate script: Add Test DataModule and Test TaskModule to support testing models on a separate script.
Manually restricting dataset types and Lightning Module loop types to be executed by running the train.py: Add a parameter --mode to train.py in order to allow for train or validate (and maybe test) options. As the user may want to just run validation for benchmarking with no training using the training configs, Data Modules, and Task Modules, it would be good to add a parameter to allow skipping training and executing validation only (or test only). It may look like that:

if args.mode=="train":
     trainer.fit(model, dataloaders=data_module)
elif args.mode=="valid":
     trainer.validate(model, dataloaders=data_module)

In this case, we require all train and valid (and maybe) data loaders to be initialized.

Run Triplet and Pairwise task and check metrics.

Is your feature request related to a problem? Please describe.
Try to reach good result on triplet and pairwise task.

SageMaker can't resume path from empty folder

Describe the bug
Running the SageMaker forces you to specify resume_path, but for epoch=0 there are no checkpoint.

To Reproduce
Run any config with resume_path with empty folder

Expected behavior
Expected to check the folder if it have a last checkpoint.

Actual behavior
Now it would fall, because resume_path is checkpoint not path.

Environment:
Like in environment.yml

Write Action with test

Is your feature request related to a problem? Please describe.
Each time when you push commit in repo, test must run

Representation Metrics don't support bfloat16

During training with bfloat16 metrics that use numpy inside fail. Converting types in metrics to torch.float32 would be a solution

Change forward method to update in MetricManager and MetricWithUtils

Is your feature request related to a problem? Please describe.
It is better to change MetricWithUtils forward method name to update, because it's use TorchMetrics update method.

Describe the solution you'd like
Rename forward to update.

Describe alternatives you've considered

Additional context

Add support for instantiating objects with hydra

Currently, torchok does not support instantiating objects with hydra as here https://hydra.cc/docs/advanced/instantiate_objects/overview/. Two common usages:

I need to provide the object to the layer. For example, generic blocks accept pooling object as parameter. And configuring those objects in the config, with hydra creating an instance of those when passing to the block, is a must.
I simply need to automatically instantiate a class passing the parameters from config, with no need to require user provide model_name and model_params, but just:

model:
   __target__: src.models.backbones.resnet.seresnet18
  pretrained: False

Explore the possibility of adding Checkpoint Loader Mixins

There are cases where multiple nn.Module components are involved in a training process, some of which are never trained but used in an eval mode, in a preprocessing, for instance (detection, neural augmenters, batch preprocessor). In these cases, we would have to override constructors for those components, but why not abstract all the checkpoint loading logic inside class Mixins, and the instantiate components that need checkpoint loading logic from the corresponding Mixin. In this case, all the checkpoint loading parameters for the mixin are placed as module parameters in config.

Incremental model unfreeze

Is your feature request related to a problem? Please describe.
A feature that let's you set which modules of the model to unfreeze on which epoch

Describe the solution you'd like
It might be implemented as something that accepts a dict config of the following structure and unfreezes modules in the given keys when the related epoch starts:

unfreeze_schedule:
  5:
    - 'neck.module1'
    - 'neck.module2'
  15:
    - 'backbone'

Describe alternatives you've considered

Unfreeze only high level parts of the model, e.g. backbone, neck, head, etc.

Additional context

Representation Metrics in DDP mode

Describe the bug
In DDP mode representation metrics doesn't work. Due to in update function gpu embedding convert to cpu embedding and after that in compute method in DDP mode called all_gather() which work only with gpu tensors.

Check read TIFF images

Is your feature request related to a problem? Please describe.
In previous version of TorchOk TIFF images were read with PIL, not OpenCV. Check reading TIFF images.

Describe the solution you'd like

Found image where OpenCV can't read
If Image was found, check frameworks for TIFF reading
Add TIFF reading by found framework

Describe alternatives you've considered

Additional context

Weight decay per group from config

Is your feature request related to a problem? Please describe.
A user can specify model layers with their own weight decay parameters being set in the training configuration.

Describe the solution you'd like
In the config file it might be a dict like this:

weight_decays:
  'head.layer1': 0.00001
  'head.layer2': 0.00005

Describe alternatives you've considered

Additional context

Integrate loguru for most important parts of the code

Is your feature request related to a problem? Please describe.
We don't have logging for now. It should be added to the critical parts ASAP!

Describe the solution you'd like
Use loguru library to simplify logging setup

Describe alternatives you've considered

Additional context

Separation of embedding layer and pooling needed

There is a package pooling which stores everything that can be used independently to process output feature maps coming from backbone. However, embedding layers can utilize different poolings and I suggest the following hierarchy:

backbones
emb_layers
heads
modules
    blocks
    bricks
    layers
        activations
        normalizations
        poolings
    necks

Top level contains components that are used in a task, here embedding_layer should replace POOLINGS.

Freeze-Unfreeze callback doesn't freeze batch norm

Describe the bug
Freeze-Unfreeze callback doesn't freeze batch norms

Does not find library paths when installing via pip install torchok

Describe the bug
It doesn't find the path to the libraries when installing via pip install torchok. After installing, I found that only the src directory is installed, there is no hierarchy folder above that. For example, there is no train.py file, test.py folder example/configs and so on.

To Reproduce

Expected behavior
I expected that after running the !python train.py path.yaml command the learning process would begin

Actual behavior
Execution crashes with an error - python3: can't open file 'train.py': [Errno 2] No such file or directory

Set up poetry

It would be great to have dependencies managed by poetry.

Get rid of warnings

Describe the bug
Show up warnings, when start training or testing

To Reproduce
For example:
python -m torchok -cp ../examples/configs -cn classification_cifar10

Expected behavior
No warnings.

Environment:

OS: [e.g. Ubuntu 22.04]
CUDA: [e.g. 11.6]
PyTorch: [e.g. 12.0]
PyTorch Lightning: [e.g. 1.6.5]

ModelCheckpointWithONNX does not remove old ckechpoints properly

When launching training with ModelCheckpointWithONNX callback and setup save_top_k: 1 this callback does not remove old checkpoints that are no longer the top.

Write a SOP Task Tutorial

Describe the solution you'd like
To format the tutorial as Sphinx documentation:

Selecting model parameters
Selection of augmentations
Choice of metrics

All external libraries should be linked to specific topics in the documentation

Add correct support for compute metrics of many validation datasets

Is your feature request related to a problem? Please describe.
Now torchok support computing metrics for many validation datasets, but it compute one metric for both datasets as this two datasets is one big. Need to add ability for separately compute metrics for every dataset.

Describe the solution you'd like
Add dataset_index parameter to metric_parameters, and create metric which name would be contain this dataset_index.

Additional context
This feature need to overwrite MetricManager also.

Add action add-to-project

https://github.com/actions/add-to-project

Dependencies installation didn't happen in CI/CD on a multi-commit PR

Describe the bug
Dependencies in pyproject.toml were changed in a multi-commit PR, but the CI/CD process didn't catch them, resulting to broken dependencies in main branch.

To Reproduce

Expected behavior
Dependencies installation needs to be ran in CI/CD when one of the commits in a PR changes pyproject.toml

Actual behavior
The broken dependencies workflow in main

Environment:
See CI/CD workflow

Add requirements for each component type in src/models

It is not obvious how to discriminate between nn blocks in different subdirectories of src/models. And it is better to add readme and base classes for each component (block, brick, module, etc) listing all the requirements (what it accepts, what it returns, what it is allowed and supposed to do with the inputs).

No examples configs inside TorchOk package

Describe the bug
There's a command in README for running basic example training but it doesn't work:

python -m torchok -cn classification_cifar10 trainer.accelerator='cpu'

To Reproduce

pip install --upgrade torchok
python -m torchok -cn classification_cifar10 trainer.accelerator='cpu'

Expected behavior
The example should run

Actual behavior
Error is raised:

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
Primary config directory not found.
Check that the config directory '/usr/local/lib/python3.7/dist-packages/torchok/examples/configs' exists and readable

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Environment:

Additional context

Welcome update to OpenMMLab 2.0

I am Vansin, the technical operator of OpenMMLab. In September of last year, we announced the release of OpenMMLab 2.0 at the World Artificial Intelligence Conference in Shanghai. We invite you to upgrade your algorithm library to OpenMMLab 2.0 using MMEngine, which can be used for both research and commercial purposes. If you have any questions, please feel free to join us on the OpenMMLab Discord at https://discord.gg/amFNsyUBvm or add me on WeChat (van-sin) and I will invite you to the OpenMMLab WeChat group.

Here are the OpenMMLab 2.0 repos branches:

	OpenMMLab 1.0 branch	OpenMMLab 2.0 branch
MMEngine		0.x
MMCV	1.x	2.x
MMDetection	0.x 、1.x、2.x	3.x
MMAction2	0.x	1.x
MMClassification	0.x	1.x
MMSegmentation	0.x	1.x
MMDetection3D	0.x	1.x
MMEditing	0.x	1.x
MMPose	0.x	1.x
MMDeploy	0.x	1.x
MMTracking	0.x	1.x
MMOCR	0.x	1.x
MMRazor	0.x	1.x
MMSelfSup	0.x	1.x
MMRotate	1.x	1.x
MMYOLO		0.x

Attention: please create a new virtual environment for OpenMMLab 2.0.

Remove ability to specify multiple training dataloaders

Is your feature request related to a problem? Please describe.
It is impossible to use multiple training datasets according to Lightning's interface (there is no dataloader_idx parameter for training_step).
But TorchOk supports a list of training datasets though it won't work - the user has to specify only one training dataloader to make it work.

Describe the solution you'd like
Remove the ability to pass multiple training dataloaders

Describe alternatives you've considered
We can also create our own wrapper around a list of datasets, but it will be confusing for a user

Additional context

Support command line params merging with .yaml

Specifying parameters through the command line works only if those parameters are in a config file used to run the train.py. So merging only works in a case of override, but not in the case of a union of .yaml config parameters and parameters from the command line.
Way to reproduce the issue:

python train.py -cp examples/configs -cn classification_cifar10 trainer.accelerator='cpu' task.params.freeze_backbone=true

Error Message:

Could not override 'task.params.freeze_backbone'.
To append to your config use +task.params.freeze_backbone=true
Key 'freeze_backbone' is not in struct
    full_key: task.params.freeze_backbone
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Write a Representation Task

Is your feature request related to a problem? Please describe.
Need task for Image Representation.

Describe the solution you'd like
Differences from Classification Task:

Another forward_onnx.

Also add:

Dataset for image representation(pairwise data).
Example of config with pairwise losses.

SwinV2 not export to onnx.

Describe the bug
When i set export_to_onnx true in yaml config, process crash on epoch end (in ModelCheckpointWithOnnx).

To Reproduce
run python -m torchok -cp ../examples/configs -cn representation_arcface_sop (before set export_to_onnx as true).

Actual behavior
Traceback (most recent call last):
File "/workdir/vpatrushev/torchOKsmall2/torchok/torchok/main.py", line 38, in entrypoint
trainer.fit(model, ckpt_path=config.resume_path)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 297, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 309, in on_train_epoch_end
self._save_last_checkpoint(trainer, monitor_candidates)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 644, in _save_last_checkpoint
self._save_checkpoint(trainer, filepath)
File "/workdir/vpatrushev/torchOKsmall2/torchok/torchok/callbacks/model_checkpoint_with_onnx.py", line 43, in _save_checkpoint
model.to_onnx(filepath + self.ONNX_EXTENSION, (*input_tensors,), **self.onnx_params)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1888, in to_onnx
torch.onnx.export(self, input_sample, file_path, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/onnx/init.py", line 350, in export
return utils.export(
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/onnx/utils.py", line 163, in export
_export(
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/onnx/utils.py", line 1074, in _export
graph, params_dict, torch_out = _model_to_graph(
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/onnx/utils.py", line 727, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/onnx/utils.py", line 602, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/onnx/utils.py", line 517, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/jit/_trace.py", line 1175, in _get_trace_graph
outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/jit/_trace.py", line 127, in forward
graph, out = torch._C._create_graph_by_tracing(
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/jit/_trace.py", line 118, in wrapper
outs.append(self.inner(*trace_inputs))
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1118, in _slow_forward
result = self.forward(*input, **kwargs)
File "/workdir/vpatrushev/torchOKsmall2/torchok/torchok/tasks/classification.py", line 51, in forward
x = self.backbone(x)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1118, in _slow_forward
result = self.forward(*input, **kwargs)
File "/workdir/vpatrushev/torchOKsmall2/torchok/torchok/models/backbones/swin.py", line 250, in forward
x = self._forward_patch_emb(x)
File "/workdir/vpatrushev/torchOKsmall2/torchok/torchok/models/backbones/swin.py", line 211, in _forward_patch_emb
x = self.patch_embed(x)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1118, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/torchok/lib/python3.9/site-packages/timm/models/layers/patch_embed.py", line 32, in forward
B, C, H, W = x.shape
ValueError: not enough values to unpack (expected 4, got 3)

Environment:

OS: [e.g. Ubuntu 22.04]
CUDA: [e.g. 11.6]
PyTorch: [e.g. 12.0]
PyTorch Lightning: [e.g. 1.6.5]

Test issue

Add seed everything by default

Is your feature request related to a problem? Please describe.
We now cannot reproduce the experiments because of the seeds not being set.

Describe the solution you'd like
Lightning has a function that allows users to seed everything: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#reproducibility. It should be enabled by default and be configured from the training configuration.

Describe alternatives you've considered

Additional context

Test issue 2

Setup initial CI/CD

Is your feature request related to a problem? Please describe.
Initially CI/CD should run flake8, print errors/warnings for user and block merging while some errors are still not solved

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Flake8 not pass if 3.7 python version

Describe the bug
When i run github action with python version 3.7. Flake8 raise SyntaxError on 179 line.
https://github.com/eora-ai/torchok/blob/dev/torchok/models/backbones/beit.py

To Reproduce
Replace .github/workflows/flake8_checks.yaml below code snippet.


on:
  push:
    branches: [dev]
  pull_request: 
    branches: [dev]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Install python
        uses: actions/setup-python@v2
        with:
          python-version: 3.7
      - name: Install deps
        run: |
          python -m pip install --upgrade pip
          pip install flake8
      - name: Run flake8
        run: flake8 .```

eora-ai / torchok Goto Github PK

torchok's People

Contributors

Stargazers

Watchers

Forkers

torchok's Issues

To Reproduce

Welcome update to OpenMMLab 2.0

Recommend Projects

Recommend Topics

Recommend Org

Jobs