GithubHelp home page GithubHelp logo

vamosc / clip4str Goto Github PK

View Code? Open in Web Editor NEW
73.0 73.0 10.0 1.54 MB

An implementation of "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model".

License: Apache License 2.0

Python 97.11% Shell 2.89%

clip4str's People

Contributors

vamosc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

clip4str's Issues

Error locating target for VL4STR

Thank you for your great work!
I tried to run train.py (not from pretrained) on Google Colab and getting this Error:

Error executing job with overrides: []
Error locating target 'strhub.models.vl_str.system.VL4STR', see chained exception above.
full_key: model
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I have tried to use absolute path for field "target" in "clip4str\configs\model\vl4str.yaml" but still get the above error.
I'm using the right hydra_core version (1.2.0) as in requirements.txt
Do you have any suggestions? Thank you!

Inference time on CPU/GPU

It would be nice if you can add the latency results to the README as well. I am planning to use this for an industry application, but before experimenting, it would be nice to know if it's even a feasible option (since I have an SLA of like 1 sec per image).

The provided lr scheduler `OneCycleLR` doesn't follow PyTorch's LRScheduler API

Thank you for your great work!
I am trying to use Multilngual_CLIP to train clip4str for Vietnamese (with charset contains 229 tokens) (use Google Colab)
I have changed charset, code in strhub/models/vl_str/systems.py and other files so that I can use Text_encoder from Multilingual_CLIP for Vietnamese
Now I am getting an error for Learning rate scheduler as following:

The dimension of the visual decoder is 768.
Len of Tokenizer 232
Done creating model!
| Name | Type | Params

0 | clip_model | CLIP | 427 M
1 | clip_model.visual | VisionTransformer | 303 M
2 | clip_model.transformer | Transformer | 85.1 M
3 | clip_model.token_embedding | Embedding | 37.9 M
4 | clip_model.ln_final | LayerNorm | 1.5 K
5 | M_clip_model | MultilingualCLIP | 560 M
6 | M_clip_model.transformer | XLMRobertaModel | 559 M
7 | M_clip_model.LinearTransformation | Linear | 787 K
8 | visual_decoder | Decoder | 9.8 M
9 | visual_decoder.layers | ModuleList | 9.5 M
10 | visual_decoder.text_embed | TokenEmbedding | 178 K
11 | visual_decoder.norm | LayerNorm | 1.5 K
12 | visual_decoder.dropout | Dropout | 0
13 | visual_decoder.head | Linear | 176 K
14 | cross_decoder | Decoder | 9.8 M
15 | cross_decoder.layers | ModuleList | 9.5 M
16 | cross_decoder.text_embed | TokenEmbedding | 178 K
17 | cross_decoder.norm | LayerNorm | 1.5 K
18 | cross_decoder.dropout | Dropout | 0
19 | cross_decoder.head | Linear | 176 K

675 M Trainable params
332 M Non-trainable params
1.0 B Total params
4,031.815 Total estimated model params size (MB)
[dataset] mean (0.48145466, 0.4578275, 0.40821073), std (0.26862954, 0.26130258, 0.27577711)
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:117: UserWarning: When using Trainer(accumulate_grad_batches != 1) and overriding LightningModule.optimizer_{step,zero_grad}, the hooks will not be called on every batch (rather, they are called on every optimization step).
rank_zero_warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[VL4STR] The length of encoder params with and without weight decay is 259 and 479, respectively.
[VL4STR] The length of decoder params with and without weight decay is 14 and 38, respectively.
Loading train_dataloader to estimate number of stepping batches.
dataset root: /content/drive/MyDrive/clip4str/dataset/str_dataset/train/real
lmdb: ArT num samples: 34984
lmdb: The number of training samples is 34984
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Error executing job with overrides: []
Traceback (most recent call last):
File "/content/drive/MyDrive/clip4str/code/clip4str/train.py", line 145, in
main()
File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 90, in decorated_main
_run_hydra(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 389, in _run_hydra
_run_app(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 452, in _run_app
run_and_report(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 216, in run_and_report
raise ex
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 213, in run_and_report
return func()
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 453, in
lambda: hydra.run(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/content/drive/MyDrive/clip4str/code/clip4str/train.py", line 104, in main
trainer.fit(model, datamodule=datamodule, ckpt_path=config.ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1217, in _run
self.strategy.setup(self)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/single_device.py", line 72, in setup
super().setup(trainer)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 139, in setup
self.setup_optimizers(trainer)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 128, in setup_optimizers
self.optimizers, self.lr_scheduler_configs, self.optimizer_frequencies = _init_optimizers_and_lr_schedulers(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 195, in _init_optimizers_and_lr_schedulers
_validate_scheduler_api(lr_scheduler_configs, model)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 350, in _validate_scheduler_api
raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: The provided lr scheduler OneCycleLR doesn't follow PyTorch's LRScheduler API. You should override the LightningModule.lr_scheduler_step hook with your own logic if you are using a custom LR scheduler.

I can not see any problem in OneCycleLR, do you have any suggestions for me with this matter? Is it a problem of package version?

Issue with inference

Hi, I am trying to perform inference using the following script:
bash code/clip4str/scripts/read.sh 7 clip4str_b_plus.ckpt /home/shreyans/scratch/tata1mg/clip4str_og/code/clip4str/misc/test_image

The error i get is:

Additional keyword arguments: {}
args.checkpoint /home/shreyans/scratch/tata1mg/clip4str_og/output/clip4str_base16x16_d70bde1f2d.ckpt

config of VL4STR:
image_freeze_nlayer: 0, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False

config of VL4STR:
image_freeze_nlayer: -1, text_freeze_nlayer: 6, freeze_language_backbone: False, freeze_image_backbone: False
use_language_model: True, context_length: 16, cross_token_embeding: False, cross_loss_weight: 1.0
use_share_dim: True, image_detach: True
cross_gt_context: True, cross_cloze_mask: False, cross_fast_decode: False

loading checkpoint from /home/shreyans/scratch/tata1mg/clip4str_og/pretrained/clip/ViT-B-16.pt
The dimension of the visual decoder is 512.
Traceback (most recent call last):
File "/DATA/scratch/shreyans/tata1mg/clip4str_og/code/clip4str/strhub/models/utils.py", line 104, in load_from_checkpoint
model = ModelClass.load_from_checkpoint(checkpoint_path, **kwargs)
File "/home/shreyans/scratch/miniconda3/envs/clip4str/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, kwargs)
File "/home/shreyans/scratch/miniconda3/envs/clip4str/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 203, in _load_model_state
model = cls(
_cls_kwargs)
File "/DATA/scratch/shreyans/tata1mg/clip4str_og/code/clip4str/strhub/models/vl_str/system.py", line 70, in init
assert os.path.exists(kwargs["clip_pretrained"])
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/shreyans/scratch/tata1mg/clip4str_og/code/clip4str/read.py", line 54, in
main()
File "/home/shreyans/scratch/miniconda3/envs/clip4str/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/shreyans/scratch/tata1mg/clip4str_og/code/clip4str/read.py", line 37, in main
model = load_from_checkpoint(args.checkpoint, **kwargs).eval().to(args.device)
File "/DATA/scratch/shreyans/tata1mg/clip4str_og/code/clip4str/strhub/models/utils.py", line 113, in load_from_checkpoint
model.load_state_dict(checkpoint)
File "/home/shreyans/scratch/miniconda3/envs/clip4str/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for VL4STR:
Missing key(s) in state_dict: "clip_model.positional_embedding", "clip_model.text_projection", "clip_model.logit_scale", "clip_model.visual.class_embedding", "clip_model.visual.positional_embedding", "clip_model.visual.proj", "clip_model.visual.conv1.weight", "clip_model.visual.ln_pre.weight", "clip_model.visual.ln_pre.bias", "clip_model.visual.transformer.resblocks.0.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.0.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.0.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.0.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.0.ln_1.weight", "clip_model.visual.transformer.resblocks.0.ln_1.bias", "clip_model.visual.transformer.resblocks.0.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.0.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.0.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.0.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.0.ln_2.weight", "clip_model.visual.transformer.resblocks.0.ln_2.bias", "clip_model.visual.transformer.resblocks.1.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.1.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.1.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.1.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.1.ln_1.weight", "clip_model.visual.transformer.resblocks.1.ln_1.bias", "clip_model.visual.transformer.resblocks.1.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.1.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.1.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.1.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.1.ln_2.weight", "clip_model.visual.transformer.resblocks.1.ln_2.bias", "clip_model.visual.transformer.resblocks.2.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.2.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.2.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.2.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.2.ln_1.weight", "clip_model.visual.transformer.resblocks.2.ln_1.bias", "clip_model.visual.transformer.resblocks.2.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.2.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.2.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.2.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.2.ln_2.weight", "clip_model.visual.transformer.resblocks.2.ln_2.bias", "clip_model.visual.transformer.resblocks.3.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.3.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.3.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.3.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.3.ln_1.weight", "clip_model.visual.transformer.resblocks.3.ln_1.bias", "clip_model.visual.transformer.resblocks.3.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.3.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.3.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.3.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.3.ln_2.weight", "clip_model.visual.transformer.resblocks.3.ln_2.bias", "clip_model.visual.transformer.resblocks.4.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.4.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.4.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.4.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.4.ln_1.weight", "clip_model.visual.transformer.resblocks.4.ln_1.bias", "clip_model.visual.transformer.resblocks.4.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.4.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.4.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.4.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.4.ln_2.weight", "clip_model.visual.transformer.resblocks.4.ln_2.bias", "clip_model.visual.transformer.resblocks.5.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.5.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.5.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.5.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.5.ln_1.weight", "clip_model.visual.transformer.resblocks.5.ln_1.bias", "clip_model.visual.transformer.resblocks.5.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.5.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.5.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.5.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.5.ln_2.weight", "clip_model.visual.transformer.resblocks.5.ln_2.bias", "clip_model.visual.transformer.resblocks.6.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.6.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.6.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.6.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.6.ln_1.weight", "clip_model.visual.transformer.resblocks.6.ln_1.bias", "clip_model.visual.transformer.resblocks.6.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.6.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.6.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.6.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.6.ln_2.weight", "clip_model.visual.transformer.resblocks.6.ln_2.bias", "clip_model.visual.transformer.resblocks.7.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.7.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.7.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.7.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.7.ln_1.weight", "clip_model.visual.transformer.resblocks.7.ln_1.bias", "clip_model.visual.transformer.resblocks.7.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.7.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.7.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.7.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.7.ln_2.weight", "clip_model.visual.transformer.resblocks.7.ln_2.bias", "clip_model.visual.transformer.resblocks.8.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.8.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.8.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.8.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.8.ln_1.weight", "clip_model.visual.transformer.resblocks.8.ln_1.bias", "clip_model.visual.transformer.resblocks.8.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.8.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.8.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.8.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.8.ln_2.weight", "clip_model.visual.transformer.resblocks.8.ln_2.bias", "clip_model.visual.transformer.resblocks.9.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.9.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.9.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.9.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.9.ln_1.weight", "clip_model.visual.transformer.resblocks.9.ln_1.bias", "clip_model.visual.transformer.resblocks.9.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.9.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.9.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.9.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.9.ln_2.weight", "clip_model.visual.transformer.resblocks.9.ln_2.bias", "clip_model.visual.transformer.resblocks.10.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.10.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.10.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.10.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.10.ln_1.weight", "clip_model.visual.transformer.resblocks.10.ln_1.bias", "clip_model.visual.transformer.resblocks.10.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.10.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.10.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.10.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.10.ln_2.weight", "clip_model.visual.transformer.resblocks.10.ln_2.bias", "clip_model.visual.transformer.resblocks.11.attn.in_proj_weight", "clip_model.visual.transformer.resblocks.11.attn.in_proj_bias", "clip_model.visual.transformer.resblocks.11.attn.out_proj.weight", "clip_model.visual.transformer.resblocks.11.attn.out_proj.bias", "clip_model.visual.transformer.resblocks.11.ln_1.weight", "clip_model.visual.transformer.resblocks.11.ln_1.bias", "clip_model.visual.transformer.resblocks.11.mlp.c_fc.weight", "clip_model.visual.transformer.resblocks.11.mlp.c_fc.bias", "clip_model.visual.transformer.resblocks.11.mlp.c_proj.weight", "clip_model.visual.transformer.resblocks.11.mlp.c_proj.bias", "clip_model.visual.transformer.resblocks.11.ln_2.weight", "clip_model.visual.transformer.resblocks.11.ln_2.bias", "clip_model.visual.ln_post.weight", "clip_model.visual.ln_post.bias", "clip_model.transformer.resblocks.0.attn.in_proj_weight", "clip_model.transformer.resblocks.0.attn.in_proj_bias", "clip_model.transformer.resblocks.0.attn.out_proj.weight", "clip_model.transformer.resblocks.0.attn.out_proj.bias", "clip_model.transformer.resblocks.0.ln_1.weight", "clip_model.transformer.resblocks.0.ln_1.bias", "clip_model.transformer.resblocks.0.mlp.c_fc.weight", "clip_model.transformer.resblocks.0.mlp.c_fc.bias", "clip_model.transformer.resblocks.0.mlp.c_proj.weight", "clip_model.transformer.resblocks.0.mlp.c_proj.bias", "clip_model.transformer.resblocks.0.ln_2.weight", "clip_model.transformer.resblocks.0.ln_2.bias", "clip_model.transformer.resblocks.1.attn.in_proj_weight", "clip_model.transformer.resblocks.1.attn.in_proj_bias", "clip_model.transformer.resblocks.1.attn.out_proj.weight", "clip_model.transformer.resblocks.1.attn.out_proj.bias", "clip_model.transformer.resblocks.1.ln_1.weight", "clip_model.transformer.resblocks.1.ln_1.bias", "clip_model.transformer.resblocks.1.mlp.c_fc.weight", "clip_model.transformer.resblocks.1.mlp.c_fc.bias", "clip_model.transformer.resblocks.1.mlp.c_proj.weight", "clip_model.transformer.resblocks.1.mlp.c_proj.bias", "clip_model.transformer.resblocks.1.ln_2.weight", "clip_model.transformer.resblocks.1.ln_2.bias", "clip_model.transformer.resblocks.2.attn.in_proj_weight", "clip_model.transformer.resblocks.2.attn.in_proj_bias", "clip_model.transformer.resblocks.2.attn.out_proj.weight", "clip_model.transformer.resblocks.2.attn.out_proj.bias", "clip_model.transformer.resblocks.2.ln_1.weight", "clip_model.transformer.resblocks.2.ln_1.bias", "clip_model.transformer.resblocks.2.mlp.c_fc.weight", "clip_model.transformer.resblocks.2.mlp.c_fc.bias", "clip_model.transformer.resblocks.2.mlp.c_proj.weight", "clip_model.transformer.resblocks.2.mlp.c_proj.bias", "clip_model.transformer.resblocks.2.ln_2.weight", "clip_model.transformer.resblocks.2.ln_2.bias", "clip_model.transformer.resblocks.3.attn.in_proj_weight", "clip_model.transformer.resblocks.3.attn.in_proj_bias", "clip_model.transformer.resblocks.3.attn.out_proj.weight", "clip_model.transformer.resblocks.3.attn.out_proj.bias", "clip_model.transformer.resblocks.3.ln_1.weight", "clip_model.transformer.resblocks.3.ln_1.bias", "clip_model.transformer.resblocks.3.mlp.c_fc.weight", "clip_model.transformer.resblocks.3.mlp.c_fc.bias", "clip_model.transformer.resblocks.3.mlp.c_proj.weight", "clip_model.transformer.resblocks.3.mlp.c_proj.bias", "clip_model.transformer.resblocks.3.ln_2.weight", "clip_model.transformer.resblocks.3.ln_2.bias", "clip_model.transformer.resblocks.4.attn.in_proj_weight", "clip_model.transformer.resblocks.4.attn.in_proj_bias", "clip_model.transformer.resblocks.4.attn.out_proj.weight", "clip_model.transformer.resblocks.4.attn.out_proj.bias", "clip_model.transformer.resblocks.4.ln_1.weight", "clip_model.transformer.resblocks.4.ln_1.bias", "clip_model.transformer.resblocks.4.mlp.c_fc.weight", "clip_model.transformer.resblocks.4.mlp.c_fc.bias", "clip_model.transformer.resblocks.4.mlp.c_proj.weight", "clip_model.transformer.resblocks.4.mlp.c_proj.bias", "clip_model.transformer.resblocks.4.ln_2.weight", "clip_model.transformer.resblocks.4.ln_2.bias", "clip_model.transformer.resblocks.5.attn.in_proj_weight", "clip_model.transformer.resblocks.5.attn.in_proj_bias", "clip_model.transformer.resblocks.5.attn.out_proj.weight", "clip_model.transformer.resblocks.5.attn.out_proj.bias", "clip_model.transformer.resblocks.5.ln_1.weight", "clip_model.transformer.resblocks.5.ln_1.bias", "clip_model.transformer.resblocks.5.mlp.c_fc.weight", "clip_model.transformer.resblocks.5.mlp.c_fc.bias", "clip_model.transformer.resblocks.5.mlp.c_proj.weight", "clip_model.transformer.resblocks.5.mlp.c_proj.bias", "clip_model.transformer.resblocks.5.ln_2.weight", "clip_model.transformer.resblocks.5.ln_2.bias", "clip_model.transformer.resblocks.6.attn.in_proj_weight", "clip_model.transformer.resblocks.6.attn.in_proj_bias", "clip_model.transformer.resblocks.6.attn.out_proj.weight", "clip_model.transformer.resblocks.6.attn.out_proj.bias", "clip_model.transformer.resblocks.6.ln_1.weight", "clip_model.transformer.resblocks.6.ln_1.bias", "clip_model.transformer.resblocks.6.mlp.c_fc.weight", "clip_model.transformer.resblocks.6.mlp.c_fc.bias", "clip_model.transformer.resblocks.6.mlp.c_proj.weight", "clip_model.transformer.resblocks.6.mlp.c_proj.bias", "clip_model.transformer.resblocks.6.ln_2.weight", "clip_model.transformer.resblocks.6.ln_2.bias", "clip_model.transformer.resblocks.7.attn.in_proj_weight", "clip_model.transformer.resblocks.7.attn.in_proj_bias", "clip_model.transformer.resblocks.7.attn.out_proj.weight", "clip_model.transformer.resblocks.7.attn.out_proj.bias", "clip_model.transformer.resblocks.7.ln_1.weight", "clip_model.transformer.resblocks.7.ln_1.bias", "clip_model.transformer.resblocks.7.mlp.c_fc.weight", "clip_model.transformer.resblocks.7.mlp.c_fc.bias", "clip_model.transformer.resblocks.7.mlp.c_proj.weight", "clip_model.transformer.resblocks.7.mlp.c_proj.bias", "clip_model.transformer.resblocks.7.ln_2.weight", "clip_model.transformer.resblocks.7.ln_2.bias", "clip_model.transformer.resblocks.8.attn.in_proj_weight", "clip_model.transformer.resblocks.8.attn.in_proj_bias", "clip_model.transformer.resblocks.8.attn.out_proj.weight", "clip_model.transformer.resblocks.8.attn.out_proj.bias", "clip_model.transformer.resblocks.8.ln_1.weight", "clip_model.transformer.resblocks.8.ln_1.bias", "clip_model.transformer.resblocks.8.mlp.c_fc.weight", "clip_model.transformer.resblocks.8.mlp.c_fc.bias", "clip_model.transformer.resblocks.8.mlp.c_proj.weight", "clip_model.transformer.resblocks.8.mlp.c_proj.bias", "clip_model.transformer.resblocks.8.ln_2.weight", "clip_model.transformer.resblocks.8.ln_2.bias", "clip_model.transformer.resblocks.9.attn.in_proj_weight", "clip_model.transformer.resblocks.9.attn.in_proj_bias", "clip_model.transformer.resblocks.9.attn.out_proj.weight", "clip_model.transformer.resblocks.9.attn.out_proj.bias", "clip_model.transformer.resblocks.9.ln_1.weight", "clip_model.transformer.resblocks.9.ln_1.bias", "clip_model.transformer.resblocks.9.mlp.c_fc.weight", "clip_model.transformer.resblocks.9.mlp.c_fc.bias", "clip_model.transformer.resblocks.9.mlp.c_proj.weight", "clip_model.transformer.resblocks.9.mlp.c_proj.bias", "clip_model.transformer.resblocks.9.ln_2.weight", "clip_model.transformer.resblocks.9.ln_2.bias", "clip_model.transformer.resblocks.10.attn.in_proj_weight", "clip_model.transformer.resblocks.10.attn.in_proj_bias", "clip_model.transformer.resblocks.10.attn.out_proj.weight", "clip_model.transformer.resblocks.10.attn.out_proj.bias", "clip_model.transformer.resblocks.10.ln_1.weight", "clip_model.transformer.resblocks.10.ln_1.bias", "clip_model.transformer.resblocks.10.mlp.c_fc.weight", "clip_model.transformer.resblocks.10.mlp.c_fc.bias", "clip_model.transformer.resblocks.10.mlp.c_proj.weight", "clip_model.transformer.resblocks.10.mlp.c_proj.bias", "clip_model.transformer.resblocks.10.ln_2.weight", "clip_model.transformer.resblocks.10.ln_2.bias", "clip_model.transformer.resblocks.11.attn.in_proj_weight", "clip_model.transformer.resblocks.11.attn.in_proj_bias", "clip_model.transformer.resblocks.11.attn.out_proj.weight", "clip_model.transformer.resblocks.11.attn.out_proj.bias", "clip_model.transformer.resblocks.11.ln_1.weight", "clip_model.transformer.resblocks.11.ln_1.bias", "clip_model.transformer.resblocks.11.mlp.c_fc.weight", "clip_model.transformer.resblocks.11.mlp.c_fc.bias", "clip_model.transformer.resblocks.11.mlp.c_proj.weight", "clip_model.transformer.resblocks.11.mlp.c_proj.bias", "clip_model.transformer.resblocks.11.ln_2.weight", "clip_model.transformer.resblocks.11.ln_2.bias", "clip_model.token_embedding.weight", "clip_model.ln_final.weight", "clip_model.ln_final.bias", "visual_decoder.pos_queries", "visual_decoder.layers.0.self_attn.in_proj_weight", "visual_decoder.layers.0.self_attn.in_proj_bias", "visual_decoder.layers.0.self_attn.out_proj.weight", "visual_decoder.layers.0.self_attn.out_proj.bias", "visual_decoder.layers.0.cross_attn.in_proj_weight", "visual_decoder.layers.0.cross_attn.in_proj_bias", "visual_decoder.layers.0.cross_attn.out_proj.weight", "visual_decoder.layers.0.cross_attn.out_proj.bias", "visual_decoder.layers.0.linear1.weight", "visual_decoder.layers.0.linear1.bias", "visual_decoder.layers.0.linear2.weight", "visual_decoder.layers.0.linear2.bias", "visual_decoder.layers.0.norm1.weight", "visual_decoder.layers.0.norm1.bias", "visual_decoder.layers.0.norm2.weight", "visual_decoder.layers.0.norm2.bias", "visual_decoder.layers.0.norm_q.weight", "visual_decoder.layers.0.norm_q.bias", "visual_decoder.layers.0.norm_c.weight", "visual_decoder.layers.0.norm_c.bias", "visual_decoder.text_embed.embedding.weight", "visual_decoder.norm.weight", "visual_decoder.norm.bias", "visual_decoder.head.weight", "visual_decoder.head.bias", "cross_decoder.pos_queries", "cross_decoder.layers.0.self_attn.in_proj_weight", "cross_decoder.layers.0.self_attn.in_proj_bias", "cross_decoder.layers.0.self_attn.out_proj.weight", "cross_decoder.layers.0.self_attn.out_proj.bias", "cross_decoder.layers.0.cross_attn.in_proj_weight", "cross_decoder.layers.0.cross_attn.in_proj_bias", "cross_decoder.layers.0.cross_attn.out_proj.weight", "cross_decoder.layers.0.cross_attn.out_proj.bias", "cross_decoder.layers.0.linear1.weight", "cross_decoder.layers.0.linear1.bias", "cross_decoder.layers.0.linear2.weight", "cross_decoder.layers.0.linear2.bias", "cross_decoder.layers.0.norm1.weight", "cross_decoder.layers.0.norm1.bias", "cross_decoder.layers.0.norm2.weight", "cross_decoder.layers.0.norm2.bias", "cross_decoder.layers.0.norm_q.weight", "cross_decoder.layers.0.norm_q.bias", "cross_decoder.layers.0.norm_c.weight", "cross_decoder.layers.0.norm_c.bias", "cross_decoder.text_embed.embedding.weight", "cross_decoder.norm.weight", "cross_decoder.norm.bias", "cross_decoder.head.weight", "cross_decoder.head.bias".
Unexpected key(s) in state_dict: "epoch", "global_step", "pytorch-lightning_version", "state_dict", "loops", "callbacks", "optimizer_states", "lr_schedulers", "NativeMixedPrecisionPlugin", "hparams_name", "hyper_parameters"

Ihave done everything as mentioned, what is the reason?

Bug in run bash script

image
I run this in google colab and i get error in bash file,
how can i do it for try this model.

Convert to ONNX

How can I convert a model to ONNX for inference implementation?

clip4str attention graph

This is an excellent work and I am very much interested in the advanced effects of clip attention maps in it.

Can you share the code used to generate the clip attention graph in the paper.

Thank you very much.

Is there a way to detect spaces?

Thank you for the great work and released models. I noticed the tokenizer does not include spaces. Was the model not trained on them or is there a way to add them to the tokenizer?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.