GithubHelp home page GithubHelp logo

roatienza / deep-text-recognition-benchmark Goto Github PK

View Code? Open in Web Editor NEW
286.0 6.0 58.0 26.25 MB

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

License: Apache License 2.0

Python 17.69% Jupyter Notebook 82.31%
str vision-transformer ocr vitstr

deep-text-recognition-benchmark's People

Contributors

akarazniewicz avatar clovaaiadmin avatar coallaoh avatar edwardpwtsoi avatar gwkrsrch avatar ku21fan avatar roatienza avatar sangkwun avatar soonge avatar tgalkovskyi avatar tjdevworks avatar varshaneya avatar wodeyuzhou avatar yacobby avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

deep-text-recognition-benchmark's Issues

ONNX

Hi thank you for your great work .
Would you please add code for converting pth to onnx ?

About the difference between the number of training iters in the paper and this Repo

Thanks for your great work and source code ! The training epoch numbers in the paper Table 2 are 300, but there are 300000 iters in source code . Data augmentations in the code are very thorough, I think a longer training process is necessary. Which one is your experimental strategy? I do not know if you have done similar experiments that how many iters of training performance will be basically stable under your strong data augment setting.
I look forward to your reply!

about ACC

hi, i try run test.py with this:
CUDA_VISION_DEVICES=0 python test.py --eval_data ../../data/data_lmdb_release/evaluation/ --benchmark_all_eval --Transformation None --FeatureExtraction None --SequenceModeling None --Prediction None --Transformer --sensitive --data_filtering_off --imgH 224 --imgW 224 --workers 0 --TransformerModel=vitstr_small_patch16_224 --saved_model ./pre_model/vitstr_small_patch16_224_aug.pth

i got the result is:
accuracy: IIIT5k_3000: 86.233 SVT: 87.172 IC03_860: 94.186 IC03_867: 93.887 IC13_857: 92.415 IC13_1015: 91.527 IC15_1811: 78.078 IC15_2077: 71.931 SVTP: 81.550 CUTE80: 77.083 total_accuracy: 84.130 averaged_infer_time: 0.410 # parameters: 21.506

A little different from what you showed on Github, is this your best model?

why don't you normalize the images?

Thanks for your work. I found that you don't normalize the images before training. Is transformer better in this way? I look forward to your reply!

About Training time.

Hello Dear. first of all I would like to apreciate your effort for such a great repo.
unfortunatly I am facing some problem regarding training,
image
The training time take almost three to four days with in three GPU. don't know why. and thhe number of itteration is 300000. how can I set the epoch ?

a question about ViTSTR

Hi, thank you for your work. This is a very meaningful job. Regarding algorithm design, I have a question.
You convert an input image into patches firstly, if some characters are cut off or some patch contains multiple characters, will it have an impact?
Looking forward to your reply.

model state loading issue

I tried to rerun the model with the vitstr tiny version weights but I got Missing and Unexpected key(s) in state_dict issues while loading the model state.

Trained model?

Thanks for your excellent work. Could you please share the weights?

About input size

Hi, thank you for your work. This is a very meaningful job.
I am curious if the input size is the same as TRBA (32 x 100).
Have you tried training with 32 x 100 input-sized images?

CTC error

Hi. Appreciate your contribution, but I have a problem When using the CTC:

CUDA_VISIBLE_DEVICES=4 python3 train.py --batch_ratio 1 --Transformation None --FeatureExtraction None --SequenceModeling None --Prediction CTC --Transformer --TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 --manualSeed=27720

error:
Traceback (most recent call last):
File "train.py", line 320, in
train(opt)
File "train.py", line 175, in train
preds = model(image, text)
UnboundLocalError: local variable 'text' referenced before assignment

About the speed of the model in Table 4 of the paper

Hello, thank you very much for being able to open the source code, it is a very rewarding work.
How are the speeds of the different models in Table 4 calculated?
When I benchmark the vitstr-tiny model using the weights vitstr_tiny_patch16_224.pth provided in the repository, the output averaged_infer_time: 0.116, which is quite different from the speed in the paper: 9.3 msec/image, so I would like to know how I should calculate the speed of the model accurately, and I look forward to your help, thank you very much! (ps: my code is running on NVIDIA GeForce RTX 2080 Ti)

Training on Japanese data

Can you please tell us regarding the changes one should make to train the network for Japanese or any other language.

Question about [GO] and [s]

Hi, thanks for your amazing work.
When you convert the label using class TokenLabelConverter, you pad the label with [GO] which is ignored during loss calculation, however in paper, figure 4 shows that the label is padded with [s].
Does this make any difference on accuracy?

How to calculate Top-5 accuracy?

If I change
_, preds_index = preds.topk(1, dim=-1, largest=True, sorted=True)
to
_, preds_index = preds.topk(k=5, dim=-1, largest=True, sorted=True),
the program raise error

dataset_root:    data_lmdb_release/evaluation/CUTE80     dataset: /
sub-directory:  /.       num samples: 288
Traceback (most recent call last):
  File "test.py", line 318, in <module>
    test(opt)
  File "test.py", line 271, in test
    benchmark_all_eval(model, criterion, converter, opt)
  File "test.py", line 57, in benchmark_all_eval
    _, accuracy_by_best_model, norm_ED_by_best_model, _, _, _, infer_time, length_of_data = validation(
  File "test.py", line 151, in validation
    preds_str = converter.decode(preds_index[:,1], length_for_pred)
  File "/home/WeiHongxi/PengHusile/Server/ViTSTR/utils.py", line 197, in decode
    text = ''.join([self.character[i] for i in text_index[index, :]])
IndexError: too many indices for tensor of dimension 1```

or the accuracy turn to 0%

about input channels of vitstr

I am reproducing the vit-tiny model but have a problem when testing on III5k.It prompts unmatched number of channels.Then i find the input images are RGB mode but the input channels of vitstr is just one.Why is this problem?I'm looking forward to your reply.
Thank you


RuntimeError Traceback (most recent call last)
/tmp/ipykernel_53850/3528011160.py in
296 opt.num_gpu = torch.cuda.device_count()
297
--> 298 test(opt)

/tmp/ipykernel_53850/3528011160.py in test(opt)
253 with torch.no_grad():
254
--> 255 _, accuracy_by_best_model, norm_ED_by_best_model, _, _, _, infer_time, length_of_data = validation(
256 model, criterion, test_loader, converter, opt)
257 print(f'{accuracy_by_best_model:0.3f}')

/tmp/ipykernel_53850/3528011160.py in validation(model, criterion, evaluation_loader, converter, opt)
39 start_time = time.time()
40
---> 41 preds = model(image, seqlen=converter.batch_max_length,is_train=False)
42 _, preds_index = preds.topk(1, dim=-1, largest=True, sorted=True)
43 forward_time = time.time() - start_time

~/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/model.py in forward(self, input, is_train, seqlen)
44
45 """ Prediction stage """
---> 46 prediction = self.vitstr(input, seqlen=seqlen)
47
48 return prediction

~/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/modules/Vistr.py in forward(self, x, seqlen)
73
74 def forward(self, x, seqlen: int =25):
---> 75 x = self.forward_features(x)
76 x = x[:, :seqlen]
77

~/modules/Vistr.py in forward_features(self, x)
59 def forward_features(self, x):
60 B = x.shape[0]
---> 61 x = self.patch_embed(x)
62
63 cls_tokens = self.cls_token.expand(B, -1, -1) # stole cls_tokens impl from Phil Wang, thanks

~/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/miniconda3/lib/python3.8/site-packages/timm/models/layers/patch_embed.py in forward(self, x)
33 _assert(H == self.img_size[0], f"Input image height ({H}) doesn't match model ({self.img_size[0]}).")
34 _assert(W == self.img_size[1], f"Input image width ({W}) doesn't match model ({self.img_size[1]}).")
---> 35 x = self.proj(x)
36 if self.flatten:
37 x = x.flatten(2).transpose(1, 2) # BCHW -> BNC

~/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),

~/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py in forward(self, input)
421
422 def forward(self, input: Tensor) -> Tensor:
--> 423 return self._conv_forward(input, self.weight)
424
425 class Conv3d(_ConvNd):

~/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight)
417 weight, self.bias, self.stride,
418 _pair(0), self.dilation, self.groups)
--> 419 return F.conv2d(input, weight, self.bias, self.stride,
420 self.padding, self.dilation, self.groups)
421

RuntimeError: Given groups=1, weight of size [192, 1, 16, 16], expected input[16, 3, 224, 224] to have 1 channels, but got 3 channels instead

How to draw the attention map of ViTSTR?

Hello, thank you very much for being able to open the source code, it is a very rewarding work.
I am a graduate student and I want to do a little bit of my own experiment based on ViTSTR. Now I want to draw an attention map similar to the one shown in Fig. 9 of your paper, can you give me some help? Thank you very much!

Training from scratch, w/o using Pretrained DeiT?

Thanks for sharing the source codes!
I found that you exploited 'Pretrained weight file of DeiT' instead of training from scratch.
However, i see you emphasize 'Efficiency' of your model.
I wonder if there exists some issue to train from scratch.

Available Model weights.

Hi, thanks for the nice work. I'm trying to get the available model weights for vitstr_base_patch16_224_aug to work with the infer.py script. So far it is not working, because the model is not build properly. Could you please give me an advice how to load the model pretrained from given checkpoint? Thanks.

CUDA out of memory.

Hello, when I am reproducing the vit-tiny model, I use four 2080ti GPU according to your configuration and it still does not work. It prompts CUDA out of memory. What is the reason? My configuration file is as follows

RANDOM=$$
GPU=0,1,2,3
CUDA_VISIBLE_DEVICES=${GPU} \
python3 train.py --train_data data_lmdb_release/training \
--valid_data data_lmdb_release/evaluation \
--select_data MJ-ST \
--batch_ratio 0.5-0.5 \
--Transformation None \
--FeatureExtraction None \
--SequenceModeling None \
--Prediction None \
--Transformer \
--TransformerModel vitstr_tiny_patch16_224 \
--imgH 224 \
--imgW 224 \
--manualSeed=$RANDOM \
--sensitive \
--valInterval 5000 \
--workers 6 \
--batch_size 48

pretrained-model loading with errors

Hello,
I used single GPU env with python == 3.8, torch==1.8.1 and torchvision==0.9.1
I followed the github hint with the following command:

python3 infer.py --gpu --image demo_image/demo_2.jpg --model vitstr_small_patch16_224.pth

It returned an error with

AttributeError: 'collections.OrderedDict' object has no attribute 'to'

it seems that the function model = torch.load(checkpoint) in infer.py returns an ordered dict instead of the model object.
One way to solve the problem is:

ordered_dict = torch.load(checkpoint)
model.load(ordered_dict )

But I do not know the hyper params of vitstr_small_patch16_224.pth when it is training. so it is very hard form me to initialize the model object with correct hyper params.
I would like to ask would it possible to may the hyper params of the pretrained models public?

I also tried the pt models

python3 infer.py --gpu --image demo_image/demo_2.jpg --model vitstr_small_patch16_jit.pt

it gives the following error:

  File "E:\ProgramFiles\anaconda3\envs\vitstr\lib\site-packages\spyder_kernels\py3compat.py", line 356, in compat_exec
    exec(code, globals, locals)

  File "e:\projects\deep-text-recognition-benchmark-master\infer.py", line 147, in <module>
    data = infer(args)

  File "e:\projects\deep-text-recognition-benchmark-master\infer.py", line 121, in infer
    model = torch.load(checkpoint)

  File "E:\ProgramFiles\anaconda3\envs\vitstr\lib\site-packages\torch\serialization.py", line 591, in load
    return torch.jit.load(opened_file)

  File "E:\ProgramFiles\anaconda3\envs\vitstr\lib\site-packages\torch\jit\_serialization.py", line 163, in load
    cpp_module = torch._C.import_ir_module_from_buffer(

RuntimeError: 
Unknown type name 'NoneType':
Serialized   File "code/__torch__/modules/vitstr.py", line 12
  embed_dim : int
  num_tokens : int
  dist_token : NoneType
               ~~~~~~~~ <--- HERE
  head_dist : NoneType
  patch_embed : __torch__.timm.models.layers.patch_embed.PatchEmbed

any way to load the model correctly please?
may thanks

Is there any performance comparison with clovaai/deep-text-recognition-benchmark

Hi, I trained two text recognition models (my own data) using following repos:
[1] clovaai/deep-text-recognition-benchmark
[2] roatienza/deep-text-recognition-benchmark

but [1] got better accuracy ([1] accuracy: 0.94, [2] accuracy: 0.85)
Is there any performance comparison with [1] on open dataset?
Is there any suggestion that I need to aware?
Thanks a lot.

Rand Aug

Hello @roatienza!

Thanks for this great repo!

I am trying to train using rand_aug but I am facing some issues. I get an error on blur.py when trying to convert from BGR to Grayscale. It seems the image has just one channel.

`error: Caught error in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/fmobrj/anaconda3/envs/vitstr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/fmobrj/anaconda3/envs/vitstr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/dataset.py", line 500, in __call__
    image_tensors = [transform(image) for image in images]
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/dataset.py", line 500, in <listcomp>
    image_tensors = [transform(image) for image in images]
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/dataset.py", line 336, in __call__
    img = self.rand_aug(img)
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/dataset.py", line 357, in rand_aug
    img = op(img, mag=mag)
  File "/media/hdd6tb/jupyter/notebooks/vitstr/deep-text-recognition-benchmark/augmentation/blur.py", line 104, in __call__
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
cv2.error: OpenCV(3.4.18) /io/opencv/modules/imgproc/src/color.simd_helpers.hpp:88: error: (-2:Unspecified error) in function 'cv::impl::{anonymous}::CvtHelper<VScn, VDcn, VDepth, sizePolicy>::CvtHelper(cv::InputArray, cv::OutputArray, int) [with VScn = cv::impl::{anonymous}::Set<3, 4>; VDcn = cv::impl::{anonymous}::Set<3, 4>; VDepth = cv::impl::{anonymous}::Set<0, 2, 5>; cv::impl::{anonymous}::SizePolicy sizePolicy = cv::impl::<unnamed>::NONE; cv::InputArray = const cv::_InputArray&; cv::OutputArray = const cv::_OutputArray&]'
> Invalid number of channels in input image:
>     'VScn::contains(scn)'
> where
>     'scn' is 1`

when i follow train.sh: line 6: --SequenceModeling: command not found

python3.6/site-packages/wand/api.py", line 151, in
libraries = load_library()
python3.6/site-packages/wand/api.py", line 140, in load_library
raise IOError('cannot find library; tried paths: ' + repr(tried_paths))
OSError: cannot find library; tried paths: []

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 17, in
from dataset import hierarchical_dataset, AlignCollate, Batch_Balanced_Dataset
python3.6/site-packages/wand/image.py", line 18, in
from . import assertions
python3.6/site-packages/wand/assertions.py", line 155, in
from .color import Color # noqa: E402
python3.6/site-packages/wand/color.py", line 10, in
from .api import library
python3.6/site-packages/wand/api.py", line 177, in
'Try to install:\n ' + msg)
ImportError: MagickWand shared library not found.
You probably had not installed ImageMagick library.
Try to install:
apt-get install libmagickwand-dev
train.sh: line 6: --SequenceModeling: command not found

I have a question

Can this code satisfied different size of image use your pretrained model? I found you use pretrained model from deit, and resize each image to 224 * 224? So can I define imgH and imgW another number and use pretrained model?

Train loss is 0.0000 at every iteration

I am training the VITSTR-tiny model on my custom text dataset and it is giving the same training loss for every iteration which is 0..0000. I have not changed any parameters in the model. What might be the cause

A question about [GO[ token

criterion = torch.nn.CrossEntropyLoss(ignore_index=0).to(device) # ignore [GO] token = ignore index 0

why you ignore GO token when setup loss?

Thank you

Is the network suit for long-text recognition?

Thanks for your work!
I read your paper and notice that input images are resized to [224, 224]. In the case of long text line,does it influence the accuracy?
Look forward to your reply!

Demo.py

I trained my model for japanese language. The validation accuracy goes upto 99.9% but when testing on test image it fails very badly. Can you share demo.py file so that I can check if I am doing everything correctly or not.
Thank you

quantification

Hello
Would you please show an example of quantification of a model ? Thank you

Poor performance on some images

Thank you for the awesome research!

I ran the code for demo images and it worked perfectly. But when I run the code on few sample images, the model seems to be incoherent.

It would be great if you answer few of my questions,

  1. Does the model perform end-to-end STR or does the model require a cropped image (using for ex: EAST or TextFuseNet text detectors)? Example: 1st and 2nd images below (where 1st image is cropped version of 2nd image), same case with 5th and 6th image
  2. Does the model perform multi line text recognition?
  3. Why You Should Try the Real Data for the Scene Text Recognition paper mentions in section 4.7 a scope of improvement using OpenImage v5 dataset on this research, have you tried this?

Examples:

I used vitstr_base_patch16_224_aug.pth model for prediction.

Image Prediction
test6 middleborough
test6_1 midleerooogg
test4 qatm
img_11 aoe
test2 castlecampbell
test1 coaeeea

train error

CUDA_VISIBLE_DEVICES=0 python train.py --train_data mydata/mytrain --valid_data mydata/mytrain --select_data / --batch_ratio 1 --Transformation None --FeatureExtraction None --SequenceModeling None --Prediction None --Transformer --TransformerModel=vitstr_tiny_patch16_224 --imgH 224 --imgW 224 --manualSeed=$RANDOM --sensitive

Traceback (most recent call last):
File "train.py", line 310, in
train(opt)
File "train.py", line 72, in train
model = Model(opt)
File "/media/passwd123/faba01fd-198e-4aa7-853f-bf64370f708c/home/passwd123/text_recognition/VITSTR/model.py", line 47, in init
self.vitstr= create_vitstr(num_tokens=opt.num_class, model=opt.TransformerModel)
File "/media/passwd123/faba01fd-198e-4aa7-853f-bf64370f708c/home/passwd123/text_recognition/VITSTR/modules/vitstr.py", line 42, in create_vitstr
checkpoint_path=checkpoint_path)
File "/home/passwd123/anaconda3/envs/pytorch_zls/lib/python3.7/site-packages/timm/models/factory.py", line 71, in create_model
model = create_fn(pretrained=pretrained, pretrained_cfg=pretrained_cfg, **kwargs)
File "/media/passwd123/faba01fd-198e-4aa7-853f-bf64370f708c/home/passwd123/text_recognition/VITSTR/modules/vitstr.py", line 159, in vitstr_tiny_patch16_224
patch_size=16, embed_dim=192, depth=12, num_heads=3, mlp_ratio=4, qkv_bias=True, **kwargs)
File "/media/passwd123/faba01fd-198e-4aa7-853f-bf64370f708c/home/passwd123/text_recognition/VITSTR/modules/vitstr.py", line 55, in init
super().init(*args, **kwargs)
TypeError: init() got an unexpected keyword argument 'pretrained_cfg'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.