royalvice / docdiff Goto Github PK

ACM Multimedia 2023: DocDiff: Document Enhancement via Residual Diffusion Models. Also contains 1597 red seals in Chinese scenes, along with their corresponding binary masks.

Home Page: https://www.aibupt.com/

License: MIT License

Python 100.00%

deblurring diffusion-models document-binarization image-translation img2img image-to-image low-level-vision super-resolution documentation-tool ocr

docdiff's Introduction

简体中文 | English | Paper

DocDiff

这里是论文DocDiff: Document Enhancement via Residual Diffusion Models的官方复现仓库。DocDiff是一个文档增强模型（详见论文），可以用于文档去模糊、文档去噪、文档二值化、文档去水印和印章等任务。DocDiff是一个轻量级的基于残差预测的扩散模型，在128*128分辨率上以Batchsize=64训练只需要12GB显存。不仅文档增强，DocDiff还可以应用在其他img2img任务上，比如自然场景去模糊¹，去噪，去雨，超分²，图像修复等low-level任务以及语义分割⁴等high-level任务。

News

置顶: 介绍一款我们实验室开发的多功能且多平台的OCR软件。其中包含了DocDiff的自动去除水印和印章的功能（自动去除水印功能即将上线）。同样包含常用的各种OCR功能，例如PDF转word，PDF转excel，公式识别，表格识别。欢迎试用！
2023.09.14: 上传了水印合成代码utils/marker.py和印章数据集。印章数据集Google Drive
2023.08.02: H-DIBCO 2018 ⁶ 和 DIBCO 2019 ⁷ 的文档二值化结果已经上传。Google Drive
2023.08.01: 祝贺！DocDiff被ACM Multimedia 2023接收！
2023.06.13: 为了方便复现，已上传推理笔记本demo/inference.ipynb和预训练模型checksave/。
2023.05.08: 代码的初始版本已经上传。请查看To-do lists来获取未来的更新。

使用指南

无论是训练还是推理，你只需要修改conf.yml中的配置参数，然后运行main.py即可。MODE=1为训练，MODE=0为推理。conf.yml中的参数都有详细注释，你可以根据注释修改参数。文档去模糊预训练权重在checksave/。 请注意conf.yml中的默认参数在文档场景表现最好。如果你想应用DocDiff在自然场景，请先看一下注意事项!!!。如果仍有问题，欢迎提issue。

由于要下采样3次，所以输入图像的分辨率必须是8的倍数。如果你的图像不是8的倍数，可以使用padding或者裁剪的方式将图像调整为8的倍数。请不要直接Resize，因为这样会导致图像失真。尤其在去模糊任务中，图像失真会导致模糊程度增加，效果会变得很差。例如，DocDiff使用的文档去模糊数据集⁵分辨率为300*300，需要先padding到304*304，再送入推理。

环境配置

python >= 3.7
pytorch >= 1.7.0
torchvision >= 0.8.0

水印合成与印章数据集

我们提供了水印合成代码utils/marker.py和印章数据集。印章数据集Google Drive。由于使用的文档背景图像是我们内部的数据，所以我们没有提供背景图片。如果你想使用水印合成代码，你需要自己找一些文档背景图像。水印合成代码是基于OpenCV实现的，所以你需要安装OpenCV。

印章数据集

印章数据集隶属于DocDiff项目，其中包含1597个中文场景下的红色系印章以及它们对应的二值化的掩膜，这些印章数据可以用于印章合成、印章消除等等任务中。由于人力有限，而从文档图片中抠出来印章是极其困难的事情，所以某些印章图片中包含一些噪声。数据集中的原始印章图片大部分来自于ICDAR 2023 Competition on Reading the Seal Title(https://rrc.cvc.uab.es/?ch=20)数据集，少部分来自于我们自己内部的图片。如果您觉得这份数据集对您有帮助，请给我们的项目一个免费的star，谢谢！！！

注意事项!!!

DocDiff的默认配置参数，训练和推理策略是为文档图像设计的，如果要用于自然场景，想获得更好的效果，需要调整参数，比如扩大模型，添加Self-Attention等（因为文档图像的模式相对固定，但是自然场景的模式比较多样需要更多的参数）并修改训练和推理策略。
训练策略：如论文所述，在文档场景中，因为不追求生成多样性，并且希望尽可能缩减推理时间。所以我们将扩散步长T设为100，并预测 $x_0$ 而不是预测 $\epsilon$。在使用基于通道叠加的引入条件（Coarse Predictor的输出）的方案的前提下，这种策略可以使得在逆向扩散的前几步就可以恢复出较好的 $x_0$ 。在自然场景中，为了更好地重建纹理并追求生成多样性，扩散步长T尽可能大，并要预测 $\epsilon$ 。你只需要修改conf.yml中的PRE_ORI="False"，即可使用预测 $\epsilon$ 的方案; 修改conf.yml中的TIMESTEPS=1000，即可使用更大的扩散步长。
推理策略：在文档场景中生成的图像不想带有随机性（短步随机采样会导致文本边缘扭曲），DocDiff执行DDIM³中的确定采样。在自然场景中，随机采样是生成多样性的关键，修改conf.yml中的PRE_ORI="False"，即可使用随机采样。也就是说，预测 $\epsilon$ 的方案与随机采样是绑定的，而预测 $x_0$ 的方案与确定采样是绑定的。如果你想预测 $x_0$ 并随机采样或者预测 $\epsilon$ 并确定采样，你需要自己修改代码。DocDiff中确定采样是DDIM中的确定采样，随机采样是DDPM中的随机采样，你可以自己修改代码实现其他采样策略。
总结：应用于不需要生成多样性的任务，比如语义分割，文档增强，使用预测 $x_0$ 的方案，扩散步长T设为100就ok，效果已经很好了；应用于需要生成多样性的任务，比如自然场景去模糊，超分，图像修复等，使用预测 $\epsilon$ 的方案，扩散步长T设为1000。

To-do lists

Stars over time

感谢

如果你觉得DocDiff对你有帮助，请给个star，谢谢！🤞😘
如果你有任何问题，欢迎提issue，我会尽快回复。
如果你想交流，欢迎给我发邮件[email protected]，备注：DocDiff。
如果你愿意将DocDiff作为你的项目的baseline，欢迎引用我们的论文。

@inproceedings{yang2023docdiff,
  title={DocDiff: Document Enhancement via Residual Diffusion Models},
  author={Yang, Zongyuan and Liu, Baolin and Xxiong, Yongping and Yi, Lan and Wu, Guibin and Tang, Xiaojun and Liu, Ziqi and Zhou, Junjie and Zhang, Xing},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={2795--2806},
  year={2023}
}

References

[1] Whang J, Delbracio M, Talebi H, et al. Deblurring via stochastic refinement[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 16293-16303.

[2] Shang S, Shan Z, Liu G, et al. ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution[J]. arXiv preprint arXiv:2303.08714, 2023.

[3] Song J, Meng C, Ermon S. Denoising diffusion implicit models[J]. arXiv preprint arXiv:2010.02502, 2020.

[4] Wu J, Fang H, Zhang Y, et al. MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model[J]. arXiv preprint arXiv:2211.00611, 2022.

[5] Michal Hradiš, Jan Kotera, Pavel Zemčík and Filip Šroubek. Convolutional Neural Networks for Direct Text Deblurring. In Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 6.1-6.13. BMVA Press, September 2015.

[6] I. Pratikakis, K. Zagori, P. Kaddas and B. Gatos, "ICFHR 2018 Competition on Handwritten Document Image Binarization (H-DIBCO 2018)," 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, NY, USA, 2018, pp. 489-493, doi: 10.1109/ICFHR-2018.2018.00091.

[7] I. Pratikakis, K. Zagoris, X. Karagiannis, L. Tsochatzidis, T. Mondal and I. Marthot-Santaniello, "ICDAR 2019 Competition on Document Image Binarization (DIBCO 2019)," 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 2019, pp. 1547-1556, doi: 10.1109/ICDAR.2019.00249.

docdiff's People

Contributors

Stargazers

Watchers

Forkers

jinxiqinghuan thinh-huynh-re samaritan1998 dtiku-cn guoqingru0911 yys674 nguyenlecong chenhuayou ryan315 chirag-mphasis eltociear le1surels greatv huangweiboy2 songuyenerza hbulaoma wujinlonglovezhangmiao1314

docdiff's Issues

有关梯度回传的问题

作者您好，有个地方想向您请教一下。
在DocDiff类的forward中，有x__ = self.denoiser(torch.cat((noisy_image, x_.clone().detach()), dim=1), t)，从代码来看，我目前理解的是noisy_image依然是带有x_的信息的，即最终的预测还是会将扩散的loss回传到第一个Unet，但是后面的x_却detach了一下，所以想问一下这里是有特意设计吗？谢谢！

关于预训练模型

请问放出来的预训练模型只是去模糊任务的吗？其他的子任务的预训练模型会放出来吗？

RuntimeError: Sizes of tensors

I tried to test your model and got an inference in Colab, but I got an error:
File "/content/DocDiff/model/DocDiff.py", line 315, in forward x = torch.cat((x, s), dim=1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 440 but got size 439 for tensor number 1 in the list.

 315        print(x.shape, s.shape)
 316        x = torch.cat((x, s), dim=1)
------------------------------------------------------------------------------
torch.Size([1, 128, 220, 160]) torch.Size([1, 128, 220, 160])
torch.Size([1, 128, 220, 160]) torch.Size([1, 96, 220, 160])
torch.Size([1, 128, 440, 320]) torch.Size([1, 96, 439, 319])

My conf.yml

# model
IMAGE_SIZE : [128, 128]   # load image size, if it's train mode, it will be randomly cropped to IMAGE_SIZE. If it's test mode, it will be resized to IMAGE_SIZE.
CHANNEL_X : 3             # input channel
CHANNEL_Y : 3             # output channel
TIMESTEPS : 100           # diffusion steps
SCHEDULE : 'linear'       # linear or cosine
MODEL_CHANNELS : 32       # basic channels of Unet
NUM_RESBLOCKS : 1         # number of residual blocks
CHANNEL_MULT : [1,2,3,4]  # channel multiplier of each layer
NUM_HEADS : 1

MODE : 0                  # 1 Train, 0 Test
PRE_ORI : 'True'          # if True, predict $x_0$, else predict $\epsilon$.

# test
NATIVE_RESOLUTION : 'False'               # if True, test with native resolution
DPM_SOLVER : 'False'      # if True, test with DPM_solver
DPM_STEP : 20             # DPM_solver step
BATCH_SIZE_VAL : 1        # test batch size
TEST_PATH_GT : '/content/drive/MyDrive/wight/data/'         # path of ground truth
TEST_PATH_IMG : '/content/drive/MyDrive/wight/data/'        # path of input
TEST_INITIAL_PREDICTOR_WEIGHT_PATH : '/content/drive/MyDrive/wight/init_predictor_document_deblurring.pth'   # path of initial predictor
TEST_DENOISER_WEIGHT_PATH : '/content/drive/MyDrive/wight/denoiser_document_deblurring.pth'            # path of denoiser
TEST_IMG_SAVE_PATH : './results'

Input image has size [1654, 2339, 3]

Could you help me solve the problem?

About config.PRE_ORI

Hi Yang, thank you for your kind words about DocDiff! As mentioned in the readme under "Notes!", we have been working on applying DocDiff to natural scenes with pattern diversity. We made modifications to the config.yml file by setting PRE_ORI: 'False' and TIMESTEPS: 1000. However, we encountered some problems.

In the trainer.py file of the DocDiff code, specifically lines 189 to 194, we have the following code snippet:

if self.pre_ori == 'True':
if self.high_low_freq == 'True':
residual_high = self.high_filter(gt.to(self.device) - init_predict)
ddpm_loss = 2*self.loss(self.high_filter(noise_pred), residual_high) + self.loss(noise_pred, gt.to(self.device) - init_predict)
else:
ddpm_loss = self.loss(noise_pred, gt.to(self.device) - init_predict)
else:
ddpm_loss = self.loss(noise_pred, noise_ref.to(self.device))
When self.pre_ori is set to 'False', the ddpm_loss causes noise_pred to learn noise_ref. However, noise_ref represents the added noise. During the training stage, the visualization of 'noise_pred.cpu() + init_predict.cpu()' will result in a noisy init_prediction!

It seems that the issue lies in the visualization step, where the noisy init_prediction is being displayed.

您好，火眼OCR APP的redirect_uri参数错误

Inference time

Can you please add inference time information for different configurations and video cards?

推理时长？

请问一下推理时长有测过么，和有和Gan对比过嘛？

请教一下训练时长

我现在用您给定的conf.yaml 在V100单卡上训练(目前好像多卡的代码还没有上传），目前速度大概是8个小时 10000iterations，这个速度正常吗，按照这个速度训练完 1000000iteration需要几十天？

# model
IMAGE_SIZE : [304, 304]   # load image size, if it's train mode, it will be randomly cropped to IMAGE_SIZE. If it's test mode, it will be resized to IMAGE_SIZE.
CHANNEL_X : 3             # input channel
CHANNEL_Y : 3             # output channel
TIMESTEPS : 100           # diffusion steps
SCHEDULE : 'linear'       # linear or cosine
MODEL_CHANNELS : 32       # basic channels of Unet
NUM_RESBLOCKS : 1         # number of residual blocks
CHANNEL_MULT : [1,2,3,4]  # channel multiplier of each layer
NUM_HEADS : 1

MODE : 1                  # 1 Train, 0 Test
PRE_ORI : 'True'          # if True, predict $x_0$, else predict $\epsilon$.


# train
PATH_GT :        # path of ground truth
PATH_IMG :     # path of input
BATCH_SIZE : 32           # training batch size
NUM_WORKERS : 8           # number of workers
ITERATION_MAX : 1000000   # max training iteration
LR : 0.0001               # learning rate
LOSS : 'L2'               # L1 or L2
EMA_EVERY : 100           # update EMA every EMA_EVERY iterations
START_EMA : 2000          # start EMA after START_EMA iterations
SAVE_MODEL_EVERY : 10000  # save model every SAVE_MODEL_EVERY iterations
EMA: 'True'               # if True, use EMA
CONTINUE_TRAINING : 'False'               # if True, continue training
CONTINUE_TRAINING_STEPS : 10000           # continue training from CONTINUE_TRAINING_STEPS
PRETRAINED_PATH_INITIAL_PREDICTOR : /home/lcw/DocDiff/checksave/init.pth'    # path of pretrained initial predictor
PRETRAINED_PATH_DENOISER : '/home/lcw/DocDiff/checksave/denoiser.pth'             # path of pretrained denoiser
WEIGHT_SAVE_PATH : './checksave_LCW'          # path to save model
TRAINING_PATH : './Training'              # path of training data
BETA_LOSS : 50            # hyperparameter to balance the pixel loss and the diffusion loss
HIGH_LOW_FREQ : 'True'    # if True, training with frequency separation

关于图片的输入大小

请问inference中的图片时必须输入是304304的大小吗，假如我有1024768需要先将他切成多个304304的patch，过模型之后再将它拼起来吗。假如我想训练自己的数据集，是先要将他们都切成304304吗？

训练集规模以及学习率

作者您好，我想问一下训练去印章的时候数据集的规模大概是多少呢？以及学习率和batchsize相关吗？我如果调小batchsize的话，学习率是否需要在conf.yml中手动缩放？谢谢！

图片降噪问题

大佬好，我这边用了这张图片进行降噪，发现好像比之前还模糊的。
原图：

结果：

Model deployment issue

How can we convert these models into ONNX format and deploy them could you please provide code for ONNX inference

Inference on variable image size

Hi,
First off thanks for sharing your work. My question is that does DocDiff work on different document image sizes, e.g: image of an entire document page or it only works on small square patches?

Thanks!

SYNTHETIC DATASETS

Thanks a lot for your work! Will you release the synthetic datasets you used in the paper

Questions about data

Hi, When I train model on my dataset, I found there is a question.
I wonder what the difference between the input and training data.
In my view, input is a blurred one and ground truth is a sharp one.
So what about the training data？Is it same with ground truth?

Code for the synthetic dataset

Hi,
I wonder when you release the synthesis code for watermark masks and the mask images for the seals?

Thanks!

Inference Issue

This issue is coming during inference phase of this model for every image

File "/home/mepluser1/rahul_hanot/try_new/DocDiff/model/DocDiff.py", line 315, in forward
x = torch.cat((x, s), dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 76 but got size 75 for tensor number 1 in the list.
I

watermark removal

Hi,
I tested the denoiser model on a watermarked document image(using the demo notebook) and the results show the watermark is not removed:
Left to right: img, init_predict.cpu(), min_max(sampledImgs), and finalImgs

Native:

Non-native:

Is there a different model for watermark removal task?