monniert / docextractor Goto Github PK
View Code? Open in Web Editor NEW(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
Home Page: https://www.tmonnier.com/docExtractor
License: MIT License
(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
Home Page: https://www.tmonnier.com/docExtractor
License: MIT License
A good suggestion would be an option to directly input a vgg.json for training from scratch or finetuning
perhaps even add an option to input multiple vgg.json files at once.
--input_json ./input/*.json
This project seems interesting, and in-order to insure the continuity of development and improvement i would suggest adding a FUNDING.yml
so that you can accept donations.
My main interest in this project:
Does these suggestions align with the goal of this project?
Do you accept donations?
(docExtractor) home@home-lnx:~/programs/docExtractor$ python src/via_converter.py --input_dir input/ --output_dir output/ --file ./test.json
Traceback (most recent call last):
File "src/via_converter.py", line 80, in <module>
conv.run()
File "src/via_converter.py", line 36, in run
img = self.convert(annot)
File "src/via_converter.py", line 41, in convert
name = annot['filename']
KeyError: 'filename'
@monniert The error:
(docExtractor) home@home-lnx:~/programs/docExtractor$ python src/syndoc_generator.py -d testing -n 100 --merged_labels
[2020-11-29 00:49:37] Creating train set...
[2020-11-29 00:49:37] Generating random document with seed 0...
Traceback (most recent call last):
File "src/syndoc_generator.py", line 62, in <module>
gen.run(args.nb_train)
File "src/syndoc_generator.py", line 46, in run
d = SyntheticDocument(**kwargs)
File "/home/home/programs/docExtractor/src/utils/__init__.py", line 74, in wrapper
return f(*args, **kw)
File "/home/home/programs/docExtractor/src/synthetic/document.py", line 126, in __init__
self.elements, self.positions = self._generate_random_layout()
File "/home/home/programs/docExtractor/src/utils/__init__.py", line 74, in wrapper
return f(*args, **kw)
File "/home/home/programs/docExtractor/src/synthetic/document.py", line 238, in _generate_random_layout
element = choice(self.available_elements, p=weights)(width, height, **element_kwargs)
File "/home/home/programs/docExtractor/src/synthetic/element.py", line 151, in __init__
self.generate_content(seed=seed)
File "/home/home/programs/docExtractor/src/utils/__init__.py", line 74, in wrapper
return f(*args, **kw)
File "/home/home/programs/docExtractor/src/synthetic/element.py", line 597, in generate_content
self.text, content_width, content_height = self.format_text(text)
File "/home/home/programs/docExtractor/src/synthetic/element.py", line 624, in format_text
text = google(text, src='en', dst='ar')
File "/home/home/anaconda3/envs/docExtractor/lib/python3.6/site-packages/translation/__init__.py", line 19, in google
dst = dst, proxies = proxies)
File "/home/home/anaconda3/envs/docExtractor/lib/python3.6/site-packages/translation/main.py", line 33, in get
if r == '': raise TranslateError('No translation get, you may retry')
translation.exception.TranslateError: No translation get, you may retry
Listing the folder's tree:
(docExtractor) home@home-lnx:~/programs/docExtractor$ tree -d
.
├── configs
├── demo
├── models
│ └── default
├── raw_data
│ └── illuhisdoc
│ ├── msd
│ ├── msi
│ ├── mss
│ ├── p
│ └── via_json
├── scripts
├── src
│ ├── datasets
│ ├── loss
│ ├── models
│ ├── optimizers
│ ├── schedulers
│ ├── synthetic
│ │ └── __pycache__
│ └── utils
│ └── __pycache__
└── synthetic_resource
├── background
│ ├── 0
│ ├── 10
│ ├── 100
│ ├── 110
│ ├── 120
│ ├── 20
│ ├── 30
│ ├── 40
│ ├── 50
│ ├── 60
│ ├── 70
│ ├── 80
│ └── 90
├── context_background
├── drawing
├── drawing_background
├── font
│ ├── arabic
│ │ ├── Amiri
│ │ ├── Arial
│ │ ├── Cairo
│ │ ├── dejavu_dejavu-sans
│ │ ├── El_Messiri
│ │ ├── gnu-freefont_freeserif
│ │ └── st-gigafont-typefaces_code2003
│ ├── chinese
│ │ ├── Liu_Jian_Mao_Cao
│ │ ├── Long_Cang
│ │ ├── Ma_Shan_Zheng
│ │ ├── Noto_Sans_SC
│ │ ├── Noto_Serif_SC
│ │ ├── ZCOOL_KuaiLe
│ │ ├── ZCOOL_QingKe_HuangYou
│ │ ├── ZCOOL_XiaoWei
│ │ └── Zhi_Mang_Xing
│ ├── foreign_like
│ │ ├── alhambra
│ │ ├── barmee_afarat-ibn-blady
│ │ ├── bizancia
│ │ ├── catharsis_bedouin
│ │ ├── catharsis_catharsis-bedouin
│ │ ├── k22_timbuctu
│ │ ├── kingthings_conundrum
│ │ ├── meifen
│ │ ├── ming_imperial
│ │ ├── running_smobble
│ │ ├── samarkan
│ │ ├── selamet_lebaran
│ │ ├── uddi-uddi_running-smobble
│ │ ├── yozakura
│ │ └── zilap_oriental
│ ├── handwritten
│ │ ├── Alako
│ │ ├── Angelina
│ │ ├── anke-print
│ │ ├── atlandsketches-bb
│ │ ├── bathilda
│ │ ├── BlackJack_Regular
│ │ ├── blzee
│ │ ├── bromello
│ │ ├── calligravity
│ │ ├── Carefree
│ │ ├── conformity
│ │ ├── Cursive_standard
│ │ ├── Damion
│ │ ├── Elegant
│ │ ├── emizfont
│ │ ├── hoffmanhand
│ │ ├── honey_script
│ │ ├── hurryup
│ │ ├── irezumi
│ │ ├── james-tan-dinawanao
│ │ ├── JaneAusten
│ │ ├── Jellyka_-_Love_and_Passion
│ │ ├── jr-hand
│ │ ├── Juergen
│ │ ├── khand
│ │ ├── kosal-says-hy
│ │ ├── Learning_Curve
│ │ ├── Learning_Curve_Pro
│ │ ├── maddison_signature
│ │ ├── may-queen
│ │ ├── mistis-fonts_october-twilight
│ │ ├── mistis-fonts_stylish-calligraphy-demo
│ │ ├── mistis-fonts_watermelon-script-demo
│ │ ├── Monika
│ │ ├── mumsies
│ │ ├── nymphont_xiomara
│ │ ├── otto
│ │ │ └── Otto
│ │ ├── Pacifico
│ │ ├── paul_signature
│ │ ├── pecita
│ │ ├── popsies
│ │ ├── quigleywiggly
│ │ ├── rabiohead
│ │ ├── roddy
│ │ ├── Saginaw
│ │ ├── Saginaw 2
│ │ ├── santos-dumont
│ │ ├── scribble
│ │ ├── scriptina
│ │ ├── sf-burlington-script
│ │ │ └── TrueType
│ │ ├── sf-foxboro-script
│ │ │ └── TrueType
│ │ ├── shadows-into-light
│ │ ├── shartoll-light
│ │ ├── shelter-me
│ │ │ └── kimberly-geswein_shelter-me
│ │ ├── shorelines_script
│ │ ├── signerica
│ │ ├── sild
│ │ ├── silent-fighter
│ │ │ └── Silent Fighter
│ │ ├── sillii_willinn
│ │ ├── silverline-script-demo
│ │ ├── simple-signature
│ │ ├── snake
│ │ │ └── Snake
│ │ ├── somes-style
│ │ ├── sophia-bella-demo
│ │ ├── spitter
│ │ │ └── Spitter
│ │ ├── stalemate
│ │ ├── standard-pilot-demo
│ │ │ └── standard pilot demo
│ │ ├── stingray
│ │ │ └── Stingray
│ │ ├── stylish-marker
│ │ ├── Sudestada
│ │ ├── sunshine-in-my-soul
│ │ │ └── kimberly-geswein_sunshine-in-my-soul
│ │ ├── sweet-lady
│ │ ├── Tabitha
│ │ ├── the-girl-next-door
│ │ ├── the-great-escape
│ │ │ └── kimberly-geswein_the-great-escape
│ │ ├── the-illusion-of-beauty
│ │ ├── theodista-decally
│ │ ├── the-only-exception
│ │ │ └── kimberly-geswein_the-only-exception
│ │ ├── the-queenthine
│ │ │ └── The Queenthine demo
│ │ ├── the_wave
│ │ ├── think-dreams
│ │ ├── toubibdemo
│ │ ├── turkeyface
│ │ ├── typhoon-type-suthi-srisopha_sweet-hipster
│ │ ├── undercut
│ │ ├── variane-script
│ │ ├── velocity-demo
│ │ ├── vengeance
│ │ ├── victorisa
│ │ ├── waiting-for-the-sunrise
│ │ ├── watasyina
│ │ ├── westbury-signature-demo-version
│ │ │ └── Westbury-Signature-Demo-Version
│ │ ├── white_angelica
│ │ ├── wiegel-kurrent
│ │ ├── wiegel-latein
│ │ ├── Windsong
│ │ ├── winkdeep
│ │ ├── wolgast-two
│ │ ├── wonder_bay
│ │ ├── written-on-his-hands
│ │ │ └── kimberly-geswein_written-on-his-hands
│ │ ├── you-wont-bring-me-down
│ │ └── zeyada
│ └── normal
│ ├── alexey-kryukov_theano
│ ├── daniel-johnson_didact-gothic
│ ├── david-perry_cardo
│ ├── dejavu_dejavu-sans
│ ├── dejavu_dejavu-serif
│ ├── ek-type_ek-mukta
│ ├── georg-duffner_eb-garamond
│ ├── gnu-freefont_freemono
│ ├── gnu-freefont_freesans
│ ├── gnu-freefont_freeserif
│ ├── google_noto-sans
│ ├── google_noto-serif
│ ├── google_roboto
│ ├── gust-e-foundry_texgyreschola
│ ├── gust-e-foundry_texgyretermes
│ ├── james-kass_code2000
│ ├── kineticplasma-fonts_din-kursivschrift
│ ├── kineticplasma-fonts_falling-sky
│ ├── kineticplasma-fonts_mechanical
│ ├── kineticplasma-fonts_trueno
│ ├── linux-libertine_linux-libertine
│ ├── m-fonts_m-2p
│ ├── nymphont_aver
│ ├── red-hat-inc_liberation-sans
│ ├── sil-international_charis-sil
│ ├── sil-international_doulos-sil
│ ├── sil-international_doulos-sil-compact
│ ├── sil-international_gentium-book-basic
│ ├── sil-international_gentium-plus
│ └── st-gigafont-typefaces_code2003
├── glyph_font
│ ├── ababil-script-demo
│ │ └── MJ Ababil Demo
│ ├── aldus_regal
│ ├── aldus_romant
│ ├── aldus_royal
│ ├── anglo-text
│ ├── art-designs-by-sue_fairies-gone-wild
│ ├── art-designs-by-sue_fairies-gone-wild-plus
│ ├── camelotcaps
│ ├── cameoappearance
│ ├── character_cherubic-initials
│ ├── character_masselleam
│ ├── character_romantique-initials
│ ├── cheap-stealer
│ │ └── cheap stealer
│ ├── cheshire-initials
│ ├── chung-deh-tien-chase-zen_chase-zen-jingletruck-karachi
│ ├── cloutierfontes_british-museum-1490
│ ├── colchester
│ ├── dan-roseman_chaucher
│ ├── decorated-roman-initials
│ ├── digital-type-foundry_burton
│ ├── dominatrix
│ ├── ds-romantiques
│ ├── egyptienne-zierinitialien
│ ├── ehmcke-fraktur-initialen
│ ├── ehmcke-schwabacher-initialen
│ ├── elzevier-caps
│ ├── eva-barabasne-olasz_kahirpersonaluse
│ ├── extraornamentalno2
│ ├── fleurcornercaps
│ ├── flowers-initials
│ ├── gate-and-lock-co_metalover
│ ├── gemfonts_gothic-illuminate
│ ├── genzsch-initials
│ ├── george-williams_andrade
│ ├── george-williams_floral-caps-nouveau
│ ├── george-williams_morris
│ ├── george-williams_square-caps
│ ├── germanika-personal-use
│ │ └── Germanika Personal Use
│ ├── griffintwo
│ ├── house-of-lime_fleurcornercaps
│ ├── house-of-lime_german-caps
│ ├── house-of-lime_gothic-flourish
│ ├── house-of-lime_lime-blossom-caps
│ ├── house-of-lime_limeglorycaps
│ ├── intellecta-design_centennialscriptfancy-three
│ ├── intellecta-design_hard-to-read-monograms
│ ├── intellecta-design_holbeinchildrens
│ ├── intellecta-design_intellecta-monograms-random-eight
│ ├── intellecta-design_intellecta-monograms-random-sam
│ ├── intellecta-design_intellecta-monograms-random-six
│ ├── intellecta-design_intellecta-monograms-random-two
│ ├── intellecta-design_jaggard-two
│ ├── intellecta-design_nardis
│ ├── jlh-fonts_apex-lake
│ ├── kaiserzeitgotisch
│ ├── kanzler
│ ├── kr-keltic-one
│ ├── lime-blossom-caps
│ ├── lord-kyl-mackay_floral-majuscules-11th-c
│ ├── lord-kyl-mackay_gothic-leaf
│ ├── lorvad_spatz
│ ├── manfred-klein_delitschinitialen
│ ├── manfred-klein_lombardi-caps
│ ├── manfred-klein_vespasiancaps
│ ├── manfred-klein_vespasiansflorials
│ ├── medici-text
│ ├── medievalalphabet
│ ├── morris-initialen
│ ├── napoli-initialen
│ ├── neugotische-initialen
│ ├── nouveau-drop-caps
│ ├── paisleycaps
│ ├── pamela
│ ├── panhead
│ ├── paulus-franck-initialen
│ ├── pau-the-1st
│ ├── precious
│ ├── rediviva
│ ├── rothenburg-decorative
│ ├── royal-initialen
│ ├── rudelsberg
│ ├── sentinel
│ ├── sniper
│ ├── spring
│ ├── the-black-box_seven-waves-sighs-salome
│ ├── tulips
│ ├── typographerwoodcutinitialsone
│ ├── unger-fraktur-zierbuchstaben
│ ├── victorian-initials-one
│ ├── vtks-deja-vu
│ ├── vtks-focus
│ ├── vtks-mercearia
│ ├── vtks-simplex-beauty-2
│ ├── vtks-sonho
│ ├── vtks-velhos-tempos
│ ├── waste-of-paint
│ ├── west-wind-fonts_exotica
│ ├── west-wind-fonts_leafy
│ ├── zallman-caps
│ └── zamolxis_zamolxisornament
├── noise_pattern
│ ├── border_hole
│ ├── center_hole
│ ├── corner_hole
│ └── phantom_character
├── text
└── wikiart
├── Abstract_Expressionism
├── Action_painting
├── Analytical_Cubism
├── Art_Nouveau_Modern
├── Baroque
├── Color_Field_Painting
├── Contemporary_Realism
├── Cubism
├── Early_Renaissance
├── Expressionism
├── Fauvism
├── High_Renaissance
├── Impressionism
├── Mannerism_Late_Renaissance
├── Minimalism
├── Naive_Art_Primitivism
├── New_Realism
├── Northern_Renaissance
├── Pointillism
├── Pop_Art
├── Post_Impressionism
├── Realism
├── Rococo
├── Romanticism
├── Symbolism
├── Synthetic_Cubism
└── Ukiyo_e
362 directories
@monniert
i generated the groundtruth masks using the via_converter.py
script that you included, they are created without the boarders, as when i used Via Annotator i was boxing the textlines.
i think it would be better to add the option to generate with boarders when using via_converter.py
.
Hi. Thanks for the great tool!
I can't get the wikiart.zip to download. First I was running the script included in the package, but it would always time out. Then I went directly to , but I get a message (in Chrome) that "this site can't be reached."
Is there any other way to obtain this resource?
Thanks in advance.
Hi. docExtractor is doing a great job! Any suggestions for getting from the line level to individual word level? Does this capability perhaps exist already? Or could you make any recommendations -- either for augmenting docExtractor or perhaps something that already exists in python elsewhere?
Hi guys,
I really like your work and want to use it in my project. But a big problem I have right now is the conda environment... I tried to setup the environment with your environment.yml
but I got a HUGE BUNCH of package conflicts.
Is there anything I'm missing? Or the environment is indeed inconsistent and I can just do it with pip
while ignoring all the conflicts?
I'm using Anaconda 4.7.10
Thanks for your time!
Greetings from Germany,
Nicole
Hi!
I read your paper and viewed your video with interest, and I would like to explore using your code for my application - getting layout segmentation from ~100-year-old newspapers. So I downloaded the repo, but in trying to set up the Anaconda environment, I discovered that you are using a number of dependencies that are Linux specific and not available for Windows. If there are no versions available for Windows, I can set up Windows Subsystem for Linux (WSL) and use it that way. But I really would like to see how your code can handle some of examples of images of newspaper pages before I go to the trouble of setting WSL up. So I went to your demo website - https://enherit.paris.inria.fr/
to see if I could use it for this evaluation - but it is down. Could you please establish a new demo website so I can evaluate your repo?
Thanks!
Thank you very much for sharing the code with us. Recently, I inspect and test all the code that you provided. I notice that your custom PolynomialLR inside the schedulers packet returns the result the same as the ConstantLR does. Please let me know if it is a bug. Thank you in advance.
First of all, thank you very much for your work! docExtractor extracts text lines very well out of the box.
But I want to fine-tune the model with custom data and I think my question is related to the process of creating the GT.
As you stated in #10 you recommend to add some border around the annotated text lines. As I went through the examples on https://enherit.paris.inria.fr/ it seems that borders are not annotated explicitly.
Moreover I'm not quite sure which labels where used when the model has been trained. On https://enherit.paris.inria.fr/ the text lines are labeled as text but the paper states labels like paragraph or table.
In my case I work with tabular data. Should text inside the cells therefore be labeled as table?
Thank you in advance!
A good suggestion would be an option of loading the data on the fly, which means instead of loading all the images for training/ predicting all-at-once into the memory, we only load the data as portions.
Even-though this option might increase the time to process the data, it is surely beneficial when dealing with huge number of images for training/predicting, and thus ensuring the ability to deal with huge datasets while maintaining the ram usage.
examples:
--train_on_fly_images 100
: 100 images to be loaded into the ram at a single time
--train_on_fly_json 10
: a complete 10 json files to be loaded into the ram at a single timeNote:
train_on_fly_json` would be used when having multiple .json files for training.
a single .json file can containing multiple images.
It would be great if you could store generated models and datasets that you collected in an archive like Zenodo. Zenodo would give the dataset (which requires a bit of metadata) a DOI for easier citation and it provides stronger long-term promises than Google Drive or Dropbox. (IMHO providing Google Drive links for important data looks phishy, although I know many other projects do the same.)
Thanks for your consideration!
Hello again.
The images that result from running docExtractor (as found in the "text" and "illustration" folders) -- should I expect these to be full resolution with respect to the original? Or is some reduction or loss involved?
(I did my own little (if crude) test. Here is a clip resulting from docExtractor:
The file size is 33.7 KB.
Here is a crop I made from the original:
The file size is 70.4 KB.
I confess I don't know enough about digital imagery, how files are saved, etc to know if this concludes anything or not. :) Regardless, my goal is to have crops made using docExtractor that are lossless.)
A good suggestion would be to save the predicted regions as a VGG Image Anotator .json file, even if the order of the predicted regions is wrong.
--output_json ./output/detected.json
--ignore_readingorder
Hi @monniert,
Good day :)
I'm using docExtractor in my project to extract textlines. Problem is, I have a LOT of pages. Therefore I'm planning on using multiprocessing to do it in parallel. Is it already an option in this project? Or do you have any suggestions on improving the efficiency of such extraction for me?
Thanks a lot!
Best,
Nicole
I'm going through the code and I can't figure out how to run the UI.
I saw that the demo is down and that you are not able to bring it back up (issue #18 ). So I tried to do it myself but I only see the code to segment the regions. In the video it seems that there is some sort of UI to navigate the book and the annotated regions. How can I start that program/server?
@monniert Hi there,
i have trained a new model to detect text regions/ paragraphs, the results were bad eventhough in training and validation the accuracy was high. The sample dataset
https://drive.google.com/drive/folders/1bCuI9SYXOuRUeP4MXY0gfcaKu6O3_WlM?usp=sharing
Hi @monniert
Thank you very much for sharing the code with us.
When running the "tester.py" script, there is the following bug related to Image.blend (Ihave checked that pred_img.size = img.size). Is it related to resize function ?
(926, 1280)
(926, 1280)
Traceback (most recent call last):
File "tester.py", line 122, in
tester.run()
File "tester.py", line 66, in run
self.save_prob_and_seg_maps()
File "tester.py", line 101, in save_prob_and_seg_maps
blend_img = Image.blend(img, pred_img, alpha=0.4)
File "/home/pejc/anaconda2/envs/layout/lib/python3.8/site-packages/PIL/Image.py", line 3011, in blend
return im1._new(core.blend(im1.im, im2.im, alpha))
ValueError: images do not match
Thank you in advance
Hi @monniert
Thank you very much for sharing the code with us.
When running the "tester.py" script, the results are obtained by using only the trained model or the trained model followed by the post-processing step?
Thank you in advance
@monniert
I trained a text-line detector model, the accuracy seemed high, but when i tested, the results were very bad. i even tried training at different image sizes, but still the results were not good.
My Guess is that the ground truth should not be all in the same color "cyan", you might need to choose 2 colors, example: first line "cyan", then second line "red", then 3rd line "cyan", then 4th line "red", etc..... This might help in separating close regions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.