Hi, Are there plans to add a call to pdfimages (from xpdf/poppler) t

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Add pdfimages support for image extraction? about cermine HOT 12 CLOSED

ceon commented on June 14, 2024

Add pdfimages support for image extraction?

from cermine.

Comments (12)

dtkaczyk commented on June 14, 2024

Hi @axfelix CERMINE and GROBID are two different and separate projects, please use GROBID's issues for this.

from cermine.

axfelix commented on June 14, 2024

Oh, sorry, I actually meant to say Cermine. That's embarrassing -- I was thinking of Grobid earlier this morning before I opened the issue.

from cermine.

dtkaczyk commented on June 14, 2024

Ah, ok then. We actually use iText library to parse PDF stream, not Poppler. I believe, however, iText also has support for extracting images, so this might be possible.

Could you describe in more detail what is the use case? In particular what would you like to obtain on the output? Just a set of images extracted from a PDF file, or more information about them?

from cermine.

axfelix commented on June 14, 2024

Sure -- the use case is getting <fig> elements in the output, providing a relative link to a .png file that is produced in the same output directory as the XML. Right now, for us to do this, we need to run pdfimages on top of Cermine or Grobid and add all of the <fig> elements to the end of the article body just to get them in there at all.

from cermine.

dtkaczyk commented on June 14, 2024

Ok, I will take a closer look at this and I'll get back to you when I know more.

from cermine.

dtkaczyk commented on June 14, 2024

@axfelix I finally found time to look at this :) It seems fairly easy to extract the images and add relative links at the end of the article body, as you described. Extracting the right captions, however, is not as trivial and would require more work and time. Do you need the captions? Would the images only without the captions be helpful as well?

from cermine.

axfelix commented on June 14, 2024

hi Dominika,

The images without the captions would still be very useful -- the captions are of interest, but not needing to call an external library to hack the JATS afterward in order to preserve the images is a priority.

from cermine.

dtkaczyk commented on June 14, 2024

@axfelix I implemented extracting images in "images_extraction" branch. Would you be interested in testing it and providing a feedback?

It requires building the code from the branch, but this should be straightforward (more information in the main README). Images are extracted by ContentExtractor class by default, it should suffice to provide the path to PDFs using -path option (again, the extraction command from the README should suffice).

One thing I noticed: from some PDFs a lot of 1x1 pixel images are extracted (dots), I also saw some horizontal lines as well in some cases (images with 1-pixel height). Do you think the code should filter those out?

from cermine.

axfelix commented on June 14, 2024

Wow, thanks for the quick implementation, Dominika! Built and tested and appears to be working great -- this saves us an additional library call and is hugely appreciated.

I'd be in favour of filtering out images that have a 1-pixel height or width; we can do this with an additional imagemagick pass but I don't see much reason not to do it by default.

from cermine.

dtkaczyk commented on June 14, 2024

Great. I've added filtering out 1-pixel height or width images and merged everything into master. For now I am closing this issue, if you find any problems or bugs, it can be reopened.

from cermine.

axfelix commented on June 14, 2024

No need to reopen, but curious: when is your next release scheduled?

from cermine.

dtkaczyk commented on June 14, 2024

There is no exact date, most likely in a few weeks.

from cermine.

Add pdfimages support for image extraction? about cermine HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs