GithubHelp home page GithubHelp logo

Comments (12)

dtkaczyk avatar dtkaczyk commented on June 14, 2024

Hi @axfelix CERMINE and GROBID are two different and separate projects, please use GROBID's issues for this.

from cermine.

axfelix avatar axfelix commented on June 14, 2024

Oh, sorry, I actually meant to say Cermine. That's embarrassing -- I was thinking of Grobid earlier this morning before I opened the issue.

from cermine.

dtkaczyk avatar dtkaczyk commented on June 14, 2024

Ah, ok then. We actually use iText library to parse PDF stream, not Poppler. I believe, however, iText also has support for extracting images, so this might be possible.

Could you describe in more detail what is the use case? In particular what would you like to obtain on the output? Just a set of images extracted from a PDF file, or more information about them?

from cermine.

axfelix avatar axfelix commented on June 14, 2024

Sure -- the use case is getting <fig> elements in the output, providing a relative link to a .png file that is produced in the same output directory as the XML. Right now, for us to do this, we need to run pdfimages on top of Cermine or Grobid and add all of the <fig> elements to the end of the article body just to get them in there at all.

from cermine.

dtkaczyk avatar dtkaczyk commented on June 14, 2024

Ok, I will take a closer look at this and I'll get back to you when I know more.

from cermine.

dtkaczyk avatar dtkaczyk commented on June 14, 2024

@axfelix I finally found time to look at this :) It seems fairly easy to extract the images and add relative links at the end of the article body, as you described. Extracting the right captions, however, is not as trivial and would require more work and time. Do you need the captions? Would the images only without the captions be helpful as well?

from cermine.

axfelix avatar axfelix commented on June 14, 2024

hi Dominika,

The images without the captions would still be very useful -- the captions are of interest, but not needing to call an external library to hack the JATS afterward in order to preserve the images is a priority.

from cermine.

dtkaczyk avatar dtkaczyk commented on June 14, 2024

@axfelix I implemented extracting images in "images_extraction" branch. Would you be interested in testing it and providing a feedback?

It requires building the code from the branch, but this should be straightforward (more information in the main README). Images are extracted by ContentExtractor class by default, it should suffice to provide the path to PDFs using -path option (again, the extraction command from the README should suffice).

One thing I noticed: from some PDFs a lot of 1x1 pixel images are extracted (dots), I also saw some horizontal lines as well in some cases (images with 1-pixel height). Do you think the code should filter those out?

from cermine.

axfelix avatar axfelix commented on June 14, 2024

Wow, thanks for the quick implementation, Dominika! Built and tested and appears to be working great -- this saves us an additional library call and is hugely appreciated.

I'd be in favour of filtering out images that have a 1-pixel height or width; we can do this with an additional imagemagick pass but I don't see much reason not to do it by default.

from cermine.

dtkaczyk avatar dtkaczyk commented on June 14, 2024

Great. I've added filtering out 1-pixel height or width images and merged everything into master. For now I am closing this issue, if you find any problems or bugs, it can be reopened.

from cermine.

axfelix avatar axfelix commented on June 14, 2024

No need to reopen, but curious: when is your next release scheduled?

from cermine.

dtkaczyk avatar dtkaczyk commented on June 14, 2024

There is no exact date, most likely in a few weeks.

from cermine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.