Comments (12)
Hi @axfelix CERMINE and GROBID are two different and separate projects, please use GROBID's issues for this.
from cermine.
Oh, sorry, I actually meant to say Cermine. That's embarrassing -- I was thinking of Grobid earlier this morning before I opened the issue.
from cermine.
Ah, ok then. We actually use iText library to parse PDF stream, not Poppler. I believe, however, iText also has support for extracting images, so this might be possible.
Could you describe in more detail what is the use case? In particular what would you like to obtain on the output? Just a set of images extracted from a PDF file, or more information about them?
from cermine.
Sure -- the use case is getting <fig>
elements in the output, providing a relative link to a .png file that is produced in the same output directory as the XML. Right now, for us to do this, we need to run pdfimages on top of Cermine or Grobid and add all of the <fig>
elements to the end of the article body just to get them in there at all.
from cermine.
Ok, I will take a closer look at this and I'll get back to you when I know more.
from cermine.
@axfelix I finally found time to look at this :) It seems fairly easy to extract the images and add relative links at the end of the article body, as you described. Extracting the right captions, however, is not as trivial and would require more work and time. Do you need the captions? Would the images only without the captions be helpful as well?
from cermine.
hi Dominika,
The images without the captions would still be very useful -- the captions are of interest, but not needing to call an external library to hack the JATS afterward in order to preserve the images is a priority.
from cermine.
@axfelix I implemented extracting images in "images_extraction" branch. Would you be interested in testing it and providing a feedback?
It requires building the code from the branch, but this should be straightforward (more information in the main README). Images are extracted by ContentExtractor class by default, it should suffice to provide the path to PDFs using -path
option (again, the extraction command from the README should suffice).
One thing I noticed: from some PDFs a lot of 1x1 pixel images are extracted (dots), I also saw some horizontal lines as well in some cases (images with 1-pixel height). Do you think the code should filter those out?
from cermine.
Wow, thanks for the quick implementation, Dominika! Built and tested and appears to be working great -- this saves us an additional library call and is hugely appreciated.
I'd be in favour of filtering out images that have a 1-pixel height or width; we can do this with an additional imagemagick pass but I don't see much reason not to do it by default.
from cermine.
Great. I've added filtering out 1-pixel height or width images and merged everything into master. For now I am closing this issue, if you find any problems or bugs, it can be reopened.
from cermine.
No need to reopen, but curious: when is your next release scheduled?
from cermine.
There is no exact date, most likely in a few weeks.
from cermine.
Related Issues (20)
- Alternative for SegmEdit
- Error while training CERMINE
- cannot build and run cermine on my computer HOT 4
- problem of resolving dependencies for CERMINE-Impl project HOT 4
- '502 Bad Gateway' error on http://maven.icm.edu.pl/artifactory/repo HOT 4
- Problems with text extracting HOT 1
- Low activity
- Can't build cermine , maven dependency link is dead HOT 3
- Filepath is made of multiple language
- Is it possible to run with word document rather than PDF?
- problem with training procedure Cermine
- How to parse single pdf file on command line cermine? HOT 4
- TrueViz extraction fails silently for some PDFs
- Start up issues HOT 1
- Help running on macOS Mojave 10.14
- Exception in thread "main" java.lang.NullPointerException HOT 1
- Has this been abandoned? HOT 5
- CharMatcher.WHITESPACE
- Extracting Line Numbers Issue
- Does not extract Publication date
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cermine.