Comments (7)
ok! The following works
first (assuming you have mutool installed):
import subprocess
subprocess.Popen('mutool draw -o "ICUS_Rules_Clean_up.html" "ICUS_Rules_Clean_up.pdf"', shell=True)
then
corpus_parser = Parser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)
very slow for one document though
[INFO] fonduer.utils.udf - Clearing existing...
[INFO] fonduer.utils.udf - Running UDF...
CPU times: user 1.16 s, sys: 484 ms, total: 1.64 s
Wall time: 4min 44s
then
from fonduer import Document, Sentence
print("Documents:", session.query(Document).count())
print("Sentences:", session.query(Sentence).count())
Documents: 1
Sentences: 6858
mutool is open source so it seems you should be recommending this as the conversion tool.
from fonduer.
related to this HazyResearch/pdftotree#27 (comment) ?
When trying to use pdftotree I get the same error as reported in that issue when converting the pdf linked to above.
from fonduer.
@clayms can you try converting the PDF to HTML using Adobe Acrobat to see if that is successful?
from fonduer.
I don't have access to Acrobat, only the reader which only appears to convert to [.docx, .doc, .rtf, .xlsx, or .pptx]
As suggested in HazyResearch/pdftotree#38 (comment), I will convert to html through MS Word. I will let you know the results.
from fonduer.
Using MS Word to convert leads to this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 240614: invalid start byte
I replaced the MS Word generated html file with the Poppler pdftohtml file and ran the same code again, and it worked, but only output 19 sentences from the first page again.
from fonduer.
possible alternative?
https://github.com/coolwanglu/pdf2htmlEX
from fonduer.
this looks promising
https://mupdf.com/docs/manual-mutool-draw.html
from fonduer.
Related Issues (20)
- Extracting Information from tables without Borders HOT 4
- get_sentence_ngrams, get_neighbor_sentence_ngrams, same_sentence should be fonduer.utils.data_model_utils.textual?
- BBox value errors HOT 3
- Is this the right way to test the saved emmental models? HOT 5
- ReadTheDocs error HOT 4
- Featurizer.get_keys() does not honor candidate classes in context
- HTMLDocPreprocessor for PDF documents is it always required HOT 3
- How can i extract a paragraph and all associated sentences in document HOT 1
- Suggestion required: Getting error while applying Featurizer HOT 3
- Parser is not splitting multiple lines sentences properly HOT 3
- unable to read images in the pdf file HOT 8
- Tokens not aligned error when spacy < 2.3.0 HOT 3
- hOCR preprocessor not available in latest release despite documentation suggesting othwerwise HOT 2
- Parser can't handle big tables? HOT 3
- Its dead slow with Win10 + PY 3.6 HOT 2
- HOCRParser fails to multiline Japanese strings HOT 2
- UDF hangs with no exception / warning HOT 5
- Tables aren't redefined for re-runs of UDF apply HOT 5
- Test code "test_postgres.py" failes with sqlalchemy delete method
- CandidateExtractor doesn't scale for larger relations HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fonduer.