Comments (6)
@JorjMcKie, yes there should be. I'll add the API for extracting image/text from a page.
from pymupdf.
Hi @rk700,
it is good to see you could find some time in the middle of your big change of life! I hope everything is turning out well for you.
In the meantime I have been trying to continue some of the work.
- some stuff indocumentation
- a solution for a table of contents
- added a close function for PDFs, so that the running process can delete / rename the input PDF. I have been guessing here: is it sufficient to just do a pdf_close_document? What about other file types?
I am afraid, I might have destroyed your change concerning text extraction? I will try it out ...
from pymupdf.
@JorjMcKie, thank you for your work on the project! Especially when I could not devote too much time on it.
For closing doc, I think we can open a new issue. Currently ,fz_drop_document
, in which doc is closed, happens in the destructor of the Document object. That is, the document will not be closed before there are still python references to it. But usually, the destructor is not called as soon as there're no references. So we should add an explicit closing function and maybe then remove the Document destructor. BTW, fz_drop_document
can deal with all the supported doc types.
And don't worry about the text extraction code, it can be recovered easily from the commit history:)
from pymupdf.
@rk700, I have done a "repair" to the extractText, so it again works.
For the close()
issue: I did add a call to pdf_close_document
in fitz.i. And it worked (see this large example for editing table of contents)! Of course I do not know all consequences.
from pymupdf.
@JorjMcKie, the global context would live as long as the module, so it is freed when the program exits. And we can use the general function fz_drop_document
for other types of files, since it will call different closing function for different types.
from pymupdf.
Text extraction tested - closing the issue.
from pymupdf.
Related Issues (20)
- find_tables doesn't recognize any table in scanned document HOT 1
- page.find_tables() is taking high CPU. HOT 1
- Move CLA signatures to dedicated branch.
- "fitz.mupdf.FzErrorArgument: code=4: source object number out of range" after "add_redact_annot" HOT 3
- MuPDF error: syntax error: unknown keyword: '4.48823e' HOT 3
- get_toc(simple=False) return 'to' point coordinate is not based on top-left origin HOT 6
- missing attribute set_dpi() HOT 1
- stamp annotation from pixmap/file HOT 1
- Re-introduced bug, text align add_redact_annot HOT 1
- doc.xref_stream(xref).decode().splitlines() does NOT split the line HOT 3
- OCR segmentation fault HOT 7
- Replacing text with redaction and insert_textbox and fixing reading order
- PyMuPDF failed to extract bw images HOT 11
- Extra characters returned by `page.get_text` with clip HOT 1
- page.get_text() cause process freeze with certain pdf on v1.24.2 HOT 2
- Unable to set ComboBox value HOT 1
- Page.apply_redactions() removes more text than expected in the pdf document. HOT 13
- insert_text() not display true font correctly HOT 2
- Facing Issues after applying redactions they delete some Image or Icons HOT 4
- Images missing from TextPage dictionary HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymupdf.