Comments (3)
Here is an example on how the function fz_print_text_page will print out a PDF page. The dotted rectangles indicates which boxes are (correctly) kept together. The red number in each box shows the printing sequence. As one can see, this sequence is totally illogical:
from pymupdf.
Feedback from MuPDF's Bugzilla page in essence signifies, that they consider it out of scope to adjust the output of fz_print_text_page
, mainly because
- the output sequence of this function comes from "the orders that the PDF file performs the marking operations".
fz__print_text_page_xml
contains all information required to achieve a better result
So we will consider creating our own functionality for improving extractText()
.
from pymupdf.
I have modified the TextPage.extractText()
method:
It now includes an optional integer parameter basic
which indicates whether fz_print_text_page
should be used (basic = True
) or not.
If Basic = False
is specified or defaulted, then fz_print_text_page_xml
is invoked, and its return is transformed to create a better ordered plain text string, which is then passed back to the caller. The above example then looks like so:
from pymupdf.
Related Issues (20)
- Extra characters returned by `page.get_text` with clip HOT 1
- page.get_text() cause process freeze with certain pdf on v1.24.2 HOT 2
- Unable to set ComboBox value HOT 1
- Page.apply_redactions() removes more text than expected in the pdf document. HOT 13
- insert_text() not display true font correctly HOT 2
- Facing Issues after applying redactions they delete some Image or Icons HOT 4
- Images missing from TextPage dictionary HOT 1
- get_pixmap function removes the table and leaves just the content behind HOT 7
- get_pixmap function takes too long to process HOT 4
- Unable to align or format text in Form Field Widgets HOT 2
- test_q_count fails with v1.24.3 HOT 11
- Widget font not being updated HOT 3
- Check the hash of the downloaded MuPDF tarball
- pix = page.get_pixmap(matrix=matrix, clip=rect) recommend to modify function get_pixmap HOT 1
- subset_fonts error exit without exception/warning HOT 6
- insert_pdf gives TypeError HOT 4
- insert_pdf gives SystemError HOT 4
- Embedded full-text search index HOT 4
- Page.delete_widget() doesn't fully remove the widget, other programs still detect the widgets HOT 7
- regression: fill_textbox: IndexError: pop from empty list HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymupdf.