Comments (1)
The PDF text is written in Helvetica Base-14 font with an array of explicitly given character widths, where the width of the space character is not given ... and therefore 0!
So arguably, neither of the versions 'Hello World\n'
, nor 'Hello \nWorld\n'
is correct. The formally correct output might be "HelloWorld"
which is the way how the PDF viewers display it.
If extracting by words page.get_text("words")
we get
In [8]: page.get_text("words")
Out[8]:
[(100.0,
270.20001220703125,
154.6719970703125, # x1 of "Hello"
303.1759948730469,
'Hello',
0,
0,
0),
(154.6719970703125, # x0 of "World"
270.20001220703125,
217.33599853515625,
303.1759948730469,
'World',
0,
1,
0)]
... we see that the end coordinate of "Hello" equals the start coordinate of "World" - which is correct.
Extracting with option "dict", the versions 1.23 and 1.24 indeed behave differently. Based on blocks=page.get_text("dict")["blocks"]
, version 1.23 gives us one span:
In [4]: blocks=page.get_text("dict")["blocks"]
In [5]: [s for b in blocks for l in b["lines"] for s in l["spans"]]
Out[5]:
[{'size': 24.0,
'flags': 0,
'font': 'Helvetica',
'color': 0,
'ascender': 1.0750000476837158,
'descender': -0.29899999499320984,
'text': 'Hello World',
'origin': (100.0, 296.0),
'bbox': (100.0, 270.20001220703125, 217.33599853515625, 303.1759948730469)}]
Whereas version 1.24 gives us 2 spans:
In [9]: blocks=page.get_text("dict")["blocks"]
In [10]: [s for b in blocks for l in b["lines"] for s in l["spans"]]
Out[10]:
[{'size': 24.0,
'flags': 0,
'font': 'Helvetica',
'color': 0,
'ascender': 1.0750000476837158,
'descender': -0.29899999499320984,
'text': 'Hello ',
'origin': (100.0, 296.0),
'bbox': (100.0, 270.20001220703125, 161.343994140625, 303.1759948730469)},
{'size': 24.0,
'flags': 0,
'font': 'Helvetica',
'color': 0,
'ascender': 1.0750000476837158,
'descender': -0.29899999499320984,
'text': 'World',
'origin': (154.6719970703125, 296.0),
'bbox': (154.6719970703125,
270.20001220703125,
217.33599853515625,
303.1759948730469)}]
But however you view it, it is based on a design decision taken in MuPDF not in PyMuPDF. MuPDF's CLI tool also produces the following when executing mutool draw -o test.txt "Simple PDF 2.0 file.pdf"
:
I suggest you join the MuPDF channel on Discord to discuss this with the team there.
In the meantime, I am taking the liberty to convert this post to a Discussions
item.
from pymupdf.
Related Issues (20)
- Widget font not being updated HOT 3
- Check the hash of the downloaded MuPDF tarball
- pix = page.get_pixmap(matrix=matrix, clip=rect) recommend to modify function get_pixmap HOT 1
- subset_fonts error exit without exception/warning HOT 6
- insert_pdf gives TypeError HOT 4
- insert_pdf gives SystemError HOT 6
- Embedded full-text search index HOT 4
- Page.delete_widget() doesn't fully remove the widget, other programs still detect the widgets HOT 14
- regression: fill_textbox: IndexError: pop from empty list HOT 5
- Unable to create a checked radiobutton HOT 1
- draw_rect scaled to very small size HOT 5
- set_toc method error HOT 8
- Marked content sequences in text trace dictionary HOT 3
- 1.24.2/1.24.3: spurious characters introduced when using subset_fonts and insert_pdf HOT 7
- PyMuPDF 1.24.4 causes "segmentation fault" under Python 3.12 and old MAC OS HOT 12
- pixmap.invert_irect(pixmap.irect) take 7 seconds HOT 3
- cygwin x64 pip3 install pymupdf error HOT 2
- When extracting a numbered list, the result is not as expected. HOT 3
- Small size after apply fitz.TOOLS.set_small_glyph_heights(True) HOT 2
- page.get_label() gets wrong label on the first page of doc
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pymupdf.