For complex documents, the sequence of text extracted with e

extractText(): non-natural ordering of text output about pymupdf HOT 3 CLOSED

pymupdf commented on May 19, 2024

extractText(): non-natural ordering of text output

from pymupdf.

Comments (3)

JorjMcKie commented on May 19, 2024

Here is an example on how the function fz_print_text_page will print out a PDF page. The dotted rectangles indicates which boxes are (correctly) kept together. The red number in each box shows the printing sequence. As one can see, this sequence is totally illogical:

from pymupdf.

JorjMcKie commented on May 19, 2024

Feedback from MuPDF's Bugzilla page in essence signifies, that they consider it out of scope to adjust the output of fz_print_text_page, mainly because

the output sequence of this function comes from "the orders that the PDF file performs the marking operations".
fz__print_text_page_xml contains all information required to achieve a better result

So we will consider creating our own functionality for improving extractText().

from pymupdf.

JorjMcKie commented on May 19, 2024

I have modified the TextPage.extractText() method:
It now includes an optional integer parameter basic which indicates whether fz_print_text_page should be used (basic = True) or not.

If Basic = False is specified or defaulted, then fz_print_text_page_xml is invoked, and its return is transformed to create a better ordered plain text string, which is then passed back to the caller. The above example then looks like so:

from pymupdf.

extractText(): non-natural ordering of text output about pymupdf HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs