GithubHelp home page GithubHelp logo

weird text order about ocrmypdf HOT 5 CLOSED

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024
weird text order

from ocrmypdf.

Comments (5)

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by fritz-hh
Fri Jan 31 19:33:54 2014


I could not reproduce the issue:

  • neither with the PDF viewer embedded in firefox v26.0
  • nor with Adobe Reader XI.

It is possible, that the issue only occurs with some PDF readers? With reader do you use?

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by femifrak
Fri Jan 31 21:05:11 2014


I use the "Document Viewer 1.2.0" which can be launched with "atril".
When using Foxit Reader the text order is correct.
That is very good news. Thanks a lot for testing.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by femifrak
Sat Feb 1 05:39:14 2014


As the output format is pdf/a i am worried why some pdf readers cannot handle the text correctly. I see three possible reasons: 1) pdf/a is not well enough defined 2) The reader is really not accaptable 3) The output is not 100% pdf/a
As i want to use OCRmyPDF for archiving files, the text output could be unreadable in future.

from ocrmypdf.

OCRmyPDF-issuebot avatar OCRmyPDF-issuebot commented on July 19, 2024

Comment by fritz-hh
Tue Feb 4 21:23:01 2014


1 and 3 are not the reason.
OCRmyPDF uses very mature tools (e.g. ghostscript) to generate the PDF/A file. Furthermore, OCRmyPDF validates the generated file against the PDF/A standard using jhove. Therefore the risk is very low that the file will won't be readable in the future. (Alltough I use OCRmyPDF for archiving files too, I cannot / won't provide any warranty...)
As discribed in #66, because the text layer is made of disconnected words and lines, basic PDF-reader can have difficulties to select the text accurately/reliably. Advanced PDF-readers should not have any problem. Seaching for words in the PDF should be possible even with basic PDF-readers.

If I have time, I will analyze how it is possible to improve selection accuracy on basic some PDF-readers. Therefore, I reopen the ticket.

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on July 19, 2024

Tesseract's PDF renderer should do a better job of this in Tess 3.04.01

from ocrmypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.