GithubHelp home page GithubHelp logo

Comments (8)

sambitdash avatar sambitdash commented on June 20, 2024

The font enconding mapping to unicode used in PDF has issues. Please share the file to investigate.

The encoding cmaps have ranges lo:hi defined in them. It seems for some reason in the mapping file you have high value lesser than low value. Hence, this assertion error.

https://github.com/sambitdash/Rectangle.jl/blob/54f36a07257b17b8bc8e1f4698aef20df90d632f/src/interval.jl#L5

from pdfio.jl.

bdeonovic avatar bdeonovic commented on June 20, 2024

A few comments:

  1. I printed one page of the pdf (print to pdf on windows) which was causing the error so I could post it here as an example. However, when I tried to run the extract on this 1 page example the extract worked.
  2. The extract doesn't correctly extract the text. The first sentence should be:

U ovoj je knjizi riječ pretežito o hobitima i iz nje će čitatelj doznati štošta o njihovu
značaju i nešto malo o njihovoj povijesti.

but the extract function seems to have a problem with the accent marks. I get this:

U ovoj je knjizi rijee
zna
Crvene knjige o Zapadnoj pokrajini koji su ve objelodanjeni pod naslovom Hobit.

which doesn't have accent marks and skips a bunch of text.

Thoughts?

test.pdf

from pdfio.jl.

sambitdash avatar sambitdash commented on June 20, 2024

I would believe you have some issues related to the font encoding in the file. If I open the file in Adobe Reader and select and copy the text I see exactly below. Which is close to what you are observing. This happens when the font toUnicode c-maps are not properly transferred. The extract text works on the same principle of copying and pasting text from a PDF file.

1 O hobitima
U ovoj je knjizi rije e
zna
Crvene knjige o Zapadnoj pokrajini koji su ve objelodanjeni pod naslovom Hobit.

from pdfio.jl.

sambitdash avatar sambitdash commented on June 20, 2024

I will need to investigate the original file with the C-Map to realize why the file does not get transmitted properly. Please share it here, if possible. If there are security concerns you can mail me at: sambitdash at gmail

from pdfio.jl.

bdeonovic avatar bdeonovic commented on June 20, 2024

email sent

from pdfio.jl.

sambitdash avatar sambitdash commented on June 20, 2024

@bdeonovic Sorry for my delay in looking into the file. The CMap file in the PDF is not aligned to the spec. Figure-6 in the attached spec.

5014.CIDFont_Spec.pdf

That's the reason some readers behave differently. While I will try to repair the cmap for a special case, this is not the correct approach. Code space ranges are rectangular regions in the byte plane and not numbers.

/Registry (BKABIP+TT5+0) /Ordering (T42UV) /Supplement 0 >> def
/CMapName /BKABIP+TT5+0 def
/CMapType 2 def
1 begincodespacerange <00fb> <0108> endcodespacerange

2 beginbfchar
<00ff> <0111>
<0108> <0110>
endbfchar
2 beginbfrange
<00fb> <00fc> <0106>
<00fd> <00fe> <010C>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end

is the CMap. As per the CMap spec the codespace range should have 2 elements.

2 begincodespacerange 
   <00fb> <00ff>
   <0100> <0108> 
endcodespacerange

from pdfio.jl.

sambitdash avatar sambitdash commented on June 20, 2024

A few comments:

  1. I printed one page of the pdf (print to pdf on windows) which was causing the error so I could post it here as an example. However, when I tried to run the extract on this 1 page example the extract worked.
  2. The extract doesn't correctly extract the text. The first sentence should be:

U ovoj je knjizi riječ pretežito o hobitima i iz nje će čitatelj doznati štošta o njihovu značaju i nešto malo o njihovoj povijesti.

but the extract function seems to have a problem with the accent marks. I get this:

U ovoj je knjizi rijee zna Crvene knjige o Zapadnoj pokrajini koji su ve objelodanjeni pod naslovom Hobit.

which doesn't have accent marks and skips a bunch of text.

Thoughts?

test.pdf

On Page-6 of the document you shared, I get:

     U ovoj je knjizi riječ pretežito o hobitima i iz nje će čitatelj doznati štošta o njihovu 
     značaju i nešto malo o njihovoj povijesti. 

This is what you are expecting. While I have introduced a workaround in the code, this is not the code as per spec.

from pdfio.jl.

sambitdash avatar sambitdash commented on June 20, 2024

9ed161f fixes this now.

from pdfio.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.