sambitdash / pdfio.jl Goto Github PK

View Code? Open in Web Editor NEW

122.0 5.0 13.0 25.04 MB

PDF Reader Library for Native Julia.

License: Other

Julia 99.51% HTML 0.31% TeX 0.18%

pdf pdf-library pdf-development julia pdf-files text-extraction pdf-document pdf-specification

pdfio.jl's People

Contributors

Stargazers

Watchers

Forkers

tkelman gwierzchowski kskyten fork4jl wissyhost nosferican standardgalactic dannywinrow hhaensel mkitti jgellene pascalr0410 wardjm

pdfio.jl's Issues

pdDocGetInfo() crash (PDF without properties)

pdDocGetInfo() crashes when used against PDF without any properties:

(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> info = pdDocGetInfo(doc)
ERROR: type CosNullType has no field val
Stacktrace:
 [1] getproperty(::Any, ::Symbol) at ./sysimg.jl:18
 [2] get(::PDFIO.Cos.CosNullType) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/CosObject.jl:39
 [3] pdDocGetInfo(::PDFIO.PD.PDDocImpl) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/PDDoc.jl:133
 [4] top-level scope at none:0

Fix seems to be easy - I will send PR.

Not able to execute any functions on a basic PDF. Error: Found ' (32)' Expected '<' here

Hi there.
I am getting an error when I try to execute getPDFText() or pdDocOpen() or any other function. This the error:
Found ' (32)' Expected '<' here

And here is the first few lines of stack trace:
the stacktrace:
[1] error(::String) at ./error.jl:33
[2] skipv at /Users/tuckercahillchambers/.julia/packages/PDFIO/Miu63/src/BufferParser.jl:25 [inlined]
[3] read_trailer(::IOStream, ::Int64) at /Users/tuckercahillchambers/.julia/packages/PDFIO/Miu63/src/CosDoc.jl:382

I have searched for this error and come up with nothing. Any ideas on where to go from here?
Thank you.

Subsequent Tj or TJ operators without Td operators do not update text matrix

When there are consecutive TJ or Tj operators, without any Td or TD operators that update the text matrix from the text line matrix the computed bounding box for the text run can be wrong.

The Tm text matrix should be updated after every showtext operation.

Feature request: add support for reading attachments

This may be low on your priority list, but being able to read PDF attachments would be great. I deal with a lot of PDFs that have xml or excel attachments with the source data used to generate the PDF. There just aren't many tools for dealing with attachments - it seems most people use command line tools.

CDDate(test) == CDDate(test) returns false

Working on unit test for PR #26 I noticed following unexpected behavior:

(v1.0) julia> test = "D:20090807192622";

(v1.0) julia> CDDate(test) == CDDate(test)
false

Looking at code I see 2 following problems:

reg ex for date:

r"D\s*:\s*(?<dt>\d{12})\s*(?<ut>[+-Z])\s*((?<tzh>\d{2})'\s*(?<tzm>\d{2}))?"

do not strictly conforms Adobe PDF date spec.
More correct would be:

r"D\s*:\s*(?<YYYY>\d{4})(?<MM>\d{2})?(?<DD>\d{2})?(?<HH>\d{2})?(?<mm>\d{2})?(?<SS>\d{2})?\s*((?<ut>[-+Z])\s*(?<tzh>\d{2}))?(\s*'\s*(?<tzm>\d{2}))?\s*"

== for CDDate fallback into identity, I would like to implement:

(==)(x::T, y::T) where {T<: CDDate}

I'm working on fix, let me know if you welcome PR, and if better to do separate PR or altogether with corrected PR #26 ?
Thanks, Best Regards, GW

ASCIIHexDecode should be vectorized

The conversion is relatively simple. Hence, should be made a vector operation and not byte by byte read.

Improve the performance `pdPageExtractText`

pdPageExtractText API is one the core APIs of PDFIO. However, smaller large number of allocations make it a bit slower. This code needs to be refactored to ensure the text extraction speeds are improved further.

Any inputs, proposals and PRs in this direction will be highly appreciated.

Unable to read in PDF

I am attempting to read in this pdf. Unfortunately, the code seems stuck on the first page. Any thoughts on why this is? I was able to run this code on another PDFs.

using PDFIO

fname = "16-969_o7jp.pdf"
doc = pdDocOpen(fname)

open("tmp.txt", "w") do io
    page = pdDocGetPage(doc, 1)
    pdPageExtractText(io, page)
end

pdDocClose(doc)

I've also tried on other pages of the PDF and see similar results - Julia works (indefinitely), but I see no error messages, and nothing is printed to the file.

Validate the document for tagged PDF.

Tagged PDF has important properties that can help in good text and graphics extraction for usage elsewhere. Hence, it's important to extract such information from PDFs.

`pdDocGetInfo` not handling `CosName` properly.

The text width of text runs may be wrongly computed.

PDFIO.jl/src/PDFonts.jl

Line 356 in 6bde9ae

w = (w - tj)*tfs / 1000.0 + ((c == SPACE_CODE(widths)) ? tc : tw)

The tw and tc must be switched.

PDFIO.jl/src/PDFonts.jl

Line 380 in 6bde9ae

totalw += get_string_width(barr, pdfont.widths, prev_char, tfs, tj, tc, tw)

tj = 0 should be there after this line to ensure a text shift if there is a number in the TJ block.

Outlines from PDF documents should be extracted

PDF document outlines can be extracted from 3 distinct sources:

PDF bookmark which show up in Adobe Reader as TOC.
PDF structure from marked content from tagged PDFs
Document structure analysis by learning or heuristics.

The scope of PDFIO is only 1 and 2. 3 can be created as a separate module over PDFIO to address knowledge oriented problems. Eventually, text extraction APIs should move into the new module.

Implement Filespec properly to address the EFF attribute of the security handler

The crypto code decrypts the streams through recursively accessing the indirect objects. For external files it may not easy to determine a file stream is an embedded file from the attributes of the extent dictionary of the stream as all the keys are kind of optional. So Filespecs should be implemented properly to identify the case where EFF flag has to be used judiciously.

Move all the test files to the PDFTest repository.

PDFIO has MIT licensing. However, some of the files may have other forms of license that is not safe to be shipped with PDFIO. The test files will be kept separate from the PDFIO. To be downloaded on demand for test purposes only.

Ensure sanity on PDF 2.0 file samples

https://github.com/pdf-association/pdf20examples has the files.

Full PostScript parser / execution engine for font files and CMap reading

May be picked up from a PostScript renderer like Cairo project as well. Currently, Cairo.jl does not expose such low level APIs.

Make CosXRefStream a special object type

Currently they are merely CosStreams with a type.

Tests fail

v1.1) pkg> test PDFIO
   Testing PDFIO
 Resolving package versions...
    Status `/var/folders/t6/ddh10c6j5r54sg19jlc59n580000gn/T/tmpCjMIM2/Manifest.toml`
  [1520ce14] AbstractTrees v0.2.1
  [715cd884] AdobeGlyphList v0.1.1
  [9e28174c] BinDeps v0.8.10
  [b99e7846] BinaryProvider v0.5.4
  [e1450e63] BufferedStreams v1.0.0
  [34da2185] Compat v2.1.0
  [ffbed154] DocStringExtensions v0.7.0
  [e30172f5] Documenter v0.22.4
  [0862f596] HTTPClient v0.2.1
  [682c06a0] JSON v0.20.0
  [2e475f56] LabelNumerals v0.1.0
  [b27032c2] LibCURL v0.5.0
  [522f3ed2] LibExpat v0.5.0
  [2ec943e9] Libz v1.0.0
  [4d0d745f] PDFIO v0.1.3
  [27ebfcd6] Primes v0.4.0
  [9a9db56c] Rectangle v0.1.1
  [37834d88] RomanNumerals v0.3.1
  [30578b45] URIParser v0.4.0
  [c17dfb99] WinRPM v0.4.2
  [a5390f91] ZipFile v0.8.1
  [2a0f44e3] Base64  [`@stdlib/Base64`]
  [ade2ca70] Dates  [`@stdlib/Dates`]
  [8bb1440f] DelimitedFiles  [`@stdlib/DelimitedFiles`]
  [8ba89e20] Distributed  [`@stdlib/Distributed`]
  [b77e0a4c] InteractiveUtils  [`@stdlib/InteractiveUtils`]
  [76f85450] LibGit2  [`@stdlib/LibGit2`]
  [8f399da3] Libdl  [`@stdlib/Libdl`]
  [37e2e46d] LinearAlgebra  [`@stdlib/LinearAlgebra`]
  [56ddb016] Logging  [`@stdlib/Logging`]
  [d6f4376e] Markdown  [`@stdlib/Markdown`]
  [a63ad114] Mmap  [`@stdlib/Mmap`]
  [44cfe95a] Pkg  [`@stdlib/Pkg`]
  [de0858da] Printf  [`@stdlib/Printf`]
  [3fa0cd96] REPL  [`@stdlib/REPL`]
  [9a3f8284] Random  [`@stdlib/Random`]
  [ea8e919c] SHA  [`@stdlib/SHA`]
  [9e88b42a] Serialization  [`@stdlib/Serialization`]
  [1a1011a3] SharedArrays  [`@stdlib/SharedArrays`]
  [6462fe0b] Sockets  [`@stdlib/Sockets`]
  [2f01184e] SparseArrays  [`@stdlib/SparseArrays`]
  [10745b16] Statistics  [`@stdlib/Statistics`]
  [8dfed614] Test  [`@stdlib/Test`]
  [cf7118a7] UUIDs  [`@stdlib/UUIDs`]
  [4ec0a83e] Unicode  [`@stdlib/Unicode`]
ERROR: LoadError: LoadError: could not open file /Users/malmaud/.julia/packages/ZipFile/YHTbb/deps/deps.jl
Stacktrace:
 [1] include at ./boot.jl:326 [inlined]
 [2] include_relative(::Module, ::String) at ./loading.jl:1038
 [3] include at ./sysimg.jl:29 [inlined]
 [4] include(::String) at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/Zlib.jl:26
 [5] top-level scope at none:0
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include at ./sysimg.jl:29 [inlined]
 [9] include(::String) at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/ZipFile.jl:36
 [10] top-level scope at none:0
 [11] include at ./boot.jl:326 [inlined]
 [12] include_relative(::Module, ::String) at ./loading.jl:1038
 [13] include(::Module, ::String) at ./sysimg.jl:29
 [14] top-level scope at none:2
 [15] eval at ./boot.jl:328 [inlined]
 [16] eval(::Expr) at ./client.jl:404
 [17] top-level scope at ./none:3
in expression starting at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/Zlib.jl:50
in expression starting at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/ZipFile.jl:43
ERROR: LoadError: Failed to precompile ZipFile [a5390f91-8eb1-5f08-bee0-b1d1ffed6cea] to /Users/malmaud/.julia/compiled/v1.1/ZipFile/cOum2.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
 [3] _require(::Base.PkgId) at ./loading.jl:960
 [4] require(::Base.PkgId) at ./loading.jl:858
 [5] require(::Module, ::Symbol) at ./loading.jl:853
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include(::Module, ::String) at ./sysimg.jl:29
 [9] include(::String) at ./client.jl:403
 [10] top-level scope at none:0
in expression starting at /Users/malmaud/.julia/packages/PDFIO/28lLV/test/runtests.jl:6
ERROR: Package PDFIO errored during testing

Extracting text with a specific font with a rectangular region as selection area

It is common to use different fonts to denote semantic meaning (e.g italics for emphasis or larger font size for section titles). Is it possible to extract text that is in a specific font and size? Also, is it possible to specify a region where to extract from? I would like to be able to, for example, extract all the italic text inside a region.

CCITT and JBIG2 decoding

The filter supported in PDF so may be needed by some niche market.

Package Registration

@JuliaRegistrator register()

Extract text content from the PDF

Extract text content from PDF. Here are some of the high level use cases.

Pure text based documents should be easily be converted to standard text formats like text, word document etc.
Document structure existing in a PDF document should be preserved as much as possible - Being a PDL, PDF text rendering does not depend on the content order. Hence, any marked information in the document in the document should be preserved.
Ideally text for reading vs. text used for clippath or artwork should be distinguishable.

Validation of digital signatures in the PDF documents

As PDFs are part of many workflows, digital signatures are becoming norm to sign those workflow transactions. Validation of such transactions will definitely benefit the workflows.

pdDocGetInfo() crash (PDF with empty properties)

pdDocGetInfo() crashes when used against PDF with empty properties:

(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> info = pdDocGetInfo(doc)
ERROR: BoundsError: attempt to access 0-element Array{UInt8,1} at index [1:4]
Stacktrace:
 [1] throw_boundserror(::Array{UInt8,1}, ::Tuple{UnitRange{Int64}}) at ./abstractarray.jl:484
 [2] checkbounds at ./abstractarray.jl:449 [inlined]
 [3] getindex at ./array.jl:737 [inlined]
 [4] convert(::Type{String}, ::PDFIO.Cos.CosXString) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/CosObjectHelpers.jl:11
 [5] String(::PDFIO.Cos.CosXString) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/CosObjectHelpers.jl:34
 [6] pdDocGetInfo(::PDFIO.PD.PDDocImpl) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/PDDoc.jl:135
 [7] top-level scope at none:0

Attached affected pdf file (it is no longer available on-line).
ALM-2009-Aug.pdf

Error tagging new release

The REQUIRE file could not be found.
cc: @sambitdash

Table picker for PDF

Natural tabular objects in a PDF document should ideally be picked up for extraction.

The intent of the project is API development, hence it will be headless for most part. There may not be a WYSIWYG picker available unlike a reader. A heuristic table picker should scan the document for existence of table like structures and dump them in tabular HTML/CSS format or extracted image objects. In cased document tagging is enabled, the table picker can use the tagged text.

pdPageExtractText() crash on file created using LaTex

pdPageExtractText() raise following error when used on file created by Latex:

"/home/grzegorz-neo/Dokumenty/Projekty/MatFiz/pdfio-test/outline.pdf"
(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> item_pg = pdDocGetPage(doc, 3);
(v1.0) julia> buf = IOBuffer();
(v1.0) julia> pdPageExtractText(buf, item_pg)
ERROR: InexactError: Int64(Int64, 312.5)
Stacktrace:
 [1] Type at ./float.jl:700 [inlined]
 [2] convert at ./number.jl:7 [inlined]
 [3] setindex!(::Array{Int64,1}, ::Float32, ::Int64) at ./array.jl:769
 [4] get_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/PDFontMetrics.jl:164
....

Change:
d[i+1] = widths[ix] into d[i+1] = round(Int,widths[ix]) in PDFontMetrics.jl fix this issue.
Fix is included into PR with implementation for Outlines.

Extract document structure from the PDF document

This may not be very accurate but a good way to start understanding the document. The creators do not always provide the final reader intent of the document.

Google Docs PDF fails at pdPageExtractText

Trying to extract text from a simple Google Docs PDF,

julia> pdPageExtractText(stdout, pdDocGetPage(pdDocOpen("Downloads/GoogleDocs.pdf"), 1))

fails with:

ERROR: MethodError: no method matching setindex!(::Rectangle.RBTree{Rectangle.IntervalKey{UInt16},Int64}, ::Float32, ::Rectangle.Interval{UInt16})
Closest candidates are:
  setindex!(::Rectangle.RBTree{Rectangle.IntervalKey{K},V}, ::V, ::Rectangle.Interval{K}) where {K, V} at /home/jarvist/.julia/packages/Rectangle/SnGUM/src/interval.jl:117
Stacktrace:
 [1] get_cid_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFontMetrics.jl:204
 [2] get_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFontMetrics.jl:164
 [3] PDFont at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFonts.jl:391 [inlined]
 [4] get_pd_font!(::PDFIO.PD.PDDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDDocImpl.jl:112
 [5] get_font(::PDFIO.PD.PDPageImpl, ::CosName) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:320
 [6] evalContent!(::PDPageElement{:Tf}, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:735
 [7] evalContent! at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:637 [inlined]
 [8] evalContent!(::PDPageTextObject, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:680
 [9] evalContent! at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:637 [inlined]
 [10] pdPageEvalContent(::PDFIO.PD.PDPageImpl, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:146
 [11] pdPageEvalContent at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:145 [inlined]
 [12] pdPageExtractText(::Base.TTY, ::PDFIO.PD.PDPageImpl) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:179
 [13] top-level scope at REPL[30]:1

Better support for T3 fonts

Some of the PDF files support T3 fonts that do not have embedded toUnicode mapping. Such fonts cannot be extracted from the document effectively. In such cases, usage of OCR might be useful. An OCR library like tesseract or such which can be helpful in such extraction of font data. This will be a helpful possibility in such scenarios. It has to be made sure that a library used should not violate the MIT Expat License of the PDFIO.

Normalize with SASLPrep for PDF passwords

SASLPrep can be implemented using the Unicode consortium supplied libraries: http://site.icu-project.org/ but I guess this may be unnecessarily added dependency.

Enhancement request has been raised to include the feature in Julia: JuliaLang/julia#32503

Expose FontDescription flags from PDFonts

Many a times it's required to know the font being a bold, italic, fixed width, allcaps or smallcaps etc. Ideally, these should be captured in TextLayout for subsequent processing,

Writing/modifying pdfs

Is it possible to modify the parsed pdf and write it to a file? Specifically I'm interested in the ideas from here: open-source-ideas/ideas#46. Julia has excellent support for neural networks, so it would be interesting to experiment with something like this.

`pdPageExtractText` should support multi-column documents

This implementation may be needed to be reviewed along with #2. Although, there may not be an exact overlap in some cases the implementation logic can be similar.

Extracting boxes

I have a pdf with important points drawn in boxes using path commands. For example:

Q
538.02 6098.07 3316.68 4.14063 re
f
538.02 5395.17 4.14063 705.059 re
f
3850.56 5395.17 4.14063 705.059 re
f
538.02 5393.19 3316.68 4.13672 re
f
q

How can I extract these?

Honor text spacing btw text objects in TJ operator

The text spaces in TJ operators can be used to simulate word spacing. Such should be supported in the text extractor.

Should support citations where provided as superscript characters in text extraction

The current algorithm may assume citation superscript as a separate line appearing above the current line where the superscript is used. This may change the layout and extracted characters from the PDF document affecting placement.

Request for examples in the documentation

I am struggling to figure out how to use this library to read a pdf as text for the purpose of Natural Language Processing as an alternative to

using Taro
Taro.init()
meta, txtdata = Taro.extract(files[1]);

as shown in
https://github.com/aviks/nlp-workshop/blob/master/NLP-in-julia.ipynb

Or can I not use this library in stead of Taro (which I cannot compile on Julia 1.0.2)?

Text extraction for Type 0 fonts with ToUnicode CMaps should be supported.

Need a `pdDocHasPageLabels()` API

May need this additional API to enhance some tasks perceived in #39.

pdDocOpen() crash (PDF created by pdfLatex)

pdDocOpen() crash with following error:

ArgumentError: extra characters after whitespace in "1502\n6"

when called against file created by Latex (attached).
Latex version: pdfTeX 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) (KDE Neon / Ubuntu 16.04 based)
I will submit PR to fix soon.
outline.pdf

Move node traversal to AbstractTree or similar interface

AbstractTree provides standard BFS, DFS interfaces, these can help later to apply more esoteric noe traversal functions.

build fails

Following the procedure for building the package on MacOS: 10.15 I get a following error:

(base)
in ~ vlad🅒 base
 mkdir test && cd test
(base)
in ~/test vlad🅒 base
 julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0 (2019-08-20)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

(v1.2) pkg> activate .
Activating new environment at `~/test/Project.toml`

(test) pkg> add PDFIO
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
  Updating `~/test/Project.toml`
  [4d0d745f] + PDFIO v0.1.8
  Updating `~/test/Manifest.toml`
  [1520ce14] + AbstractTrees v0.2.1
  [715cd884] + AdobeGlyphList v0.1.1
  [9e28174c] + BinDeps v0.8.10
  [34da2185] + Compat v2.2.0
  [2e475f56] + LabelNumerals v0.1.0
  [4d0d745f] + PDFIO v0.1.8
  [27ebfcd6] + Primes v0.4.0
  [9a9db56c] + Rectangle v0.1.2
  [37834d88] + RomanNumerals v0.3.1
  [30578b45] + URIParser v0.4.0
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8bb1440f] + DelimitedFiles
  [8ba89e20] + Distributed
  [b77e0a4c] + InteractiveUtils
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [1a1011a3] + SharedArrays
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode

(test) pkg> test PDFIO
   Testing PDFIO
 Resolving package versions...
    Status `/var/folders/k3/hy22jxt17xd4hggsb2fqhs4m0000gn/T/jl_OC5Fsf/Manifest.toml`
  [1520ce14] AbstractTrees v0.2.1
  [715cd884] AdobeGlyphList v0.1.1
  [9e28174c] BinDeps v0.8.10
  [b99e7846] BinaryProvider v0.5.8
  [34da2185] Compat v2.2.0
  [2e475f56] LabelNumerals v0.1.0
  [4d0d745f] PDFIO v0.1.8
  [27ebfcd6] Primes v0.4.0
  [9a9db56c] Rectangle v0.1.2
  [37834d88] RomanNumerals v0.3.1
  [30578b45] URIParser v0.4.0
  [a5390f91] ZipFile v0.8.3
  [2a0f44e3] Base64  [`@stdlib/Base64`]
  [ade2ca70] Dates  [`@stdlib/Dates`]
  [8bb1440f] DelimitedFiles  [`@stdlib/DelimitedFiles`]
  [8ba89e20] Distributed  [`@stdlib/Distributed`]
  [b77e0a4c] InteractiveUtils  [`@stdlib/InteractiveUtils`]
  [76f85450] LibGit2  [`@stdlib/LibGit2`]
  [8f399da3] Libdl  [`@stdlib/Libdl`]
  [37e2e46d] LinearAlgebra  [`@stdlib/LinearAlgebra`]
  [56ddb016] Logging  [`@stdlib/Logging`]
  [d6f4376e] Markdown  [`@stdlib/Markdown`]
  [a63ad114] Mmap  [`@stdlib/Mmap`]
  [44cfe95a] Pkg  [`@stdlib/Pkg`]
  [de0858da] Printf  [`@stdlib/Printf`]
  [3fa0cd96] REPL  [`@stdlib/REPL`]
  [9a3f8284] Random  [`@stdlib/Random`]
  [ea8e919c] SHA  [`@stdlib/SHA`]
  [9e88b42a] Serialization  [`@stdlib/Serialization`]
  [1a1011a3] SharedArrays  [`@stdlib/SharedArrays`]
  [6462fe0b] Sockets  [`@stdlib/Sockets`]
  [2f01184e] SparseArrays  [`@stdlib/SparseArrays`]
  [10745b16] Statistics  [`@stdlib/Statistics`]
  [8dfed614] Test  [`@stdlib/Test`]
  [cf7118a7] UUIDs  [`@stdlib/UUIDs`]
  [4ec0a83e] Unicode  [`@stdlib/Unicode`]
ERROR: LoadError: LoadError: LoadError: PDFIO not properly installed. Please run Pkg.build("PDFIO")
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/LibCrypto.jl:1
 [3] include at ./boot.jl:328 [inlined]
 [4] include_relative(::Module, ::String) at ./loading.jl:1094
 [5] include at ./Base.jl:31 [inlined]
 [6] include(::String) at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:1
 [7] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:8
 [8] include at ./boot.jl:328 [inlined]
 [9] include_relative(::Module, ::String) at ./loading.jl:1094
 [10] include at ./Base.jl:31 [inlined]
 [11] include(::String) at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:3
 [12] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:5
 [13] include at ./boot.jl:328 [inlined]
 [14] include_relative(::Module, ::String) at ./loading.jl:1094
 [15] include(::Module, ::String) at ./Base.jl:31
 [16] top-level scope at none:2
 [17] eval at ./boot.jl:330 [inlined]
 [18] eval(::Expr) at ./client.jl:432
 [19] top-level scope at ./none:3
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/LibCrypto.jl:1
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:8
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:5
ERROR: LoadError: Failed to precompile PDFIO [4d0d745f-9d9a-592e-8d18-1ad8a0f42b92] to /Users/vlad/.julia/compiled/v1.2/PDFIO/cmOJE.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1253
 [3] _require(::Base.PkgId) at ./loading.jl:1013
 [4] require(::Base.PkgId) at ./loading.jl:911
 [5] require(::Module, ::Symbol) at ./loading.jl:906
 [6] include at ./boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1094
 [8] include(::Module, ::String) at ./Base.jl:31
 [9] include(::String) at ./client.jl:431
 [10] top-level scope at none:5
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/test/runtests.jl:2
ERROR: Package PDFIO errored during testing

(test) pkg>

and this is what happens when I am trying to build the package:

(test) pkg> build PDFIO
  Building PDFIO → `~/.julia/packages/PDFIO/LF83Q/deps/build.log`
┌ Error: Error building `PDFIO`:
│
│ signal (6): Abort trap: 6
│ in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/deps/build.jl:76
│ __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
│ Allocations: 10469954 (Pool: 10467735; Big: 2219); GC: 21
└ @ Pkg.Operations ~/julia/usr/share/julia/stdlib/v1.2/Pkg/src/backwards_compatible_isolation.jl:647

(test) pkg>

Does this error look familiar to anyone?

Support text extraction for Symbol fonts

At least Symbol and ZapfDingbats should be supported

Support Forms XObjects

Forms XObject is a PDF content embedded as a whole in a PDF page content. This kind of XObjects can have text also in the content and hence may be relevant to text extraction.

Support for JPEG filter

Content filter for JPEG and JPEG2000 should be supported.

Since, these are special type filters whether decoding over direct streaming into the graphics channel for rendering should be reviewed.

Secured PDF document with X509 certificates

Ability to open and honor security enabled PDFs. The standard security handler is implemented already but PKI based handler needs to be implemented.

Move the Zlib and OpenSSL dependency to JuliaBinaryWrappers

JuliaBinaryWrappers has binaries of Zlib and OpenSSL built-in. Instead of building them, it may be ideal to pick them up from pre-built binaries. That way the unnecessary build time can be reduced and it will be consistent with the pre-built binaries and thus consistent test experience. However, the minimal Julia release has to be 1.3.

Once, Julia 1.3 is GA this can be taken up.

`cosDocGetPageNumbers` crashes when there is no `PageLabels` in PDF catalog

Hi. Working on Outline support implementation I got following error:

        @test begin
            filename="files/1.pdf"
            DEBUG && println(filename)
            doc = pdDocOpen(filename)
            @assert length(pdDocGetPageRange(doc, "1")) >= 1
            pdDocClose(doc)
            length(utilPrintOpenFiles()) == 0
        end

Fails with:
MethodError: no method matching get(::CosNullType, ::CosName)

I think pdDocGetPageRange should fall back to parsing label as number and return appropriate page if there are no labels dictionary in PDF or at least return empty vector.

I have ready fix so would like to submit PR.

tocPDF

I have created a repository which the plan is to auto-generate bookmarks from the table of contents already available at the beginning of pdf files.
https://github.com/aminya/tocPDF

For now, I plan to start using available software (e.g k2pdfoptdoes), and then later make the functionality Julia native (when you add pdf write capability).

Current algorithm plan: https://github.com/aminya/tocPDF#automated

I looked at the PDFIO doc, however, it is a long one, and it has many functions. Could you help me start using PDFIO?

if anyone is interested in participation, that will be awesome. (@kskyten @sambitdash )

sambitdash / pdfio.jl Goto Github PK

pdfio.jl's People

Contributors

Stargazers

Watchers

Forkers

pdfio.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs