GithubHelp home page GithubHelp logo

sambitdash / pdfio.jl Goto Github PK

View Code? Open in Web Editor NEW
122.0 5.0 13.0 25.04 MB

PDF Reader Library for Native Julia.

License: Other

Julia 99.51% HTML 0.31% TeX 0.18%
pdf pdf-library pdf-development julia pdf-files text-extraction pdf-document pdf-specification

pdfio.jl's People

Contributors

alexhanna avatar fredrikekre avatar gwierzchowski avatar juliatagbot avatar mkitti avatar sambitdash avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pdfio.jl's Issues

pdDocGetInfo() crash (PDF without properties)

pdDocGetInfo() crashes when used against PDF without any properties:

(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> info = pdDocGetInfo(doc)
ERROR: type CosNullType has no field val
Stacktrace:
 [1] getproperty(::Any, ::Symbol) at ./sysimg.jl:18
 [2] get(::PDFIO.Cos.CosNullType) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/CosObject.jl:39
 [3] pdDocGetInfo(::PDFIO.PD.PDDocImpl) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/PDDoc.jl:133
 [4] top-level scope at none:0

Fix seems to be easy - I will send PR.

Not able to execute any functions on a basic PDF. Error: Found ' (32)' Expected '<' here

Hi there.
I am getting an error when I try to execute getPDFText() or pdDocOpen() or any other function. This the error:
Found ' (32)' Expected '<' here

And here is the first few lines of stack trace:
the stacktrace:
[1] error(::String) at ./error.jl:33
[2] skipv at /Users/tuckercahillchambers/.julia/packages/PDFIO/Miu63/src/BufferParser.jl:25 [inlined]
[3] read_trailer(::IOStream, ::Int64) at /Users/tuckercahillchambers/.julia/packages/PDFIO/Miu63/src/CosDoc.jl:382

I have searched for this error and come up with nothing. Any ideas on where to go from here?
Thank you.

Feature request: add support for reading attachments

This may be low on your priority list, but being able to read PDF attachments would be great. I deal with a lot of PDFs that have xml or excel attachments with the source data used to generate the PDF. There just aren't many tools for dealing with attachments - it seems most people use command line tools.

CDDate(test) == CDDate(test) returns false

Working on unit test for PR #26 I noticed following unexpected behavior:

(v1.0) julia> test = "D:20090807192622";

(v1.0) julia> CDDate(test) == CDDate(test)
false

Looking at code I see 2 following problems:

  1. reg ex for date:
r"D\s*:\s*(?<dt>\d{12})\s*(?<ut>[+-Z])\s*((?<tzh>\d{2})'\s*(?<tzm>\d{2}))?"

do not strictly conforms Adobe PDF date spec.
More correct would be:

r"D\s*:\s*(?<YYYY>\d{4})(?<MM>\d{2})?(?<DD>\d{2})?(?<HH>\d{2})?(?<mm>\d{2})?(?<SS>\d{2})?\s*((?<ut>[-+Z])\s*(?<tzh>\d{2}))?(\s*'\s*(?<tzm>\d{2}))?\s*"
  1. == for CDDate fallback into identity, I would like to implement:
(==)(x::T, y::T) where {T<: CDDate} 

I'm working on fix, let me know if you welcome PR, and if better to do separate PR or altogether with corrected PR #26 ?
Thanks, Best Regards, GW

Improve the performance `pdPageExtractText`

pdPageExtractText API is one the core APIs of PDFIO. However, smaller large number of allocations make it a bit slower. This code needs to be refactored to ensure the text extraction speeds are improved further.

Any inputs, proposals and PRs in this direction will be highly appreciated.

Unable to read in PDF

I am attempting to read in this pdf. Unfortunately, the code seems stuck on the first page. Any thoughts on why this is? I was able to run this code on another PDFs.

using PDFIO

fname = "16-969_o7jp.pdf"
doc = pdDocOpen(fname)

open("tmp.txt", "w") do io
    page = pdDocGetPage(doc, 1)
    pdPageExtractText(io, page)
end

pdDocClose(doc)

I've also tried on other pages of the PDF and see similar results - Julia works (indefinitely), but I see no error messages, and nothing is printed to the file.

Validate the document for tagged PDF.

Tagged PDF has important properties that can help in good text and graphics extraction for usage elsewhere. Hence, it's important to extract such information from PDFs.

Outlines from PDF documents should be extracted

PDF document outlines can be extracted from 3 distinct sources:

  1. PDF bookmark which show up in Adobe Reader as TOC.
  2. PDF structure from marked content from tagged PDFs
  3. Document structure analysis by learning or heuristics.

The scope of PDFIO is only 1 and 2. 3 can be created as a separate module over PDFIO to address knowledge oriented problems. Eventually, text extraction APIs should move into the new module.

Implement Filespec properly to address the EFF attribute of the security handler

The crypto code decrypts the streams through recursively accessing the indirect objects. For external files it may not easy to determine a file stream is an embedded file from the attributes of the extent dictionary of the stream as all the keys are kind of optional. So Filespecs should be implemented properly to identify the case where EFF flag has to be used judiciously.

Move all the test files to the PDFTest repository.

PDFIO has MIT licensing. However, some of the files may have other forms of license that is not safe to be shipped with PDFIO. The test files will be kept separate from the PDFIO. To be downloaded on demand for test purposes only.

Tests fail

v1.1) pkg> test PDFIO
   Testing PDFIO
 Resolving package versions...
    Status `/var/folders/t6/ddh10c6j5r54sg19jlc59n580000gn/T/tmpCjMIM2/Manifest.toml`
  [1520ce14] AbstractTrees v0.2.1
  [715cd884] AdobeGlyphList v0.1.1
  [9e28174c] BinDeps v0.8.10
  [b99e7846] BinaryProvider v0.5.4
  [e1450e63] BufferedStreams v1.0.0
  [34da2185] Compat v2.1.0
  [ffbed154] DocStringExtensions v0.7.0
  [e30172f5] Documenter v0.22.4
  [0862f596] HTTPClient v0.2.1
  [682c06a0] JSON v0.20.0
  [2e475f56] LabelNumerals v0.1.0
  [b27032c2] LibCURL v0.5.0
  [522f3ed2] LibExpat v0.5.0
  [2ec943e9] Libz v1.0.0
  [4d0d745f] PDFIO v0.1.3
  [27ebfcd6] Primes v0.4.0
  [9a9db56c] Rectangle v0.1.1
  [37834d88] RomanNumerals v0.3.1
  [30578b45] URIParser v0.4.0
  [c17dfb99] WinRPM v0.4.2
  [a5390f91] ZipFile v0.8.1
  [2a0f44e3] Base64  [`@stdlib/Base64`]
  [ade2ca70] Dates  [`@stdlib/Dates`]
  [8bb1440f] DelimitedFiles  [`@stdlib/DelimitedFiles`]
  [8ba89e20] Distributed  [`@stdlib/Distributed`]
  [b77e0a4c] InteractiveUtils  [`@stdlib/InteractiveUtils`]
  [76f85450] LibGit2  [`@stdlib/LibGit2`]
  [8f399da3] Libdl  [`@stdlib/Libdl`]
  [37e2e46d] LinearAlgebra  [`@stdlib/LinearAlgebra`]
  [56ddb016] Logging  [`@stdlib/Logging`]
  [d6f4376e] Markdown  [`@stdlib/Markdown`]
  [a63ad114] Mmap  [`@stdlib/Mmap`]
  [44cfe95a] Pkg  [`@stdlib/Pkg`]
  [de0858da] Printf  [`@stdlib/Printf`]
  [3fa0cd96] REPL  [`@stdlib/REPL`]
  [9a3f8284] Random  [`@stdlib/Random`]
  [ea8e919c] SHA  [`@stdlib/SHA`]
  [9e88b42a] Serialization  [`@stdlib/Serialization`]
  [1a1011a3] SharedArrays  [`@stdlib/SharedArrays`]
  [6462fe0b] Sockets  [`@stdlib/Sockets`]
  [2f01184e] SparseArrays  [`@stdlib/SparseArrays`]
  [10745b16] Statistics  [`@stdlib/Statistics`]
  [8dfed614] Test  [`@stdlib/Test`]
  [cf7118a7] UUIDs  [`@stdlib/UUIDs`]
  [4ec0a83e] Unicode  [`@stdlib/Unicode`]
ERROR: LoadError: LoadError: could not open file /Users/malmaud/.julia/packages/ZipFile/YHTbb/deps/deps.jl
Stacktrace:
 [1] include at ./boot.jl:326 [inlined]
 [2] include_relative(::Module, ::String) at ./loading.jl:1038
 [3] include at ./sysimg.jl:29 [inlined]
 [4] include(::String) at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/Zlib.jl:26
 [5] top-level scope at none:0
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include at ./sysimg.jl:29 [inlined]
 [9] include(::String) at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/ZipFile.jl:36
 [10] top-level scope at none:0
 [11] include at ./boot.jl:326 [inlined]
 [12] include_relative(::Module, ::String) at ./loading.jl:1038
 [13] include(::Module, ::String) at ./sysimg.jl:29
 [14] top-level scope at none:2
 [15] eval at ./boot.jl:328 [inlined]
 [16] eval(::Expr) at ./client.jl:404
 [17] top-level scope at ./none:3
in expression starting at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/Zlib.jl:50
in expression starting at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/ZipFile.jl:43
ERROR: LoadError: Failed to precompile ZipFile [a5390f91-8eb1-5f08-bee0-b1d1ffed6cea] to /Users/malmaud/.julia/compiled/v1.1/ZipFile/cOum2.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
 [3] _require(::Base.PkgId) at ./loading.jl:960
 [4] require(::Base.PkgId) at ./loading.jl:858
 [5] require(::Module, ::Symbol) at ./loading.jl:853
 [6] include at ./boot.jl:326 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1038
 [8] include(::Module, ::String) at ./sysimg.jl:29
 [9] include(::String) at ./client.jl:403
 [10] top-level scope at none:0
in expression starting at /Users/malmaud/.julia/packages/PDFIO/28lLV/test/runtests.jl:6
ERROR: Package PDFIO errored during testing

Extracting text with a specific font with a rectangular region as selection area

It is common to use different fonts to denote semantic meaning (e.g italics for emphasis or larger font size for section titles). Is it possible to extract text that is in a specific font and size? Also, is it possible to specify a region where to extract from? I would like to be able to, for example, extract all the italic text inside a region.

Extract text content from the PDF

Extract text content from PDF. Here are some of the high level use cases.

  1. Pure text based documents should be easily be converted to standard text formats like text, word document etc.
  2. Document structure existing in a PDF document should be preserved as much as possible - Being a PDL, PDF text rendering does not depend on the content order. Hence, any marked information in the document in the document should be preserved.
  3. Ideally text for reading vs. text used for clippath or artwork should be distinguishable.

pdDocGetInfo() crash (PDF with empty properties)

pdDocGetInfo() crashes when used against PDF with empty properties:

(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> info = pdDocGetInfo(doc)
ERROR: BoundsError: attempt to access 0-element Array{UInt8,1} at index [1:4]
Stacktrace:
 [1] throw_boundserror(::Array{UInt8,1}, ::Tuple{UnitRange{Int64}}) at ./abstractarray.jl:484
 [2] checkbounds at ./abstractarray.jl:449 [inlined]
 [3] getindex at ./array.jl:737 [inlined]
 [4] convert(::Type{String}, ::PDFIO.Cos.CosXString) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/CosObjectHelpers.jl:11
 [5] String(::PDFIO.Cos.CosXString) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/CosObjectHelpers.jl:34
 [6] pdDocGetInfo(::PDFIO.PD.PDDocImpl) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/PDDoc.jl:135
 [7] top-level scope at none:0

Attached affected pdf file (it is no longer available on-line).
ALM-2009-Aug.pdf

Table picker for PDF

Natural tabular objects in a PDF document should ideally be picked up for extraction.

The intent of the project is API development, hence it will be headless for most part. There may not be a WYSIWYG picker available unlike a reader. A heuristic table picker should scan the document for existence of table like structures and dump them in tabular HTML/CSS format or extracted image objects. In cased document tagging is enabled, the table picker can use the tagged text.

pdPageExtractText() crash on file created using LaTex

pdPageExtractText() raise following error when used on file created by Latex:

"/home/grzegorz-neo/Dokumenty/Projekty/MatFiz/pdfio-test/outline.pdf"
(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> item_pg = pdDocGetPage(doc, 3);
(v1.0) julia> buf = IOBuffer();
(v1.0) julia> pdPageExtractText(buf, item_pg)
ERROR: InexactError: Int64(Int64, 312.5)
Stacktrace:
 [1] Type at ./float.jl:700 [inlined]
 [2] convert at ./number.jl:7 [inlined]
 [3] setindex!(::Array{Int64,1}, ::Float32, ::Int64) at ./array.jl:769
 [4] get_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/PDFontMetrics.jl:164
....

Change:
d[i+1] = widths[ix] into d[i+1] = round(Int,widths[ix]) in PDFontMetrics.jl fix this issue.
Fix is included into PR with implementation for Outlines.

Google Docs PDF fails at pdPageExtractText

Trying to extract text from a simple Google Docs PDF,

julia> pdPageExtractText(stdout, pdDocGetPage(pdDocOpen("Downloads/GoogleDocs.pdf"), 1))

fails with:

ERROR: MethodError: no method matching setindex!(::Rectangle.RBTree{Rectangle.IntervalKey{UInt16},Int64}, ::Float32, ::Rectangle.Interval{UInt16})
Closest candidates are:
  setindex!(::Rectangle.RBTree{Rectangle.IntervalKey{K},V}, ::V, ::Rectangle.Interval{K}) where {K, V} at /home/jarvist/.julia/packages/Rectangle/SnGUM/src/interval.jl:117
Stacktrace:
 [1] get_cid_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFontMetrics.jl:204
 [2] get_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFontMetrics.jl:164
 [3] PDFont at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFonts.jl:391 [inlined]
 [4] get_pd_font!(::PDFIO.PD.PDDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDDocImpl.jl:112
 [5] get_font(::PDFIO.PD.PDPageImpl, ::CosName) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:320
 [6] evalContent!(::PDPageElement{:Tf}, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:735
 [7] evalContent! at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:637 [inlined]
 [8] evalContent!(::PDPageTextObject, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:680
 [9] evalContent! at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:637 [inlined]
 [10] pdPageEvalContent(::PDFIO.PD.PDPageImpl, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:146
 [11] pdPageEvalContent at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:145 [inlined]
 [12] pdPageExtractText(::Base.TTY, ::PDFIO.PD.PDPageImpl) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:179
 [13] top-level scope at REPL[30]:1

Better support for T3 fonts

Some of the PDF files support T3 fonts that do not have embedded toUnicode mapping. Such fonts cannot be extracted from the document effectively. In such cases, usage of OCR might be useful. An OCR library like tesseract or such which can be helpful in such extraction of font data. This will be a helpful possibility in such scenarios. It has to be made sure that a library used should not violate the MIT Expat License of the PDFIO.

Expose FontDescription flags from PDFonts

Many a times it's required to know the font being a bold, italic, fixed width, allcaps or smallcaps etc. Ideally, these should be captured in TextLayout for subsequent processing,

Writing/modifying pdfs

Is it possible to modify the parsed pdf and write it to a file? Specifically I'm interested in the ideas from here: open-source-ideas/ideas#46. Julia has excellent support for neural networks, so it would be interesting to experiment with something like this.

Extracting boxes

I have a pdf with important points drawn in boxes using path commands. For example:

Q
538.02 6098.07 3316.68 4.14063 re
f
538.02 5395.17 4.14063 705.059 re
f
3850.56 5395.17 4.14063 705.059 re
f
538.02 5393.19 3316.68 4.13672 re
f
q

How can I extract these?

pdDocOpen() crash (PDF created by pdfLatex)

pdDocOpen() crash with following error:

ArgumentError: extra characters after whitespace in "1502\n6"

when called against file created by Latex (attached).
Latex version: pdfTeX 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) (KDE Neon / Ubuntu 16.04 based)
I will submit PR to fix soon.
outline.pdf

build fails

Following the procedure for building the package on MacOS: 10.15 I get a following error:

(base)
in ~ vladπŸ…’ base
 mkdir test && cd test
(base)
in ~/test vladπŸ…’ base
 julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0 (2019-08-20)
 _/ |\__'_|_|_|\__'_|  |
|__/                   |

(v1.2) pkg> activate .
Activating new environment at `~/test/Project.toml`

(test) pkg> add PDFIO
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
  Updating `~/test/Project.toml`
  [4d0d745f] + PDFIO v0.1.8
  Updating `~/test/Manifest.toml`
  [1520ce14] + AbstractTrees v0.2.1
  [715cd884] + AdobeGlyphList v0.1.1
  [9e28174c] + BinDeps v0.8.10
  [34da2185] + Compat v2.2.0
  [2e475f56] + LabelNumerals v0.1.0
  [4d0d745f] + PDFIO v0.1.8
  [27ebfcd6] + Primes v0.4.0
  [9a9db56c] + Rectangle v0.1.2
  [37834d88] + RomanNumerals v0.3.1
  [30578b45] + URIParser v0.4.0
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [8bb1440f] + DelimitedFiles
  [8ba89e20] + Distributed
  [b77e0a4c] + InteractiveUtils
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [1a1011a3] + SharedArrays
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode

(test) pkg> test PDFIO
   Testing PDFIO
 Resolving package versions...
    Status `/var/folders/k3/hy22jxt17xd4hggsb2fqhs4m0000gn/T/jl_OC5Fsf/Manifest.toml`
  [1520ce14] AbstractTrees v0.2.1
  [715cd884] AdobeGlyphList v0.1.1
  [9e28174c] BinDeps v0.8.10
  [b99e7846] BinaryProvider v0.5.8
  [34da2185] Compat v2.2.0
  [2e475f56] LabelNumerals v0.1.0
  [4d0d745f] PDFIO v0.1.8
  [27ebfcd6] Primes v0.4.0
  [9a9db56c] Rectangle v0.1.2
  [37834d88] RomanNumerals v0.3.1
  [30578b45] URIParser v0.4.0
  [a5390f91] ZipFile v0.8.3
  [2a0f44e3] Base64  [`@stdlib/Base64`]
  [ade2ca70] Dates  [`@stdlib/Dates`]
  [8bb1440f] DelimitedFiles  [`@stdlib/DelimitedFiles`]
  [8ba89e20] Distributed  [`@stdlib/Distributed`]
  [b77e0a4c] InteractiveUtils  [`@stdlib/InteractiveUtils`]
  [76f85450] LibGit2  [`@stdlib/LibGit2`]
  [8f399da3] Libdl  [`@stdlib/Libdl`]
  [37e2e46d] LinearAlgebra  [`@stdlib/LinearAlgebra`]
  [56ddb016] Logging  [`@stdlib/Logging`]
  [d6f4376e] Markdown  [`@stdlib/Markdown`]
  [a63ad114] Mmap  [`@stdlib/Mmap`]
  [44cfe95a] Pkg  [`@stdlib/Pkg`]
  [de0858da] Printf  [`@stdlib/Printf`]
  [3fa0cd96] REPL  [`@stdlib/REPL`]
  [9a3f8284] Random  [`@stdlib/Random`]
  [ea8e919c] SHA  [`@stdlib/SHA`]
  [9e88b42a] Serialization  [`@stdlib/Serialization`]
  [1a1011a3] SharedArrays  [`@stdlib/SharedArrays`]
  [6462fe0b] Sockets  [`@stdlib/Sockets`]
  [2f01184e] SparseArrays  [`@stdlib/SparseArrays`]
  [10745b16] Statistics  [`@stdlib/Statistics`]
  [8dfed614] Test  [`@stdlib/Test`]
  [cf7118a7] UUIDs  [`@stdlib/UUIDs`]
  [4ec0a83e] Unicode  [`@stdlib/Unicode`]
ERROR: LoadError: LoadError: LoadError: PDFIO not properly installed. Please run Pkg.build("PDFIO")
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/LibCrypto.jl:1
 [3] include at ./boot.jl:328 [inlined]
 [4] include_relative(::Module, ::String) at ./loading.jl:1094
 [5] include at ./Base.jl:31 [inlined]
 [6] include(::String) at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:1
 [7] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:8
 [8] include at ./boot.jl:328 [inlined]
 [9] include_relative(::Module, ::String) at ./loading.jl:1094
 [10] include at ./Base.jl:31 [inlined]
 [11] include(::String) at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:3
 [12] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:5
 [13] include at ./boot.jl:328 [inlined]
 [14] include_relative(::Module, ::String) at ./loading.jl:1094
 [15] include(::Module, ::String) at ./Base.jl:31
 [16] top-level scope at none:2
 [17] eval at ./boot.jl:330 [inlined]
 [18] eval(::Expr) at ./client.jl:432
 [19] top-level scope at ./none:3
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/LibCrypto.jl:1
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:8
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:5
ERROR: LoadError: Failed to precompile PDFIO [4d0d745f-9d9a-592e-8d18-1ad8a0f42b92] to /Users/vlad/.julia/compiled/v1.2/PDFIO/cmOJE.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1253
 [3] _require(::Base.PkgId) at ./loading.jl:1013
 [4] require(::Base.PkgId) at ./loading.jl:911
 [5] require(::Module, ::Symbol) at ./loading.jl:906
 [6] include at ./boot.jl:328 [inlined]
 [7] include_relative(::Module, ::String) at ./loading.jl:1094
 [8] include(::Module, ::String) at ./Base.jl:31
 [9] include(::String) at ./client.jl:431
 [10] top-level scope at none:5
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/test/runtests.jl:2
ERROR: Package PDFIO errored during testing

(test) pkg>

and this is what happens when I am trying to build the package:

(test) pkg> build PDFIO
  Building PDFIO β†’ `~/.julia/packages/PDFIO/LF83Q/deps/build.log`
β”Œ Error: Error building `PDFIO`:
β”‚
β”‚ signal (6): Abort trap: 6
β”‚ in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/deps/build.jl:76
β”‚ __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
β”‚ Allocations: 10469954 (Pool: 10467735; Big: 2219); GC: 21
β”” @ Pkg.Operations ~/julia/usr/share/julia/stdlib/v1.2/Pkg/src/backwards_compatible_isolation.jl:647

(test) pkg>

Does this error look familiar to anyone?

Support Forms XObjects

Forms XObject is a PDF content embedded as a whole in a PDF page content. This kind of XObjects can have text also in the content and hence may be relevant to text extraction.

Support for JPEG filter

Content filter for JPEG and JPEG2000 should be supported.

Since, these are special type filters whether decoding over direct streaming into the graphics channel for rendering should be reviewed.

Move the Zlib and OpenSSL dependency to JuliaBinaryWrappers

JuliaBinaryWrappers has binaries of Zlib and OpenSSL built-in. Instead of building them, it may be ideal to pick them up from pre-built binaries. That way the unnecessary build time can be reduced and it will be consistent with the pre-built binaries and thus consistent test experience. However, the minimal Julia release has to be 1.3.

Once, Julia 1.3 is GA this can be taken up.

`cosDocGetPageNumbers` crashes when there is no `PageLabels` in PDF catalog

Hi. Working on Outline support implementation I got following error:

        @test begin
            filename="files/1.pdf"
            DEBUG && println(filename)
            doc = pdDocOpen(filename)
            @assert length(pdDocGetPageRange(doc, "1")) >= 1
            pdDocClose(doc)
            length(utilPrintOpenFiles()) == 0
        end

Fails with:
MethodError: no method matching get(::CosNullType, ::CosName)

I think pdDocGetPageRange should fall back to parsing label as number and return appropriate page if there are no labels dictionary in PDF or at least return empty vector.

I have ready fix so would like to submit PR.

tocPDF

I have created a repository which the plan is to auto-generate bookmarks from the table of contents already available at the beginning of pdf files.
https://github.com/aminya/tocPDF

For now, I plan to start using available software (e.g k2pdfoptdoes), and then later make the functionality Julia native (when you add pdf write capability).

Current algorithm plan: https://github.com/aminya/tocPDF#automated

I looked at the PDFIO doc, however, it is a long one, and it has many functions. Could you help me start using PDFIO?

if anyone is interested in participation, that will be awesome. (@kskyten @sambitdash )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.