sambitdash / pdfio.jl Goto Github PK
View Code? Open in Web Editor NEWPDF Reader Library for Native Julia.
License: Other
PDF Reader Library for Native Julia.
License: Other
pdDocGetInfo() crashes when used against PDF without any properties:
(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> info = pdDocGetInfo(doc)
ERROR: type CosNullType has no field val
Stacktrace:
[1] getproperty(::Any, ::Symbol) at ./sysimg.jl:18
[2] get(::PDFIO.Cos.CosNullType) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/CosObject.jl:39
[3] pdDocGetInfo(::PDFIO.PD.PDDocImpl) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/PDDoc.jl:133
[4] top-level scope at none:0
Fix seems to be easy - I will send PR.
Hi there.
I am getting an error when I try to execute getPDFText() or pdDocOpen() or any other function. This the error:
Found ' (32)' Expected '<' here
And here is the first few lines of stack trace:
the stacktrace:
[1] error(::String) at ./error.jl:33
[2] skipv at /Users/tuckercahillchambers/.julia/packages/PDFIO/Miu63/src/BufferParser.jl:25 [inlined]
[3] read_trailer(::IOStream, ::Int64) at /Users/tuckercahillchambers/.julia/packages/PDFIO/Miu63/src/CosDoc.jl:382
I have searched for this error and come up with nothing. Any ideas on where to go from here?
Thank you.
When there are consecutive TJ
or Tj
operators, without any Td
or TD
operators that update the text matrix from the text line matrix the computed bounding box for the text run can be wrong.
The Tm
text matrix should be updated after every showtext operation.
This may be low on your priority list, but being able to read PDF attachments would be great. I deal with a lot of PDFs that have xml or excel attachments with the source data used to generate the PDF. There just aren't many tools for dealing with attachments - it seems most people use command line tools.
Working on unit test for PR #26 I noticed following unexpected behavior:
(v1.0) julia> test = "D:20090807192622";
(v1.0) julia> CDDate(test) == CDDate(test)
false
Looking at code I see 2 following problems:
r"D\s*:\s*(?<dt>\d{12})\s*(?<ut>[+-Z])\s*((?<tzh>\d{2})'\s*(?<tzm>\d{2}))?"
do not strictly conforms Adobe PDF date spec.
More correct would be:
r"D\s*:\s*(?<YYYY>\d{4})(?<MM>\d{2})?(?<DD>\d{2})?(?<HH>\d{2})?(?<mm>\d{2})?(?<SS>\d{2})?\s*((?<ut>[-+Z])\s*(?<tzh>\d{2}))?(\s*'\s*(?<tzm>\d{2}))?\s*"
(==)(x::T, y::T) where {T<: CDDate}
I'm working on fix, let me know if you welcome PR, and if better to do separate PR or altogether with corrected PR #26 ?
Thanks, Best Regards, GW
The conversion is relatively simple. Hence, should be made a vector operation and not byte by byte read.
pdPageExtractText
API is one the core APIs of PDFIO. However, smaller large number of allocations make it a bit slower. This code needs to be refactored to ensure the text extraction speeds are improved further.
Any inputs, proposals and PRs in this direction will be highly appreciated.
I am attempting to read in this pdf. Unfortunately, the code seems stuck on the first page. Any thoughts on why this is? I was able to run this code on another PDFs.
using PDFIO
fname = "16-969_o7jp.pdf"
doc = pdDocOpen(fname)
open("tmp.txt", "w") do io
page = pdDocGetPage(doc, 1)
pdPageExtractText(io, page)
end
pdDocClose(doc)
I've also tried on other pages of the PDF and see similar results - Julia works (indefinitely), but I see no error messages, and nothing is printed to the file.
Tagged PDF has important properties that can help in good text and graphics extraction for usage elsewhere. Hence, it's important to extract such information from PDFs.
PDF document outlines can be extracted from 3 distinct sources:
The scope of PDFIO
is only 1 and 2. 3 can be created as a separate module over PDFIO
to address knowledge oriented problems. Eventually, text extraction APIs should move into the new module.
The crypto code decrypts the streams through recursively accessing the indirect objects. For external files it may not easy to determine a file stream is an embedded file from the attributes of the extent dictionary of the stream as all the keys are kind of optional. So Filespecs should be implemented properly to identify the case where EFF flag has to be used judiciously.
PDFIO has MIT licensing. However, some of the files may have other forms of license that is not safe to be shipped with PDFIO. The test files will be kept separate from the PDFIO. To be downloaded on demand for test purposes only.
https://github.com/pdf-association/pdf20examples has the files.
May be picked up from a PostScript renderer like Cairo project as well. Currently, Cairo.jl
does not expose such low level APIs.
Currently they are merely CosStreams with a type.
v1.1) pkg> test PDFIO
Testing PDFIO
Resolving package versions...
Status `/var/folders/t6/ddh10c6j5r54sg19jlc59n580000gn/T/tmpCjMIM2/Manifest.toml`
[1520ce14] AbstractTrees v0.2.1
[715cd884] AdobeGlyphList v0.1.1
[9e28174c] BinDeps v0.8.10
[b99e7846] BinaryProvider v0.5.4
[e1450e63] BufferedStreams v1.0.0
[34da2185] Compat v2.1.0
[ffbed154] DocStringExtensions v0.7.0
[e30172f5] Documenter v0.22.4
[0862f596] HTTPClient v0.2.1
[682c06a0] JSON v0.20.0
[2e475f56] LabelNumerals v0.1.0
[b27032c2] LibCURL v0.5.0
[522f3ed2] LibExpat v0.5.0
[2ec943e9] Libz v1.0.0
[4d0d745f] PDFIO v0.1.3
[27ebfcd6] Primes v0.4.0
[9a9db56c] Rectangle v0.1.1
[37834d88] RomanNumerals v0.3.1
[30578b45] URIParser v0.4.0
[c17dfb99] WinRPM v0.4.2
[a5390f91] ZipFile v0.8.1
[2a0f44e3] Base64 [`@stdlib/Base64`]
[ade2ca70] Dates [`@stdlib/Dates`]
[8bb1440f] DelimitedFiles [`@stdlib/DelimitedFiles`]
[8ba89e20] Distributed [`@stdlib/Distributed`]
[b77e0a4c] InteractiveUtils [`@stdlib/InteractiveUtils`]
[76f85450] LibGit2 [`@stdlib/LibGit2`]
[8f399da3] Libdl [`@stdlib/Libdl`]
[37e2e46d] LinearAlgebra [`@stdlib/LinearAlgebra`]
[56ddb016] Logging [`@stdlib/Logging`]
[d6f4376e] Markdown [`@stdlib/Markdown`]
[a63ad114] Mmap [`@stdlib/Mmap`]
[44cfe95a] Pkg [`@stdlib/Pkg`]
[de0858da] Printf [`@stdlib/Printf`]
[3fa0cd96] REPL [`@stdlib/REPL`]
[9a3f8284] Random [`@stdlib/Random`]
[ea8e919c] SHA [`@stdlib/SHA`]
[9e88b42a] Serialization [`@stdlib/Serialization`]
[1a1011a3] SharedArrays [`@stdlib/SharedArrays`]
[6462fe0b] Sockets [`@stdlib/Sockets`]
[2f01184e] SparseArrays [`@stdlib/SparseArrays`]
[10745b16] Statistics [`@stdlib/Statistics`]
[8dfed614] Test [`@stdlib/Test`]
[cf7118a7] UUIDs [`@stdlib/UUIDs`]
[4ec0a83e] Unicode [`@stdlib/Unicode`]
ERROR: LoadError: LoadError: could not open file /Users/malmaud/.julia/packages/ZipFile/YHTbb/deps/deps.jl
Stacktrace:
[1] include at ./boot.jl:326 [inlined]
[2] include_relative(::Module, ::String) at ./loading.jl:1038
[3] include at ./sysimg.jl:29 [inlined]
[4] include(::String) at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/Zlib.jl:26
[5] top-level scope at none:0
[6] include at ./boot.jl:326 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1038
[8] include at ./sysimg.jl:29 [inlined]
[9] include(::String) at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/ZipFile.jl:36
[10] top-level scope at none:0
[11] include at ./boot.jl:326 [inlined]
[12] include_relative(::Module, ::String) at ./loading.jl:1038
[13] include(::Module, ::String) at ./sysimg.jl:29
[14] top-level scope at none:2
[15] eval at ./boot.jl:328 [inlined]
[16] eval(::Expr) at ./client.jl:404
[17] top-level scope at ./none:3
in expression starting at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/Zlib.jl:50
in expression starting at /Users/malmaud/.julia/packages/ZipFile/YHTbb/src/ZipFile.jl:43
ERROR: LoadError: Failed to precompile ZipFile [a5390f91-8eb1-5f08-bee0-b1d1ffed6cea] to /Users/malmaud/.julia/compiled/v1.1/ZipFile/cOum2.ji.
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1197
[3] _require(::Base.PkgId) at ./loading.jl:960
[4] require(::Base.PkgId) at ./loading.jl:858
[5] require(::Module, ::Symbol) at ./loading.jl:853
[6] include at ./boot.jl:326 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1038
[8] include(::Module, ::String) at ./sysimg.jl:29
[9] include(::String) at ./client.jl:403
[10] top-level scope at none:0
in expression starting at /Users/malmaud/.julia/packages/PDFIO/28lLV/test/runtests.jl:6
ERROR: Package PDFIO errored during testing
It is common to use different fonts to denote semantic meaning (e.g italics for emphasis or larger font size for section titles). Is it possible to extract text that is in a specific font and size? Also, is it possible to specify a region where to extract from? I would like to be able to, for example, extract all the italic text inside a region.
The filter supported in PDF so may be needed by some niche market.
@JuliaRegistrator register()
Extract text content from PDF. Here are some of the high level use cases.
As PDFs are part of many workflows, digital signatures are becoming norm to sign those workflow transactions. Validation of such transactions will definitely benefit the workflows.
pdDocGetInfo() crashes when used against PDF with empty properties:
(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> info = pdDocGetInfo(doc)
ERROR: BoundsError: attempt to access 0-element Array{UInt8,1} at index [1:4]
Stacktrace:
[1] throw_boundserror(::Array{UInt8,1}, ::Tuple{UnitRange{Int64}}) at ./abstractarray.jl:484
[2] checkbounds at ./abstractarray.jl:449 [inlined]
[3] getindex at ./array.jl:737 [inlined]
[4] convert(::Type{String}, ::PDFIO.Cos.CosXString) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/CosObjectHelpers.jl:11
[5] String(::PDFIO.Cos.CosXString) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/CosObjectHelpers.jl:34
[6] pdDocGetInfo(::PDFIO.PD.PDDocImpl) at /home/grzegorz-neo/.julia/packages/PDFIO/DyeYY/src/PDDoc.jl:135
[7] top-level scope at none:0
Attached affected pdf file (it is no longer available on-line).
ALM-2009-Aug.pdf
The REQUIRE file could not be found.
cc: @sambitdash
Natural tabular objects in a PDF document should ideally be picked up for extraction.
The intent of the project is API development, hence it will be headless for most part. There may not be a WYSIWYG picker available unlike a reader. A heuristic table picker should scan the document for existence of table like structures and dump them in tabular HTML/CSS format or extracted image objects. In cased document tagging is enabled, the table picker can use the tagged text.
pdPageExtractText() raise following error when used on file created by Latex:
"/home/grzegorz-neo/Dokumenty/Projekty/MatFiz/pdfio-test/outline.pdf"
(v1.0) julia> doc = pdDocOpen(filename);
(v1.0) julia> item_pg = pdDocGetPage(doc, 3);
(v1.0) julia> buf = IOBuffer();
(v1.0) julia> pdPageExtractText(buf, item_pg)
ERROR: InexactError: Int64(Int64, 312.5)
Stacktrace:
[1] Type at ./float.jl:700 [inlined]
[2] convert at ./number.jl:7 [inlined]
[3] setindex!(::Array{Int64,1}, ::Float32, ::Int64) at ./array.jl:769
[4] get_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/grzegorz-ubu/Dokumenty/Projekty/Julia/PDFIO.jl/src/PDFontMetrics.jl:164
....
Change:
d[i+1] = widths[ix]
into d[i+1] = round(Int,widths[ix])
in PDFontMetrics.jl fix this issue.
Fix is included into PR with implementation for Outlines.
This may not be very accurate but a good way to start understanding the document. The creators do not always provide the final reader intent of the document.
Trying to extract text from a simple Google Docs PDF,
julia> pdPageExtractText(stdout, pdDocGetPage(pdDocOpen("Downloads/GoogleDocs.pdf"), 1))
fails with:
ERROR: MethodError: no method matching setindex!(::Rectangle.RBTree{Rectangle.IntervalKey{UInt16},Int64}, ::Float32, ::Rectangle.Interval{UInt16})
Closest candidates are:
setindex!(::Rectangle.RBTree{Rectangle.IntervalKey{K},V}, ::V, ::Rectangle.Interval{K}) where {K, V} at /home/jarvist/.julia/packages/Rectangle/SnGUM/src/interval.jl:117
Stacktrace:
[1] get_cid_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFontMetrics.jl:204
[2] get_font_widths(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFontMetrics.jl:164
[3] PDFont at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDFonts.jl:391 [inlined]
[4] get_pd_font!(::PDFIO.PD.PDDocImpl, ::PDFIO.Cos.CosIndirectObject{CosDict}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDDocImpl.jl:112
[5] get_font(::PDFIO.PD.PDPageImpl, ::CosName) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:320
[6] evalContent!(::PDPageElement{:Tf}, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:735
[7] evalContent! at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:637 [inlined]
[8] evalContent!(::PDPageTextObject, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:680
[9] evalContent! at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPageElement.jl:637 [inlined]
[10] pdPageEvalContent(::PDFIO.PD.PDPageImpl, ::PDFIO.PD.GState{:PDFIO}) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:146
[11] pdPageEvalContent at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:145 [inlined]
[12] pdPageExtractText(::Base.TTY, ::PDFIO.PD.PDPageImpl) at /home/jarvist/.julia/packages/PDFIO/Miu63/src/PDPage.jl:179
[13] top-level scope at REPL[30]:1
Some of the PDF files support T3 fonts that do not have embedded toUnicode mapping. Such fonts cannot be extracted from the document effectively. In such cases, usage of OCR might be useful. An OCR library like tesseract
or such which can be helpful in such extraction of font data. This will be a helpful possibility in such scenarios. It has to be made sure that a library used should not violate the MIT Expat License of the PDFIO.
SASLPrep can be implemented using the Unicode consortium supplied libraries: http://site.icu-project.org/ but I guess this may be unnecessarily added dependency.
Enhancement request has been raised to include the feature in Julia: JuliaLang/julia#32503
Many a times it's required to know the font being a bold, italic, fixed width, allcaps or smallcaps etc. Ideally, these should be captured in TextLayout for subsequent processing,
Is it possible to modify the parsed pdf and write it to a file? Specifically I'm interested in the ideas from here: open-source-ideas/ideas#46. Julia has excellent support for neural networks, so it would be interesting to experiment with something like this.
This implementation may be needed to be reviewed along with #2. Although, there may not be an exact overlap in some cases the implementation logic can be similar.
I have a pdf with important points drawn in boxes using path commands. For example:
Q
538.02 6098.07 3316.68 4.14063 re
f
538.02 5395.17 4.14063 705.059 re
f
3850.56 5395.17 4.14063 705.059 re
f
538.02 5393.19 3316.68 4.13672 re
f
q
How can I extract these?
The text spaces in TJ operators can be used to simulate word spacing. Such should be supported in the text extractor.
The current algorithm may assume citation superscript as a separate line appearing above the current line where the superscript is used. This may change the layout and extracted characters from the PDF document affecting placement.
I am struggling to figure out how to use this library to read a pdf as text for the purpose of Natural Language Processing as an alternative to
using Taro
Taro.init()
meta, txtdata = Taro.extract(files[1]);
as shown in
https://github.com/aviks/nlp-workshop/blob/master/NLP-in-julia.ipynb
Or can I not use this library in stead of Taro (which I cannot compile on Julia 1.0.2)?
May need this additional API to enhance some tasks perceived in #39.
pdDocOpen() crash with following error:
ArgumentError: extra characters after whitespace in "1502\n6"
when called against file created by Latex (attached).
Latex version: pdfTeX 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian)
(KDE Neon / Ubuntu 16.04 based)
I will submit PR to fix soon.
outline.pdf
AbstractTree provides standard BFS, DFS interfaces, these can help later to apply more esoteric noe traversal functions.
Following the procedure for building the package on MacOS: 10.15 I get a following error:
(base)
in ~ vladπ
base
ο§ mkdir test && cd test
(base)
in ~/test vladπ
base
ο§ julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.2.0 (2019-08-20)
_/ |\__'_|_|_|\__'_| |
|__/ |
(v1.2) pkg> activate .
Activating new environment at `~/test/Project.toml`
(test) pkg> add PDFIO
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
Updating `~/test/Project.toml`
[4d0d745f] + PDFIO v0.1.8
Updating `~/test/Manifest.toml`
[1520ce14] + AbstractTrees v0.2.1
[715cd884] + AdobeGlyphList v0.1.1
[9e28174c] + BinDeps v0.8.10
[34da2185] + Compat v2.2.0
[2e475f56] + LabelNumerals v0.1.0
[4d0d745f] + PDFIO v0.1.8
[27ebfcd6] + Primes v0.4.0
[9a9db56c] + Rectangle v0.1.2
[37834d88] + RomanNumerals v0.3.1
[30578b45] + URIParser v0.4.0
[2a0f44e3] + Base64
[ade2ca70] + Dates
[8bb1440f] + DelimitedFiles
[8ba89e20] + Distributed
[b77e0a4c] + InteractiveUtils
[76f85450] + LibGit2
[8f399da3] + Libdl
[37e2e46d] + LinearAlgebra
[56ddb016] + Logging
[d6f4376e] + Markdown
[a63ad114] + Mmap
[44cfe95a] + Pkg
[de0858da] + Printf
[3fa0cd96] + REPL
[9a3f8284] + Random
[ea8e919c] + SHA
[9e88b42a] + Serialization
[1a1011a3] + SharedArrays
[6462fe0b] + Sockets
[2f01184e] + SparseArrays
[10745b16] + Statistics
[8dfed614] + Test
[cf7118a7] + UUIDs
[4ec0a83e] + Unicode
(test) pkg> test PDFIO
Testing PDFIO
Resolving package versions...
Status `/var/folders/k3/hy22jxt17xd4hggsb2fqhs4m0000gn/T/jl_OC5Fsf/Manifest.toml`
[1520ce14] AbstractTrees v0.2.1
[715cd884] AdobeGlyphList v0.1.1
[9e28174c] BinDeps v0.8.10
[b99e7846] BinaryProvider v0.5.8
[34da2185] Compat v2.2.0
[2e475f56] LabelNumerals v0.1.0
[4d0d745f] PDFIO v0.1.8
[27ebfcd6] Primes v0.4.0
[9a9db56c] Rectangle v0.1.2
[37834d88] RomanNumerals v0.3.1
[30578b45] URIParser v0.4.0
[a5390f91] ZipFile v0.8.3
[2a0f44e3] Base64 [`@stdlib/Base64`]
[ade2ca70] Dates [`@stdlib/Dates`]
[8bb1440f] DelimitedFiles [`@stdlib/DelimitedFiles`]
[8ba89e20] Distributed [`@stdlib/Distributed`]
[b77e0a4c] InteractiveUtils [`@stdlib/InteractiveUtils`]
[76f85450] LibGit2 [`@stdlib/LibGit2`]
[8f399da3] Libdl [`@stdlib/Libdl`]
[37e2e46d] LinearAlgebra [`@stdlib/LinearAlgebra`]
[56ddb016] Logging [`@stdlib/Logging`]
[d6f4376e] Markdown [`@stdlib/Markdown`]
[a63ad114] Mmap [`@stdlib/Mmap`]
[44cfe95a] Pkg [`@stdlib/Pkg`]
[de0858da] Printf [`@stdlib/Printf`]
[3fa0cd96] REPL [`@stdlib/REPL`]
[9a3f8284] Random [`@stdlib/Random`]
[ea8e919c] SHA [`@stdlib/SHA`]
[9e88b42a] Serialization [`@stdlib/Serialization`]
[1a1011a3] SharedArrays [`@stdlib/SharedArrays`]
[6462fe0b] Sockets [`@stdlib/Sockets`]
[2f01184e] SparseArrays [`@stdlib/SparseArrays`]
[10745b16] Statistics [`@stdlib/Statistics`]
[8dfed614] Test [`@stdlib/Test`]
[cf7118a7] UUIDs [`@stdlib/UUIDs`]
[4ec0a83e] Unicode [`@stdlib/Unicode`]
ERROR: LoadError: LoadError: LoadError: PDFIO not properly installed. Please run Pkg.build("PDFIO")
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/LibCrypto.jl:1
[3] include at ./boot.jl:328 [inlined]
[4] include_relative(::Module, ::String) at ./loading.jl:1094
[5] include at ./Base.jl:31 [inlined]
[6] include(::String) at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:1
[7] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:8
[8] include at ./boot.jl:328 [inlined]
[9] include_relative(::Module, ::String) at ./loading.jl:1094
[10] include at ./Base.jl:31 [inlined]
[11] include(::String) at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:3
[12] top-level scope at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:5
[13] include at ./boot.jl:328 [inlined]
[14] include_relative(::Module, ::String) at ./loading.jl:1094
[15] include(::Module, ::String) at ./Base.jl:31
[16] top-level scope at none:2
[17] eval at ./boot.jl:330 [inlined]
[18] eval(::Expr) at ./client.jl:432
[19] top-level scope at ./none:3
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/LibCrypto.jl:1
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/Common.jl:8
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/src/PDFIO.jl:5
ERROR: LoadError: Failed to precompile PDFIO [4d0d745f-9d9a-592e-8d18-1ad8a0f42b92] to /Users/vlad/.julia/compiled/v1.2/PDFIO/cmOJE.ji.
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1253
[3] _require(::Base.PkgId) at ./loading.jl:1013
[4] require(::Base.PkgId) at ./loading.jl:911
[5] require(::Module, ::Symbol) at ./loading.jl:906
[6] include at ./boot.jl:328 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1094
[8] include(::Module, ::String) at ./Base.jl:31
[9] include(::String) at ./client.jl:431
[10] top-level scope at none:5
in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/test/runtests.jl:2
ERROR: Package PDFIO errored during testing
(test) pkg>
and this is what happens when I am trying to build the package:
(test) pkg> build PDFIO
Building PDFIO β `~/.julia/packages/PDFIO/LF83Q/deps/build.log`
β Error: Error building `PDFIO`:
β
β signal (6): Abort trap: 6
β in expression starting at /Users/vlad/.julia/packages/PDFIO/LF83Q/deps/build.jl:76
β __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
β Allocations: 10469954 (Pool: 10467735; Big: 2219); GC: 21
β @ Pkg.Operations ~/julia/usr/share/julia/stdlib/v1.2/Pkg/src/backwards_compatible_isolation.jl:647
(test) pkg>
Does this error look familiar to anyone?
At least Symbol and ZapfDingbats should be supported
Forms XObject is a PDF content embedded as a whole in a PDF page content. This kind of XObjects can have text also in the content and hence may be relevant to text extraction.
Content filter for JPEG and JPEG2000 should be supported.
Since, these are special type filters whether decoding over direct streaming into the graphics channel for rendering should be reviewed.
Ability to open and honor security enabled PDFs. The standard security handler is implemented already but PKI based handler needs to be implemented.
JuliaBinaryWrappers has binaries of Zlib and OpenSSL built-in. Instead of building them, it may be ideal to pick them up from pre-built binaries. That way the unnecessary build time can be reduced and it will be consistent with the pre-built binaries and thus consistent test experience. However, the minimal Julia release has to be 1.3.
Once, Julia 1.3 is GA this can be taken up.
Hi. Working on Outline support implementation I got following error:
@test begin
filename="files/1.pdf"
DEBUG && println(filename)
doc = pdDocOpen(filename)
@assert length(pdDocGetPageRange(doc, "1")) >= 1
pdDocClose(doc)
length(utilPrintOpenFiles()) == 0
end
Fails with:
MethodError: no method matching get(::CosNullType, ::CosName)
I think pdDocGetPageRange
should fall back to parsing label as number and return appropriate page if there are no labels dictionary in PDF or at least return empty vector.
I have ready fix so would like to submit PR.
I have created a repository which the plan is to auto-generate bookmarks from the table of contents already available at the beginning of pdf files.
https://github.com/aminya/tocPDF
For now, I plan to start using available software (e.g k2pdfoptdoes), and then later make the functionality Julia native (when you add pdf write capability).
Current algorithm plan: https://github.com/aminya/tocPDF#automated
I looked at the PDFIO doc, however, it is a long one, and it has many functions. Could you help me start using PDFIO?
if anyone is interested in participation, that will be awesome. (@kskyten @sambitdash )
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.