unidoc / unipdf Goto Github PK
View Code? Open in Web Editor NEWGolang PDF library for creating and processing PDF files (pure go)
Home Page: https://unidoc.io
License: Other
Golang PDF library for creating and processing PDF files (pure go)
Home Page: https://unidoc.io
License: Other
Loading certain PDFs from a test corpus leads to the following error:
[TRACE] parser.go:1647 Trailer: Dict("Size": 53, "Info": Ref(29 0), "Encrypt": null, "Root": Ref(35 0), "Prev": 256729, "ID": [qXv�b, ��$6"h- %], )
[TRACE] parser.go:1681 Checking encryption dictionary!
[TRACE] parser.go:1686 Is encrypted!
[DEBUG] pdf_passthrough_bench.go:282 Reader create error unsupported type: *core.PdfObjectNull
/tmp/encrypt-dict-null/263071.pdf - fail unsupported type: *core.PdfObjectNull
263071.pdf 0.3 false 1.6 Error: unsupported type: *core.PdfObjectNull
Attachment: failing files
encrypt-dict-null.zip
Add support for decoding and encoding with the JBIG2 standard.
See section 7.4.7 JBIG2Decode Filter in the PDF reference (PDF32000_2008):
The JBIG2Decode filter decodes monochrome (1 bit per pixel) image data
that has been encoded using JBIG2 encoding.
The optional parameters for JBIG2Decode filter in PDF are:
See also Example 1 in the standard which can be used as a testcase.
I am currently not aware of any golang implementations of JBIG2. However, there are a few open source implementions in other languages that might be a good reference.
PDF has a feature where streams can be encrypted with a crypt filter that is specified via DecodeParms that refers to the /CF dictionary of the /Encrypt dictionary. If DecodeParms is missing the Identity filter is used (raw data unchanged)
From 7.6.5 Crypt Filters (PDF32000_2008.PDF):
A stream filter type, the Crypt filter (see 7.4.10, "Crypt Filter") can be specified for any stream in the
document to override the default filter for streams.
For example here is a case where /Crypt is specified and DecodeParms is missing (Identity filter) so the data is left in tact. Seems to be used for metadata sometimes.
165 0 obj<</Length 3575/Filter[/Crypt]/Type/Metadata/Subtype/XML>>stream^M
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
Another example from PDF32000_2008.PDF (p. 77)
5 0 obj
<< /Title ($#*#%*$#^&##) >> % Info dictionary: encrypted text string
endobj
6 0 obj
<< /Type /Metadata
/Subtype /XML
/Length 15
/Filter [/Crypt] % Uses a crypt filter
/DecodeParms % with these parameters
<< /Type /CryptFilterDecodeParms
/Name /Identity % Indicates no encryption
>>
>>
stream
XML metadata % Unencrypted metadata
endstream
endobj
8 0 obj % Encryption dictionary
<< /Filter /MySecurityHandlerName
/V 4 % Version 4: allow crypt filters
/CF % List of crypt filters
<< /MyFilter0
<< /Type /CryptFilter
/CFM V2 >> % Uses the standard algorithm
>>
/StrF /MyFilter0 % Strings are decrypted using /MyFilter0
/StmF /MyFilter0 % Streams are decrypted using /MyFilter0
... % Private data for /MySecurityHandlerName
/MyUnsecureKey (12345678)
/EncryptMetadata false
>>
endobj
Add support for supporting the Crypt filter in a similar fashion as other stream filters in core/stream.go
Playing around with the compression I found that in some cases the "optimized" images can get bigger than the original ones. In some of the cases this was for example because a CCITTFaxDecoded image was converted to JPEG (DCT). That will be handled in a separate ticket. However, also saw that a JPEG image with exactly the same parameters could become bigger after decoding and reencoding.
In any case, it is always better to go back to the original object for the cases when the optimized image is bigger.
Is there currently any support for embedded files?
From the looks of it, I would need to read in content streams not attached to a specific page.
Thanks
A simple way to build forms should be present in the creator package. Together with tables etc it should provide a an easy way to make PDF forms using the creator. Advanced support for forms exists in package model although the use is pretty low level and will serve as the basis for this.
Should support:
Basic API ideas:
c := creator.New()
form := c.NewForm() // Serves as the AcroForm for the document represented by `c`.
tf := form.NewTextField("field_name1") // occupies available width with some default height
tf.SetDefaultText("Enter your name")
// can force certain width with tf.SetWidth(100)
// tf.SetRelHeight(1.2) 1.2 X fontheight
c.Draw(tf)
cf := form.NextCheckboxField("checkbox_male")
cf.SetText("Male")
cf.SetChecked(false)
c.Draw(tf) // Draws at current position
cf = form.NextCheckboxField("checkbox_female")
cf.SetText("Female")
cf.SetChecked(false)
c.Draw(tf) // Draws at current position
The drawing does not involve adding anything to the page contentstream, but rather creating Fields in the AcroForms (that refer the page) and the actual content goes into widget annotations. The annotations should also be added to the page dictionaries Annots array.
For arranging fields it would in many cases make sense to place inside a Table to arrange nicely over the page.
The extractor ExtractText method doesn't seem to extract paragraphs or take titles into account. It would be great to have access to titles and paragraphs of text while able to ignore headers and footers.
This JavaScript implementation of PDF to Markdown is actually quite good:
Is this possible using UniDoc?
Problem with loading files leads to the error:
[TRACE] reader.go:274 Outline root: Object stream 375: Dict("Filter": FlateDecode, "Length": Ref(376 0), )
[DEBUG] reader.go:246 ERROR: Failed to build outline tree (outline root should be an indirect object)
[DEBUG] pdf_passthrough_bench.go:282 Reader create error outline root should be an indirect object
/tmp/outline-root/005168.pdf - fail outline root should be an indirect object
005168.pdf 2.6 false 0.7 Error: outline root should be an indirect object
Attachments:
outline-root.zip
Reported by Peter Williams:
I get an invalid JPEG format: bad RST marker error on this
This looks like an error in the Go JPEG library.
The CCITTFaxDecode filter (and JBIG2 which is under development) is particularly well suited for encoding binary images (0/1). Images for optimization that are binary (DeviceGray and/or BitsPerComponent 1) should be encoded with CCITTFaxDecode filter.
Problem file: 095121_v02.pdf
Options:
unioptimize.Options{
CombineDuplicateDirectObjects: true,
CombineIdenticalIndirectObjects: true,
ImageUpperPPI: 100.0,
CombineDuplicateStreams: true,
CompressStreams: true,
UseObjectStreams: true,
ImageQuality: 80,
}
Output PDF (1083kB) is larger than the original PDF (672kB) and lookse more blurry. Needs some investigation. Could be related to BitsPerComponent as this is a scanned image probably low bit depth.
Extracting 1 bit grayscale images from a pdf and using ToGoImage()
result in 8bit golang grayscale images with 0 for black (correct) but 1 for white (instead of 255). The rest of the extraction appears to be correct.
Code in ToGoImage()
:
if this.ColorComponents == 1 {
if this.BitsPerComponent == 16 {
val := uint16(samples[i])<<8 | uint16(samples[i+1])
c = gocolor.Gray16{val}
} else {
val := uint8(samples[i] & 0xff)
c = gocolor.Gray{val}
}
}
Switching val := uint8(samples[i] & 0xff)
to val := uint8(samples[i] * uint32(256 / this.BitsPerComponent-1))
resolves the issue for 1 bit images but haven't tested it for other cases.
Make shallow/deep compare methods for dictionaries & indirect objects. This will allow unneccessary duplication of objects that are identical, for example when creating resource dictionaries etc.
Images with only 2 values (min/max) are suitable for encoding with CCITTFaxDecode and JBIG2 (when encoding support is ready).
Add a flag BinaryImageOptimization = true
by default which looks at the image values and detects whether it is suitable for encoding with CCITTFaxDecode. Note that this is slightly different from #428 as this involves looking at the values (color histogram), for example a BitPerComponent = 8 using only color values 0/255 would be suitable for encoding as a binary image.
It might also make sense to define some threshold so that the number of pixels outside the min/max bins could be defined so that a certain percentage of pixels could fall outside and would be interpolated to the closest bin. For example if 99% of pixels fall into 0, 255 and then there are some pixels with values 1,5,9,230, 150, 250, those would be interpolated to the closest value 0,0,0,255,255,255 so that there are only 2 values prior to feeding to the CCITTFaxDecode algorithm. The threshold should be defined as BinaryImageOptimizationThreshold = 0.99
(float64).
This issue is a master issue/epic and can lead to subissues that will be referenced from here.
The extractor package will have the capability to extract vectorized text and objects (with position and dimensions).
Goal: Extract a list of graphics objects from each PDF page.
There are three types of graphics objects:
Each of these objects has a
This is not a rendering system but we hope to design it in a way that will allow it to be extended to become a renderer. Initial versions of the renderer could convert the lists of graphics objects to PDF or PostScript pages. This would provide closed-loop tests.
text: Text objects and operators. The text operators specify the glyphs to be painted, represented by string objects whose values shall be interpreted as sequences of character codes. A text object encloses a sequence of text operators and associated parameters. (page 237)
Paragraph fragments are the largest substrings in text paragraphs that are rendered contiguously on a PDF page. If a paragraph is split between pages or columns then the parts of the paragraph that appear at the end of the first page / column and the start of the second page / column are paragraph fragments. When a paragraph fits entirely within a single column and page, the entire paragraph is a paragraph fragment.
There are at least three levels of text objects, all of which are composed of lower level (lower numbered in the following list) objects.
Initially we will only concern ourselves with stroked and filled paths and ignore clipping paths
// Path can define shapes, trajectories and regions of all sorts. Used to draw lines and define shapes of filled areas.
type Path struct {
segments []lineSegments
}
// Only export if deemed necessary for outside access.
// For connected subpaths (segments), the x1, y1 coordinate will start at x2, y2 coordinate of the previous segment.
type lineSegment struct {
isCurved bool // Bezier curve if true, otherwise line
x1, y1 float64
x2, y2 float64
cx, cy float64 // Control point (if curved)
isNoop bool // Path ended without filling/stroking.
isStroked bool
strokeColor model.PdfColor
isFilled bool
fillColor model.PdfColor
fillRule windingTypeRule
}
type windingNumberRule int
const (
nonZeroWindingNumberRule windingNumberRule = iota
evenOddWindingNumberRule
)
This should include inline images, XObject images, possibly some shadings etc. UniDoc already has a pretty good framework for this.
func (e *Extractor) GraphicsObjects() []GraphicsObject
type GraphicsObject interface {
// What do graphics objects have in common, or what common operations can be applied to them?
// Possibly make into a struct rather than an interface and convert to an interface if we think it makes sense.
}
func render(o GraphicsObject, gs GraphicsState)
The rendering would be over all graphics objects on a page in the order they occur. This would be driven by a single processor.AddHandler()
that could be configured to emit any combination of text, shape, and image objects.
func renderCore(doText, doShapes, doImages bool, render Renderer)
or rendering context/state rather than doX...
Potential use cases that should be possible to base on this implementation:
Going from the primitive contentstream operands to a higher level representation, there is a need to have a connection from the higher level representation to the lower level. For example if removing content, may need to filter on a higher level basis but have a connection down to the primitive operands to actually filter those out.
There may be a cascade/sequence of processing operations, initially on the primitive operands, for example grouping.
It should be clear whether those processes are lossy or lossless, where lossless would mean that they could reproduce the exact same operands as originally and same look. Lossy would mean that some aspect was lost, for example if grouping text together, character spacing/kerning info could be lost.
Preferably all processing would have the capability to be lossless, but it remains to be seen whether that is practical.
This should also include a test case with merging PDFs with forms and annotations writing out, loading and checking that it is as expected.
Currently we have our core/...
packages that are truly core to unipdf as it can be imported anywhere, should not rely on any other package (except internal utility packages).
It defines all the primitive types:
core.PdfObject
core.PdfIndirectObject
core.PdfObjectDictionary
core.PdfObjectArray
Would it be nicer to have
core.Object
core.IndirectObject
core.Dictionary or pdfcore.Dict
core.Array
core.String
? or is that not specific enough, maybe
pdfcore.Object
pdfcore.Dictionary or pdfcore.Dict
pdfcore.Array
pdfcore.String
etc. ?
Similarly for model
package, there are some pretty lengthy names:
model.PdfPage
model.PdfPageResourcesColorspaces
model.PdfColorspaceDeviceNAttributes
Clearly the name space in PDF models is pretty huge, however it might be possible to improve here. What about
pdfmodel.Page
pdfmodel.ResourceColorspace
pdfmodel.ColorspaceDeviceNAttributes
or
pdf.Page
pdf.ResourceColorspace
pdf.ColorspaceDeviceNAttributes
Would be interesting to get some input on this. We are always looking on ways to improve the internals, although it can take time and would obviously not appear until in a future major version.
Currently text extraction fails on some text using this font type. Need to add support for it to properly work for extraction.
To properly extract certain text in PDF, it may be necessary to detect/group lines, identify tables, equations. This may either be done post-extraction of objects or before, depending on what is easier to implement and gives good results.
Also need to assemble a solid corpus for testing, as well as an API prototyping. Tabular extraction may need a different approach than equations and possibly a different API.
At this point we are collecting input so that we can define this issue better.
Create a model.PdfCatalog type
// PdfCatalog represents the root Catalog dictionary (section 7.7.2 p. 79)
type PdfCatalog struct {
Type *core.PdfObjectName
Version *core.PdfObjectName
etc...
}
Will make it easier to work with externally and simplify usage within the model package.
Currently the size of a fresh checkout of unidoc v3 is over 50mb.
Go through each testdata file and determine if it needs to be in unidoc testing or can go into private testdata and be activated via environment variable.
The goal is to make the tests more clear and easier to read as well as improve coverage.
Steps:
Currently the unit test coverage is [v3]:
$ go test -cover .
ok github.com/unidoc/unidoc/pdf/core 0.214s coverage: 48.7% of statements
whereas with cross-package tests the coverage is ~63.85% according to codecov.io.
With the following code, I'm seeing about 40-45ms per "paragraph" addition:
p := creator.NewParagraph(content)
p.SetFont(fonts.NewFontCourier())
p.SetFontSize(fontSize)
p.SetPos(xPos, yPos)
err = c.Draw(p)
For building a page from small elements (50-100 elements) for multi-page documents, this can get very time-consuming. It looks as thought the performance issue is the Draw()
method:
Function | Execution Time |
---|---|
NewParagraph | 2.341µs |
SetFont/Pos | 455ns |
Draw | 38.043205ms |
Currently our support is limited to BitsPerComponent = 8 (BPC).
According to PDF32000_2008 the value for BitsPerComponent:
The number of bits used to represent each colour component in a sample. Valid values
are 1, 2, 4, 8, and (PDF 1.5) 16. Default value: 8.
In practice, BPC=8 is by far the most common. However, for completeness we need to support all and it will improve the code.
Add support for decoding and encoding with the JPEG2000 standard.
See section 7.4.9 JPXDecode Filter (PDF32000_2008):
The JPXDecode filter decodes data that has been encoded using the JPEG2000 compression
method, an ISO standard for the compression and packaging of image data.
I am currently not aware of any golang implementations of JPEG2000. However, there are a few open source implementions in other languages that might be a good reference in addition to the standard.
The string
type is really not the right one for the writer and the eventual output is always []byte. Would make sense to change the signature to Write(io.Reader) instead of the WriteString. Easy to use the interface for any kinds of testing as well. May improve performance.
Should test:
Include cases with 1,3,4 color components.
Can assume BitsPerComponent = 8 for now. Support for more BPC will be included in a separate ticket which will include making tests.
No errors in ghostscript. Needs investigation.
Currently loading a page that has page.Rotate with an angle of 90,180,270 (not 0): The page contents are loaded into the block and the rotation not accounted for.
Thus drawing the block onto a page, things will normally appear as incorrect, i.e. rotated.
When loading a page NewBlockFromPage does not account for page's Rotate flag.
NewBlockFromPage accounts for the page's rotate flag and rotates the contents accordingly. Page size is also adjusted based on rotation, for example 90 degrees, original width and height are swapped.
When the block is added to a new page, the contents appear in the right orientation (although the new page has Rotate flag not set corresponding to 0).
Change NewBlockFromPage to rotate the contents based on the rotate flag and take rotation into block size.
The changes are non-breaking from a programmatic endpoint, but can break code that is already doing this rotation manually, easy to account for that though although one should be aware of it.
Issues with certain test files:
000008.pdf Black outline box appears around "Mar 27, 2017, 2K50 pm AEDT"
000011.pdf Light text columns on page 2 have become dark e.g. "Technical Support"
000023.pdf Black box appears at top left of page 1
000058.orig.pdf Black background appears below "A marketing email ... "
000040.pdf Blue text has same gray level as black text
000050.pdf Blend on left of page 1 is not printed
Based on discussion in #441. The proposal is to create a buffered reader type which encapsulates a ReadSeeker and a buffered Reder (bufio.Reader). Rather than accessing and working with both separately.
The suggested type is something as follows:
// bufferedReadSeeker offers buffered read access to a seekable source (such as a file) with seek capabilities.
type bufferedReadSeeker struct {
rs io.ReadSeeker
reader *bufio.Reader
}
// Implement the Read and Seek methods.
The reason for using the bufio.Reader is purely based on performance during parsing as it has a buffer. Any time we change the offset position of rs a new reader must be constructed with a new buffer (and use the buffered information to correct for offsets as done in parser.GetPosition).
Currently text extraction fails on some text using this font type. Need to add support for it to properly work for extraction.
@peterwilliams97 Can you provide a specific example of something that does not work? Would be good to have a snippet from the content stream
Currently, font files are embedded their entirety. This can be somewhat wasteful, as often only a small portion of glyphs are used, and font files can be large especially for unicode fonts with large numbers of glyphs.
There are two use cases:
Those two cases may require slightly different approaches to be done efficiently. So it is probably best to keep them separate. Here we will focus on the first use case (for creating PDFs).
This requires.
fnt, _ := NewCompositePdfFontFromTTFFile("largefnt.ttf")
fnt.Subset(true) // Marks font for subsetting on write
// then use fnt as normally.
// Each call to the font's encoder Encode will record use of glyph to be used.
Significantly smaller generated PDF files using TTF fonts.
Section 9.6.4 Font Subsets (PDF32000_2008):
PDF documents may include subsets of Type 1 and TrueType fonts. The font and font descriptor
that describe a font subset are slightly different from those of ordinary fonts. These differences
allow a conforming reader to recognize font subsets and to merge documents containing different
subsets of the same font. (For more information on font descriptors, see 9.8, "Font Descriptors".)
For a font subset, the PostScript name of the font—the value of the font’s BaseFont entry and the
font descriptor’s FontName entry— shall begin with a tag followed by a plus sign (+). The tag shall
consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the
same PDF file shall have different tags.
EXAMPLE EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font
And in section 9.9 (Embedded Font Programs) it states:
A TrueType font program may be used as part of either a font or a CIDFont. Although the basic
font file format is the same in both cases, there are different requirements for what information
shall be present in the font program. These TrueType tables shall always be present if present in
the original TrueType font program:
“head”, “hhea”, “loca”, “maxp”, “cvt”, “prep”, “glyf”, “hmtx”, and “fpgm”.
If used with a simple font dictionary, the
font program shall additionally contain a cmap table defining one or more encodings,
as discussed in 9.6.6.4, "Encodings for TrueType Fonts". If used with a CIDFont dictionary,
the cmap table is not needed and shall not be present, since the mapping from character codes
to glyph descriptions is provided separately.
Section 9.6.6.4 (Encodings for TrueType fonts) additionally describes how TrueType cmaps and font dictionary's Encoding are used to map between character codes and glyph descriptions.
Problem file: TheCaseStudyMethod.pdf
Options:
unioptimize.Options{
CombineDuplicateDirectObjects: true,
CombineIdenticalIndirectObjects: true,
ImageUpperPPI: 100.0,
CombineDuplicateStreams: true,
CompressStreams: true,
UseObjectStreams: true,
ImageQuality: 80,
}
When optimizing get:
[ERROR] encoding.go:1855 Unsupported filter CCITTFaxDecode
[ERROR] stream.go:42 Failed creating multi encoder: invalid filter in multi filter array
Seems like the CCITTFaxDecode filter is not ignored during compression when inside a multi filter encoder.
Hello to all, I am trying to extract certain text from a PDF page, and writing it back on a new PDF document.
I am experimenting with v3 branch, that have new api tro extract vectorized text from the page.
In order to return all the text marks -that are private in the current version of the api- I created a new convencience struct returning from a new getter in PageText struct.
The text extraction works well, the problem is that I am unable to set the font for a new paragraph element, created iterating returning marks, the same as the original one.
I also tried to add the font before in the page and then to the paragraph, but I have only errors, such as:
[DEBUG] simple.go:56 ERROR: NewSimpleTextEncoder. Unknown encoding "default"
[DEBUG] simple.go:56 ERROR: NewSimpleTextEncoder. Unknown encoding "custom"
error is: unsupported font encoding
or, with another PDF files:
[DEBUG] ttfparser.go:527 parseCmapVersion: format=0 length=262 language=0
[DEBUG] ttfparser.go:732 No PostScript name information is provided for the font.
This is the test code I am using: Gist
You can pull the library with changes here: Repo Link
I attach the PDF files I'm testing on.
Thank you in advance!
newspaper.pdf
See section 8.6.5.5 ICCBased Colour Spaces (PDF32000_2008):
ICCBased colour spaces shall be based on a cross-platform colour profile as defined by the
International Color Consortium (ICC)... an ICCBased colour space shall be characterized by
a sequence of bytes in a standard format. Details of the profile format can be found in the
ICC specification.
There are multiple versions of the ICC specification that are supported in PDF as
shown in Table 67 (PDF32000_2008). However, it also says that
ICC.1:2004:10
as required by PDF 1.7, which will enable it to properly render all embedded ICC profiles regardless of PDF version.Does the library support the primitives required to implement some kind of redaction function? I'm figuring it would require:
Implement a renderer for PDF pages which can be used to render pages to images.
Can be implemented in a few steps/milestones:
The final step will be the most challenging, however, we are already building a strong foundation for font and text support which makes it possible.
Prototype code for rendering images/shapes exists in: https://github.com/unidoc/unipdf-examples/blob/v3-render-support/pdf/render/pdf_render.go
The rendering should be implemented as a package (renderer) inside unipdf. The prototype code could used as a base and refactored into a package.
For rendering text it might make sense to start by using fonts that are available on the system or fixed local fonts. Typically PDF viewers rely on the system fonts, as well as loading embedded fonts.
Hey guys, really enjoying using unidoc, just spotted something (and apologies if this is an oversight)
The pdfReader.Inspect
method doesn't appear to pull the JavaScript out of the following file:
%PDF-1.7
4 0 obj
<<
/Length 0
>>
stream
endstream endobj
5 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 3 0 R
/Contents 4 0 R
/MediaBox [ 0 0 612 792 ]
>>
endobj
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
/OpenAction [ 5 0 R /Fit ]
/Names << % the Javascript entry
/JavaScript <<
/Names [
(EmbeddedJS)
<<
/S /JavaScript
/JS (
app.alert('Hello, World!');
)
>>
]
>>
>> % end of the javascript entry
>>
endobj
2 0 obj
<<
/Type /Pages
/Count 1
/Kids [ 5 0 R ]
>>
endobj
3 0 obj
<<
>>
endobj
xref
0 6
0000000000 65535 f
0000000166 00000 n
0000000244 00000 n
0000000305 00000 n
0000000009 00000 n
0000000058 00000 n
trailer <<
/Size 6
/Root 1 0 R
>>
startxref
327
%%EOF
It correctly identifies the number of pages and etc, just doesn't pick up the JS around line 25
Additional question: do you guys support stripping this kind of stuff from the file? The example does a great job of explaining how to find JS/Flash/Video in a PDF, but not how to remove (if possible)
Thanks!
The current implementation of PDF compression has some issues, in particular with image handling.
We need to test on some actual PDF files and check whether the output is as expected. Checking errors, file size, and comparing pages in a PDF viewer.
For any bug that comes up we need to set the PDF aside and create a ticket (unless we can fix the problem easily right away).
For implementation: Can use identity/passthrough benchmark as a basis (pdf_passthrough_bench.go
)
and add compression flag:
https://github.com/unidoc/unidoc-examples/blob/v3/pdf/testing/pdf_passthrough_bench.go
Something like:
if params.optimize {
optim := optimize.New(optimize.Options{
CombineDuplicateDirectObjects: true,
CombineIdenticalIndirectObjects: true,
ImageUpperPPI: 100.0,
UseObjectStreams: true,
ImageQuality: 50,
CombineDuplicateStreams: true,
})
writer.SetOptimizer(optim)
}
Clearly need to try changing the parameters and see if can find more bugs.
Currently the colorspace handling only supports DeviceGray
and DeviceRGB
and the handling is simplistic only looping through the images in XObject and compressing all of those. If any image was never used in the contentstream it would still not be removed for example.
Also this means that inline images are not handled.
The handling should be made more generic and use the ContentStreamProcessor to process the contents. The colorspace handling should also be more generic and fall back to alternative colorspaces in cases where not properly supported. The handling should be similar as for example in:
https://github.com/unidoc/unidoc-examples/blob/v3/pdf/advanced/pdf_grayscale_transform.go
although we can ignore handling of patterns and shadings at the moment.
Take care to remove resources that are not used. Perhaps that should be its own optimization as it makes sense to remove images, fonts, colorspaces or any other resources that are not actually used.
Overview of long examples with potential enhancements:
Example | #Lines | Comment |
---|---|---|
pages/pdf_merge_advanced.go | 334 | Advanced merging - done in unicli/pdf |
page/pdf_list_images.go | 253 | Reimplement with extractor |
image/pdf_extract_images.go | 253 | Reimplement with extractor |
forms/pdf_form_list_fields.go | 198 | High level interface - fjson/extractor/form |
analysis/pdf_fonts.go | 189 | Extractor should have capability to extract fonts (page basis) |
metadata/pdf_metadata_get_xml.go | 181 | High level metadata interface |
barcode/pdf_add_barcode.go | 173 | Could be easier to add an image to a PDF page of existing document? |
forms/pdf_form_add.go | 163 | High level forms interface (creator/form/fjson) |
Currently it isn't possible to parse/edit/write PDFs if their size approaches the available memory on the computer. Unidoc parses the object tree and stores them in memory. Based on the architecture of a PDF file we should be able to parse (and for example extract text) from arbitrarily large PDFs that greatly exceed the memory capacity of the server. #128 could resolve a lot of this but would require pages/objects to be able to be freed from memory when no longer needed.
Additionally when writing large PDFs there would need to be something implemented to either cache objects to disk before writing the completed PDF or stream the PDF writing while each page is completed and releasing the memory back. This would get tricky with shared objects.
is plan support convert pdf to image?
We have code to scale down PDFs to A4 documents and center them, as part of this we want to output a portrait PDF. If the input PDF was landscape this means they end up much smaller so we want to make sure they are 90 degrees rotate so they benefit from having as much space as possible.
In the current API it rotates from some sort of origin the library defaults to, we looked through the code and it looks like the only way to control this is to hack it by doing some translation before rotating.
I noticed you have a TODO to control the origin which we would be interested in as we want to be able to control it so we can simply rotate by 90 degrees and center it (The centering code completely breaks at the moment as it assumes the origin is in the middle.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.