GithubHelp home page GithubHelp logo

uglytoad / pdfpig Goto Github PK

View Code? Open in Web Editor NEW
1.6K 45.0 226.0 134.86 MB

Read and extract text and other content from PDFs in C# (port of PDFBox)

Home Page: https://github.com/UglyToad/PdfPig/wiki

License: Apache License 2.0

C# 99.60% HTML 0.36% Batchfile 0.01% PowerShell 0.04%
pdfbox pdf pdf-document csharp netstandard pdf-extractor pdf-document-processor pdf-files alto-xml hocr

pdfpig's Introduction

PdfPig

Gitter nuget

This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes.

This project aims to port PDFBox to C#.

Migrating to 0.1.6 from 0.1.x? Use this guide: migration to 0.1.6.

Wiki

Check out our wiki for more examples and detailed guides on the API.

Installation

The package is available via the releases tab or from Nuget:

https://www.nuget.org/packages/PdfPig/

Or from the package manager console:

> Install-Package PdfPig

While the version is below 1.0.0 minor versions will change the public API without warning (SemVer will not be followed until 1.0.0 is reached).

Get Started

See the wiki for more examples

Read words in a page

The simplest usage at this stage is to open a document, reading the words from every page:

using (PdfDocument document = PdfDocument.Open(@"C:\Documents\document.pdf"))
{
	foreach (Page page in document.GetPages())
	{
		string pageText = page.Text;

		foreach (Word word in page.GetWords())
		{
			Console.WriteLine(word.Text);
		}
	}
}

An example of the output of this is shown below:

Image shows three words 'Write something in' in 2 sections, the top section is the normal PDF output, the bottom section is the same text with 3 word bounding boxes in pink and letter bounding boxes in blue-green

Where for the PDF text ("Write something in") shown at the top the 3 words (in pink) are detected and each word contains the individual letters with glyph bounding boxes.

Ceate PDF Document

To create documents use the class PdfDocumentBuilder. The Standard 14 fonts provide a quick way to get started:

PdfDocumentBuilder builder = new PdfDocumentBuilder();

PdfPageBuilder page = builder.AddPage(PageSize.A4);

// Fonts must be registered with the document builder prior to use to prevent duplication.
PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);

page.AddText("Hello World!", 12, new PdfPoint(25, 700), font);

byte[] documentBytes = builder.Build();

File.WriteAllBytes(@"C:\git\newPdf.pdf", documentBytes);

The output is a 1 page PDF document with the text "Hello World!" in Helvetica near the top of the page:

Image shows a PDF document in Google Chrome's PDF viewer. The text "Hello World!" is visible

Each font must be registered with the PdfDocumentBuilder prior to use enable pages to share the font resources. Only Standard 14 fonts and TrueType fonts (.ttf) are supported.

Advanced Document Extraction

In this example a more advanced document extraction is performed. PdfDocumentBuilder is used to create a copy of the pdf with debug information (bounding boxes and reading order) added.

//using UglyToad.PdfPig;
//using UglyToad.PdfPig.DocumentLayoutAnalysis.PageSegmenter;
//using UglyToad.PdfPig.DocumentLayoutAnalysis.ReadingOrderDetector;
//using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;
//using UglyToad.PdfPig.Fonts.Standard14Fonts;
//using UglyToad.PdfPig.Writer;


var sourcePdfPath = "";
var outputPath = "";
var pageNumber = 1;
using (var document = PdfDocument.Open(sourcePdfPath))
{
    var builder = new PdfDocumentBuilder { };
    PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
    var pageBuilder = builder.AddPage(document, pageNumber);
    pageBuilder.SetStrokeColor(0, 255, 0);
    var page = document.GetPage(pageNumber);

    var letters = page.Letters; // no preprocessing

    // 1. Extract words
    var wordExtractor = NearestNeighbourWordExtractor.Instance;

    var words = wordExtractor.GetWords(letters);

    // 2. Segment page
    var pageSegmenter = DocstrumBoundingBoxes.Instance;

    var textBlocks = pageSegmenter.GetBlocks(words);

    // 3. Postprocessing
    var readingOrder = UnsupervisedReadingOrderDetector.Instance;
    var orderedTextBlocks = readingOrder.Get(textBlocks);

    // 4. Add debug info - Bounding boxes and reading order
    foreach (var block in orderedTextBlocks)
    {
        var bbox = block.BoundingBox;
        pageBuilder.DrawRectangle(bbox.BottomLeft, bbox.Width, bbox.Height);
        pageBuilder.AddText(block.ReadingOrder.ToString(), 8, bbox.TopLeft, font);
    }

    // 5. Write result to a file
    byte[] fileBytes = builder.Build();
    File.WriteAllBytes(outputPath, fileBytes); // save to file
}

Image shows a PDF document created by the above code block with the bounding boxes and reading order of the words displayed

See Document Layout Analysis for more information on advanced document analysing.

See Export for more advanced tooling to analyse document layouts.

Usage

PdfDocument

The PdfDocument class provides access to the contents of a document loaded either from file or passed in as bytes. To open from a file use the PdfDocument.Open static method:

using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (PdfDocument document = PdfDocument.Open(@"C:\my-file.pdf"))
{
	int pageCount = document.NumberOfPages;

	// Page number starts from 1, not 0.
	Page page = document.GetPage(1);

	decimal widthInPoints = page.Width;
	decimal heightInPoints = page.Height;

	string text = page.Text;
}

PdfDocument should only be used in a using statement since it implements IDisposable (unless the consumer disposes of it elsewhere).

Encrypted documents can be opened by PdfPig. To provide an owner or user password provide the optional ParsingOptions when calling Open with the Password property defined. For example:

using (PdfDocument document = PdfDocument.Open(@"C:\my-file.pdf",  new ParsingOptions { Password = "password here" }))

You can also provide a list of passwords to try:

using (PdfDocument document = PdfDocument.Open(@"C:\file.pdf", new ParsingOptions
{
	Passwords = new List<string> { "One", "Two" }
}))

The document contains the version of the PDF specification it complies with, accessed by document.Version:

decimal version = document.Version;

Document Creation (0.0.5)

The PdfDocumentBuilder creates a new document with no pages or content.

For text content, a font must be registered with the builder. This library supports Standard 14 fonts provided by Adobe by default and TrueType format fonts.

To add a Standard 14 font use:

public AddedFont AddStandard14Font(Standard14Font type)

Or for a TrueType font use:

AddedFont AddTrueTypeFont(IReadOnlyList<byte> fontFileBytes)

Passing in the bytes of a TrueType file (.ttf). You can check the suitability of a TrueType file for embedding in a PDF document using:

bool CanUseTrueTypeFont(IReadOnlyList<byte> fontFileBytes, out IReadOnlyList<string> reasons)

Which provides a list of reasons why the font cannot be used if the check fails. You should check the license for a TrueType font prior to use, since the compressed font file is embedded in, and distributed with, the resultant document.

The AddedFont class represents a key to the font stored on the document builder. This must be provided when adding text content to pages. To add a page to a document use:

PdfPageBuilder AddPage(PageSize size, bool isPortrait = true)

This creates a new PdfPageBuilder with the specified size. The first added page is page number 1, then 2, then 3, etc. The page builder supports adding text, drawing lines and rectangles and measuring the size of text prior to drawing.

To draw lines and rectangles use the methods:

void DrawLine(PdfPoint from, PdfPoint to, decimal lineWidth = 1)
void DrawRectangle(PdfPoint position, decimal width, decimal height, decimal lineWidth = 1)

The line width can be varied and defaults to 1. Rectangles are unfilled and the fill color cannot be changed at present.

To write text to the page you must have a reference to an AddedFont from the methods on PdfDocumentBuilder as described above. You can then draw the text to the page using:

IReadOnlyList<Letter> AddText(string text, decimal fontSize, PdfPoint position, PdfDocumentBuilder.AddedFont font)

Where position is the baseline of the text to draw. Currently only ASCII text is supported. You can also measure the resulting size of text prior to drawing using the method:

IReadOnlyList<Letter> MeasureText(string text, decimal fontSize, PdfPoint position, PdfDocumentBuilder.AddedFont font)

Which does not change the state of the page, unlike AddText.

Changing the RGB color of text, lines and rectangles is supported using:

void SetStrokeColor(byte r, byte g, byte b)
void SetTextAndFillColor(byte r, byte g, byte b)

Which take RGB values between 0 and 255. The color will remain active for all operations called after these methods until reset is called using:

void ResetColor()

Which resets the color for stroke, fill and text drawing to black.

Document Information

The PdfDocument provides access to the document metadata as DocumentInformation defined in the PDF file. These tend not to be provided therefore most of these entries will be null:

PdfDocument document = PdfDocument.Open(fileName);

// The name of the program used to convert this document to PDF.
string producer = document.Information.Producer;

// The title given to the document
string title = document.Information.Title;
// etc...

Document Structure (0.0.3)

The document now has a Structure member:

UglyToad.PdfPig.Structure structure = document.Structure;

This provides access to tokenized PDF document content:

Catalog catalog = structure.Catalog;
DictionaryToken pagesDictionary = catalog.PagesDictionary;

The pages dictionary is the root of the pages tree within a PDF document. The structure also exposes a GetObject(IndirectReference reference) method which allows random access to any object in the PDF as long as its identifier number is known. This is an identifier of the form 69 0 R where 69 is the object number and 0 is the generation.

Page

The Page contains the page width and height in points as well as mapping to the PageSize enum:

PageSize size = Page.Size;

bool isA4 = size == PageSize.A4;

Page provides access to the text of the page:

string text = page.Text;

There is a new (0.0.3) method which provides access to the words. This uses basic heuristics and is not reliable or well-tested:

IEnumerable<Word> words = page.GetWords();

You can also (0.0.6) access the raw operations used in the page's content stream for drawing graphics and content on the page:

IReadOnlyList<IGraphicsStateOperation> operations = page.Operations;

Consult the PDF specification for the meaning of individual operators.

There is also an early access (0.0.3) API for retrieving the raw bytes of PDF image objects per page:

IEnumerable<XObjectImage> images = page.ExperimentalAccess.GetRawImages();

This API will be changed in future releases.

Letter

Due to the way a PDF is structured internally the page text may not be a readable representation of the text as it appears in the document. Since PDF is a presentation format, text can be drawn in any order, not necessarily reading order. This means spaces may be missing or words may be in unexpected positions in the text.

To help users resolve actual text order on the page, the Page file provides access to a list of the letters:

IReadOnlyList<Letter> letters = page.Letters;

These letters contain:

  • The text of the letter: letter.Value.
  • The location of the lower left of the letter: letter.Location.
  • The width of the letter: letter.Width.
  • The font size in unscaled relative text units (these sizes are internal to the PDF and do not correspond to sizes in pixels, points or other units): letter.FontSize.
  • The name of the font used to render the letter if available: letter.FontName.
  • A rectangle which is the smallest rectangle that completely contains the visible region of the letter/glyph: letter.GlyphRectangle.
  • The points at the start and end of the baseline StartBaseLine and EndBaseLine which indicate if the letter is rotated. The TextDirection indicates if this is a commonly used rotation or a custom rotation.

Letter position is measured in PDF coordinates where the origin is the lower left corner of the page. Therefore a higher Y value means closer to the top of the page.

Annotations (0.0.5)

Early support for retrieving annotations on each page is provided using the method:

page.ExperimentalAccess.GetAnnotations()

This call is not cached and the document must not have been disposed prior to use. The annotations API may change in future.

Bookmarks (0.0.10)

The bookmarks (outlines) of a document may be retrieved at the document level:

bool hasBookmarks = document.TryGetBookmarks(out Bookmarks bookmarks);

This will return false if the document does not define any bookmarks.

Forms (0.0.10)

Form fields for interactive forms (AcroForms) can be retrieved using:

bool hasForm = document.TryGetForm(out AcroForm form);

This will return false if the document does not contain a form.

The fields can be accessed using the AcroForm's Fields property. Since the form is defined at the document level this will return fields from all pages in the document. Fields are of the types defined by the enum AcroFieldType, for example PushButton, Checkbox, Text, etc.

Please note the forms are readonly and values cannot be changed or added using PdfPig.

Hyperlinks (0.1.0)

A page has a method to extract hyperlinks (annotations of link type):

IReadOnlyList<UglyToad.PdfPig.Content.Hyperlink> hyperlinks = page.GetHyperlinks();

TrueType (0.1.0)

The classes used to work with TrueType fonts in the PDF file are now available for public consumption. Given an input file:

using UglyToad.PdfPig.Fonts.TrueType;
using UglyToad.PdfPig.Fonts.TrueType.Parser;

byte[] fontBytes = System.IO.File.ReadAllBytes(@"C:\font.ttf");
TrueTypeDataBytes input = new TrueTypeDataBytes(fontBytes);
TrueTypeFont font = TrueTypeFontParser.Parse(input);

The parsed font can then be inspected.

Embedded Files (0.1.0)

PDF files may contain other files entirely embedded inside them for document annotations. The list of embedded files and their byte content may be accessed:

if (document.Advanced.TryGetEmbeddedFiles(out IReadOnlyList<EmbeddedFile> files)
    && files.Count > 0)
{
    var firstFile = files[0];
    string name = firstFile.Name;
    IReadOnlyList<byte> bytes = firstFile.Bytes;
}

Merging (0.1.2)

You can merge 2 or more existing PDF files using the PdfMerger class:

var resultFileBytes = PdfMerger.Merge(filePath1, filePath2);
File.WriteAllBytes(@"C:\pdfs\outputfilename.pdf", resultFileBytes);

API Reference

If you wish to generate doxygen documentation, run doxygen doxygen-docs and open docs/doxygen/html/index.html.

See also the wiki for a detailed documentation on parts of the API

Issues

Please do file an issue if you encounter a bug.

However in order for us to assist you, you must provide the file which causes your issue. Please host this in a publically available place.

Credit

This project wouldn't be possible without the work done by the PDFBox team and the Apache Foundation.

pdfpig's People

Contributors

bobld avatar davebrokit avatar davmarksman avatar eliotjones avatar fnatzke avatar giovanninova avatar grinay avatar huzhiguan avatar iamcarbon avatar inusualz avatar jonowa avatar jot85 avatar kapiosk avatar kasperdaff avatar listm avatar michaelschnyder avatar modest-as avatar mvantzet avatar numpsy avatar otuncelli avatar plaisted avatar plaisted-work avatar pme8hw0krfqa avatar poltuu avatar sbruyere avatar theolivenbaum avatar thinkbeforecoding avatar vadik299 avatar yufeih avatar zlangner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfpig's Issues

Inspect Type 1 glyph positions and locations

From the Visual verification test for the Latex integration test document Glyph bounding boxes appear to be roughly the right shape but appear in the wrong position and the wrong scale for Type 1 fonts in PDF documents.

image

It's possible this is down to not using the right font matrix for Type 1 fonts or something else entirely. Add a test or tests which assert against glyph positions from a 3rd party tool similar to the SinglePageNonLatinAcrobatDistillerTests. You can use https://www.xfiniumpdf.com/xfinium-pdf-downloads.html to get these bounding boxes.

If they prove to be incorrect fix them.

Support for custom document properties in PDFs?

Hi,

Can I ask if there are any plans and/or possibility of supporting custom document properties in PDF files as well as the 'known' ones (Author and Keywords and such)?

I haven't used PDFBox, but the documentation for PDDocumentInformation does seem to have functions to access custom properties.

Some paths are missing in page

Hi,

I am trying to retrieve all the paths from this pdf document , but it seems some of them are missing.

When drawing all the bounding boxes found, this is what I get (the PdfPath.BezierCurve are in red, and the PdfPath.Line are in blue):
104-7-3

As you can see, for each of the charts, only one line contains bounding boxes, the others seem to be ignored. Same issue for grid lines: some are drawn and some are not.

Am I doing something wrong, or are they really missing?
Thanks,

The code I used is the following:

        using (PdfDocument document = PdfDocument.Open(path))
        {
            for (var i = 0; i < document.NumberOfPages; i++)
            {
                var page = document.GetPage(i + 1);
                var paths = page.ExperimentalAccess.Paths;

                using (var bitmap = converter.GetPage(i + 1, zoom))
                using (var graphics = Graphics.FromImage(bitmap))
                {
                    var imageHeight = bitmap.Height;

                    foreach (var p in paths)
                    {
                        if (p == null) continue;
                        var commands = p.Commands;

                        foreach (var command in commands)
                        {
                            if (command is PdfPath.Line line)
                            {
                                var bbox = line.GetBoundingRectangle();
                                if (bbox.HasValue)
                                {
                                    var rect = new Rectangle(
                                        (int)(bbox.Value.Left * (decimal)zoom),
                                        imageHeight - (int)(bbox.Value.Top * (decimal)zoom),
                                        (int)(bbox.Value.Width == 0 ? 1 : bbox.Value.Width * (decimal)zoom),
                                        (int)(bbox.Value.Height == 0 ? 1 : bbox.Value.Height * (decimal)zoom));
                                    graphics.DrawRectangle(bluePen, rect);
                                }
                            }
                            else if (command is PdfPath.BezierCurve curve)
                            {
                                var bbox = curve.GetBoundingRectangle();
                                if (bbox.HasValue)
                                {
                                    var rect = new Rectangle(
                                        (int)(bbox.Value.Left * (decimal)zoom),
                                        imageHeight - (int)(bbox.Value.Top * (decimal)zoom),
                                        (int)(bbox.Value.Width == 0 ? 1 : bbox.Value.Width * (decimal)zoom),
                                        (int)(bbox.Value.Height == 0 ? 1 : bbox.Value.Height * (decimal)zoom));
                                    graphics.DrawRectangle(redPen, rect);
                                }
                            }
                        }
                    }
                }
            }
        }

Make all the token classes public. Expose via a StructureExplorer class or similar.

It will be useful for more advanced users to directly access the underlying PDF tokens and objects to work around currently unsupported behaviour.

Suggested API would be something like:

document.ContentExplorer

Which would provide access to the xref table to navigate directly to objects as well as inspecting the tokens forming those objects and being able to decode streams with filters.

To this end the classes in the UglyToad.PdfPig.Tokenization.Tokens namespace should be moved to UglyToad.PdfPig.Tokens, gaps in test coverage fixed and any mutability prevented. A general sanity check before exposing on the public API.

Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion Test fails

Changeset: 7fab13e

Test Name:	UglyToad.PdfPig.Tests.Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion
Test FullName:	UglyToad.PdfPig.Tests.Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion
Test Source:	C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig.Tests\Fonts\Type1\Type1FontParserTests.cs : line 15
Test Outcome:	Failed
Test Duration:	0:00:00.005

Result StackTrace:	
at UglyToad.PdfPig.Fonts.Type1.Parser.Type1Tokenizer.ReadNextToken() in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1Tokenizer.cs:line 59
   at UglyToad.PdfPig.Fonts.Type1.Parser.Type1Tokenizer.GetNext() in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1Tokenizer.cs:line 33
   at UglyToad.PdfPig.Fonts.Type1.Parser.Type1EncryptedPortionParser.Parse(IReadOnlyList`1 bytes, Boolean isLenientParsing) in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1EncryptedPortionParser.cs:line 40
   at UglyToad.PdfPig.Fonts.Type1.Parser.Type1FontParser.Parse(IInputBytes inputBytes, Int32 length1, Int32 length2) in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig\Fonts\Type1\Parser\Type1FontParser.cs:line 149
   at UglyToad.PdfPig.Tests.Fonts.Type1.Type1FontParserTests.CanReadHexEncryptedPortion() in C:\Users\Jacob\dev\PdfPig\src\UglyToad.PdfPig.Tests\Fonts\Type1\Type1FontParserTests.cs:line 19
Result Message:	System.InvalidOperationException : Encountered an end of string ')' outside of string.

Document metadata/XMP access?

Hi,

Any plans or thoughts about adding a direct means of getting the XMP metadata of a document?
Looks like I can get hold of the data by doing something like

doc.Structure.Catalog.CatalogDictionary.TryGet<IndirectReferenceToken>(NameToken.Metadata, out var token);

var objectToken = doc.Structure.GetObject(token.Data);
var streamToken = objectToken.Data as UglyToad.PdfPig.Tokens.StreamToken;

and then parsing streamToken.Data with XmpCore, but it might be useful to be able to get at the data more directly (not sure what the best format to expose it as would be though).

StackOverflowException reading corrupt PDF document

Hi,

I've been doing a few tests with PdfPig 0.0.6, and one of the things I tried was loading the invalid pdf file in corrupt.zip in it, and that seems to result in a StackOverflowException being thrown from mscorlib (via NameTokenizer.TryTokenize I think).

For reference, that file was generated by running SharpFuzz against the PDFClown library (it also fails with a stackoverflow trying to load that file).

[enhancement] Add PdfRectangle.IntersectsWith

It would be helpful to have bool PdfRectangle.IntersectsWith(PdfRectangle other) added.

I often need to extract text from a given location so being able to check the bounding boxes using this would be convenient.

(IntersectsWith is what System.Drawing.Rectangle uses as the name, so suggesting that to be consistent)

Expected name as dictionary key, instead got: Ghostscript

Hi there,

I'm trying to extract the text of a PDF generated by Ghostscript. The pdf itself seems fine, I tried to display it with a PDF viewer, which works. Also text extraction with iTextSharp seems to work. However, if I try to read the PDF with PdfPig, then I get the following exception:

PdfDocumentFormatException: Expected name as dictionary key, instead got: Ghostscript

I've looked at the pdf source to look for references to 'Ghostscript' and found the following snippet:

<?xpacket end='w'?>
endstream
endobj
2 0 obj
<</Producer(GPL Ghostscript 9.25)
/CreationDate(D:20190813110636Z00'00')
/ModDate(D:20190813110636Z00'00')
/Creator(OpenText Capture Recognition Engine \(RecoStar\) 7.8.0)>>endobj
xref

If I set a breakpoint and inspect the tokens, this indeed seems the place where the exception occurs. It seems that the parser cannot handle this kind of syntax. I must say, I don't have enough knowledge around the PDF format to know if this syntax is allowed, but in any case it exists in the wild with documents generated by RecoStar / Ghostscript.

Do you have any advice?

Make PdfRectangle rotatable

Currently PDF rectangle is always assumed to be horizontal. This does not work for rotated text. Make sure it supports angled rectangles too. The result of these changes can be assessed against the visual verification for Rotated Text Libre Office.pdf
image

Test odd page numbered documents

A PDF document can be created containing the pages 3, 5 and 7, test how the current pages API handles this and make any necessary changes to allow consumption of documents which miss pages.

GetPage fails with error : 'Cannot convert array to rectangle'

Hello,

When i try to open a PDF file and read it, i have an error :

UglyToad.PdfPig.Exceptions.PdfDocumentFormatException : 'Cannot convert array to rectangle, expected 4 values instead got: [ 0, 0 ].'

UglyToad.PdfPig.Exceptions.PdfDocumentFormatException
HResult=0x80131500
Message=Cannot convert array to rectangle, expected 4 values instead got: [ 0, 0 ].
Source=UglyToad.PdfPig
Arborescence des appels de procédure :
à UglyToad.PdfPig.Util.ArrayTokenExtensions.ToIntRectangle(ArrayToken array)
à UglyToad.PdfPig.Parser.PageFactory.GetMediaBox(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing)
à UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing)
à UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber)
à UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber)
à WindowsFormsApp2.Form1.button1_Click(Object sender, EventArgs e) dans C:\Users\source\repos\WindowsFormsApp2\WindowsFormsApp2\Form1.cs :ligne 40
à System.Windows.Forms.Control.OnClick(EventArgs e)
à System.Windows.Forms.Button.OnClick(EventArgs e)
à System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
à System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
à System.Windows.Forms.Control.WndProc(Message& m)
à System.Windows.Forms.ButtonBase.WndProc(Message& m)
à System.Windows.Forms.Button.WndProc(Message& m)
à System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m)
à System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
à System.Windows.Forms.NativeWindow.DebuggableCallback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
à System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG& msg)
à System.Windows.Forms.Application.ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr dwComponentID, Int32 reason, Int32 pvLoopData)
à System.Windows.Forms.Application.ThreadContext.RunMessageLoopInner(Int32 reason, ApplicationContext context)
à System.Windows.Forms.Application.ThreadContext.RunMessageLoop(Int32 reason, ApplicationContext context)
à System.Windows.Forms.Application.Run(Form mainForm)
à WindowsFormsApp2.Program.Main() dans C:\Users\source\repos\WindowsFormsApp2\WindowsFormsApp2\Program.cs :ligne 19

Hope it will help you to fix it,

Fix Bezier curve behaviour in Type 2 charstring for CFF font

It looks like the interpretation of Bezier curves in Type2CharStringParser has gone slightly wrong:
image

This is probably in one of the curveto commands.

This B is from the PigProductionHandbookTests.CanReadContent test. Take a look and see where the error is and fix it.

Letter width/font is incorrect

Hi, I'm trying to extract text letters and positions from PDFs. For most documents it's working great, but for attached sample (and many others) it's returning Letter.Width=0 and Letter.FontSize=1
Any ideas how to work around this? Thank you!
letter_size_problem.pdf

            using (PdfDocument document = PdfDocument.Open(@"letter_size_problem.pdf"))
            {
                var page = document.GetPage(1);

                Letter l = page.Letters[0];

                decimal x = l.Location.X;
                decimal y = l.Location.Y;
                decimal width = l.Width;
                decimal fontSize = l.FontSize;
            }

How to get page orientation?

First of all, thank you! This library is great! But it seems there are some minor issues :-)

It seems page width/height is not correctly reported when page is rotated.
E.g. in the attached document it's showing width=612 and height = 792, but it's in landscape. So should it be reversed? Or have some orientation flag similar to PdfBox "page.findRotation()" method?

letter_size_problem.pdf

            
using (PdfDocument document = PdfDocument.Open(@"letter_size_problem.pdf"))
{
    var page = document.GetPage(1);
    decimal width = page.Width;
    decimal height = page.Height;
}

Complete access to images from the PDF

Somewhere in the code I added support for reading images from the PDF just as the raw object stream bytes. We won't add much more than this for now but this should be nicely wrapped in an image class with a type enum, size and position on the page if this doesn't require a PNG decoder or something fancy.

Any other easily exposed metadata should be included. Add documents to test this.

Issues reading unicode document properties?

Hi,

I had a try with loading some simple test documents into PdfPig 0.0.6, and noticed that unicode document properties don't seem to be handled correctly.

e.g., if I open the attached minimal.pdf in Acrobat reader it displays:

image

but in PdfDocument.Information, I get:

image

Would this be expected to work?

Thanks.

Letters extracted with no Value

a.pdf

In this file all letters on pages 1-54 have Value=null. I guess this is due to font "TTdcr10"?
Starting from page 55 letters are extracted correctly (when font is changing to ArialMT).

Is this a known issue? It seems to be working in PdfBox.

Optimize SystemFontFinder

In profiling done for #47 SystemFontFinder.GetTrueTypeFontNamed was called 18 times for a total of 4 seconds of the 29 second total.

This code is slow because it has to scan all fonts on the host operating system but it can be optimized trivially by using a static cache rather than per-call. It may also quicker to use File.ReadAllBytes rather than using a FileStream as the input to TrueType parsing.

Try using floats instead of decimals for calculated values

Due to the poor performance of PdfPig for end-user scenarios we should see what impact substituting decimals for floats provides where the values are being used in calculations (all TransformationMatrix based code).

If the benefits from #64 aren't considered good enough then it may be that calculated values are better of being float based.

Add word extraction

A barrier to adoption of the library is probably the lack of a "batteries included" text extraction API. We support retrieving letters and their size but each client must write their own word generation logic. We should include a naive default with pluggable interface. For example:

var document = PdfDocument.Open("somedocument.pdf");
var page = document.GetPage(1);
IEnumerable<Word> words = page.GetWords();

Where the get words method is using an optional parameter:

IEnumerable<Word> GetWords(IWordExtractor extractor = null)

Which if not set uses the internal library implementation. It doesn't need to be great for now but should at least do the obvious things right...

ArgumentOutOfRangeException occurs when execute document.GetPage(i + 1)

Hello there,

When I execute the samples you provided, no matter which one,
ArgumentOutOfRangeException will occur when executing var page = document.GetPage(i + 1);,
but when document.NumberOfPages is used to fetch the page number,
The number of pages obtained is correct. The relevant information is as follows

The StackTrace :
at System.DateTime.Add(Double value, Int32 scale)
at UglyToad.PdfPig.Fonts.TrueType.TrueTypeDataBytes.ReadInternationalDate() at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\TrueTypeDataBytes.cs: 行 114
at UglyToad.PdfPig.Fonts.TrueType.Tables.HeaderTable.Load(TrueTypeDataBytes data, TrueTypeHeaderTable table) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\Tables\HeaderTable.cs: 行 97
at UglyToad.PdfPig.Fonts.TrueType.Parser.TrueTypeFontParser.ParseTables(Decimal version, IReadOnlyDictionary`2 tables, TrueTypeDataBytes data) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\Parser\TrueTypeFontParser.cs: 行 59
at UglyToad.PdfPig.Fonts.TrueType.Parser.TrueTypeFontParser.Parse(TrueTypeDataBytes data) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\TrueType\Parser\TrueTypeFontParser.cs: 行 35
at UglyToad.PdfPig.Fonts.Parser.Parts.CidFontFactory.ReadDescriptorFile(FontDescriptor descriptor) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Parts\CidFontFactory.cs: 行 114
at UglyToad.PdfPig.Fonts.Parser.Parts.CidFontFactory.Generate(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Parts\CidFontFactory.cs: 行 56
at UglyToad.PdfPig.Fonts.Parser.Handlers.Type0FontHandler.ParseDescendant(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Handlers\Type0FontHandler.cs: 行 128
at UglyToad.PdfPig.Fonts.Parser.Handlers.Type0FontHandler.Generate(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\Parser\Handlers\Type0FontHandler.cs: 行 34
at UglyToad.PdfPig.Fonts.FontFactory.Get(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Fonts\FontFactory.cs: 行 51
at UglyToad.PdfPig.Content.ResourceContainer.LoadFontDictionary(DictionaryToken fontDictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Content\ResourceContainer.cs: 行 93
at UglyToad.PdfPig.Content.ResourceContainer.LoadResourceDictionary(DictionaryToken resourceDictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Content\ResourceContainer.cs: 行 33
at UglyToad.PdfPig.Parser.PageFactory.LoadResources(DictionaryToken dictionary, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Parser\PageFactory.cs: 行 215
at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, Boolean isLenientParsing) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Parser\PageFactory.cs: 行 67
at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\Content\Pages.cs: 行 62
at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber) at C:\git\csharp\UglyToad.PdfPig\src\UglyToad.PdfPig\PdfDocument.cs: 行 158
at DocumentLayoutAnalysis.ImageTest.Run(String path) at F:\IISC\My Lab\DocumentLayoutAnalysis-master\DocumentLayoutAnalysis\DocumentLayoutAnalysis\ImageTest.cs: 行 25

Details on the system :
OS : MS Windows v10
VS : VS 2017 C#
.NET version : .NET framework 4.6.1

19937571.pdf

Page.Letters is empty for document which contains text

Using attached document and program below nothing is written to the console. Sample PDF came from a commercial HTML to PDF library one of our customers uses.

Getting_Started.pdf

using System;
using UglyToad.PdfPig;

namespace ExtractTest
{
    class Program
    {
        static void Main(string[] args)
        {
            using (PdfDocument document = PdfDocument.Open("Getting_Started.pdf"))
            {
                for (var i = 0; i < document.NumberOfPages; i++)
                {
                    var page = document.GetPage(i + 1);
                    foreach (var letter in page.Letters)
                    {
                        Console.WriteLine(letter.Value);
                    }
                }
            }
        }
    }
}

Create "Hello World" PDF

The major feature of the next release should be the ability to create PDF documents, for now they will only support the addition of plain text.

This is the first ticket to implement enough of the API to create a single page PDF A4 document containing the text "Hello World!" on a single line.

Optimize TryReadStream

In profiling done for #47 PdfTokenScanner.TryReadStream takes up 1/6th of the total time for parsing a set of 26 documents being called a total of 1,148 times. This is probably low-hanging fruit for performance optimization since in general we know the length of the stream ahead of time.

image

Optimize TransformationMatrix

As seen in #47 multiplication operations on TransformationMatrix take over 1/3rd of the total parsing time for PdfPig. We will investigate the optimizations/tradeoffs of using floats instead of decimals which may result in a large speedup however it's also worth checking the performance impact of using values directly (9 internal decimals rather than an array) which may either be slower due to the large value to copy, or improve performance due to removing repeated array access.

Some text is missing

For attached PDF the charts and text around them are missing. For example the text "Historical Arrears by Month" is not in Letters collection or page extracted text, as well as all the numbers/labels on the charts. The lines (paths collection) are also missing everything around the charts area. Is there possibly a sub-stream, which is not being processed?

missing_text_sample.pdf

Incorrect number format when using not "en-US" number style

Hi. Thank you for this library, this good, but I have problem.
When I use PC with not "en-US" style numbers, by default I have throw when try open any pdf file.
I was try to fix this but still not found all parse function.

for example:

    private static decimal ReadDecimal(IInputBytes input)
    {
        decimal result;

        var str = ReadString(input);

        Decimal.TryParse(str, NumberStyles.Any, new CultureInfo("en-US"), out result); // <-

        return result;
    }

Sorry for my english. Thank you for your work.

Support PDF documents using named system fonts

The height of letters in the document Multiple Page - from Mortality Statistics.pdf are currently wrong because this uses a TrueType font which is not included in the document (ArialMT), in this case the provider is meant to use the files from the host operating system.

Line 78 of TrueTypeFontHandler has a TODO describing what PDFBox does in this situation, we should do the same thing.

testing page.text

page.text did not give text with newlines.
using this code on lestest code of pdfpig
using (PdfDocument doc = PdfDocument.Open(textBox1.Text, new ParsingOptions { Password = textBox2.Text }))
{
var page = doc.GetPage(1);
string pagetext = page.Text;
File.WriteAllText("text.txt", pagetext);
textBox3.Text = pagetext;
}

Implement support for the gs content stream operator

PDF Page content streams can contain the gs operator:

Set[s] parameters from graphics state parameter dictionary

This currently has no effect which can lead to letters being given the wrong size.

Each entry in the parameter dictionary specifies the value of an individual graphics state parameter, as shown in Table 4.8. All entries need not be present for every invocation of the gs operator; the supplied parameter dictionary may include
any combination of parameter entries.

Implement support for setting the graphics state from the graphics state parameter dictionary.

Create 2 page PDF document with wrapped Lorem Ipsum placeholder text

As part of the document creation epic for the next release we should handle wrapping text automatically (nothing fancy like working out the right place to line-break), create a document that shows the following capabilities:

  • Two or more pages
  • Text wrapping
  • Different font sizes, weights and faces

Relatively slow processing

This is not exactly an issue, but more like a general question. While extracting Letters collection I noticed that overall the process runs about 4-5 times slower than PdfBox. I run PdfBox through Ikvm, so I was expecting it to be the other way around :-)
Of course there can be many things contributing to this, but I did one quick test - I ran a mass replace of word "decimal" to "double" across the whole code base. And yes, the speed got right on par with PdfBox! Changing it to "float" made it even a little faster (probably due to smaller memory footprint).
Sure double/float is not precise, but personally I often need to run extraction over hundreds of thousands PDFs, so speed is crucial and the time difference is substantial. On the other hand I think that letters/lines positions and dimensions would be OK with 2 digits precision at most (ok, maybe 3:-))
I see a few ways to approach this:

  1. Change all properties to float. This is quick, but not very user friendly and may have some minor issues if anyone ever tries to compare numbers directly.
  2. Keep values internally as Int multiplied by 100 (or 1000). Do all calculations on Int, then return to the user as decimal (divide by 100) - this may be better, but probably harder to do and potentially more confusing.
  3. Keep everything as is and suggest getting a better server :-) (or optimize somewhere else)

Thoughts? Thank you for your time and all the work put in this library!

GetPage fails with error "Could not find a name for this font"

Steps to reproduce:

  1. Download this PDF File: Tackling the Poor Assumptions of Naive Bayes Text Classifiers
  2. Call PdfDocument.Open(...)
  3. Call document.GetPage(1)

The call to GetPage fails with the following error:

UglyToad.PdfPig.Fonts.Exceptions.InvalidFontFormatException: Could not find a name for this font (/Type, /Font) (/Subtype, /Type1) (/FirstChar, COSInt{0}) (/LastChar, COSInt{127}) (/Widths, COSObject{325, 0}) (/BaseFont, COSObject{331, 0}) (/FontDescriptor, COSObject{332, 0}) .
   at UglyToad.PdfPig.Fonts.Parser.FontDictionaryAccessHelper.GetName(PdfDictionary dictionary, FontDescriptor descriptor)
   at UglyToad.PdfPig.Fonts.Parser.Handlers.Type1FontHandler.Generate(PdfDictionary dictionary, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Fonts.FontFactory.Get(PdfDictionary dictionary, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Content.ResourceContainer.LoadFontDictionary(PdfDictionary fontDictionary, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Content.ResourceContainer.LoadResourceDictionary(PdfDictionary dictionary, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, PdfDictionary dictionary, PageTreeMembers pageTreeMembers, IRandomAccessRead reader, Boolean isLenientParsing)
   at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber)
   at FactReapUsage.Program.Main(String[] args) in C:\Code\FactReapUsage\FactReapUsage\FactReapUsage\Program.cs:line 19

BoundingBox for Images

Hi,

Unless I am mistaken, there is no support for getting an image's BoundingBox. Is there any plan to add this functionality?

Thanks!

Inspect Type 1 CFF glyph positions and sizes

The Pig Production Handbook.pdf page 1 has a visual verification test:
image

As you can see the glyph boxes are way out. Investigate, add a test that asserts the positions of some letters using either xfinium pdf inspector or pdfbox to find the expected positions and fix this.

The test is:

GenerateLetterBoundingBoxImages.PigProductionCompactFontFormat

Support AES-256bit encryption

I really like this library. The letters feature that shows me the glyph rectangle is very helpful. The problem I'm having is that a lot of my PDF documents come in encrypted AES-256 bit. Please support it! Thank you!

Make content stream operators public

The classes in the UglyToad.PdfPig.Graphics.Operations namespace represent all operations a page's content stream can contain. Finish implementing writing for all of them and make them public. Use a reflection based test to ensure they can all be written.

Rework public API for letters

A letter in a PDF has the following information:

  • A placement origin position (x, y)
  • A bounding box which entirely surrounds the actual visible shape of the glyph
  • A width by which the rendering advances (advance width) to place the next character which may be greater or less than the width of the bounding box

To illustrate this consider the following SVG of a character 'o' or '0'taken from a PDF:

image

The red dot illustrates the placement origin, the blue box illustrates the bounding box for the glyph itself, notice how it extends below the origin, it can also go to the left of the origin or in this case not include the origin. The advance width for this character would probably be greater than the bounding box width since the origin is outside the character.

A letter should have origin as PdfPoint, glyph bounding box as PdfRectangle and Width as decimal with comments explaining the above.

Test, refactor and prepare the FontDescriptor class for public API

From the spec:

A font descriptor specifies metrics and other attributes of a simple font or a CIDFont as a whole, as distinct from the metrics of individual glyphs. These font metrics provide information that enables a consumer application to synthesize a substitute font or select a similar font when the font program is unavailable. The font descriptor may also be used to embed the font program in the PDF file.

We have a class to represent the FontDescriptor but it's a bit of a mess, test it, tidy up its creation and generally improve the public API for this class (it is internal but aim to make it public for 0.0.2).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.