Comments (2)
Hello Philipp,
I intentionnally left an exception here because the pdf file format is so
tricky regarding page description that I was sure that one day I would
encounter a case like yours.
For your curiosity, there is a triple indirection in the way which text
objects are contained in which pages :
-
Object #x contains a keyword that specifies a certain number of
objects y1, �., yn1
-
Each object y1, �, y1 references objects z1, �, zn2. These are
the contents for one page
-
In turn, each object z1, � zn2 lists the object that contain the
text drawing instructions to draw a part of the page
And you can even find pdf files without any page description at all ! this
is the case for example of the official Adobe PDF Specification document�
I suspect that your pdf samples have a little inconsistency ; they say that
the page contents for one page are described by 32 objects, while only 6 are
referenced. This may be due to a bug in the application that generated it
but if this is the case, pdf readers need to be highly tolerant so I will
change my class accordingly.
Regarding issue #2 (the repeating error) , I suspect that I need to add a
check somewhere.
Ok, I�ll put that in my bug tracking system.
Christian.
De : phisu [mailto:[email protected]]
Envoyé : mercredi 27 juillet 2016 11:00
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] error by the /Count parameter
(#8)
hello christian.
i get an error concerning page count. i did:
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;
and i got the following error:
Object #202 : Page count given by the /Count parameter (32) differs from the
actual number of objects referenced by the /Kids parameter (6).
PdfToText.php
545
512
the following files produces similar errors:
http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK
-LabelCheck_screen.pdf
*
https://www.uni-muenster.de/imperia/md/content/physikalische_chemie/praktiku
m/h_p_saetze.pdf
*
http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK
-LabelCheck_screen.pdf
and the same error on the following file. but a repeating error too:
http://www.umweltberatung.at/downloads/mehrweggetraenke-bezugsquellen-abfall
.pdf
Undefined offset: 1
/PdfToText.php
2115
philipp
�
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#8 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8ald5t9481fTbyYBQDRGHj
D61DH0Zks5qZx4igaJpZM4JV-mM> the thread.
<https://github.com/notifications/beacon/ARM8akIa8zNncDVJdBVHBpBtLWqwDOhXks5
qZx4igaJpZM4JV-mM.gif>
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
from pdftotext.
Hello Philipp,
I corrected the repeating problem of « undefined offset 1 ». This was due to
an improper parsing of floating point numbers used for specifying
coordinates. A value such as « 0.12 » was recognized, while « .12 » was
discarded.
Regarding the warning (« Page count given by the /Count parameter�. »), your
samples made me discover that page maps could be nested, the top level page
map listing only objects describing further page maps and giving their total
count (yet another pdf surprise !).
I disabled this warning in non-debug mode ; I am not yet able to evaluate
whether the individual page contents extracted from your samples will be
correct ; however, I know that I have to modify the PdfTexterPageMap class
in my source to handle this new crazy situation. This is an issue I added to
my list of open issues�
Regarding the text positioning issues you reported me in another mail (with
extra spaces and extraneous line breaks) , don�t worry, I�m handling them in
a separate thread�
Christian.
De : phisu [mailto:[email protected]]
Envoyé : mercredi 27 juillet 2016 11:00
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] error by the /Count parameter
(#8)
hello christian.
i get an error concerning page count. i did:
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;
and i got the following error:
Object #202 : Page count given by the /Count parameter (32) differs from the
actual number of objects referenced by the /Kids parameter (6).
PdfToText.php
545
512
the following files produces similar errors:
http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK
-LabelCheck_screen.pdf
*
https://www.uni-muenster.de/imperia/md/content/physikalische_chemie/praktiku
m/h_p_saetze.pdf
*
http://www.cleanclothes.at/media/common/uploads/download/cck-label-check/CCK
-LabelCheck_screen.pdf
and the same error on the following file. but a repeating error too:
http://www.umweltberatung.at/downloads/mehrweggetraenke-bezugsquellen-abfall
.pdf
Undefined offset: 1
/PdfToText.php
2115
philipp
�
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#8 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8ald5t9481fTbyYBQDRGHj
D61DH0Zks5qZx4igaJpZM4JV-mM> the thread.
<https://github.com/notifications/beacon/ARM8akIa8zNncDVJdBVHBpBtLWqwDOhXks5
qZx4igaJpZM4JV-mM.gif>
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
from pdftotext.
Related Issues (20)
- Font Widths from another PDF Object
- Can't read PDF-file HOT 13
- Preserve new lines in pdf after converting to text.
- issue in convert maths paper how can i solve it HOT 6
- extracted images are black HOT 1
- Problem with Euro (€) char HOT 1
- different fonts problem
- No Spaces in between two text
- Converting only parts of the file
- Error of 'Undefined Constant 'IMG_JPEG' HOT 7
- problem with extracting some hebrew font
- How to get PDF form fields and values ?
- High Memory Usage HOT 1
- Causes garbled characters HOT 2
- PdfToText not reading files created or modified with PDFelement
- Extract Data from PDF form Undefined Functions
- Coordinates not recognized HOT 1
- Why is the original image different from the extracted image? HOT 1
- PdfToText returns only spaces but no text
- A lot fo depreciated warrning on PHP 8 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdftotext.