GithubHelp home page GithubHelp logo

pdflib's Introduction

pdflib

Build Status

Python binding for poppler.

Installation

Using pip: pip install pdflib

From source:

  • Clone poppler source code and compile it:
git clone --branch poppler-0.63.0 --depth 1 https://anongit.freedesktop.org/git/poppler/poppler.git poppler_src
cd poppler_src/
cmake -DENABLE_SPLASH=OFF -DBUILD_GTK_TESTS=OFF -DENABLE_UTILS=OFF -DENABLE_LIBOPENJPEG=none .
make
  • Set POPPLER_SRC environment variable
export POPPLER_ROOT=/pdflib/poppler_src/
  • Install cython
pip install cython
  • Build extension
python setup.py build_ext --inplace

Usage

>>> from pdflib import Document
>>> doc = Document("path/to/file.pdf")

Getting metadata

>>> print(doc.metadata)
>>> print(doc.xmp_metadata)

Getting text content of each page

>>> for page in doc:
        print(' \n'.join(page.lines).strip())

Getting images from each page

>>> for page in doc:
        page.extract_images(path='images', prefix='img')

LICENSE

pdflib is available under GPL v3 (poppler is GPL).

pdflib's People

Contributors

dependabot-preview[bot] avatar pudo avatar sunu avatar tjstavenger-pnnl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdflib's Issues

install package get: poppler error from MacOs

I tried pip3 install pdflib/ pip install pdflib get this error.
截圖 2023-10-23 下午2 56 40

Then I followed REAME step to install. I’d get poppler nss3 not found error.
截圖 2023-10-23 下午2 59 27

I couldn't find a way to slove this ld: library 'nss3' not found error.

In venv
python version - 3.7.9
pip3 version 23.3.1

Right-to-left language PDF text extraction is backwards in Aleph

I've found that the extracted text from PDFs in right-to-left languages is backwards (reversed). I can reproduce the error by calling the pdflib package directly, but if I call Poppler's pdftotext utility, it is correct. It looks like Aleph/pdflib are taking a different approach to extracting text page-by-page than what poppler does https://github.com/alephdata/aleph/blob/master/services/ingest-file/ingestors/support/pdf.py#L13 vs https://gitlab.freedesktop.org/poppler/poppler/-/blob/master/utils/pdftotext.cc#L400 (in poppler you have to read through all the HTML vs plain text as it is mixed together in one function).

https://github.com/alephdata/pdflib/blob/master/pdflib/poppler.pyx#L379-L388 appears to read the text in right-to-left and then write it out lef-to-right.

Installation on macOS

Hi — is this able to be installed on a mac? I'm getting errors when I try to pip install pdflib.

Collecting pdflib
  Using cached pdflib-0.1.2.tar.gz (49 kB)
    ERROR: Command errored out with exit status 1:
     command: /usr/local/anaconda3/envs/chapter-extraction/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-install-hx2shit4/pdflib_6966299fddf3403c86746bcd924e7ff7/setup.py'"'"'; __file__='"'"'/private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-install-hx2shit4/pdflib_6966299fddf3403c86746bcd924e7ff7/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-pip-egg-info-a9e9_nrp
         cwd: /private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-install-hx2shit4/pdflib_6966299fddf3403c86746bcd924e7ff7/
    Complete output (11 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/kp/03rd6_8x1835gg33z1lg18vm0000gp/T/pip-install-hx2shit4/pdflib_6966299fddf3403c86746bcd924e7ff7/setup.py", line 54, in <module>
        ext_modules=cythonize([poppler_ext]),
      File "/usr/local/anaconda3/envs/chapter-extraction/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 965, in cythonize
        module_list, module_metadata = create_extension_list(
      File "/usr/local/anaconda3/envs/chapter-extraction/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 815, in create_extension_list
        for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
      File "/usr/local/anaconda3/envs/chapter-extraction/lib/python3.9/site-packages/Cython/Build/Dependencies.py", line 114, in nonempty
        raise ValueError(error_msg)
    ValueError: 'pdflib/poppler.pyx' doesn't match any files

I have poppler installed and $POPPLER_ROOT exported.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.