GithubHelp home page GithubHelp logo

pdfminer3k's Introduction

See docs/index.html

pytest is needed to run tests in the 'tests' folder.

pdfminer3k's People

Contributors

jaepil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfminer3k's Issues

root logger in psparser

Root level logging is still present in pdfminer.psparser.nextobject():

logging.debug('do_keyword: pos=%r, token=%r, stack=%r', pos, token, self.curstack)

I guess it is was not intended : )

KeyError: 'ID'

pdfFile.set_parser(parser)
File "C:\Users\Administrator\Envs\artcle\lib\site-packages\pdfminer\pdfparser.py", line 431, in set_parser
self.encryption = (list_value(trailer['ID']),dict_value(trailer['Encrypt']))
KeyError: 'ID'

ModuleNotFoundError: No module named 'pdfminer.pdfpage'

I am using Anaconda and used conda forge to install pdfminer3k

Error:

runfile('C:/Phoenix/Python/listpdfsandcountwords.py', wdir='C:/Phoenix/Python')
Traceback (most recent call last):

File "", line 1, in
runfile('C:/Phoenix/Python/listpdfsandcountwords.py', wdir='C:/Phoenix/Python')

File "C:\Work\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile
execfile(filename, namespace)

File "C:\Work\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Phoenix/Python/listpdfsandcountwords.py", line 14, in
from pdfminer.pdfpage import PDFPage

ModuleNotFoundError: No module named 'pdfminer.pdfpage'

Conda Environment:

(C:\Work) C:\Users\dparamanand>conda info
Current conda install:

           platform : win-64
      conda version : 4.3.29
   conda is private : False
  conda-env version : 4.3.29
conda-build version : 3.0.27
     python version : 3.6.3.final.0
   requests version : 2.18.4
   root environment : C:\Work  (writable)
default environment : C:\Work
   envs directories : C:\Work\envs
                      C:\Users\dparamanand\AppData\Local\conda\conda\envs
                      C:\Users\dparamanand\.conda\envs
      package cache : C:\Work\pkgs
                      C:\Users\dparamanand\AppData\Local\conda\conda\pkgs
       channel URLs : https://repo.continuum.io/pkgs/main/win-64
                      https://repo.continuum.io/pkgs/main/noarch
                      https://repo.continuum.io/pkgs/free/win-64
                      https://repo.continuum.io/pkgs/free/noarch
                      https://repo.continuum.io/pkgs/r/win-64
                      https://repo.continuum.io/pkgs/r/noarch
                      https://repo.continuum.io/pkgs/pro/win-64
                      https://repo.continuum.io/pkgs/pro/noarch
                      https://repo.continuum.io/pkgs/msys2/win-64
                      https://repo.continuum.io/pkgs/msys2/noarch
        config file : C:\Users\dparamanand\.condarc
         netrc file : None
       offline mode : False
         user-agent : conda/4.3.29 requests/2.18.4 CPython/3.6.3 Windows/10 Windows/10.0.16299
      administrator : False

Code:

-- coding: utf-8 --

"""
Created on Fri Sep 29 10:43:29 2017

@author: dpar0004
"""

import os
#for reading the pdf
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from nltk.corpus import stopwords
from nltk.collocations import TrigramCollocationFinder
from nltk.collocations import QuadgramCollocationFinder

#for counting the sentences and words
import nltk
import collections
from nltk import word_tokenize
from collections import Counter

#for couting most frequent words
import re

def convert(filename, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)

output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)

infile = open(filename, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
    interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text

pdfFiles = []
dir_name='C:\Phoenix\Documents from Bryan'
for filename in os.listdir(dir_name):
if filename.endswith('.pdf') or filename.endswith('.PDF') or filename.endswith('.Pdf') or filename.endswith('.pDf') or filename.endswith('.pdF') or filename.endswith('.pDF') or filename.endswith('.pDf') or filename.endswith('.PDF'):
pdfFiles.append(filename)
text=convert(os.path.join(dir_name, filename))
sentence_count = len(nltk.tokenize.sent_tokenize(text))
word_count = len(nltk.tokenize.word_tokenize(text))
print('\nThe file ',filename,' has ',word_count, 'words and ', sentence_count,' sentences in it.\n')

     #use findall for counting most common words, quadgrams, trigrams
     all_text = re.findall(r'\w+', text)
     all_text =map(lambda x: x.lower(), all_text)
     filtered_words = list(filter(lambda word: word not in stopwords.words('english') and word.isalpha(), all_text))

     word_counts = Counter(filtered_words).most_common(20)
     print('The 20 most commonly occuring words in this file are : \n\n', word_counts)
     
     print('\nThe 10 most common 3 word combinations appearing in this file are: \n')
     trigram = TrigramCollocationFinder.from_words(filtered_words)
     print(sorted(trigram.ngram_fd.items(), key=lambda t: (-t[1], t[0]))[:10])
     
     fourgrams=QuadgramCollocationFinder.from_words(filtered_words)
     print('\nThe 10 most common 4 word combinations appearing in this file are: \n')
     print(sorted(fourgrams.ngram_fd.items(), key=lambda t: (-t[1], t[0]))[:10])
     
     print('----------------------------------------------------------------------------------------------------')

PDFMiner3k: Maximum recursion depth exceeded while calling a Python object

maximum recursion depth exceeded error when using pdfminer3k
Here is my code:

def readPDF(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)

with io_open(path, 'rb') as pdfFile:
    process_pdf(rsrcmgr, device, pdfFile)
device.close()

content = retstr.getvalue()
retstr.close()

filename = path.replace('pdf', 'txt')
with open(filename, 'w') as f:
    f.write(content)

This is the error received:

--- Logging error ---
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.6/logging/init.py", line 992, in emit
msg = self.format(record)
File "/usr/local/python3/lib/python3.6/logging/init.py", line 838, in format
return fmt.format(record)
File "/usr/local/python3/lib/python3.6/logging/init.py", line 575, in format
record.message = record.getMessage()
File "/usr/local/python3/lib/python3.6/logging/init.py", line 338, in getMessage
msg = msg % self.args
File "/usr/local/python3/lib/python3.6/site-packages/pdfminer/pdftypes.py", line 132, in repr
return '<PDFStream(%r): raw=%d, %r>' % (self.objid, len(self.rawdata), self.attrs)
RecursionError: maximum recursion depth exceeded while calling a Python object
Call stack:
File "run_history.py", line 11, in
cmdline.execute("scrapy crawl sse_listedinfo_announcement -a begin_date={0} -a end_date={1} -a path={2}".format(begin_date, end_date, path).split())
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/cmdline.py", line 142, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/python3/lib/python3.6/site-packages/scrapy/cmdline.py", line 88, in _run_print_help

Fatal Python error: Cannot recover from stack overflow:

WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont=

When extracting text from pdf (https://www.aanda.org/articles/aa/pdf/2006/02/aa3061-05.pdf), I got a lot of warning and the extraction failed.

My code is as:
import os
import sys
import importlib
importlib.reload(sys)
from pdfminer.pdfparser import PDFParser,PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
def parse(path,target):
if (os.path.exists(target)):
os.remove(target)
fp = open(path, 'rb')
praser = PDFParser(fp)
doc = PDFDocument()
praser.set_document(doc)
doc.set_parser(praser)

doc.initialize()

if not doc.is_extractable:
    raise PDFTextExtractionNotAllowed
else:
    rsrcmgr = PDFResourceManager()
    laparams = LAParams(all_texts = True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    for page in doc.get_pages(): # doc.get_pages() 获取page列表
        interpreter.process_page(page)
        layout = device.get_result()
        for x in layout:
            if (isinstance(x, LTTextBoxHorizontal)):
                with open(target, 'a', encoding='utf-8') as f:
                    results = x.get_text()
                    # print(results)
                    f.write(results + '\n')

if name == 'main':
path = r'./pdf/aa3061-05.pdf'
parse(path,path.replace('.pdf','.txt'))

the warnings:
......
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 4
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BICMGG+txex'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
WARNING:pdfminer.converter:undefined: <PDFType1Font: basefont='BIBNJI+txsy'>, 5
......

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.