micka33 / content-extractor Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 11.0 16.46 MB

Extract meaningful content from pdf and psd file, such as texts and images both linked into a common JSON string

Home Page: https://micka33.github.io/content-extractor/

Python 97.39% Shell 0.85% Ruby 1.75%

content-extractor's People

Contributors

Stargazers

Watchers

Forkers

engmsaleh lucanaso thuutin devroy73 gsdu8g9 a2393439531 parinaya-007 ymollard bobycv06fpm deeby

content-extractor's Issues

No module named psd_tools

Hi,

When I execute the parser script, I get the following error

Traceback (most recent call last):
  File "./parser.py", line 36, in <module>
    json = parse(sys.argv[1], sys.argv[2])
  File "./parser.py", line 24, in parse
    from psdtools import main
  File "/Users/xyz/Documents/tools/content-extractor/psdtools/main.py", line 1, in <module>
    from psd_tools import PSDImage
ImportError: No module named psd_tools

I have installed all the dependencies (curl, brew, python 2.7 ...)

I also tried to install psd_tools separately

Requirement already satisfied (use --upgrade to upgrade): psd-tools in /usr/local/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): docopt>=0.5 in /usr/local/lib/python2.7/site-packages (from psd-tools)

I always get the same error. Do you have any idea ?

pdfminer has changed it's API and broken some links

The euske / pdfminer repository has changed the location of the PDFDocument class, as noted in the README. This class can be refound easily, but also other things have changed as can be deducted from the following error message. I will not pursue this any further and use pdfminer directly.

[..]/pdfsplitter/content_extractor/pdfreader/util/convert.py in <module>()
        2 from pdfminer.pdfparser import PDFParser
        3 from pdfminer.pdfdocument import PDFDocument
----> 4 from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
        5 from pdfminer.pdfdevice import PDFDevice, TagExtractor
        6 from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter

ImportError: cannot import name process_pdf

IndexError: list index out of range

I encounter this problem, though the examples are processed successfully.
File "general.py", line 12, in
json = main.run("Programming with PDFMiner.pdf", "./images/")
File "D:\codegit\python2.7\pdfminer\content-extractor-master\pdfreader\main.py", line 82, in run
dict_book = text_to_dict(pdf_file)
File "D:\codegit\python2.7\pdfminer\content-extractor-master\pdfreader\main.py", line 13, in text_to_dict
b = book(xml)
File "D:\codegit\python2.7\pdfminer\content-extractor-master\pdfreader\lib\book.py", line 20, in init
self._pages.append(page(p))
File "D:\codegit\python2.7\pdfminer\content-extractor-master\pdfreader\lib\page.py", line 17, in init
self._paragraphs.append(paragraph(p))
File "D:\codegit\python2.7\pdfminer\content-extractor-master\pdfreader\lib\paragraph.py", line 39, in init
self._lines.append(line(l))
File "D:\codegit\python2.7\pdfminer\content-extractor-master\pdfreader\lib\line.py", line 40, in init
self._chars.append(char(c))
File "D:\codegit\python2.7\pdfminer\content-extractor-master\pdfreader\lib\char.py", line 30, in init
self._font = xml_char.get('font').split('+')[1] if xml_char.get('font') != None else None
IndexError: list index out of range

micka33 / content-extractor Goto Github PK

content-extractor's People

Contributors

Stargazers

Watchers

Forkers

content-extractor's Issues

No module named psd_tools

pdfminer has changed it's API and broken some links

IndexError: list index out of range

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs