jlsutherland / doc2text Goto Github PK

View Code? Open in Web Editor NEW

1.3K 39.0 97.0 52 KB

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

License: MIT License

Python 93.93% Shell 6.07%

doc2text's People

Stargazers

Watchers

Forkers

agiza wavelets mcilhargey rsalasa3 achikin bedros neuroradiology ericschles vibster bag-of-projects chrmorais avi-levy luoyiqi hitluobin belwase stevenlol makemefriendanshu bradparks geolibrerian techscientist wj0s3ph oliveshell hongphi ravitejareddy mertkanyemen tom-zeit remi-pr gadgetsteve yjmade awesome-archive barbaramaseda rcatajar qimingnana jimlawton jasonshapiro sjwang1988 modulexcite kevinhuo88888 xyz8 andjelx architrave-de chagge lxdatgithub semtle wsyjwps1983 zaytiamo jingyw zgsxwsdxg 513266358 qwzhong1988 sharkfinliu crazypenguincode angelo337 mehmetzantur kld123509945 baifengbai boragocode ginking kitter newenglandml rubbish822 lizadaly hanwsf fred7b afcarl dillu24 kierudsen michaelvu812 bellyfat rajibmitra cberranger lgdkobe24 tommyneeld jzcruiser iprayerr awesomemachinelearning nischaypandey shalevy1 cgy1992 sethips nidhoggurz yangtong1989 wupj1993 tristantang youly172 licshire yuansky bobycv06fpm chetanbir77 zouyayi martinhiti yemenpython bingzhen bappaditya-td vishnunkumar

doc2text's Issues

Support for non scanned documents (.doc, .docx, regular pdf)

Hi @jlsutherland and thanks for this cool module, OCR is a hard problem and you provide a pretty efficient and simple solution.

Would you be interested by PR with text extraction for non-scanned documents ? I think it fits the module name "doc2text" quite well but maybe you want to stick with just OCR, let me know

No module name PythonMagick

I have installed doc2text and required packages but when I try to import doc2text it gives me error no module name PythonMagick.

Maybe a stupid question about the api, can't find in source code

Thanks for the nice code.

Just a question about the code, because i see the examples to use

doc = doc2text.Document()

# You can pass the lang (as 3 letters code) to the class to improve accuracy
# On ubuntu it requires the package tesseract-ocr-$lang$
# On other OS, see https://github.com/tesseract-ocr/langdata
doc = doc2text.Document(lang="eng")

# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
# If reading a PDF, doc2text will split the PDF into its component pages.
doc.read('./path/to/my/file')

# Crop the pages down to estimated text regions, deskew, and optimize for OCR.
doc.process()

# Extract text from the pages.
doc.extract_text()
text = doc.get_text()

but when i try to find the api like .process, . read,
i can't find them in source. Any suggestion on this?
Thanks

Does not work on python3

I installed with pip install doc2text, then tried in an ipython shell to import doc2text. This gave error in init.py line 77 because of print statement with no parantheses.

it'd be nice if this could produce text-overlaid PDFs

tesseract seems to be able to produce PDFs these days with text overlaid on the image. This is useful for searching int he PDF when viewing that way.

It'd be nice if this could produce nice de-skewed PDFs

Can't pip install this

Can't seem to install it. Lots of errors: http://hastebin.com/alotujupog.vhdl

On calling the process_image() method, the image to be processed is not cropped accurately (attached below). Which (and how) of the calls in the method will I need to modify to get the accurately cropped outputs?

AttributeError: Document instance has no attribute 'file_basename'

>>> import doc2text
>>> doc = doc2text.Document()
>>> doc.read('test.png')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jwilk/.local/lib/python2.7/site-packages/doc2text/__init__.py", line 67, in read
    self.file_basepath, self.file_basename + '_temp.png'
AttributeError: Document instance has no attribute 'file_basename'

Tested with git master (41dca91).

Unable to process

Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80
dst is not a numpy array, neither a scalar

Eror on pip install PythonMagick

PythonMagick is a required package for doc2text. I installed it through pip.

(doc2txt) ➜  Programs pip install PythonMagick
Collecting PythonMagick
  Could not find a version that satisfies the requirement PythonMagick (from versions: )
No matching distribution found for PythonMagick

Anyone knows what's wrong with it...thanks.

FileNotFoundError

When I am trying to call doc.extract_text() it gives error file not found.
I'm using windows 10 and Python 3.6 and Jupyter

Python 3.5 compatibility

Seems library not 100% python3 compatible. When I'm tying to run simple code:

import doc2text

doc = doc2text.Document()
doc = doc2text.Document(lang="eng")
doc.read('pdf-sample.pdf')

I'm getting

Traceback (most recent call last):
  File "doc2text_test.py", line 13, in <module>
    doc.read('pdf-sample.pdf')
  File "/usr/local/lib/python3.5/dist-packages/doc2text/__init__.py", line 44, in read
    for i in xrange(self.num_pages):
NameError: name 'xrange' is not defined

Error passing the lang to the class

When I try to pass the language as in the example:

doc = doc2text.Document(lang="por")

I received the following error message:

    doc = doc2text.Document(lang="por")
TypeError: __init__() got an unexpected keyword argument 'lang'

issue with extract_text

When doing:

import doc2text
doc = doc2text.Document()
doc.read('something.pdf')
doc.process()
doc.extract_text()

I get the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-5-57184997370d> in <module>()
----> 1 doc.extract_text()

/usr/local/lib/python2.7/dist-packages/doc2text/__init__.pyc in extract_text(self)
     89             for page in self.processed_pages:
     90                 new = page
---> 91                 text = new.extract_text()
     92                 self.page_content.append(text)
     93         else:

/usr/local/lib/python2.7/dist-packages/doc2text/page.pyc in extract_text(self)
     36     def extract_text(self):
     37         temp_path = 'text_temp.png'
---> 38         cv2.imwrite(temp_path, self.image)
     39         self.text = pytesseract.image_to_string(Image.open(temp_path))
     40         os.remove(temp_path)

AttributeError: Page instance has no attribute 'image'

ModuleNotFoundError: No module named 'PyPDF2'

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    import doc2text
  File "/Users/Stan/Downloads/doc2text-master/doc2text/__init__.py", line 6, in <module>
    import PyPDF2 as pyPdf
ModuleNotFoundError: No module named 'PyPDF2'

AttributeError: 'Page' object has no attribute 'image' ISSUE

hi there
I am testing your product, however I am getting this type of error:

Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80
dst is not a numpy array, neither a scalar
Traceback (most recent call last):
File "example_doc2text.py", line 19, in
doc.extract_text()
File "/usr/local/lib/python2.7/dist-packages/doc2text/init.py", line 96, in extract_text
text = new.extract_text()
File "/usr/local/lib/python2.7/dist-packages/doc2text/page.py", line 46, in extract_text
cv2.imwrite(temp_path, self.image)
AttributeError: 'Page' object has no attribute 'image'

my test files is as follow:

> import doc2text
> 
> # Initialize the class.
> doc = doc2text.Document()
> 
> # You can pass the lang (as 3 letters code) to the class to improve accuracy
> # On ubuntu it requires the package tesseract-ocr-$lang$
> # On other OS, see https://github.com/tesseract-ocr/langdata
> doc = doc2text.Document(lang="eng")
> 
> # Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
> # If reading a PDF, doc2text will split the PDF into its component pages.
> doc.read('myfile.tiff')
> 
> # Crop the pages down to estimated text regions, deskew, and optimize for OCR.
> doc.process()
> 
> # Extract text from the pages.
> doc.extract_text()
> text = doc.get_text()
> print text

could you please help me?
thanks a lot

What is wrong with this ? Can someone please explain ?

doc.read('/home/ubuntu/doc2text/test.jpg')
File "/usr/local/lib/python2.7/dist-packages/doc2text/init.py", line 78, in read
raise FileNotAcceptedException

Error on doc.process()

When doing:

import doc2text
doc = doc2text.Document()
doc.read('something.pdf')
doc.process()

I get:

Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 23
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 197
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 77
dst is not a numpy array, neither a scalar

And then, when I do:

doc.extract_text()

I get:

AttributeError                            Traceback (most recent call last)
<ipython-input-5-57184997370d> in <module>()
----> 1 doc.extract_text()

/usr/local/lib/python2.7/dist-packages/doc2text/__init__.pyc in extract_text(self)
     89             for page in self.processed_pages:
     90                 new = page
---> 91                 text = new.extract_text()
     92                 self.page_content.append(text)
     93         else:

/usr/local/lib/python2.7/dist-packages/doc2text/page.pyc in extract_text(self)
     36     def extract_text(self):
     37         temp_path = 'text_temp.png'
---> 38         cv2.imwrite(temp_path, self.image)
     39         self.text = pytesseract.image_to_string(Image.open(temp_path))
     40         os.remove(temp_path)

AttributeError: Page instance has no attribute 'image'

text extraction from png files does not seem to work

Thank you for this fantastic utility.

Text extraction is not successful for any png image with texts. The jpg and pdf works. Is this a known issue and will there be a fix..thanks.

Can not install pythonmagick.

I tried sudo apt-get install python-pythonmagick and pip install to ensure doc2text import well. But even I successfully installed python-pythonmagick via apt, I still can not import doc2text.

I check the source package which python-pythonmagick installed via apt, it seems the packages can only support python2.

So could you help to fix the problem? I want to doc2text on python3 (Ubuntu)

Does is support stream data ?

I'm having a flask app which gets the file from the api and want to get the text out of it , but i don't want to save it on the disk . Is there any way ? I'm trying to push the stream object so its giving me the error.

code :
file = request.files['file']
file_data = file.stream.read()

error:

\venv\lib\site-packages\docx2txt\docx2txt.py", line 76, in process
zipf = zipfile.ZipFile(docx)
File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1225, in init
self._RealGetContents()
File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1288, in _RealGetContents
endrec = _EndRecData(fp)
File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 259, in _EndRecData
fpin.seek(0, 2)
AttributeError: 'bytes' object has no attribute 'seek'

jlsutherland / doc2text Goto Github PK

doc2text's People

Stargazers

Watchers

Forkers

doc2text's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs