jlsutherland / doc2text Goto Github PK
View Code? Open in Web Editor NEWDetect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.
License: MIT License
Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.
License: MIT License
>>> import doc2text
>>> doc = doc2text.Document()
>>> doc.read('test.png')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jwilk/.local/lib/python2.7/site-packages/doc2text/__init__.py", line 67, in read
self.file_basepath, self.file_basename + '_temp.png'
AttributeError: Document instance has no attribute 'file_basename'
Tested with git master (41dca91).
When doing:
import doc2text
doc = doc2text.Document()
doc.read('something.pdf')
doc.process()
doc.extract_text()
I get the following error:
AttributeError Traceback (most recent call last)
<ipython-input-5-57184997370d> in <module>()
----> 1 doc.extract_text()
/usr/local/lib/python2.7/dist-packages/doc2text/__init__.pyc in extract_text(self)
89 for page in self.processed_pages:
90 new = page
---> 91 text = new.extract_text()
92 self.page_content.append(text)
93 else:
/usr/local/lib/python2.7/dist-packages/doc2text/page.pyc in extract_text(self)
36 def extract_text(self):
37 temp_path = 'text_temp.png'
---> 38 cv2.imwrite(temp_path, self.image)
39 self.text = pytesseract.image_to_string(Image.open(temp_path))
40 os.remove(temp_path)
AttributeError: Page instance has no attribute 'image'
I installed with pip install doc2text
, then tried in an ipython
shell to import doc2text
. This gave error in init.py
line 77 because of print
statement with no parantheses.
tesseract seems to be able to produce PDFs these days with text overlaid on the image. This is useful for searching int he PDF when viewing that way.
It'd be nice if this could produce nice de-skewed PDFs
Traceback (most recent call last):
File "test.py", line 1, in <module>
import doc2text
File "/Users/Stan/Downloads/doc2text-master/doc2text/__init__.py", line 6, in <module>
import PyPDF2 as pyPdf
ModuleNotFoundError: No module named 'PyPDF2'
When doing:
import doc2text
doc = doc2text.Document()
doc.read('something.pdf')
doc.process()
I get:
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 23
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 197
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 77
dst is not a numpy array, neither a scalar
And then, when I do:
doc.extract_text()
I get:
AttributeError Traceback (most recent call last)
<ipython-input-5-57184997370d> in <module>()
----> 1 doc.extract_text()
/usr/local/lib/python2.7/dist-packages/doc2text/__init__.pyc in extract_text(self)
89 for page in self.processed_pages:
90 new = page
---> 91 text = new.extract_text()
92 self.page_content.append(text)
93 else:
/usr/local/lib/python2.7/dist-packages/doc2text/page.pyc in extract_text(self)
36 def extract_text(self):
37 temp_path = 'text_temp.png'
---> 38 cv2.imwrite(temp_path, self.image)
39 self.text = pytesseract.image_to_string(Image.open(temp_path))
40 os.remove(temp_path)
AttributeError: Page instance has no attribute 'image'
I'm having a flask app which gets the file from the api and want to get the text out of it , but i don't want to save it on the disk . Is there any way ? I'm trying to push the stream object so its giving me the error.
code :
file = request.files['file']
file_data = file.stream.read()
error:
\venv\lib\site-packages\docx2txt\docx2txt.py", line 76, in process
zipf = zipfile.ZipFile(docx)
File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1225, in init
self._RealGetContents()
File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1288, in _RealGetContents
endrec = _EndRecData(fp)
File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 259, in _EndRecData
fpin.seek(0, 2)
AttributeError: 'bytes' object has no attribute 'seek'
I tried sudo apt-get install python-pythonmagick
and pip install
to ensure doc2text import well. But even I successfully installed python-pythonmagick via apt, I still can not import doc2text.
I check the source package which python-pythonmagick installed via apt, it seems the packages can only support python2.
So could you help to fix the problem? I want to doc2text on python3 (Ubuntu)
When I try to pass the language as in the example:
doc = doc2text.Document(lang="por")
I received the following error message:
doc = doc2text.Document(lang="por")
TypeError: __init__() got an unexpected keyword argument 'lang'
Thanks for the nice code.
Just a question about the code, because i see the examples to use
doc = doc2text.Document()
# You can pass the lang (as 3 letters code) to the class to improve accuracy
# On ubuntu it requires the package tesseract-ocr-$lang$
# On other OS, see https://github.com/tesseract-ocr/langdata
doc = doc2text.Document(lang="eng")
# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
# If reading a PDF, doc2text will split the PDF into its component pages.
doc.read('./path/to/my/file')
# Crop the pages down to estimated text regions, deskew, and optimize for OCR.
doc.process()
# Extract text from the pages.
doc.extract_text()
text = doc.get_text()
but when i try to find the api like .process, . read,
i can't find them in source. Any suggestion on this?
Thanks
Thank you for this fantastic utility.
Text extraction is not successful for any png image with texts. The jpg and pdf works. Is this a known issue and will there be a fix..thanks.
Can't seem to install it. Lots of errors: http://hastebin.com/alotujupog.vhdl
Seems library not 100% python3 compatible. When I'm tying to run simple code:
import doc2text
doc = doc2text.Document()
doc = doc2text.Document(lang="eng")
doc.read('pdf-sample.pdf')
I'm getting
Traceback (most recent call last):
File "doc2text_test.py", line 13, in <module>
doc.read('pdf-sample.pdf')
File "/usr/local/lib/python3.5/dist-packages/doc2text/__init__.py", line 44, in read
for i in xrange(self.num_pages):
NameError: name 'xrange' is not defined
Is this supported on windows?
doc.read('/home/ubuntu/doc2text/test.jpg')
File "/usr/local/lib/python2.7/dist-packages/doc2text/init.py", line 78, in read
raise FileNotAcceptedException
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80
dst is not a numpy array, neither a scalar
hi there
I am testing your product, however I am getting this type of error:
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80
dst is not a numpy array, neither a scalar
Traceback (most recent call last):
File "example_doc2text.py", line 19, in
doc.extract_text()
File "/usr/local/lib/python2.7/dist-packages/doc2text/init.py", line 96, in extract_text
text = new.extract_text()
File "/usr/local/lib/python2.7/dist-packages/doc2text/page.py", line 46, in extract_text
cv2.imwrite(temp_path, self.image)
AttributeError: 'Page' object has no attribute 'image'
my test files is as follow:
> import doc2text
>
> # Initialize the class.
> doc = doc2text.Document()
>
> # You can pass the lang (as 3 letters code) to the class to improve accuracy
> # On ubuntu it requires the package tesseract-ocr-$lang$
> # On other OS, see https://github.com/tesseract-ocr/langdata
> doc = doc2text.Document(lang="eng")
>
> # Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
> # If reading a PDF, doc2text will split the PDF into its component pages.
> doc.read('myfile.tiff')
>
> # Crop the pages down to estimated text regions, deskew, and optimize for OCR.
> doc.process()
>
> # Extract text from the pages.
> doc.extract_text()
> text = doc.get_text()
> print text
could you please help me?
thanks a lot
When I am trying to call doc.extract_text() it gives error file not found.
I'm using windows 10 and Python 3.6 and Jupyter
PythonMagick is a required package for doc2text. I installed it through pip.
(doc2txt) โ Programs pip install PythonMagick
Collecting PythonMagick
Could not find a version that satisfies the requirement PythonMagick (from versions: )
No matching distribution found for PythonMagick
Anyone knows what's wrong with it...thanks.
Hi @jlsutherland and thanks for this cool module, OCR is a hard problem and you provide a pretty efficient and simple solution.
Would you be interested by PR with text extraction for non-scanned documents ? I think it fits the module name "doc2text" quite well but maybe you want to stick with just OCR, let me know
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.