GithubHelp home page GithubHelp logo

jlsutherland / doc2text Goto Github PK

View Code? Open in Web Editor NEW
1.3K 39.0 97.0 52 KB

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

License: MIT License

Python 93.93% Shell 6.07%

doc2text's Introduction

doc2text


Signup for Announcements doc2text Example

doc2text extracts higher quality text by fixing common scan errors

Developing text corpora can be a massive pain in the butt. Much of the text data we are interested in as scientists are locked away in pdfs that are poorly scanned. These scans can be off kilter, poor resolution, have a hand in them... and if you OCR these scans without fixing these errors, the OCR doesn't turn out so well. doc2text was created to help researchers fix these errors and extract the highest quality text from their pdfs as possible.

doc2text is super duper alpha atm

doc2text is developed and tested on Ubuntu 16.04 LTS Xenial Xerus. We do not pretend to serve all operating systems at the moment because that would be irresponsible. Please use this software with a huge grain of salt. We are currently working on:

  • Increasing the responsiveness of the text block identifier.
  • Optimizing the binarization for tesseract detection.
  • Identifying text in multiple columns (right now, treats as one big column).
  • Handling tables.
  • Many other optimizations.

Support and Contributions

If you have feedback or would like to contribute, please, please submit a pull request or contact me at joseph dot sutherland at columbia dot edu.

Installation

To install the doc2text package, simply:

pip install doc2text

doc2text relies on the OpenCV, tesseract, and PythonMagick libraries. To execute the quick-install script, which installs OpenCV, tesseract, and PythonMagick:

curl https://raw.githubusercontent.com/jlsutherland/doc2text/master/install_deps.sh | bash

Manual installation

To install OpenCV manually:

sudo apt-get install -y build-essential
sudo apt-get install -y cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev
sudo apt-get install -y python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev
git clone https://github.com/opencv/opencv.git opencv
git clone https://github.com/opencv/opencv_contrib.git opencv_contrib
cd opencv
git checkout 3.1.0
cd ../opencv_contrib
git checkout 3.1.0
cd ../opencv
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D INSTALL_C_EXAMPLES=OFF -D INSTALL_PYTHON_EXAMPLES=ON -D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib/modules -D BUILD_EXAMPLES=ON ..
make -j4
sudo make install
sudo ldconfig

To install tesseract manually:

sudo apt-get install tesseract-ocr

To install PythonMagick manually:

sudo apt-get install python-pythonmagick

Example usage

import doc2text

# Initialize the class.
doc = doc2text.Document()

# You can pass the lang (as 3 letters code) to the class to improve accuracy
# On ubuntu it requires the package tesseract-ocr-$lang$
# On other OS, see https://github.com/tesseract-ocr/langdata
doc = doc2text.Document(lang="eng")

# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
# If reading a PDF, doc2text will split the PDF into its component pages.
doc.read('./path/to/my/file')

# Crop the pages down to estimated text regions, deskew, and optimize for OCR.
doc.process()

# Extract text from the pages.
doc.extract_text()
text = doc.get_text()

Big thanks

doc2text would be nothing without the open-source contributions of:

doc2text's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doc2text's Issues

No module name PythonMagick

I have installed doc2text and required packages but when I try to import doc2text it gives me error no module name PythonMagick.

2021-07-21

Does not work on python3

I installed with pip install doc2text, then tried in an ipython shell to import doc2text. This gave error in init.py line 77 because of print statement with no parantheses.

Maybe a stupid question about the api, can't find in source code

Thanks for the nice code.

Just a question about the code, because i see the examples to use

doc = doc2text.Document()

# You can pass the lang (as 3 letters code) to the class to improve accuracy
# On ubuntu it requires the package tesseract-ocr-$lang$
# On other OS, see https://github.com/tesseract-ocr/langdata
doc = doc2text.Document(lang="eng")

# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
# If reading a PDF, doc2text will split the PDF into its component pages.
doc.read('./path/to/my/file')

# Crop the pages down to estimated text regions, deskew, and optimize for OCR.
doc.process()

# Extract text from the pages.
doc.extract_text()
text = doc.get_text()

but when i try to find the api like .process, . read,
i can't find them in source. Any suggestion on this?
Thanks

AttributeError: Document instance has no attribute 'file_basename'

>>> import doc2text
>>> doc = doc2text.Document()
>>> doc.read('test.png')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jwilk/.local/lib/python2.7/site-packages/doc2text/__init__.py", line 67, in read
    self.file_basepath, self.file_basename + '_temp.png'
AttributeError: Document instance has no attribute 'file_basename'

Tested with git master (41dca91).

AttributeError: 'Page' object has no attribute 'image' ISSUE

hi there
I am testing your product, however I am getting this type of error:

Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80
dst is not a numpy array, neither a scalar
Traceback (most recent call last):
File "example_doc2text.py", line 19, in
doc.extract_text()
File "/usr/local/lib/python2.7/dist-packages/doc2text/init.py", line 96, in extract_text
text = new.extract_text()
File "/usr/local/lib/python2.7/dist-packages/doc2text/page.py", line 46, in extract_text
cv2.imwrite(temp_path, self.image)
AttributeError: 'Page' object has no attribute 'image'

my test files is as follow:

> import doc2text
> 
> # Initialize the class.
> doc = doc2text.Document()
> 
> # You can pass the lang (as 3 letters code) to the class to improve accuracy
> # On ubuntu it requires the package tesseract-ocr-$lang$
> # On other OS, see https://github.com/tesseract-ocr/langdata
> doc = doc2text.Document(lang="eng")
> 
> # Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
> # If reading a PDF, doc2text will split the PDF into its component pages.
> doc.read('myfile.tiff')
> 
> # Crop the pages down to estimated text regions, deskew, and optimize for OCR.
> doc.process()
> 
> # Extract text from the pages.
> doc.extract_text()
> text = doc.get_text()
> print text

could you please help me?
thanks a lot

Image not cropped accurately

On calling the process_image() method, the image to be processed is not cropped accurately (attached below). Which (and how) of the calls in the method will I need to modify to get the accurately cropped outputs?

test1_edit_v2
test2_edit_v2
test1
test2

FileNotFoundError

When I am trying to call doc.extract_text() it gives error file not found.
I'm using windows 10 and Python 3.6 and Jupyter

Eror on pip install PythonMagick

PythonMagick is a required package for doc2text. I installed it through pip.

(doc2txt) โžœ  Programs pip install PythonMagick
Collecting PythonMagick
  Could not find a version that satisfies the requirement PythonMagick (from versions: )
No matching distribution found for PythonMagick

Anyone knows what's wrong with it...thanks.

Error on doc.process()

When doing:

import doc2text
doc = doc2text.Document()
doc.read('something.pdf')
doc.process()

I get:

Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 23
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 197
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 77
dst is not a numpy array, neither a scalar

And then, when I do:

doc.extract_text()

I get:

AttributeError                            Traceback (most recent call last)
<ipython-input-5-57184997370d> in <module>()
----> 1 doc.extract_text()

/usr/local/lib/python2.7/dist-packages/doc2text/__init__.pyc in extract_text(self)
     89             for page in self.processed_pages:
     90                 new = page
---> 91                 text = new.extract_text()
     92                 self.page_content.append(text)
     93         else:

/usr/local/lib/python2.7/dist-packages/doc2text/page.pyc in extract_text(self)
     36     def extract_text(self):
     37         temp_path = 'text_temp.png'
---> 38         cv2.imwrite(temp_path, self.image)
     39         self.text = pytesseract.image_to_string(Image.open(temp_path))
     40         os.remove(temp_path)

AttributeError: Page instance has no attribute 'image'

Python 3.5 compatibility

Seems library not 100% python3 compatible. When I'm tying to run simple code:

import doc2text

doc = doc2text.Document()
doc = doc2text.Document(lang="eng")
doc.read('pdf-sample.pdf')

I'm getting

Traceback (most recent call last):
  File "doc2text_test.py", line 13, in <module>
    doc.read('pdf-sample.pdf')
  File "/usr/local/lib/python3.5/dist-packages/doc2text/__init__.py", line 44, in read
    for i in xrange(self.num_pages):
NameError: name 'xrange' is not defined

ModuleNotFoundError: No module named 'PyPDF2'

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    import doc2text
  File "/Users/Stan/Downloads/doc2text-master/doc2text/__init__.py", line 6, in <module>
    import PyPDF2 as pyPdf
ModuleNotFoundError: No module named 'PyPDF2'

Support for non scanned documents (.doc, .docx, regular pdf)

Hi @jlsutherland and thanks for this cool module, OCR is a hard problem and you provide a pretty efficient and simple solution.

Would you be interested by PR with text extraction for non-scanned documents ? I think it fits the module name "doc2text" quite well but maybe you want to stick with just OCR, let me know

Error passing the lang to the class

When I try to pass the language as in the example:

doc = doc2text.Document(lang="por")

I received the following error message:

    doc = doc2text.Document(lang="por")
TypeError: __init__() got an unexpected keyword argument 'lang'

Can not install pythonmagick.

I tried sudo apt-get install python-pythonmagick and pip install to ensure doc2text import well. But even I successfully installed python-pythonmagick via apt, I still can not import doc2text.

I check the source package which python-pythonmagick installed via apt, it seems the packages can only support python2.

So could you help to fix the problem? I want to doc2text on python3 (Ubuntu)

Unable to process

Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 25
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 211
dst is not a numpy array, neither a scalar
Error in /usr/local/lib/python2.7/dist-packages/doc2text/page.py on line 80
dst is not a numpy array, neither a scalar

issue with extract_text

When doing:

import doc2text
doc = doc2text.Document()
doc.read('something.pdf')
doc.process()
doc.extract_text()

I get the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-5-57184997370d> in <module>()
----> 1 doc.extract_text()

/usr/local/lib/python2.7/dist-packages/doc2text/__init__.pyc in extract_text(self)
     89             for page in self.processed_pages:
     90                 new = page
---> 91                 text = new.extract_text()
     92                 self.page_content.append(text)
     93         else:

/usr/local/lib/python2.7/dist-packages/doc2text/page.pyc in extract_text(self)
     36     def extract_text(self):
     37         temp_path = 'text_temp.png'
---> 38         cv2.imwrite(temp_path, self.image)
     39         self.text = pytesseract.image_to_string(Image.open(temp_path))
     40         os.remove(temp_path)

AttributeError: Page instance has no attribute 'image'

Does is support stream data ?

I'm having a flask app which gets the file from the api and want to get the text out of it , but i don't want to save it on the disk . Is there any way ? I'm trying to push the stream object so its giving me the error.

code :
file = request.files['file']
file_data = file.stream.read()

error:

\venv\lib\site-packages\docx2txt\docx2txt.py", line 76, in process
zipf = zipfile.ZipFile(docx)
File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1225, in init
self._RealGetContents()
File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 1288, in _RealGetContents
endrec = _EndRecData(fp)
File "C:\ProgramData\Anaconda3\lib\zipfile.py", line 259, in _EndRecData
fpin.seek(0, 2)
AttributeError: 'bytes' object has no attribute 'seek'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.