How To Use • How To Run Locally • Built process • Feedback
You may interested in this bot if you need to recognize some text from the image. It's free and quick.
Supported languages:
- ✔️ English
- ✔️ Russian
The 🗝️ key technology is a Tesseract OCR by Google that has Python API.
🤖 Bot link: https://t.me/boramorka_text_extraction_bot
- Send a photo of text. Type /lang to choose a language. ✔️
- Make sure that your document has a white background, readable black letters and picture is not rotated. ✔️
- If choosed EN+RU mode it recognises both languages at the same time. But more artifacts may arise. If your document is in one language, please select that language. ✔️
# Clone this repository
$ git clone https://github.com/boramorka/text-extraction-app.git
# Go into the repository
$ cd text-extraction-app
# Install dependencies
$ pip install requirements.txt
# Run app
$ python bot.py
-
First of all we creating an app.py file for the main app. It contains:
# Path to pytesseract pytesseract.pytesseract.tesseract_cmd # Code for text recognition def get_text(): ...............
-
Bot.py script starts the bot. It containts AIOGram. It's a pretty simple and fully asynchronous framework for Telegram Bot API written in Python 3.7 with asyncio and aiohttp. It helps you to make your bots faster and simpler.
# Bot class takes an API key to connect to the Telegram servers. bot = Bot(token=os.getenv("TEXT_EXTRACTOR_API_KEY")) #Note: API key is envioroment variable """ Dispatcher will process incoming updates: • messages • edited messages • channel posts • edited channel posts • inline queries • chosen inline results • callback queries • shipping queries • pre-checkout queries. """ dp = Dispatcher(bot) # Decorator that takes a message and processes it. @dp.message_handler(text=message)
-
Heroku deployment: Important files:
- 📄 bot.py: the bot application (refer to my Github for the source code)
- 📄 Aptfile : the third-party dependencies for Heroku to install (e.g: tesseract-ocr)
- 📄 Procfile : a list of process types in an app (on Heroku)
- 📄 requirements.txt : a list of dependencies to install
- 📄 runtime.txt : version of Python to run on Heroku (optional)
# HEROKU DEPLOYMENT PROCESS # Note: # Add this line to bot.py pytesseract.pytesseract.tesseract_cmd = "/app/.apt/usr/bin/tesseract" # (refer to my Github for the source code) # Login to Heroku, and create a new app: $ heroku login $git init $heroku create boramorka-text-extraction-app $heroku git:remote -a boramorka-text-extraction-app # Add Buildpacks: $ heroku buildpacks:add --index 1 https://github.com/heroku/heroku-buildpack-apt $ heroku buildpacks:add --index 2 heroku/python # Add Config Vars: $ heroku config:set TESSDATA_PREFIX=/app/.apt/usr/share/tesseract-ocr/4.00/tessdata # heroku stack (heroku-20) has bad compatibility with tesseract. # You may need to change heroku stack from 20 to 18 using command: $ heroku stack:set heroku-18 # Deploy app on Heroku: $ git add . $ git commit -m "Initial commit to Heroku" $ heroku git:remote -a boramorka-text-extraction-app $ git push heroku master # Check worker status: $ heroku ps # Run worker $ heroku ps:scale worker=1
🤵 Feel free to send me feedback on Telegram. Feature requests are always welcome.