bweigel / aws-lambda-tesseract-layer Goto Github PK

A layer for AWS Lambda containing the tesseract C libraries and tesseract executable.

License: Apache License 2.0

Dockerfile 0.29% Shell 1.37% Python 6.66% TypeScript 90.07% JavaScript 1.61%

lambda aws-lambda lambda-layer serverless serverless-framework tesseract amazon-linux

aws-lambda-tesseract-layer's Introduction

Tesseract OCR Lambda Layer

AWS Lambda layer containing the tesseract OCR libraries and command-line binary for Lambda Runtimes running on Amazon Linux 1 and 2.

⚠️ The Amazon Linux AMI (Version 1) is being deprecated. Users are advised to not use Lambda runtimes (i.e. Python 3.6) based on this version. Refer also to the AWS Lambda runtime deprecation policy.

Quickstart
Ready-to-use binaries
- Use with Serverless Framework
- Use with AWS CDK
Build tesseract layer from source using Docker
Known Issues
- Avoiding Pillow library issues
- Unable to import module 'handler': cannot import name '_imaging'
Contributors ❤️

Quickstart

This repo comes with ready-to-use binaries compiled against the AWS Lambda Runtimes (based on Amazon Linux 1 and 2). Example Projects in Python 3.6 (& 3.8) using Serverless Framework and CDK are provided:

## Demo using Serverless Framework and prebuilt layer
cd example/serverless
npm ci
npx sls deploy

## or ..

## Demo using CDK and prebuilt layer
cd example/cdk
npm ci
npx cdk deploy

Ready-to-use binaries

For compiled, ready to use binaries that you can put in your layer see ready-to-use, or check out the latest release.

See examples for some ready-to-use examples.

Use with Serverless Framework

Serverless Framework

Reference the path to the ready-to-use layer contents in your serverless.yml:

service: tesseract-ocr-layer

provider:
  name: aws

# define layer
layers:
  tesseractAl2:
    # and path to contents
    path: ready-to-use/amazonlinux-2
    compatibleRuntimes:
      - python3.8

functions:
  tesseract-ocr:
    handler: ...
    runtime: python3.8
    # reference layer in function
    layers:
      - { Ref: TesseractAl2LambdaLayer }
    events:
      - http:
          path: ocr
          method: post

Deploy

npx sls deploy

Use with AWS CDK

AWS CDK

Reference the path to the layer contents in your constructs:

const app = new App();
const stack = new Stack(app, 'tesseract-lambda-ci');

const al2Layer = new lambda.LayerVersion(stack, 'al2-layer', {
    // reference the directory containing the ready-to-use layer
    code: Code.fromAsset(path.resolve(__dirname, './ready-to-use/amazonlinux-2')),
    description: 'AL2 Tesseract Layer',
});
new lambda.Function(stack, 'python38', {
    // reference the source code to your function
    code: lambda.Code.fromAsset(path.resolve(__dirname, 'lambda-handlers')),
    runtime: Runtime.PYTHON_3_8,
    // add tesseract layer to function
    layers: [al2Layer],
    memorySize: 512,
    timeout: Duration.seconds(30),
    handler: 'handler.main',
});

Build tesseract layer from source using Docker

You can build layer contents manually with the provided Dockerfiles.

Build layer using your preferred Dockerfile:

## build
docker build -t tesseract-lambda-layer -f [Dockerfile.al1|Dockerfile.al2] .
## run container
export CONTAINER=$(docker run -d tesseract-lambda-layer false)
## copy tesseract files from container to local folder layer
docker cp $CONTAINER:/opt/build-dist layer
## remove Docker container
docker rm $CONTAINER
unset CONTAINER

available `Dockerfile`s

Dockerfile	Base-Image	compatible Runtimes
`Dockerfile.al1` (:warning: deprecated)	Amazon Linux 1	Python 2.7/3.6/3.7, Ruby 2.5, Java 8 (OpenJDK), Go 1.x, .NET Core 2.1
`Dockerfile.al2`	Amazon Linux 2	Python 3.8, Ruby 2.7, Java 8/11 (Coretto), .NET Core 3.1

Building a different tesseract version and/or language

Per default the build generates the tesseract 4.1.3 (amazonlinux-1) or 5.2.0 (amazonlinux-2) OCR libraries with the fast german, english and osd (orientation and script detection) data files included.

The build process can be modified using different build time arguments (defined as ARG in Dockerfile.al[1|2]), using the --build-arg option of docker build.

Build-Argument	description	available versions
`TESSERACT_VERSION`	the tesseract OCR engine	https://github.com/tesseract-ocr/tesseract/releases
`LEPTONICA_VERSION`	fundamental image processing and analysis library	https://github.com/danbloomberg/leptonica/releases
`OCR_LANG`	Language to install (in addition to `eng` and `osd`)	https://github.com/tesseract-ocr/tessdata (`<lang>.traineddata`)
`TESSERACT_DATA_SUFFIX`	Trained LSTM models for tesseract. Can be empty (default), `_best` (best inference) and `_fast` (fast inference).	https://github.com/tesseract-ocr/tessdata, https://github.com/tesseract-ocr/tessdata_best, https://github.com/tesseract-ocr/tessdata_fast
`TESSERACT_DATA_VERSION`	Version of the trained LSTM models for tesseract. (currently - in July 2022 - only `4.1.0` is available)	https://github.com/tesseract-ocr/tessdata/releases/tag/4.1.0

Example of custom build

## Build a Dockerimage based on Amazon Linux 2, with French language support
docker build --build-arg OCR_LANG=fra -t tesseract-lambda-layer-french -f Dockerfile.al2 .
## Build a Dockerimage based on Amazon Linux 2, with Tesseract 4.0.0 and french language support
docker build --build-arg TESSERACT_VERSION=4.0.0 --build-arg OCR_LANG=fra -t tesseract-lambda-layer -f Dockerfile.al2 .

Deployment size optimization

The library files that are content of the layer are stripped, before deployment to make them more suitable for the lambda environment. See Dockerfiles:

RUN ... \
  find ${DIST}/lib -name '*.so*' | xargs strip -s

The stripping can cause issues, when the build runtime and the lambda runtime are different (e.g. if building on Amazon Linux 1 and running on Amazon Linux 2).

Building the layer binaries directly using CDK

You can build the layer directly and get the artifacts (like in ready-to-use). This is done using AWS CDK with the bundling option.

Refer to continous-integration and the corresponding Github Workflow for an example.

Layer contents

The layer contents get deployed to /opt, when used by a function. See here for details. See ready-to-use for layer contents for Amazon Linux 1 and Amazon Linux 2 (TODO).

Known Issues

Avoiding Pillow library issues

Use cloud9 IDE with AMI linux to deploy example. Or alternately follow instructions for getting correct binaries for lambda using EC2. AWS lambda uses AMI linux distro which needs correct python binaries. This step is not needed for deploying layer function. Layer function and example function are separately deployed.

Unable to import module 'handler': cannot import name '_imaging'

You might run into an issue like this:

/var/task/PIL/_imaging.cpython-36m-x86_64-linux-gnu.so: ELF load command address/offset not properly aligned
Unable to import module 'handler': cannot import name '_imaging'

The root cause is a faulty stripping of libraries using strip here.

Quickfix

You can just disable stripping (comment out the line in the Dockerfile) and the libraries (*.so) won't be stripped. This also means the library files will be larger and your artifact might exceed lambda limits.

A lenghtier fix

AWS Lambda Runtimes work on top of Amazon Linux. Depending on the Runtime AWS Lambda uses Amazon Linux Version 1 or Version 2 under the hood. For example the Python 3.8 Runtime uses Amazon Linux 2, whereas Python <= 3.7 uses version 1.

The current Dockerfile runs on top of Amazon Linux Version 1. So artifacts for runtimes running version 2 will throw the above error. You can try and use a base Dockerimage for Amazon Linux 2 in these cases:

FROM: lambci/lambda-base-2:build
...

or, as @secretshardul suggested

simple solution: Use AWS cloud9 to deploy example folder. Layer can be deployed from anywhere. complex solution: Deploy EC2 instance with AMI linux and get correct binaries.

Contributors ❤️

@secretshardul
@TheLucasMoore for providing a Dockerfile that builds working binaries for Python 3.8 / Amazon Linux 2

aws-lambda-tesseract-layer's People

Contributors

Stargazers

Watchers

aws-lambda-tesseract-layer's Issues

Cannot import ready-to-use-Lambda layer

I'm unable to use the ready-to-use Lambda Layer.

Steps I followed:

Step 1:
Download repo

Step 2:
cd ready-to-use

Step 3:
Zip the "amazonlinux-2" folder and attach it to my Python3.8 Lambda function

Step 4:
Inside my function, I'm running the below code which simply gets a file from S3 and attempts to use pytesseract.image_to_string:

import json
import os
import boto3
import botocore
import pytesseract
from PIL import Image

def lambda_handler(event, context):

    s3 = boto3.resource('s3')
    #downloads file from S3 to /tmp directory
    BUCKET_NAME = 'xxxxxxx' # replace with your bucket name
    KEY = 'test.jpg'
    try:
        s3.Bucket(BUCKET_NAME).download_file(KEY, '/tmp/my_local_image.jpg')
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == "404":
            print("The object does not exist.")
        else:
            raise
    
    image = '/tmp/my_local_image.jpg'

    text = pytesseract.image_to_string(Image.open(image))
    print(text)
    
    return "Hellow world"

Result:
I run into error -> Unable to import module 'lambda_function': No module named 'pytesseract'.

Kindly suggest how to make use of this layer, or could you provide the .zip file to upload as a layer directly.

Unable to import module 'handler': cannot import name '_imaging'

I ran example.
Response is Internal server error.
The error is the title.

Question about Dockerfile in al2-serverless example

I got https://github.com/bweigel/aws-lambda-tesseract-layer/tree/master/example/al2-serverless example to work.

What's the use of the Dockerfile?: https://github.com/bweigel/aws-lambda-tesseract-layer/blob/master/example/al2-serverless/Dockerfile

Do I need it? How and why? What can I practically do with it? Or I shouldn't touch it?

I used to use Docker to handle my lambdas but I was having lots of issues to create a Docker image with tesseract and your layer is a bless!

In principle I could simply change your example handler.py to my application needs (very similar actually to the example, except that I need to fetch images from a S3 bucket and save the output there as well), so I don't see I could need this Dockerfile.

So, in the end I just want to now about that Dockerfile. I removed and it apparently worked.

Error opening data file

Hi Benjamin,

Thanks very much for detailing out how to set up tesseract as a lambda layer. It has been super helpful to me.

Although I recognize this directory is probably not actively maintained anymore, I thought I'd report a bug to help others avoid running into it as well.

The locations from which the DockerFile (https://github.com/bweigel/aws-lambda-tesseract-layer/blob/master/Dockerfile) downloads the tessdata files seem to be outdated. This causes a slightly misleading error (in pytesseract):

'Error opening data file Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.

The problem is not the location of the tessdata directory, but its contents which are downloaded from stale links. I managed to get things working by updating:
curl -L https://github.com/tesseract-ocr/tessdata${TESSERACT_DATA_SUFFIX}/raw/${TESSERACT_DATA_VERSION}/osd.traineddata > osd.traineddata && \

to:
curl -L https://github.com/tesseract-ocr/tessdata${TESSERACT_DATA_SUFFIX}/raw/master/osd.traineddata > osd.traineddata && \

Compiling for Python 3.9

How would I do this?

For context, I have a lambda function that runs on Python3.9 and needs to use Tesseract so I was going to build it following your instructions and then link the layer. Would I even need to build this in Python 3.9, or would Python3.8 be fine since it is a separate layer?

My bad, not really an issue, more of just a general question.

Update for Python 3.8 and lambda-base-2:build

I'm happy to put up a PR for this.

As suggested in the known issues in the README, updating the base image to FROM: lambci/lambda-base-2:build for Python 3.8 requires some changes.

I found this issue with Klayers for tesseract.

"errorMessage": "(127, 'tesseract: error while loading shared libraries: libjpeg.so.62: cannot open shared object file: No such file or directory')", "errorType": "TesseractError"

Working through the errors still, but I was able to get each lib working by copying them into lib/

WORKDIR /opt
RUN mkdir -p ${DIST}/lib && mkdir -p ${DIST}/bin && \
    cp ${TESSERACT}/bin/tesseract ${DIST}/bin/ && \
    cp ${TESSERACT}/lib/libtesseract.so.4  ${DIST}/lib/ && \
    cp ${LEPTONICA}/lib/liblept.so.5 ${DIST}/lib/liblept.so.5 && \
    cp /usr/lib64/libwebp.so.4 ${DIST}/lib/ && \
    cp /usr/lib64/libpng15.so.15 ${DIST}/lib/ && \
    cp /usr/lib64/libjpeg.so.62 ${DIST}/lib/ && \
    cp /usr/lib64/libtiff.so.5 ${DIST}/lib/ && \
    echo -e "LEPTONICA_VERSION=${LEPTONICA_VERSION}\nTESSERACT_VERSION=${TESSERACT_VERSION}\nTESSERACT_DATA_FILES=tessdata${TESSERACT_DATA_SUFFIX}/${TESSERACT_DATA_VERSION}" > ${DIST}/TESSERACT-README.md && \
    find ${DIST}/lib -name '*.so*' | xargs strip -s

Create binaries for ARM64 that'll work on Lambda ARM runtimes

It would be nice to have binaries compiled against ARM64 architecture to be able to use AWS Lambda ARM runtimes.

Pytesseract image_to_pdf_or_hocr function throws an error.

First of all,
Thank you for making our lives easier by developing this. It helped a lot.

Other functions that pytesseract offers like: image_to_string, image_to_data works well without any hiccups.

But, when I try to use image_to_pdf_or_hocr like this:

pdf = pytesseract.image_to_pdf_or_hocr(f'/tmp/{file_name}/{page.number}.png', extension='pdf')

it does not work and throws error like:

Traceback (most recent call last):
File "/var/task/helpers/ocr_helper.py", line 36, in save_searchable_pdf
f'/tmp/{file_name}/{page.number}.png', extension='pdf')
File "/var/task/pytesseract/pytesseract.py", line 432, in image_to_pdf_or_hocr
return run_and_get_output(*args)
File "/var/task/pytesseract/pytesseract.py", line 289, in run_and_get_output
with open(filename, 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tess_6_hu78b0.pdf'

It says that the file tess_6_hu78b0.pdf does not exist. What does this mean? I have no file with tess_6_hu78b0 name to begin with.
The path that I am passing to image_to_pdf_or_hocr function is 100% correct and an image is present there. I have confirmed and the same thing works on my local.

I found somewhere that I needed to install libtesseract-dev too. Hence, I modified my dockerfile as:

FROM lambci/lambda:build-python3.6
RUN sudo apt install tesseract-ocr
RUN sudo apt install libtesseract-dev

but unfortunately this too did not work.

Can someone please help me out on this? Thank you.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.