GithubHelp home page GithubHelp logo

marcury122 / slide-extractor Goto Github PK

View Code? Open in Web Editor NEW

This project forked from johan456789/slide-extractor

0.0 0.0 0.0 17 KB

A script that extracts slides from lecture video and converts them into a searchable OCRed PDF.

Python 100.00%

slide-extractor's Introduction

slide-extractor

A script that extracts slides from lecture video and converts them into a searchable OCRed PDF.

This script extracts different frames from lecture videos in current directory recursively (imagehash, cv2), combine frames into image-only PDFs (img2pdf), OCR the frames and output text-only PDFs (tesserect, ghostscript), and merge text-only and image-only PDFs into high quality searchable lecture slides.

Usage:

Put slide-extractor.py in the video directory, run python slide-extractor.py. The output PDFs will be stored in the same (sub)directories as those videos.

Dependencies

brew install tesseract ghostscript
pip install tqdm pillow imagehash opencv-python PyPDF2 img2pdf

Tested environment: Python 3.7.2, macOS

Homebrew packages: tesserect, ghostscript

Python packages: tqdm, pillow, imagehash, opencv-python, PyPDF2, img2pdf

  • Other possible candidate libraries for this tiny project and why they are not used:

    • imagemagick: convert *.png out.pdf it re-encodes the image. With zip compression (-compress Zip) you can get lossless output, but the file will be larger. img2pdf does not re-encode by default, runs faster, and uses less memory, so img2pdf is used.

    • OCRmyPDF: ocrmypdf in.pdf out-ocr.pdf Tesseract & ghostscript pipeline is actually faster and has better image quality, as it uses the original images in OCRed PDFs (downsides: high I/O, larger output files), so ocrmypdf is not used. If smaller PDF is desired, just do further compression using other software.

      $ time (for i in frame*.png; do tesseract -c textonly_pdf=1 $i $i pdf; done; gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combine-text.pdf -dBATCH frame*.pdf; python merge.py;)
      
      real	0m35.962s
      user	0m28.935s
      sys	0m1.890s
      
      $ time ocrmypdf in.pdf out-ocr.pdf
      
      real	0m39.866s
      user	1m11.777s
      sys	0m7.876s

Sidenote

This program is intended for use on MOOC videos. For Cousera and edX, you can check out coursera-dl and edx-dl to download videos in batch.

slide-extractor's People

Contributors

johan456789 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.