GithubHelp home page GithubHelp logo

romaingehrig / pdfcollate Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 33 KB

Merge PDFs if your document scanner can't do duplex scanning

License: MIT License

Dockerfile 5.38% Shell 0.93% Python 93.70%

pdfcollate's Introduction

What's that

When you want to scan documents on both sides but your automatic document feeder (ADF) only scan one side, then this project may help you.

If you just want to merge two PDFs in the correct order once and have pdftk installed on your machine, the command is pdftk A=first.pdf B=second.pdf shuffle A Bend-1 output collated.pdf (adapt names to your situation). This project does it automagically for you.

How does it work

Watch PDFs files created in SOURCE_DIRECTORY. The first one will be used as is. The second one will be used with its page in reverse order (because we flip the document and start scanning from the end). The resulting PDF file will be created in DESTINATION_DIRECTORY.

How common problems are solved:

  • Merging a new PDF with an old one: There is a timeout COLLATE_TIMEOUT that runs from the moment the first PDF is done writing (event IN_CLOSE_WRITE). If a new PDF is created (event IN_CREATE) before the timeout ends, then this new PDF is understood to be the second one. Otherwise (timeout passed), the new PDF becomes the first one and the previous one is evicted with a timeout warning.
  • Merging incompatible PDFs: The number of pages should be equal for PDFs to be merged. If it's not the case, the second PDF replaces the first with a warning.

Limitations:

  • Depends on inotify, so it can be used only on Linux (Docker can solve that).
  • Cannot distinguish between PDFs coming from your scanner and those created differently (eg. copied or temporary file). Set SOURCE_DIRECTORY to a directory where your scanner is the only one to write to, with no subdirectory. Also, don't set DESTINATION_DIRECTORY to the same directory.

Installation and configuration

The Docker image is available as cranium/pdfcollate.

Also available is a Python package you can download with pip install pdfcollate.

My usage of the project:

  • I have a NAS with two SAMBA directories: one for single-sided scans (/Scans), and the other for two-sided scans (/DuplexScans).
  • My NAS docker-compose uses the project's Dockerfile and sets two volumes /DuplexScans:/files and /Scans:/output.
  • My scanner has the two SAMBA directories as possible scan destinations. When I want to scan both sides: I put the document in the ADF, select scan to duplex directory (this scans one side), then retrieve the document from the tray, put it on its flip side, and select scan to duplex directory again. Once the scan is done, PDFCollate finds both documents and creates the collated document in the destination directory.

Environment variables used by the Python script:

  • SOURCE_DIRECTORY: Directory watched for new PDF files
  • DESTINATION_DIRECTORY: Where the collated PDF will be created
  • COLLATE_TIMEOUT: How much time before we consider two PDFs to be unrelated.
  • OUTPUT_NAME_SUFFIX: Added to the output PDF name between the document name and .pdf
  • DELETE_OLD_FILES: Remove files that have being correctly merged (True by default)

Why

Necessity is the mother of innovation. And I needed to scan both sides without too much hassle.

TODOs (don't hesitate to make a PR!)

  • Upgrade alpine: Stuck at alpine:3.8 because it has the pdftk binary.
  • Document utilisation: Can be used as pure Python, as a Docker image, or in a docker-compose file
  • Add CLI arguments for configuration: For improved flexibility
  • Add tests: Making sure we do the right thing in every case.

Done

  • Make it a Python package: Would enable one-off use. Eg: python3 -m pdfcollate
  • Publish image to Docker registry: Easier installation and docker-compose integration~~
  • Improve file permissions: We should copy the input file permissions to the output files.
  • Remove old files: Once the merge is successful, we can remove the two old PDFs.

pdfcollate's People

Contributors

romaingehrig avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

pdfcollate's Issues

Inotify alternative

It would be great to have a way to use pdfcollate without inotify. Today I was trying to use PDF Collate on machine with some NFS mounts but NFS doesn't support inotify so it doesn't work.

Maybe a "switch" to disable inotify and check for new files in a time manner? If pdfcollate could check every x seconds for new files - it could be used with NFS or CIFS shares.

Remove old files

I know that this is on the todo list :)

Pdfcollate made one of my servers die for few hours because of full disk (full of source PDF's that was collated) ;)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.