GithubHelp home page GithubHelp logo

rinse's Introduction

rinse

Web service that converts untrusted documents to image-based PDF:s in a sandbox.

Requirements

  • podman is required.
  • gVisor is highly recommended, but optional.
  • 2GB+ of disk space on /tmp, as the conversion process can use up a lot of space.

Container security

The container is run with read-only filesystem, no privileges and no network. If you have gVisor installed and run rinse as root (which gVisor requires), gVisor will be used to further sandbox the container.

Process

First, a temporary directory is created for the job. This will be mounted in the container as /var/rinse.

The original document is renamed to input with it's extension preserved.

Then, each of these stages run in their own container, which is destroyed as soon as the stage is complete. rinse removes files as soon as possible after each stage. At the end, only the final document-rinsed.pdf file remains on disk until the job is deleted, which also deletes the final PDF.

If the language is to be auto-detected, Apache Tika is used to do so.

If the document is not a PDF, LibreOffice is used to try to covert it to one, and if successful, the original document is deleted.

The input.pdf file is converted to a set of image files using pdftoppm.

The set of images files is OCR-ed and processed into a PDF using tesseract.

rinse's People

Contributors

linkdata avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.