GithubHelp home page GithubHelp logo

sdjaloret / plugin-extractocr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jsicot/plugin-extractocr

0.0 2.0 0.0 144 KB

Thie omeka plugin allow creation of xml files from pdf using pdftohtml. The xml is stored as a new file associated with the item.

plugin-extractocr's Introduction

Extract OCR (plugin for Omeka)

Summary

Omeka plugin to extract OCR text in XML from PDF files, allowing fulltext searching within BookReader plugin for omeka.

See demo of the in Bibliothèque numérique de l'université Rennes 2 (France).

Installation

  • This plugin needs pdftohtml command-line tool on your server
    sudo apt-get install poppler-utils
  • Upload the Extract OCR plugin folder into your plugins folder on the server;
  • you can install the plugin via github
    cd omeka/plugins  
    git clone [email protected]:symac/Plugin-ExtractOcr.git "ExtractOcr"
  • Activate it from the admin → Settings → Plugins page
  • Click the Configure link to process or not existing PDF files.

Using the PDF TOC Plugin

  • Create an item
  • Add PDF file(s) to this item
  • Save Item
  • To locate extracted OCR xml file, select the item to which the PDF is attached. Normally, you should see an XML file attached to the record with the same filename than the pdf file.

Optional plugins

  • BookReader : This plugin adds Internet Archive BookReader into Omeka. If both plugins (BookReader & ExtractOcr) are installed it's possible to search fulltext within BookReader frame. To enable it you need to overwrite Bookreader/libraries/BookReaderCustom.php using Bookreader/libraries/BookReaderCustom_extractOCR.php

Troubleshooting

See online PDF TOC issues.

License

This plugin is published under [GNU/GPL].

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Contact

  • Syvain Machefert, Université Bordeaux 3 (see symac)

plugin-extractocr's People

Contributors

jsicot avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.