GithubHelp home page GithubHelp logo

dlft's Introduction

DownLoad Full Text

A tool to get the full text of a multipage item in Harvard Digital Collections

Description

In version 1 of Harvard Digital Collections, users who want the full OCR text of a multi-page item have to copy it from a popup in the mirador viewer, which takes multiple clicks and is time-consuming for more than a few pages.

alt text

dlft is a proof of concept for single-click full-text download. The script takes the URL of a Harvard Digital Collections item page as input. After locating the Digital Repository Service (DRS) identifier on the HDC page, the script calls the endpoint https://pds.lib.harvard.edu/pds/get/ to get the text for each page and concatenates them all into a single TXT file, saved in /Results/.

Dependencies

The program was made with Python 3.7.3 and needs the following modules installed in the run environment:

  • requests
  • bs4
  • tqdm

To install python modules, you can use pip with this syntax at a bash console: pip install <name of module>

How to run the program

  1. Check to make sure you have the above external modules installed.
  2. Change the value of HDC_url at the top of dlft.py to the desired URL. Example:
HDC_url = 'https://digitalcollections.library.harvard.edu/catalog/990043816950203941'
  1. Run dlft.py in a bash console with python dlft.py.
  2. Wait. It will take a while, depending on the length of the book. The page delivery service seems to be able to return about 2-3 pages per second.

Options

You can set the page range manually to get only the OCR for a specified range. Change False to True and change the numbers in these lines:

manual_pagination = False
manual_page_start = 1
manual_page_end = 11

Future development

Before this tool is added to the interface of Harvard Digital Collections, it will need to be rewritten using javascript. More work might also be done to find the best endpoint. https://pds.lib.harvard.edu/pds/get/ may not be as fast as http://fds.lib.harvard.edu/fds/deliver/, but the File Delivery Service requires the exact DRS ID for each txt file, which needs to be scraped from a comprehensive XML page available from the same endpoint.

dlft's People

Contributors

joemull avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.