GithubHelp home page GithubHelp logo

ar-jan / corpus-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from centrefordigitalhumanities/corpus-scraper

0.0 1.0 0.0 79 KB

Automated corpus data collection for several online word-in-context corpuses

License: MIT License

JavaScript 100.00%

corpus-scraper's Introduction

CorpusScraper

By Digital Humanities Lab, Utrecht University

Motivation

Some online text concordance corpora, such as Corpus del Español, do not offer a button to download your search results in a practical format, effectively expecting you to copy, paste and edit the data manually page by page. CorpusScraper is a bookmarklet that automates that work for you, so saving your search results becomes nearly as easy as if there would be such a button.

See https://uudigitalhumanitieslab.github.io/corpus-scraper/ for a full explanation of the usage.

What’s included

The bookmarklet is written for Chrome and appears to work in Safari and Firefox as well. Internet Explorer seems not to work for most corpora, but your mileage may vary (either way, you should have at least version 9 installed).

Currently, the script can scrape data from all byu.edu corpora and from Real Academia Española (CREA/CORDE). As of version 2.0, corpora from the Fundación Rafael Lapesa are supported as well (CORPES, CREA, CNDHE). It is written such that you can add your own implementations for other online concordance corpora.

How to add support for a new online text corpus

The top anonymous function of the script contains a variable declaration that looks like this:

var domains = {
    ...
};

containing key-value pairs that look like this:

'corpus.example.org': {...}

add such a key-value pair for each domain that you want to add support for, using an existing pair as an example. Note that you can create aliases afterwards if the same implementation works on multiple domains, like in the following line:

domains['www.corpusdelespanol.org'] = domains['corpus.byu.edu'];

At the very least, the value part should contain the following four members:

columns: [...]

An array of strings, representing the names of the columns that will be extracted. This is considered a promise. Every data row that your domain implementation outputs, should have the same length as this array and contain the corresponding fields in the same order.

init: function ( ) {...}

Initializes the target and progressSteps top-level variables. target must be set to the innermost frame or window that contains the first page of data, while progressSteps should be set to the number of pages to extract from. Make sure to have a special case for when there is only a single page of results with no navigation available (see the Corpus del Español example).

getNext: function (doc) {...}

Returns either the URL to the next page in doc, which is a Document object, or a function which accepts a callback, fetches the next document and finally calls the callback with the next document. If there is no next page it returns undefined. Note that you cannot return the URL simply as anchor.href, because Chrome does not add the href property to anchor elements when the containing document is parsed in the background without rendering it. Instead, you should return anchor.getAttribute('href').

scrape1page: function (doc) {...}

Extracts the data from the page represented by doc (again a Document object) and appends those data to the top-level data array, one nested array per row (so data becomes an array of arrays of strings or numbers). Note that each row should contain the same number of fields in the same order as described in your columns member.

Please take note that your custom domain implementation should only use the target, data and progressSteps identifiers from the top-level function. Do not use window.document or window.jQuery, because that will probably not work out as you expect. You can however use doc.querySelector and doc.querySelectorAll instead. See http://www.w3.org/TR/selectors-api/ for a specification of these selectors.

corpus-scraper's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.