GithubHelp home page GithubHelp logo

samuelbradshaw / python-scripture-scraper Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 2.0 23.71 MB

Tool for downloading public domain scripture content from ChurchofJesusChrist.org.

License: MIT License

Python 99.10% CSS 0.77% HTML 0.12%
bible book-of-mormon church-of-jesus-christ csv html json markdown mysql scriptures sqlite

python-scripture-scraper's Introduction

Python Scripture Scraper

This tool provides a way to download scripture content and metadata from ChurchofJesusChrist.org. Content is pulled from public-facing pages and can be output to several formats, including JSON, HTML, Markdown, plain text, CSV, TSV, SQL (MySQL), and SQL (SQLite).

The Python Scripture Scraper is licensed under the MIT License. Scripture content downloaded by the Python Scripture Scraper using default settings is in public domain (see the Legal Q&A section below).

Sample data

Sample data downloaded by the Python Scripture Scraper can be found in the sample folder in this repository. You are welcome to copy and use the sample data, or run the script yourself, following the instructions below. (Running the script yourself will allow you to adjust several parameters.)

Running the script

  1. Verify that Python 3 is installed (installing Python with venv is recommended):
python3 -V
  1. Download the Python Scripture Scraper.

  2. In Terminal, go to the python-scripture-scraper directory:

cd /path/to/python-scripture-scraper
  1. Install dependencies:
pip3 install -r requirements.txt
  1. Configure any parameters you’d like to set in resources/config.py.

  2. Run the Python Scripture Scraper:

python3 scrape.py

Content will be downloaded to a folder called _output. Any previously-downloaded content in the _output folder will be overwritten when you run the script.

Configuration parameters

For the full list of configuration paramaters, see resources/config.py

Legal Q&A

If you plan to distribute scraped content publicly or use it commercially, you may want to consult with a legal professional; however, the information below might be helpful for personal projects.

Is scripture content copyrighted?

The complete text of the standard works of The Church of Jesus Christ of Latter-day Saints in English is in public domain in the United States, except for Official Declaration 2, which was first published in 1978. Public domain content is not subject to copyright, and can be used freely for any purpose.

The following content is in public domain, because it was first published more than 95 years ago:

  • Old Testament (English) – King James Version, first published in 1611.
  • New Testament (English) – King James Version, first published in 1611.
  • Book of Mormon (English) – first published in 1830.
  • Doctrine and Covenants (English) – first published in 1835, with occasional additions and removals, up through Section 138 (1918) and Official Declaration 1 (1890).
  • Pearl of Great Price (English) – first published in 1851, with occasional additions and removals.

The following content is not in public domain, and requires permission from the Church before it can be copied for anything other than personal or Church use:

  • Official Declaration 2 (1978).
  • Scripture study helps, including footnotes, chapter summaries, indexes, and other reference materials first published with the Church’s 1979/1981 edition of the scriptures.
  • Scripture study helps that have been added since 1981.
  • Translations of the scriptures first published within the last 95 years (most translations currently in use are still under copyright).
  • Audio recordings of the scriptures.
  • Scripture cover artwork.

By default, the Python Scripture Scraper will not download copyrighted content. A configuration setting to include copyrighted content is available, but should be used at your own risk, and is not intended for public or commercial use.

Is the most recent edition of the scriptures in public domain?

In the United States, copyright protections are available for “derivitive works,” which include substantial revisions or translations of an earlier work.

Section headers and other study helps added in recent editions of the scriptures qualify for their own copyright protection. How content is organized and structured in the printed book and on the Church website may also be copyrightable. However, the main scripture text has not changed significantly in the past 95 years.

In order to qualify for copyright, a newly-published work must be original and creative. For example, these types of changes generally can’t qualify for copyright protection on their own:

  • Punctuation changes, modernizing spelling, and fixing typos (the changes aren’t creative).
  • Changes that bring the work closer to its original manuscript (the changes aren’t original).
  • Adding a table of contents (the layout may be protectable, but the list of books isn’t creative).

The most recent major edition of the English scriptures was published in 2013. A summary of the changes can be found here: Summary of Approved Adjustments for the 2013 Edition of the Scriptures (PDF).

Based on the above, the main scripture text in the latest English edition of the standard works is in public domain, inheriting from the main English scripture text in previous editions.

Is it legal to scrape content from a website?

Generally, courts in the United States have found web scraping of publicly-available web pages to be legal, but various factors are taken into consideration:

  • Is the data publicly available (can anyone access it)?
  • Is the data easy to access (or does it require digging in source code for an API)?
  • Does accessing the data require signing in?
  • Is the data sensitive (personally identifiable information)?
  • Is the data copyrighted?

The Church of Jesus Christ of Latter-day Saints provides a lot of content at ChurchofJesusChrist.org for Church members and others to use. However, the Church does not have unlimited resources, and the Church website is primarily designed to be used by humans (rather than scripts). Please be respectful in how you use this tool, to avoid overloading Church servers with too many or too frequent server requests. You will also want to avoid running the script during peak traffic times, such as Sundays.

python-scripture-scraper's People

Contributors

samuelbradshaw avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

python-scripture-scraper's Issues

Unable to get current languages.

The current languages can no longer be scraped from https://www.churchofjesuschrist.org/languages?lang=eng.

Getting languages
Found 0 languages


Creating metadata-languages.json
Creating metadata-scriptures.json
Creating metadata-uri-to-name.json

Traceback (most recent call last):
  File "d:\...\python-scripture-scraper\scrape.py", line 1265, in <module>
    main()
  File "d:\...\python-scripture-scraper\scrape.py", line 132, in main
    output_full_content(config.DEFAULT_LANG)
  File "d:\...\python-scripture-scraper\scrape.py", line 727, in output_full_content
    if metadata_scriptures['languages'][bcp47_lang]['churchAvailability'][publication_slug]:
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'en'

AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

Python 3.11.3
Manjaro Linux
Linux 5.10.181-2-MANJARO

Getting languages
Warning: Language “asf” is not recognized and will be ignored (most likely it was added to Church content recently). To include this language, add it to the language mappings in resources.py.
Found 140 languages

Gathering metadata: af / Afrikaans (Afrikaans)
Gathering metadata: am / አማርኛ (Amharic)
Gathering metadata: ar / العربية (Arabic)
Gathering metadata: ase / American Sign Language (ASL) (ASL)
Gathering metadata: ay / Aymar Aru (Aymara)
Gathering metadata: bg / Български (Bulgarian)
Gathering metadata: bi / Bislama (Bislama)
Gathering metadata: bik / Bikol (Bikolano)
Gathering metadata: bla / Nitsi’powahsin (Blackfoot)
Gathering metadata: bm / Bambara (Bambara)
Gathering metadata: bn / বাংলা (Bengali)
Gathering metadata: ca / Català (Catalan)
Gathering metadata: cag / Nivacle (Chulupi)
Gathering metadata: cak / Cakchiquel (Kaqchikel)
Gathering metadata: ceb / Cebuano (Cebuano)
Gathering metadata: ch / Chamoru (Chamorro)
Gathering metadata: chk / Fosun Chuuk (Chuukese)
Gathering metadata: cmn-Hans / 简体中文 - 普通话 (Simplified Chinese/Mandarin)
Gathering metadata: cmn-Hant / 繁體中文 - 國語 (Traditional Chinese/Mandarin
  )
Gathering metadata: cs / Česky (Czech)
Gathering metadata: cuk / Dulegaya (Kuna)
Gathering metadata: cy / Cymraeg (Welsh)
Gathering metadata: da / Dansk (Danish)
Gathering metadata: de / Deutsch (German)
Traceback (most recent call last):
  File "/home/jared/Projects/scripture-scraper/scrape.py", line 1042, in <module>
    main()
  File "/home/jared/Projects/scripture-scraper/scrape.py", line 89, in main
    gather_metadata_for_language(language)
  File "/home/jared/Projects/scripture-scraper/scrape.py", line 282, in gather_metadata_for_language
    verse_range_separator_example = footnotes.select_one('#note1d_p1 a').find_next_sibling('a').text  # 'Mosiah 1:2–3'
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.