morrizzzzz / digidure Goto Github PK

This repository contains codes and scripts developed for the Digital Dutch Religion Portal 1500-2000 project.

Home Page: https://research-software-directory.org/projects/digidure

License: Apache License 2.0

Jupyter Notebook 100.00%

digidure's Introduction

DigiDuRe

This repository contains code and scripts developed for the Digital Dutch Religion Portal 1500-2000 project, a collaboration between the Vrije Universiteit Amsterdam, Faculty of Humanities, HDC and the Netherlands eScience Center. The main researcher from the VU is prof. dr. Fred van Lieburg, and the lead research software engineer from the Netherlands eScience Center is Dr. Maurice de Kleijn.

The project aims to map long-term developments in Dutch public discourse, especially in religion. An analysis of book titles and connected metadata with multiple other data sources should deliver a bottom-up reconstruction of trends and changes in thematization. To do so, four main activities are defined.

Overview of the various activities in DigiDuRe

1. Data Harmonization - Protestant individuals

A structured dataset with careers of Dutch Protestant ministers from 1555 until 2004 is created out of a series of semi-structured data sources that are the result of archival historical research. This process is considered data harmonization and entails the modeling of the data structure and a series of processing steps. The processing steps are available as Jupyter notebooks and are grouped under 1_Data_Harmonization. On top of the data harmonization, a curation process took place as well, especially since multiple data sources with information from the same individual have been integrated and contained differences in spelling. This process has resulted in a new dataset named CLERUS - Database Dutch Reformed Clergy, where the emphasis lies on ministers that have served in Dutch churches. The database and data model could be extended with data from ministers that were active in other countries (i.e., in the West and East Indies), as well as information about individuals that obtained the right to act as Protestant ministers (proponenten) but followed a different career path (e.g., school teacher, professor, medical doctor, etc.).

2. Linking and Enriching Data

To use book title data with the CLERUS, a script has been created that extracts the data from the Royal Library’s Linked Data SPARQL endpoint and translates that to separate tables. The main reason for this is that the digital skill level of the target audience did not match the required skill level needed for using Linked Data. In addition, the history scholars participating in this project want to be able to store the dataset locally and have the opportunity to manipulate it. 2_Linking_data thus consists of a script that allows downloading data from the Royal Library’s SPARQL endpoint to extract a selection from the Short-Title Catalogue Netherlands (STCN) dataset and the Nederlandse Bibliografie Totaal (NBT).

Linking CLERUS with the book title datasets is done based on string matching. However, since the spelling of names in both datasets does not always align, and since in many cases people have been given the same names. 2_Linking_data provides a script that links CLERUS with the book title data.

Besides linking book title data, a script that connects CLERUS with the Dutch Biography Portal (BP) is made. Furthermore, a geocoding script has been developed to link place names in CLERUS (e.g., places of birth and where certain individuals acted as ministers, etc.) to XY coordinates allowing them to be plotted on a map.

3. Data Analysis

With the CLERUS dataset available, a series of analysis scripts have been developed to explore the data and add to the analysis of book titles, allowing for a bottom-up reconstruction of trends and changes in thematization. These analysis scripts can be found under 3_Data_Analysis. The scripts are presented are dynamic and presented for CLERUS v1, but can easily be rerun once CLERUS_v2 and CLERUS+ are processed and curated see1_Data_Harmonization.

4. Dissemination

During the project, it has been presented on many occasions. Under 4_Dissemination, slide decks can be found, as well as installation instructions for the software used.

5. Collaborators

Doreen van den Boogaart (VU researcher) - Shaping data model, curating multiple datasources and transcribing archival resources
Robin Korink (VU Student Assistant) - Curation DRC
Renée Brouwer (VU Student Assistant) - Curation DRC and DM
Cécile Bras (VU Student Assistant) - Curation DRC
Fred van Lieburg (professor at VU – director of HDC Centre for Religious History) - Project Leader
Mart van Lieburg (emeritus professor – director Trefpunt Medische Geschiedenis in Nederland) - External advisor
Maurice de Kleijn (senior Research Software Engineer – Netherlands eScience Center) - Developing datamodel, harmonizing dataset, perfoming analyses, creating and authoring this repository.

6. Use of generative AI

Parts of the code in this repository has been generated with the help of ChatGPT 3.5, in this process we never put in actual names of individuals, but we used dummy data instead. All AI-output has been verified for orrectness, accuracy and completeness, adapted where needed, and approved by the author.

digidure's People

Contributors

Watchers

digidure's Issues

translate CLERUS to LOD

Map the various fields form CLERUS to various vocabularies like schema.org etc.
Develop a pipeline that translates CLERUS to RDF triples.

Documentation for bz_linkage branch

the branch bz_linkage links boekzaallijst with the DRC dataset. It provides suggestions for links based on rule based links combined with Levensthein. It doe what we need it do do, however the documentation needs to be updated/ included. After that it can be merged to main.

Analysis scripts

Meaningful linking of various datasets and apply specific rule based queries to generate new insights in various discours in Religious history. This need to be discussed with the Lead Applicant of the project.

Restructuring of the repository

Now that the various activities of the project have become clear we need to restructured the repo.

Update Boekzaallijst datalinking script

A script to propose links between the Boekzaallijst and DRC needs to be checked and updated with comments.

Extracting STCN data pipline

review the pipeline that extracts data from the Short Title Catalogue of the Netherlands STCN from the KB sparql endpoint.

Biography portal

A pipeline to map data from the Biography portal with CLERUS. This pipeline needs to be reviewed.

Generate pipeline that geocodes the placenames in Clerus dataset

An initial pipeline to geocode the various placenames in the dataset has been produced using a join with Geonames.org.

We need to see if there are any historical geoname like tools / datasets around to map historical place names. The dataset from Rombert Stapel is interesting in that regard.

Ideas for improvement

Great project @Morrizzzzz!

Here are some thoughts on taking this repository to a more reusable state:

Focus on writing functions where you can

Functions are reusable bits of code
They are important starting points for documentation and tests as well

Instead of (or in addition to) notebooks, write Python scripts

Particularly the functions should be written in .py files
They can be imported in your notebook as needed

Go ahead and create a new branch, and a pull request, and I will be happy to review!

Additional data from reformed ministers overseas

The LA will produce an additional dataset with Dutch reformed ministers that went overseas. This will be integrated into CLERUS according to the same structure as DRC.

Review the DRC pipeline

Review the pipeline which translated a text field into a relational database, followed by a process of manual curating by a group of student assistants that were supervised by a domain expert.

NBT SPARQLendpoint

Add functionality to activity 4 that allows to get data from NBT as well.

Update the Database schema for CLERUS

As main output of the project a datamodel is created that allows integrating various types of data that has already been collected or that is collected and curated from various archives.

The datamodel and harmonized dataset this project will produce is called CLERUS, which contains data about Dutch Reformed Clergy from 1500 to 2000.

The datasources it exists of are:

Dutch Reformed Clergy data (DRC) (1500-1816) - A semi structured text file that has been generated between in the 1990s as part of the disseration research performed by prof. dr. Fred van Lieburg and updated ever since. The dataset contains all individuals that have been a protestant minister between 1500 and did their "minister exam" before 1816.
Dutch Ministers (DM) (1500- 2005)- A table with one row for every time a Dutch reformed minister has acted as such in a Dutch parish.
ACTA Classis - ~1620 - 1735 - This dataset contains all individuals that have done an exam (named "proponenten" in Dutch) which gives them to right to act as official reformed minister. Note, that not all inidivuals that did this exam actually went for such a carreer. Individuals from DRC and DM from this period will be present in this dataset. However it will contain more than that. The collection of this dataset started, therefore this datamodel functions as a framework with which the data can be structured.
Keppel (published in 1747 which contains information information about all "proponenten" from the period ~1700 to 1747). This dataset is complete, and connections with the DRC have been made. It could however be that some links are missing.
Boekzaallijst 1736 -1816 - Like Keppel (3) and ACTA Classis (4), this datasets contains all proponenten, however from a different, partly overlapping period. This dataset is complete, and connections with the DRC have been made. It could however be that some links are missing.
Naamregister (1717 -1739) - Like 3,4 and 5 This dataset contains proponenten and is complete. Connections with the DRC have been made. It could however be that some links are missing.

Pipeline to translate datsets to CLERUS

In the project a series of datasets will be integrated to result in a full version of CLERUS .

For this integration the DRC dataset has been used as a starting point.

Datapoints in the datasets that represent the same individual have used the DRC id to connect them. This has been done manually and supported with creating joins like using Levenshtein and combing years and surnames etc. . It would be an idea to include the tooling developed for this project into a set of tutorial (as mentioned in issue 8 #14 )

A script should be developed that integrates and maps all these datasets into the CLERUS datamodel #7

General Linking of datasets

The core of the challenges in this project is that datasets are to be linked based on values of various fields. To do so many options exist. Like for instance combining the year and surname from two datasets or apply methods like Levenshtein (combing the two also makes a lot of sense).

It would be valuable for this project to write a series of tutorials and implementations that allows users to work with these tools.