GithubHelp home page GithubHelp logo

morrizzzzz / digidure Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 6.21 MB

This repository contains codes and scripts developed for the Digital Dutch Religion Portal 1500-2000 project.

Home Page: https://research-software-directory.org/projects/digidure

License: Apache License 2.0

Jupyter Notebook 100.00%

digidure's Introduction

DigiDuRe

This repository contains code and scripts developed for the Digital Dutch Religion Portal 1500-2000 project, a collaboration between the Vrije Universiteit Amsterdam, Faculty of Humanities, HDC and the Netherlands eScience Center. The main researcher from the VU is prof. dr. Fred van Lieburg, and the lead research software engineer from the Netherlands eScience Center is Dr. Maurice de Kleijn.

The project aims to map long-term developments in Dutch public discourse, especially in religion. An analysis of book titles and connected metadata with multiple other data sources should deliver a bottom-up reconstruction of trends and changes in thematization. To do so, four main activities are defined.

Figure 1 shows a schematic overview of the various activities. Overview of the various activities in DigiDuRe

1. Data Harmonization - Protestant individuals

A structured dataset with careers of Dutch Protestant ministers from 1555 until 2004 is created out of a series of semi-structured data sources that are the result of archival historical research. This process is considered data harmonization and entails the modeling of the data structure and a series of processing steps. The processing steps are available as Jupyter notebooks and are grouped under 1_Data_Harmonization. On top of the data harmonization, a curation process took place as well, especially since multiple data sources with information from the same individual have been integrated and contained differences in spelling. This process has resulted in a new dataset named CLERUS - Database Dutch Reformed Clergy, where the emphasis lies on ministers that have served in Dutch churches. The database and data model could be extended with data from ministers that were active in other countries (i.e., in the West and East Indies), as well as information about individuals that obtained the right to act as Protestant ministers (proponenten) but followed a different career path (e.g., school teacher, professor, medical doctor, etc.).

2. Linking and Enriching Data

To use book title data with the CLERUS, a script has been created that extracts the data from the Royal Library’s Linked Data SPARQL endpoint and translates that to separate tables. The main reason for this is that the digital skill level of the target audience did not match the required skill level needed for using Linked Data. In addition, the history scholars participating in this project want to be able to store the dataset locally and have the opportunity to manipulate it. 2_Linking_data thus consists of a script that allows downloading data from the Royal Library’s SPARQL endpoint to extract a selection from the Short-Title Catalogue Netherlands (STCN) dataset and the Nederlandse Bibliografie Totaal (NBT).

Linking CLERUS with the book title datasets is done based on string matching. However, since the spelling of names in both datasets does not always align, and since in many cases people have been given the same names. 2_Linking_data provides a script that links CLERUS with the book title data.

Besides linking book title data, a script that connects CLERUS with the Dutch Biography Portal (BP) is made. Furthermore, a geocoding script has been developed to link place names in CLERUS (e.g., places of birth and where certain individuals acted as ministers, etc.) to XY coordinates allowing them to be plotted on a map.

3. Data Analysis

With the CLERUS dataset available, a series of analysis scripts have been developed to explore the data and add to the analysis of book titles, allowing for a bottom-up reconstruction of trends and changes in thematization. These analysis scripts can be found under 3_Data_Analysis. The scripts are presented are dynamic and presented for CLERUS v1, but can easily be rerun once CLERUS_v2 and CLERUS+ are processed and curated see1_Data_Harmonization.

4. Dissemination

During the project, it has been presented on many occasions. Under 4_Dissemination, slide decks can be found, as well as installation instructions for the software used.

5. Collaborators

  • Doreen van den Boogaart (VU researcher) - Shaping data model, curating multiple datasources and transcribing archival resources

  • Robin Korink (VU Student Assistant) - Curation DRC

  • Renée Brouwer (VU Student Assistant) - Curation DRC and DM

  • Cécile Bras (VU Student Assistant) - Curation DRC

  • Fred van Lieburg (professor at VU – director of HDC Centre for Religious History) - Project Leader

  • Mart van Lieburg (emeritus professor – director Trefpunt Medische Geschiedenis in Nederland) - External advisor

  • Maurice de Kleijn (senior Research Software Engineer – Netherlands eScience Center) - Developing datamodel, harmonizing dataset, perfoming analyses, creating and authoring this repository.

6. Use of generative AI

Parts of the code in this repository has been generated with the help of ChatGPT 3.5, in this process we never put in actual names of individuals, but we used dummy data instead. All AI-output has been verified for orrectness, accuracy and completeness, adapted where needed, and approved by the author.

digidure's People

Contributors

morrizzzzz avatar

Watchers

 avatar Barbara Vreede avatar

digidure's Issues

translate CLERUS to LOD

  • Map the various fields form CLERUS to various vocabularies like schema.org etc.

  • Develop a pipeline that translates CLERUS to RDF triples.

Documentation for bz_linkage branch

the branch bz_linkage links boekzaallijst with the DRC dataset. It provides suggestions for links based on rule based links combined with Levensthein. It doe what we need it do do, however the documentation needs to be updated/ included. After that it can be merged to main.

Analysis scripts

Meaningful linking of various datasets and apply specific rule based queries to generate new insights in various discours in Religious history. This need to be discussed with the Lead Applicant of the project.

Ideas for improvement

Great project @Morrizzzzz!

Here are some thoughts on taking this repository to a more reusable state:

Focus on writing functions where you can

  • Functions are reusable bits of code
  • They are important starting points for documentation and tests as well

Instead of (or in addition to) notebooks, write Python scripts

  • Particularly the functions should be written in .py files
  • They can be imported in your notebook as needed

Go ahead and create a new branch, and a pull request, and I will be happy to review!

Review the DRC pipeline

Review the pipeline which translated a text field into a relational database, followed by a process of manual curating by a group of student assistants that were supervised by a domain expert.

Update the Database schema for CLERUS

As main output of the project a datamodel is created that allows integrating various types of data that has already been collected or that is collected and curated from various archives.

The datamodel and harmonized dataset this project will produce is called CLERUS, which contains data about Dutch Reformed Clergy from 1500 to 2000.

The datasources it exists of are:

  1. Dutch Reformed Clergy data (DRC) (1500-1816) - A semi structured text file that has been generated between in the 1990s as part of the disseration research performed by prof. dr. Fred van Lieburg and updated ever since. The dataset contains all individuals that have been a protestant minister between 1500 and did their "minister exam" before 1816.

  2. Dutch Ministers (DM) (1500- 2005)- A table with one row for every time a Dutch reformed minister has acted as such in a Dutch parish.

  3. ACTA Classis - ~1620 - 1735 - This dataset contains all individuals that have done an exam (named "proponenten" in Dutch) which gives them to right to act as official reformed minister. Note, that not all inidivuals that did this exam actually went for such a carreer. Individuals from DRC and DM from this period will be present in this dataset. However it will contain more than that. The collection of this dataset started, therefore this datamodel functions as a framework with which the data can be structured.

  4. Keppel (published in 1747 which contains information information about all "proponenten" from the period ~1700 to 1747). This dataset is complete, and connections with the DRC have been made. It could however be that some links are missing.

  5. Boekzaallijst 1736 -1816 - Like Keppel (3) and ACTA Classis (4), this datasets contains all proponenten, however from a different, partly overlapping period. This dataset is complete, and connections with the DRC have been made. It could however be that some links are missing.

  6. Naamregister (1717 -1739) - Like 3,4 and 5 This dataset contains proponenten and is complete. Connections with the DRC have been made. It could however be that some links are missing.

Pipeline to translate datsets to CLERUS

In the project a series of datasets will be integrated to result in a full version of CLERUS .

For this integration the DRC dataset has been used as a starting point.

Datapoints in the datasets that represent the same individual have used the DRC id to connect them. This has been done manually and supported with creating joins like using Levenshtein and combing years and surnames etc. . It would be an idea to include the tooling developed for this project into a set of tutorial (as mentioned in issue 8 #14 )

A script should be developed that integrates and maps all these datasets into the CLERUS datamodel #7

General Linking of datasets

The core of the challenges in this project is that datasets are to be linked based on values of various fields. To do so many options exist. Like for instance combining the year and surname from two datasets or apply methods like Levenshtein (combing the two also makes a lot of sense).

It would be valuable for this project to write a series of tutorials and implementations that allows users to work with these tools.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.