GithubHelp home page GithubHelp logo

ah-tran / pubmed_wikibase Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 1.0 1006 KB

Automated inserts of Pubmed Metadata and MeSH Data into fresh Wikibase Instances with RaiseWikibase.

Python 100.00%
wikibase wikibase-instance mediawiki wikidata python pubmed mesh

pubmed_wikibase's Introduction

Pubmed-Wikibase

Automated creation and filling of a new wikibase instance with PubMed metadata and MeSH terms.
  • Automated inserts into a fresh Raise-Wikibase instance
  • Scripts for automated property and item creation given a queryterm
  • Scripts tested in a Windows system within a Ubuntu 20.04 terminal with Python 3.9.2
  • The Documentation can also be read on the hosted GitHub Page

Table of contents

Overview

A Repository that specializes itself in creating a fresh Wikibase instance filled with relevant PubMed metadata and MeSH Headings from the NLM Database. image info

Entities and Properties

The Knowledge graph of this Wikibase instance is based on two type of entities:

  • PubMed Articles
  • MeSH Term Entities
    The two entities are connected to each other semantically via the shared "MeSH ID". image info

How to use

Installation

For the Installation the following steps are required after cloning/forking this repository:
For Windows: Install & Enable WSL2

Microsoft Docs : https://docs.microsoft.com/en-us/windows/wsl/install-win10

For Windows: Install Ubuntu Terminal

Ubuntu Docs : https://ubuntu.com/tutorials/ubuntu-on-windows#1-overview

Install Docker & Docker Desktop

Docker Docs : https://docs.docker.com/get-docker/

Creating Virtual Environment for Usage
On Linux and MacOS

$ python3 -m venv venv
$ source venv/bin/activate

On Windows

$ py -m venv venv
.\venv\Scripts\activate

Requirements.txt

pip3 install -r requirements.txt

Further Instructions

Raise Wikibase as further reference : https://github.com/UB-Mannheim/RaiseWikibase

git clone https://github.com/UB-Mannheim/RaiseWikibase
cd RaiseWikibase/
sudo apt install default-libmysqlclient-dev
pip3 install -e .
env.tmpl > rename to '.env' and insert usernames/password
docker-compose up -d

Easy Errors to avoid on setup

  • Fill the Wikibase instance as soon as possible with first items/properties. An empty wikibase is prone to many errors, especially on restart.
  • env.tmpl: Passwords must be at least 10 characters long or else the wikibase services will refuse to properly authenticate
  • env.tmpl: [WB_DB_USER], [WB_DB_NAME], [WB_ADMIN_USER] must be (fully) capitalized in order to avoid db connection errors when inserting items

Customizing Wikibase

With Extensions, it is possible to customize the Wikibase Instance.

## Making Changes
- LocalSettings.php.template
## Extensions
- Download Extension in ./RaiseWikibase/extensions
- Add Volume to docker-compose.yml
    - ./extensions/TemplateStyles:/var/www/html/extensions/TemplateStyles
- Add Volume to LocalSettings.php.template
    -wfLoadExtension( 'TemplateStyles' );
    ${DOLLAR}wgTidyConfig = [
        'driver' => 'RaggettInternalPHP',
        'tidyConfigFile' => "${DOLLAR}IP/includes/tidy/tidy.conf",
    ];

Important Scripts

The following sections introduces the scripts responsible for automated ingestion of a fresh Wikibase instance.

Systems.py

Mainscript that executes all important scripts to sucessfully fill the Wikibase instance:
systems.py

#Import Main Scripts
import create_properties 
import data_retrieval
import create_mesh_items
import create_items_wd

retmaximum = 10
queryterm ='infectious diseases'

# Execute scripts
create_properties.first_property_setup()
metadata = data_retrieval.main(retmaximum, queryterm)
create_mesh_items.main('meshtermlist.csv')
create_items_wd.main(metadata)

Wikibase Log-in

Creates a Bot, saves the credentials in config.json and uses those credentials to log into the Wikibase Instance

def wikibase_login():
    ## Create Bot and save credentials in .config.json
    create_bot()
    config = Settings()

    ## Connect to Wikibase Instance and login with credentials
    wdi_config['MEDIAWIKI_API_URL'] = config.mediawiki_api_url
    wdi_config['SPARQL_ENDPOINT_URL'] = config.sparql_endpoint_url
    wdi_config['WIKIBASE_URL'] = config.wikibase_url
    
    #The config dictionary can be used in WikibaseIntegrator for creating a login instance:
    login_instance = wdi_login.WDLogin(user=config.username, pwd=config.password)

Creating Properties

Automatically allows the user to create user-defined properties in the Wikibase instance. Allows you to both extract properties directly from WikiData and create custom properties:
create_properties.py

# Extract Properties from WD(WikiData) and insert them into the Wikibase instance
p1 = property_wd('P932') #PBMCID
p2 = property_wd('P698') #pubmed id
batch('wikibase-property', [p1, p2])

#Create custom MeSH relevant properties
p27 = create_property('MeSH Heading', ['Heading'], 'Label of the MeSH descriptor ID', 'string')
p28 = create_property('ScopeNote', ['Note'], 'Note describing further the Label of a MeSH Heading', 'string')
p29 = create_property('MeshBrowserLink', ['Mesh URL', 'MeSH Browser'], 'MeSH Browser URL of a specific Mesh Heading', 'url')
p30 = create_property('MeSH Treecode', ['Tree', 'MeSH Tree'], 'MeSH Tree Code of a specific Mesh Heading', 'string')
batch('wikibase-property', [p27, p28, p29, p30])

Data Retrieval

Automatically creates a csv-file with all used MeSHTerms,their description and tree number and a list of all articles found to the queryterm:
data_retrieval.py

urllist = df['MeSHBrowserLink'].tolist()
for i in urllist:
    n = i
    start = time.time()
    driver = webdriver.Chrome(chrome_options=options)
    driver.get(i)
    time.sleep(3)
    element = driver.find_elements_by_xpath('//a[contains(@id,"treeNumber_")]')
    elementlist = []
    for i in element:
        elementlist.append(i.text)
    TNlist.append(elementlist)
    end = time.time()
    timelist.append(end-start)
df['TreeNumbers'] = TNlist
df.to_csv('meshtermlist.csv')
return dicts

Creating MeSH Items

Automatically creates MeSH items with retrieved metadata:
create_mesh_items.py

def upload_data(login_instance, config, meshtermlist):
    # Load excel table to load into Wikibase
    mydata = pd.read_csv(meshtermlist)
    for index, row in mydata.iterrows():
        ## Prepare the statements to be added
        item_statements = [] # all statements for one item
        item_statements.append(wdi_core.WDString(mydata.loc[index].at['MeSH Unique ID'], prop_nr="P26")) #MeSH Unique ID 
        item_statements.append(wdi_core.WDString(mydata.loc[index].at['MeSH Heading'], prop_nr="P27")) #MeSH Heading 
        item_statements.append(wdi_core.WDUrl(mydata.loc[index].at['MeSHBrowserLink'], prop_nr="P29")) #MeSH URL

        ## instantiate the Wikibase page, add statements, labels and descriptions
        wbPage = wdi_core.WDItemEngine(data=item_statements, mediawiki_api_url=config.wikibase_url + "/w/api.php")
        wbPage.set_label(mydata.loc[index].at['MeSH Heading'], lang="en")
        wbPage.set_description("MeSH Entity extracted from NLM", lang="en")

        ## sanity check (debug)
        pprint.pprint(wbPage.get_wd_json_representation())

        ## write data to wikibase
        wbPage.write(login_instance)

Creating PubMed Items

Automatically creates PubMed article items with retrieved metadata:
create_items_wd.py

def upload_data(login_instance, config, metadata):
    author_list = []
    mesh_list = []
    df = pd.read_csv('meshtermlist.csv')

    for index in metadata:
        # Get relevant Values from retrieved Metadata
            PMID = safeget(index, 'PubmedArticleSet','PubmedArticle', 'MedlineCitation', 'PMID', '#text')
            title = safeget(index, 'PubmedArticleSet','PubmedArticle', 'MedlineCitation', 'Article', 'ArticleTitle')
            pdate = safeget(index, 'PubmedArticleSet','PubmedArticle', 'MedlineCitation', 'Article', 'ArticleDate', 'Day') + '.' + \
                    safeget(index, 'PubmedArticleSet','PubmedArticle', 'MedlineCitation', 'Article', 'ArticleDate', 'Month') + '.' + \
                    safeget(index, 'PubmedArticleSet','PubmedArticle', 'MedlineCitation', 'Article', 'ArticleDate', 'Year')
            author_list = safeget(index, 'PubmedArticleSet','PubmedArticle', 'MedlineCitation', 'Article', 'AuthorList', 'Author')
            language = safeget(index, 'PubmedArticleSet','PubmedArticle', 'MedlineCitation', 'Article', 'Language')
            mesh_list = safeget(index, 'PubmedArticleSet','PubmedArticle', 'MedlineCitation', 'MeshHeadingList', 'DescriptorName')

        ## Prepare the statements to be added
        item_statements = [] # all statements for one item
        item_statements.append(wdi_core.WDString(PMID, prop_nr="P5")) #PMID
        item_statements.append(wdi_core.WDString(title, prop_nr="P11")) #title 
        item_statements.append(wdi_core.WDString(pdate, prop_nr="P14")) #publication date 
        for a in author_list:
                item_statements.append(wdi_core.WDString(str(safeget(a, 'LastName')+ ',' + safeget(a, 'ForeName')), prop_nr="P13")) #author name string
        item_statements.append(wdi_core.WDString(language, prop_nr="P18")) #language
        for m in mesh_list:
            if ( df[df['MeSH Unique ID'] == safeget(m, '@UI')].index[0]):
                r= df[df['MeSH Unique ID'] == safeget(m, '@UI')].index[0] + 1
                entity_link = 'http://localhost:8181/wiki/Item:'+ 'Q' + str(r)
                item_statements.append(wdi_core.WDItem(entity_link, prop_nr="P39"))

        ## instantiate the Wikibase page, add statements, labels and descriptions
            wbPage = wdi_core.WDItemEngine(data=item_statements, mediawiki_api_url=config.wikibase_url + "/w/api.php")
            wbPage.set_label(title, lang="en")
            wbPage.set_description("Article retrieved from PubMed", lang="en")

        ## write data to wikibase
            wbPage.write(login_instance)

Useful Docker Commands

Following Docker Commands can be of use when customizing and editing the Wikibase instance.

Stop Wikibase Docker

docker-compose down

Remove uploaded Data & run fresh Wikibase Instance

sudo rm -rf mediawiki-*  query-service-data/ quickstatements-data/
docker-compose up -d

Reload a single service (example: wikibase) to adopt new changes in settings

docker-compose up --no-deps -d wikibase

Creating Backups

Volume backups and snapshots can be made through these commands.

docker run -v wikibase-registry_mediawiki-mysql-data:/volume -v /root/volumeBackups:/backup --rm loomchild/volume-backup backup mediawiki-mysql-data_20190129
docker run -v wikibase-registry_mediawiki-images-data:/volume -v /root/volumeBackups:/backup --rm loomchild/volume-backup backup mediawiki-images-data_20190129
docker run -v wikibase-registry_query-service-data:/volume -v /root/volumeBackups:/backup --rm loomchild/volume-backup backup query-service-data_20190129

Performance

Inspecting the Performance, we see a linear growth of time taken per Article Volume, averaging around 1.88 Article Insertions per second.

Number of Articles Time in (s) Article per second
100 191 1.91
250 456 1.82
500 948 1.9
800 1520 1.9

Sources

Acknowledgements

Many thanks to UB-Mannheim for providing the Raise Wikibase Repository, Adam Shorland for providing many resources when it comes to building and customizing Wikibase instances, and Konrad Förstner and Eva Seidlmayer for the continued guidance throughout the course.

Further Research

Following Issues and possible points of further research interest have been identified

  • Dynamically Adapt Property Creation with CSV
  • Extract MeSH Entities directly from Wikidata
  • Enter Wikibase into Wikibase Registry
  • Authenticate Author Entities with external databases such as Scholia
  • Calculate IR/NLP metrics on Abstracts
  • Implement MeSH Tree Hierarchy as Knowledge Graph Structure
  • Complement Knowledge Graph Structure with semantic connection between articles, author entities and affiliation entities
  • Measure Accuracy on MeSH Term Enrichening Process in data_retrieval.py

pubmed_wikibase's People

Contributors

ah-tran avatar amir331 avatar estrixds avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

estrixds

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.