GithubHelp home page GithubHelp logo

biomedical-retractions's Introduction

Biomedical Retractions

Identifying biomedical articles at risk for retraction

  • Tagline: (i.e: Analyze political botnet activity on Twitter and develop effective counter-measures)
  • Date: October 2016
  • Category: Applied Research
  • Author(s): Graham Mueller

Project Status:

  • Brainstorming Phase: Currently collecting data for baseline models.

Community Links:

Problem description:

Develop a model capable of reliably flagging biomedical articles (appearing on bioRxiv or in biomedical scientific publications) that may be at risk of retraction. Such articles would then be carefully reviewed by peers in the community.

Why this problem matters:

Although the retraction of a scientific article in the biomedical literature is still a rare event, it is getting increasingly frequent here and here.

  1. Retractions reflect error, misconduct, and fraud, which can significantly affect the scientific community and undermine the trust that the public puts in science.
  2. Detecting articles at risk of retraction could help focus the attention of efforts like Retraction Watch and other post-publication peer review groups.
  3. In turn, if the detection of problematic articles becomes more effective, the incentive for fraud is greatly diminished and the penalty for errors is increased, which should improve the overall quality and reliability of the biomedical literature.

Datasets:

PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).

The Open Access Subset (OA Subset) is the largest collection of articles available for text mining via PMC. Articles in the OA Subset are still protected by copyright in most cases, but are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.

To download a collection in PMC for text mining, you must use the designated services (usually the PMC FTP service).

https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/

Extracting Data - Preprocessing details:

Once you have downloaded the data, extract the directories and place them in the data folder. Please do not commit the ~45GB worth of xml files to this repo!

From the root directory, you can create the text file which lists the path of each journal article using the following:

find . -iname '*.nxml' > files-all.txt

The spark script uses the path of each article as input, loads the xml and parse out the article-type, which identifies whether an article was retracted or not. To run this script use:

spark-submit scripts/spark_retractions.py

Requirements

Python 2.7+ lxml

Apache Spark is used for parsing the xml documents. http://spark.apache.org

Relevant Work:

Contribute:

  • Provide a starting point readme file and status of the current project for new researchers. These projects can take months if not longer sometimes to complete, such information will help onboarding faster.
  • Guideline on how to edit-add new resources to this project, if there is a specific requirement, mention them. i.e:
    • Please create a branch and do a pull-request when adding to this example project.
    • Open Issues if something is not clear in the readme, or found linguistic/ grammar mistakes.

References:


  1. A Comprehensive Analysis of Articles Retracted Between 2004 and 2013 from Biomedical Literature – A Call for Reforms
  2. Why Has the Number of Scientific Retractions Increased?
  3. Why and how do journals retract articles? An analysis of Medline retractions 1988-2008

PS: Last few notes:

  • Be Nice & Be Respectful.
  • Value other people's work, please reference them. Don't just copy & paste what you find elsewhere when it comes to sharing information.
  • Give constructive criticism, as in if you see something not working, or wrong, suggest an attempt to tackle resolving the issue.
  • Please Ask Questions: This is one big attempt to open up opportunity for everyone being able to contribute, if they can add value towards these research topics.
  • Also, keep in mind that most of the researchers that are opening these projects might have full-time work/research. If there is a specific question, try opening an issue, use the given open communication channels rather than direct contact.

biomedical-retractions's People

Contributors

wgmueller1 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.