wcmc-its / reciter Goto Github PK

View Code? Open in Web Editor NEW

44.0 24.0 23.0 236.57 MB

ReCiter: an enterprise open source author disambiguation system for academic institutions

License: Apache License 2.0

Java 98.60% JavaScript 1.34% Dockerfile 0.05% Procfile 0.01%

machine-learning-algorithms entity-resolution clustering reciter spring-boot scopus maven aws dynamodb s3

reciter's People

Stargazers

Watchers

reciter's Issues

Leverage data on name variants to improve phase two matching

*** Scroll down for a description of this project ***

Delimited name variant list for full-time WCMC faculty will be based on name variants in WCMC systems of record. This list is in the table RC_identity_directory.

Examine calculation for sensitivity/recall=zero

Update data flow diagram and entity relationship diagram

Design plan for handling cases where a person has more than a certain number of publications

e.g. limit by top 5 most frequent words in their affiliation

Run ReCiter locally and identify problems

see notes from meeting on 2-25-15 at #error

Use TF-IDF for year

Implement code that picks 0 to many clusters, depending on threshold

Leverage known co-investigators on grants to improve phase two matching

Overview

In phase 2 matching ReCiter selects one or more piles, and the resulting set of articles constitute ReCiter's final determination as to the articles that were written by the target author. In some cases ReCiter fails to include a pile that should be selected for a target author, resulting in errors of omission and reduced recall. One piece of evidence that ReCiter can use to make better decisions in phase 2 matching is co-authorship on grants. To be specific, if a given pile of articles includes a name that matches that of a person who has previously served as a co-author with the target author on a grant, this should increase the likelihood that the articles in the given pile were in fact written by the target author.

Operationalization

Lookup grant ID's associated with CWID in rc_identity_grant table.
Identity CWIDs also associated with those grant ID's
Use CWID to lookup first and last name of co-investigator in rc_identity table
Add to list of known co-authors which will be used in Phase Two matching
(Need to identify where list of known co-authors resides)

Example

One example where this should work:

Shahin Rafii (srafii) has a worked on a number of grants with co-authors that have names that should map to the co-authors of these papers in PubMed: 24424657 22282653 24074864 20211137 18565886 23574623 24733255 24799717 24298994

Allow ReCiter to be used as a web service by Academic Staff Management System

For every given CWID and PMID combination where a suggestion is positive or a user has provided feedback.

Only output anything that ReCiter suggests.

Outline

Citation
- pubDate
- authorList
  - author1 (targetAuthor="N")
  - author2 (targetAuthor="Y")
  - author3 (targetAuthor="N")
  - ...
- journal
  - verbose (full journal title)
  - medlineTA (title abbreviation)
- volume
- issue
- pages
- PMCID
- DOI
userAssertion (e.g., "accepted, rejected, null")
Score
- See below
Evidence
- See below

Score

An integer that can be used to sort suggestions in the web interface.

+10 for email match
+2 for full exact name match (firstName, middle initial, lastName)
+1 for abbreviated name match (firstInitial, middle initial, lastName)
+1 for every other type of evidence... Count each instance of department, institutional affiliation, common collaborator, known relationship, etc. separately as an additional point.

Evidence

This article was suggested for the following reasons:

Name similarity

The name [John Smith] is similar or identical to an author of this article: {J. S. Smith}.

Matching relationships

[John Smith] is associated with: {Curt Cole (as a co-investigator)}.

-OR, if two relationships-

[John Smith] is associated with: {Charles Smith (as a co-investigator)} and {Rainu Kaushal (as a mentor/mentee)}.

-OR, if three or more relationships--

[John Smith] is associated with: {Curtis Cole (as a co-investigator)}, {Rainu Kaushal (as a mentor/mentee)}, and {J. Schmitz (as a person in the same organizational unit)}.

Affiliation match

The affiliation of the matching author in the article is {affiliation string}. This is consistent with the following:

[[email protected]] (known email)
[Department of Genetic Medicine] (known departmental affiliation)
[Emory University] (known institutional affiliation)
[Hospital for Special Surgery] (a common-collaborator of [Weill Cornell Medicine])

Grant match

[John Smith] is listed on grant #[grantNumber], which matches a grant number listed in the article record: {funding statement in article}.

Timing of academic degrees

The article was published in {pubYear}, which is [#] years [before/after] [John Smith] received Bachelor's degree and [#] years [before/after] receiving a terminal degree.

Clustering

The article was selected because it has the following in common with certain selected articles:

This article shares MeSH major term of [MeSH major term] with following article(s).
This article cites following article(s).
This article is cited by following article(s).
A third article cites this article and a clustered article.
This article is in the same journal, [Journal], as the clustered article.

Update ReCiter and installation instructions so that it can be run outside of WCMC

Currently, ReCiter only runs if users have a password to an internal WCMC database. We need to update the documentation at README.md with separate installation instructions for people internal to CUMC and external developers; and make any needed changes to the ReCiter code base so that ReCiter can be generalized to run outside of WCMC.

Store output and scores in the database

ReCiter output is currently written to a series of .csv files, one for each target author. The goal of this project is to write ReCiter output data to the ReCiter database (in addition to the .csv files). The table name is rc_results and the column headers are identical to the column headers in the .csv output files (current as of the 7-15-15 versions of the .csv files).

Please note: As mentioned above, writing output to the database is to be added as a secondary default behavior -- we wish to maintain ReCiter's current default behavior of writing the .csv output files (at least for now) because the files in effect filter the results and can be used in a nimble fashion for error analysis.

Enumerate and describe types of errors

Author name disambiguation is a complex challenge in informatics and computer science. Every disambiguation system makes mistakes, and ReCiter is no exception -- we know that ReCiter sometimes does not properly assign articles to target authors. However, as of April 2015 we do not have a sufficiently detailed accounting of the reasons for the errors. A thorough, exhaustive, and iterative error analysis is needed, and we expect that in many cases the results will inform modifications to the system that will improve performance. Additional detail is available on the Wiki at https://github.com/wcmc-its/ReCiter/wiki/Error-Analysis

Create interface to run ReCiter

e.g. a script that allows Prakash to run a shell script from the server, or whatever format required by web developer

Add decremental lookup algorithm to identify multi-word phrases from the UMLS Metathesaurus

Identify exact target author's affiliations from Scopus XML

Given a list of known publications for a given author, determine the author's institutional affiliation at the time each article was published. Add the prior institutional affiliations and years to the database.

Example:

Suppose we identify this candidate record for a Rainu Kaushal - http://www.ncbi.nlm.nih.gov/pubmed/25571986
Look up the equivalent record in Scopus. Here is the record.
Count the number of authors that match the surname.
- If there's one, use that one. That's the case here. There's only one author whose surname is "Kaushal"
- If there's more than one, you will need to use first name to find the correct author.
Alternative to above: use author "rank" but this is sometimes unreliable.
The author element looks like this:

<author>
<author-url>http://api.elsevier.com/content/author/author_id:7005295324</author-url>
  <authid>7005295324</authid>
  <authname>Kaushal,R.</authname>
  <surname>Kaushal</surname>
  <given-name>Rainu</given-name>
  <initials>R.</initials>
  <afid>60007997</afid>
  <afid>60007997</afid>
  <afid>60007997</afid>
  <afid>60018043</afid>
  <afid>112593445</afid>
  <afid>60007997</afid>
</author>

As far as I know, the AF-ID's are relatively persistent (so you could use them between one lookup and another) and correspond to institutions listed above in the XML
Here's what the XML for a given affiliation looks like:

<affiliation>
  <affiliation-url>http://api.elsevier.com/content/affiliation/affiliation_id:60007997</affiliation-url>
  <afid>60007997</afid>
  <affilname>Weill Cornell Medical College</affilname>
  <name-variant>Weill Cornell Medical College</name-variant>
  <name-variant>Weill Medical College of Cornell University</name-variant>
  <name-variant>Cornell University</name-variant>
  <name-variant>Cornell University Medical College</name-variant>
  <affiliation-city>New York</affiliation-city>
  <affiliation-country>United States</affiliation-country>
</affiliation>

Of course, Rainu has six (!) affiliations, so you would want to grab all of them.

Move selected documentation from ReCiter wiki to GitHub

Review Michael's draft descriptions for JReCiter classes

Currently in ReCiter - Code Documentation at https://nexus.med.cornell.edu/display/vivo/ReCiter+-+Code+Documentation
Will be moved to this repo

Implement stemming of terms used in phase one clustering

Is a Phase One clustering improvement.
Jie added a parameter to config.properties

Convert the publication data into format required for scikit-learn Python machine learning tools

Integrate ReCiter with PubAdmin interface: accepts and rejects

Prakash assigned

Document steps required to run clustering locally

Notify Michael when this is ready so that he can begin error analysis

Year-based clustering and matching

Background:

Year comparison
- Goal: leverage year of publication to create clusters that account for the fact that articles written by the same person tend to be published around the same date.
- From the gold standard, Paul did a quick (imperfect) analysis of 500k random pairs of articles written by the same person. See here. For example, 88% of article pairs with known shared authors were written within one year or greater of each other. 79% within two years or greater. ~50% of article pairs were written within 5 years of each other.
Terminal degree

Phase One clustering:

Suggested approach:
- Compute difference in year between candidate article and closest year in article cluster.
- Lookup score in table (see link above) and use to decide whether to include in cluster
- For example, suppose a candidate article has a 7 year difference with the closest year in a cluster. Lookup "> 6" in the table and multiply the similarity score by 0.486 and then some constant.

Phase Two matching:

Use for clustering (see above)
Leverage terminal degree
Suggested approach for terminal degree:
- Moderately penalize articles that were published slightly before (0 - 7 years) a person's terminal degree, and strongly penalize articles that were published well before (>7 years) a person's terminal degree.

Possible approaches include:

Ideal: employ a probabilistic approach where the likelihood diminishes in proportion to the number of standard deviations away from the center of the distribution of the years in which the author's known publications were published
Populate a vector of the frequency of articles the author is known to have published in each year
Use TF/IDF measure using the individual years as terms

Example:

Ben Gold got a PhD in 2001, but these publications (are false positive, should be true negative) were published in 1977.
- 402849
- 892358

Leverage year of terminal degree

Add documentation on how to compile and run

Prepare specs for front-end developer

Report precision and recall at the article level to mirror 2014 ReCiter paper

Create web interface for displaying ReCiter results

Use journal similarity for phase one and two matching

Background: Paul downloaded five years of Medline records and their associated MeSH terms. Then, using Jie's workflow for the grant recommendation tool project, Paul calculated the "field scores" for each journal. Then, Paul used some basic (non-rigorous) arithmetic to calculate similarity between journals such that any two journals that had at least one MeSH term in the past five years (n = 2300) have a similarity score relative to each other.

Understanding the scores:

The range is 0-1. The average score is ~0.65.
A high score suggests high similarity and a greater likelihood that a given author wrote for both journals. A low score suggests the opposite.
The matching is done based on Medline title abbreviation. For example, the PMID 25864809 has a title abbreviation of Acad Pediatr.

How can this be used?

Phase One: decide if a publication should be part of a cluster; for example:
- You have 3 articles, A, B, and C. A has the most complete information followed by B.
- Do a lookup in the table to see how similar A is to B. Their similarity is 0.91. So you put them in the same cluster.
- The similarity of C to A and C to B is an average of 0.6 (see distribution, and falls below the predetermined arbitrary threshold (0.8?), so C is in its own cluster.
Phase Two: use default department scores when people have few publications
- The same sort of matching described above can work for Phase Two matching.
- This is especially true if someone has no or few publications when we can calculate default scores department (not ready yet).

I will provide a link to the similarity file (300 MB) outside of Git Hub.

Use target author's known publications to populate first cluster

Jie assessing the use of random forest classifier for this purpose (this issue previously pertained to phase two matching)

Add primary and/or other department name(s) to list of topic keywords

Primary and other department for WCMC faculty are available in the reciter_pubs database, available for download via Downloads and additional documentation.

Related to #79.

Phase two error analysis

Update ReCiter clustering so that it can be run locally and produce readable output

A data file (e.g., CSV or other common format) in which each row is an article. The first row has field names, which pertain to the following:
-All available article metadata, including title, author names, affiliations, etc., in separate fields
-Which cluster the article was assigned to
-An indication of the order in which the cluster was created as ReCiter ran down through the list of articles. (This could be accomplished by assigning an integer to the clusters, which corresponds to the order in which they were created; maybe you already have this in the code.)
-Evidence ReCiter used when assigning the article to the cluster, if applicable
-Indication of whether the assigned cluster was correct based on reference standard, if applicable
-Other data you think could be useful in examining the output in detail, if any

Include Phase Two matching score in output

When matching a cluster to a person in Phase Two, include the matching score. This will allow us to see how certain the match is.

Add author's known prior institutional affiliations to the database

Implement code to accomplish the following: Given a list of known publications for a given author, determine the author's institutional affiliation at the time each article was published. Add the prior institutional affiliations and years to the database.

Create authorAffiliationScoringStrategy

Overview

With this scoring strategy, we're trying to account for the extent to which affiliation of all authors affects the likelihood a given targetAuthor authored an article.

To do this, we need to ask and answer several questions.

Which sources are we using to make the match?

Scopus - does institutional disambiguation; provides affiliations as numeric codes (e.g., 6007997)
PubMed - affiliations are just strings

Which affiliation(s) are we considering?

targetAuthor
non-targetAuthor

What type of match is this?

explicitly defined for the individual, e.g., Dr. X got an undergraduate degree from Georgetown University, did her residency at Montefiore, etc.
explicitly defined for the institution, e.g., Weill Cornell faculty frequently co-author papers with individuals from Hospital for Special Surgery
match was not attempted because there was no available affiliation data
match was attempted but failed

About Scopus data

There are currently 276,666 institutions in the Identity table, which represents 3,861 unique institutions. This comes from several sources, which use a controlled vocabulary.

We've looked up the Scopus Institution ID for the 1,786 institutions that are most often cited as being a current or historical affiliation. This collectively represents 273,006 affiliations. In other words, ~99% of the time we can predict what the Scopus Institution ID could be. Note that a given institution such as Weill Cornell might have multiple institution IDs.

Values in application.properties

targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1.5
targetAuthor-institutionalAffiliation-matchType-null-score: 0
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2

nonTargetAuthor-institutionalAffiliation-weight: 0.5
nonTargetAuthor-institutionalAffiliation-maxScore: 3

homeInstitution-scopusInstitutionIDs: 60007997, 60019868, 60000247, 60072750, 60109878

homeInstitution-keywords: weill|cornell, weill|medicine, cornell|medicine, cornell|medical, weill|medical, weill|bugando, weill|graduate, cornell|presbyterian, weill|presbyterian, 10065|cornell, 10065|presbyterian, 10021|cornell, 10021|presbyterian, weill|qatar, cornell|qatar, @med.cornell.edu, @qatar-med.cornell.edu

institutionStopwords: of, the, for, and, to

collaboratingInstitutions-scopusInstitutionIDs: 60010570, , 60025849, 60012732, 60018043, 60008981, 60022875, 60019970, 60025879, 60009343, 60009656, 60072743, 60072746, 60104769, 60012981, 60000764, 60004670, 60014933, 60022377, 60005705, 60003158, 60027954, 60003711, 60103484, 60029961, 60031841, 60005208, 60002388, 60024099, 60030304, 60029652, 60026273, 60024541, 60023247, 60007555, 60017027, 60002896, 60011605, 60027565

collaboratingInstitutions-keywords: new|york|presbyterian, HSS, hospital|special|surgery, North|Shore|hospital, Long|Island|Jewish, memorial|sloan, sloan|kettering, hamad, mount|sinai, methodist|houston, National|Institute|Mental|Health, beth israel, University|Pennsylvania|Medicine, Merck|Research, New|York|Medical|College, Medicine|Dentistry|New|Jersey, Montefiore, Lenox|Hill, Cold|Spring|Harbor, St|Luke|Roosevelt, New|York|University|Medicine, Langone, SUNY|Downstate, Albert|Einstein|Medicine, Yeshiva, UMDNJ, Icahn|Medicine, Mount|Sinai, columbia|medical, columbia|physicians

Desired output

Variables

targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType: null
targetAuthor-institutionalAffiliation-matchType: noMatch

targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-source: PubMed

nonTargetAuthor-institutionalAffiliation-source: Scopus
nonTargetAuthor-institutionalAffiliation-source: PubMed

TargetAuthor

Case 1: Target author has affiliation statements in Scopus and PubMed

targetAuthorAffiliation
	Scopus
		1 
			targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
			targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University"
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Medicine" 
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  
			targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
		2
			targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-identity: "Hospital for Special Surgery"
			targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Special Surgery"  
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "61492421"  
			targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 1.5
		3
			targetAuthor-institutionalAffiliation-matchType: noMatch
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "University of Adelaide"  
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "6999421"  
			targetAuthor-institutionalAffiliation-matchType-noMatch-individual-score: -2			
		etc...
	PubMed
			targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Medicine, New York, NY 10065"

Notes:

One target author can have N affiliations in Scopus (as opposed to PubMed). Each of these matches will count towards additional points.
We output the PubMed affiliation statement, but that's just for reference. We're not using it for scoring purposes.

Case 2: Target author has affiliation statements in Scopus only

targetAuthorAffiliation
	Scopus
		1 
			targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
			targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University"
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Medicine" 
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  
			targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
		2
			targetAuthor-institutionalAffiliation-matchType: noMatch
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "University of Adelaide"  
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "6999421"  
			targetAuthor-institutionalAffiliation-matchType-noMatch-individual-score: -2

Case 3: Target author has affiliation statements only in PubMed

targetAuthorAffiliation
    PubMed
		targetAuthor-institutionalAffiliation-source: PubMed
		targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA."
		targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
		targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
		targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2

Non-target author

Case 4: Non-target author(s) have one or more affiliation statements in Scopus

nonTargetAuthorAffiliation
	Scopus
		nonTargetAuthor-institutionalAffiliation-source: Scopus
		nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Cornell Medicine, 60007997, 3
		nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Graduate School of Medical Sciences, 60000247, 2
		nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: Methodist Hospital System, 60008981, 2
		nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Medical Research Institute, 60022377, 1
		nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Rehabilitation Hospital, 60005705, 1
		nonTargetAuthor-institutionalAffiliation-matchType-match-score: 2.4  /* example */

Notes:

Don't worry about displaying PubMed affiliations in this case.

Case 5: Non-target author(s) have an affiliation statement in PubMed but not Scopus

We don't consider this case.

Psuedocode

Evaluate targetAuthor

Decide which source to use for scoring.

We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.

1. As set in application.properties, is use.scopus.articles=true?

If yes, go to 2
If no, go to 3

2. Does article have a Scopus affiliation for targetAuthor?

If no, go to 3
If yes, go to "Evaluate Scopus affiliation"

3. Does candidate article have a PubMed affiliation for targetAuthor?

If no, go to 4
If yes, go to "Evaluate PubMed Affiliation"

4. Return the following:

targetAuthor-institutionalAffiliation-matchType: null
targetAuthor-institutionalAffiliation-matchType-null-score: 0

Evaluate Scopus affiliation

1. Get list of institutions (these are strings) from identity.Institution for target person. Also, get Scopus institution IDs from `homeInstitution-scopusInstitutionIDs` from application.properties.

2. Get any scopusInstitutionIDs (e.g., 60007997) from article.affiliation for targetAuthor.

3. Use values from identity.Institution to lookup Scopus institutional identifiers in InstitutionAfid table. For example `Weill Graduate School of Medical Sciences of Cornell University` returns:

  "afids": [
    "60007997",
    "60019868",
    "60000247",
    "60072750",
    "60026978",
    "60025849",
    "105533257"
    ]

4. Attempt match between article and identity.

If there's a positive match between article and identity, output the following:

targetAuthor-institutionalAffiliation-source: Scopus

For EACH positive match between article and identity, output the following:

targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Graduate School of Medical Sciences"  /* example */
targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
 /* value stored in application.properties */

If match, go to 7.
If no match, go to 5.

5. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-scopusInstitutionIDs (stored in application.properties). Look for overlap between the two.

If there's any one positive match between article and identity, output the following for all matches:

targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1
 /* value stored in application.properties */
targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Special Surgery"  /* example */
targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "61492421"  /* example */

While there can be multiple matches, the maximum score returned for this type of match should be 1.

If no match, go to 6.

6. There's no match. Output:

targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Sick Children"  /* example */
targetAuthor-institutionalAffiliation-matchType: noMatch
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2  /* value stored in application.properties */

Test case: meb7002 and 22667600

Go to 7.

7. If PubMed affiliation exists, output that (but don't score it):

targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Medicine, New York, NY 10065"

Evaluate PubMed affiliation

1. Get list of institutions (these are strings) from identity.institutions for person under consideration.

2. Get article.affiliation for targetAuthor.

3. Preprocess.

Get list of stopwords from institution-Stopwords field in application.properties.

Remove stopwords, commas, and dashes from article.affiliation and identity.institutions.

Ignore any words inside parentheses. These are typically countries and are not included in affiliation statements.

4. Attempt match from article.affiliation and identity.institutions. The logic here is that keywords from identity.institutions are some substring of article.affiliation.

Here's how we do this match. Grab each affiliation and see if all the keywords are represented in a single affiliation. For example, suppose an author has a known affiliation in identity.institutions of "Weill Cornell Medical College". And, suppose the article affiliation is "Department of Pharmacology, Medical College of Weill Cornell." This would be a match because all the words in the identity affiliation are represented in the article affiliation.

If there's a match, output the following:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA." /* example */
targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
 /* value stored in application.properties */

Maximum of one match.

If there's no match, go to 5.

5. Attempt match against homeInstitution-keywords.

Get homeInstitution-keywords from application.properties.

Look for cases where homeInstitution keywords is present in affiliation string in any order. Here's how we do this. Take any groups of terms from homeInstitution, e.g., "weill|cornell". In order for this to be a match, both terms must be present in any order, with any case.

These are matches: "Cornell Weill Medical College", "The Weill Medical School of Cornell University"
These are not matches: "Cornell University", "Cornell Med"

If there's a match, output the following:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA." /* example */
targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
homeInstitution-Label: Weill Cornell Medicine / NewYork-Presbyterian Hospital
 /* value stored in application.properties */

Maximum of one match.

If there's no match, go to 6.

6. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-keywords (stored in application.properties). Look for overlap between the two.

If there's any one positive match between article and identity, output the following for all matches:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1
 /* value stored in application.properties */
targetAuthor-institutionalAffiliation-article-pubMedLabel: "Hospital for Special Surgery, New York, NY 10021"  /* example */

While there can be multiple matches, the maximum score returned for this type of match should be 1.

targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution

If there's no match, go to 7.

7. There's no match. Output:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubMedLabel: "Hospital for Sick Children, Quebec City, Quebec, Canada YRV MX1"  /* example */
targetAuthor-institutionalAffiliation-matchType: noMatch
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2  /* value stored in application.properties */

Evaluate nonTargetAuthor

Decide which source to use

We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.

1. As set in application.properties, is use.scopus.articles=true?

If yes, go to 2
If no, go to 3

2. Does article have any Scopus affiliation for nonTargetAuthor?

If no, go to 3
If yes, go to "Evaluate Scopus affiliation"

3. Does candidate article have any PubMed affiliation for nonTargetAuthor?

If no, go to 4
If yes, go to "Evaluate PubMed Affiliation"

4. Return the following:

nonTargetAuthor-institutionalAffiliation-matchType: null
nonTargetAuthor-institutionalAffiliation-matchType-null-score: 0

Evaluate Scopus affiliation

1. Preprocessing

A. Create scopusIDsNonTargetAuthor-Article.

This contains all scopusInstitutionIDs (e.g., 60007997) from article.affiliation for all nonTargetAuthors.

B. Create scopusIDsNonTargetAuthor-Identity-KnownInstitutions.

This contains all Scopus Institution IDs from homeInstitution-scopusInstitutionIDs as stored in application.properties.
It also contains all Scopus Institution IDs for targetAuthor from identity.institutions; do this by matching against identity.institutionafids as described above.

C. Create scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutions

This contains all Scopus Institution IDs from collaboratingInstitution-scopusInstitutionIDs as stored in application.properties.

2. Determine overlap.

Compute the following:

countScopusIDNonTargetAuthor-Affiliations - non-unique count of all Scopus affiliation IDs for all authors
countScopusIDsNonTargetAuthor-Article-KnownInstitution - count of cases where affiliation ID from scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-KnownInstitutions
countScopusIDsNonTargetAuthor-Article-CollaboratingInstitution - count of cases where affiliation IDfrom scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutions
countScopusIDsNonTargetAuthor-Article-NoMatch - count of cases in which none of the above are true

3. Compute overall score.

Get nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight and nonTargetAuthor-institutionalAffiliation-maxScore from application.properties.

nonTargetAuthor-institutionalAffiliation-maxScore * (countScopusIDsNonTargetAuthor-Article-KnownInstitution + (countScopusIDsNonTargetAuthor-Article-CollaboratingInstitution * nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight )) / countScopusIDNonTargetAuthor-Affiliations

4. Output values

nonTargetAuthor-institutionalAffiliation-source: Scopus
nonTargetAuthor-institutionalAffiliation-matchType-match-score: 2.4  /* example */

/* Here we're outputting Scopus institution labels, identifiers, and counts for all matching institutions. */
nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Cornell Medicine, 60007997, 3
nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Graduate School of Medical Sciences, 60000247, 2
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: Methodist Hospital System, 60008981, 2
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Medical Research Institute, 60022377, 1
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Rehabilitation Hospital, 60005705, 1

Evaluate PubMed affiliation

At this time, we're not evaluating PubMed affiliation for nonTargetAuthors.

Run ReCiter against data from Columbia from ReCiter (Version 1) manuscript

Michael to do with Steve.
Update 3-31-15 -- Steve is preparing data set.

Normalize Unicode characters to Roman equivalents

This is a Phase One clustering improvement.
Jie notes that he tried http://stackoverflow.com/questions/1008802/converting-symbols-accent-letters-to-english-alphabet, but doesn't seem to remove, "Grün".

Integrate ReCiter with PubAdmin interface: input seed publication

Prakash assigned

Move selected to-do items from ReCiter wiki to GitHub; assign tasks to milestones

Use last name, first initial queries to download XML via eFetch for full-time faculty

Jie, if you continue to have problems with the error message I can help reach out to the support team for eFetch. -MB

Assess runtime performance for common names like Y. Wang

Prepare XML retrieval for all CWIDs and assertions (gold standard)

Add MySQL db for Phase Two matching to repo

Test ReCiter performance for all faculty

Test ReCiter performance for all faculty, for the main clustering algorithm, assuming correct cluster selection.

Determine how ReCiter will be invoked, in practice, for each of the two main use cases

Michael and Drew to work on this.

First use case is running ReCiter for the first time, e.g. when a new faculty member joins. Second use case is determining whether a person did in fact write a publication that has been newly identified via an institution-based search.

ReCiter is not storing the exact number of xml results returned by PubMed.

When retrieving xml for a given cwid, sometimes the number of publications retrieved from query is not equal to the number of publications stored on disk.

The bug can be produced by querying

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retstart=0&retmax=10000&usehistory=y&term=Toth%20Miklos[au]

which currently PubMed returns 208 number of publications. But the number of publications stored on disk is only 206.

This is likely due to the values of retstart and retmax.

Decide on data representation for author profiles used in the Phase Two matching

Michael and Paul to work on this
a. department
b. relationships
i. PI. See PrincipalInvestigators-PhDandMDPhDstudents.xlsx (available by request from Michael Bales, [email protected])
ii. Examples of co-authors: http://www.ncbi.nlm.nih.gov/pubmed/?term=Nimer+Hatlen
c. year of terminal degree
d. clinical expertise
i. list of ID's - https://pops.weillcornell.org/providerprofiles/ids.json
ii. sample provider profile - http://pops.weillcornell.org/providerprofiles/38.json
iii. equivalent on web site - http://weillcornell.org/patmarino
e. board certifications. See BoardCertifications.xls (available by request from Michael Bales, [email protected])
f. other fields?

Leverage data on board certifications to improve phase two matching

Background

In phase 2 matching ReCiter selects one or more piles, and the resulting set of articles constitute ReCiter's final determination as to the articles that were written by the target author. In some cases ReCiter fails to include a pile that should be selected for a target author, resulting in errors of omission and reduced recall. One piece of evidence that ReCiter can use to make better decisions in phase 2 matching is board certification data. To be specific, based on cosine similarity, if a given pile of articles includes words that match words in one or more of the target author's board certifications, this should increase the likelihood that the articles in the given pile were in fact written by the target author.

Operationalization

Board certifications for WCMC available here.

Using CWID retrieve any available board certifications or areas of expertise.
Pre-process data:
- Break up terms into multiple words: "Blood Banking" >> "Blood", "Banking"
- Break up terms containing a slash into two distinct terms: "Obstetrics/Gynecology" >> "Obstetrics", "Gynecology"
- Remove any of the following terms: "and", "the", "medicine", "-", "with", "in", "med", "adult" , "general"
Append processed board certification data to the list of elements used by ReCiter in the cluster selection step
Build out cluster selection step to include processed board certification data

Examples

Jonathan W. Weinsaft (jww2001) has a variety of specialties associated with him:
- Nuclear Cardiology
- Cardiology
- Clinical Expertise
- Cardiovascular Nuclear Imaging
- Radioisotope Imaging For Heart Diseases
- Cardiovascular Stress Testing
- Stress Testing
When you do phase two matching, that should pick up on these PMID's, all of which are related to these topics:
- 25799706
- 22835669
- 22815751
- 21812692
- 21757159
- 20579652
- 18598895
- 19808512
- 17976589
- 17478239

Linda Vahdat (ltv2001) has a board certification in Medical Oncology. These papers, which are currently false negatives, consistently use the oncology keyword.

24202699
24682463
23403636
21376385
20679609
20299316
19349550
24699910
21937232
18650153
17606975
17606972
18762793
18378531
16921067
16821602
15945506
14585260
12548594
7949141
7949097

Thomas A. Caputo (tac2001) is board certified in Obstetrics and Gynecology. Those terms should map to these PMID's:
- 760018
- 876565
- 10920302
- 3976767
- 22101154
- 11104615
- 11006044
- 9740708
- 8626101
- 9234922
- 2909447

Explore how scores improve as asserted publications are used to select clusters rather than seed them

As ReCiter is a greedy algorithm that does agglomerative clustering, we may assume that performance will improve if we provide one or more "seed" articles that are known to have written by an author. We wish to quantify the extent to which precision and recall improve in specific cases. This project involves running ReCiter several times for a set group of authors, in each case varying the number of "seed" articles provided. The results of this study could be written up and submitted to a journal.

wcmc-its / reciter Goto Github PK

reciter's People

Stargazers

Watchers

Forkers

reciter's Issues

Overview

Operationalization

Example

Outline

Score

Evidence

Name similarity

Matching relationships

Affiliation match

Grant match

Timing of academic degrees

Clustering

Overview

About Scopus data

Values in application.properties

Desired output

Variables

TargetAuthor

Case 1: Target author has affiliation statements in Scopus and PubMed

Case 2: Target author has affiliation statements in Scopus only

Case 3: Target author has affiliation statements only in PubMed

Non-target author

Case 4: Non-target author(s) have one or more affiliation statements in Scopus

Case 5: Non-target author(s) have an affiliation statement in PubMed but not Scopus

Psuedocode

Evaluate targetAuthor

Decide which source to use for scoring.

1. As set in application.properties, is use.scopus.articles=true?

2. Does article have a Scopus affiliation for targetAuthor?

3. Does candidate article have a PubMed affiliation for targetAuthor?

4. Return the following:

Evaluate Scopus affiliation

1. Get list of institutions (these are strings) from identity.Institution for target person. Also, get Scopus institution IDs from homeInstitution-scopusInstitutionIDs from application.properties.

2. Get any scopusInstitutionIDs (e.g., 60007997) from article.affiliation for targetAuthor.

3. Use values from identity.Institution to lookup Scopus institutional identifiers in InstitutionAfid table. For example Weill Graduate School of Medical Sciences of Cornell University returns:

4. Attempt match between article and identity.

5. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-scopusInstitutionIDs (stored in application.properties). Look for overlap between the two.

6. There's no match. Output:

7. If PubMed affiliation exists, output that (but don't score it):

Evaluate PubMed affiliation

1. Get list of institutions (these are strings) from identity.institutions for person under consideration.

2. Get article.affiliation for targetAuthor.

3. Preprocess.

4. Attempt match from article.affiliation and identity.institutions. The logic here is that keywords from identity.institutions are some substring of article.affiliation.

5. Attempt match against homeInstitution-keywords.

6. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-keywords (stored in application.properties). Look for overlap between the two.

7. There's no match. Output:

Evaluate nonTargetAuthor

Decide which source to use

1. As set in application.properties, is use.scopus.articles=true?

2. Does article have any Scopus affiliation for nonTargetAuthor?

3. Does candidate article have any PubMed affiliation for nonTargetAuthor?

4. Return the following:

Evaluate Scopus affiliation

1. Preprocessing

2. Determine overlap.

3. Compute overall score.

4. Output values

Evaluate PubMed affiliation

Background

Operationalization

Examples

Recommend Projects

Recommend Topics

Recommend Org

Jobs

1. Get list of institutions (these are strings) from identity.Institution for target person. Also, get Scopus institution IDs from `homeInstitution-scopusInstitutionIDs` from application.properties.

3. Use values from identity.Institution to lookup Scopus institutional identifiers in InstitutionAfid table. For example `Weill Graduate School of Medical Sciences of Cornell University` returns: