GithubHelp home page GithubHelp logo

wcmc-its / reciter Goto Github PK

View Code? Open in Web Editor NEW
44.0 24.0 23.0 236.57 MB

ReCiter: an enterprise open source author disambiguation system for academic institutions

License: Apache License 2.0

Java 98.60% JavaScript 1.34% Dockerfile 0.05% Procfile 0.01%
machine-learning-algorithms entity-resolution clustering reciter spring-boot scopus maven aws dynamodb s3

reciter's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reciter's Issues

Leverage known co-investigators on grants to improve phase two matching

Overview

In phase 2 matching ReCiter selects one or more piles, and the resulting set of articles constitute ReCiter's final determination as to the articles that were written by the target author. In some cases ReCiter fails to include a pile that should be selected for a target author, resulting in errors of omission and reduced recall. One piece of evidence that ReCiter can use to make better decisions in phase 2 matching is co-authorship on grants. To be specific, if a given pile of articles includes a name that matches that of a person who has previously served as a co-author with the target author on a grant, this should increase the likelihood that the articles in the given pile were in fact written by the target author.

Operationalization

  • Lookup grant ID's associated with CWID in rc_identity_grant table.
  • Identity CWIDs also associated with those grant ID's
  • Use CWID to lookup first and last name of co-investigator in rc_identity table
  • Add to list of known co-authors which will be used in Phase Two matching
  • (Need to identify where list of known co-authors resides)

Example

One example where this should work:

  • Shahin Rafii (srafii) has a worked on a number of grants with co-authors that have names that should map to the co-authors of these papers in PubMed: 24424657 22282653 24074864 20211137 18565886 23574623 24733255 24799717 24298994

Allow ReCiter to be used as a web service by Academic Staff Management System

For every given CWID and PMID combination where a suggestion is positive or a user has provided feedback.

  • Only output anything that ReCiter suggests.

Outline

  • Citation
    • pubDate
    • authorList
      • author1 (targetAuthor="N")
      • author2 (targetAuthor="Y")
      • author3 (targetAuthor="N")
      • ...
    • journal
      • verbose (full journal title)
      • medlineTA (title abbreviation)
    • volume
    • issue
    • pages
    • PMCID
    • DOI
  • userAssertion (e.g., "accepted, rejected, null")
  • Score
    • See below
  • Evidence
    • See below

Score

An integer that can be used to sort suggestions in the web interface.

+10 for email match
+2 for full exact name match (firstName, middle initial, lastName)
+1 for abbreviated name match (firstInitial, middle initial, lastName)
+1 for every other type of evidence... Count each instance of department, institutional affiliation, common collaborator, known relationship, etc. separately as an additional point.

Evidence

This article was suggested for the following reasons:

Name similarity

The name [John Smith] is similar or identical to an author of this article: {J. S. Smith}.

Matching relationships

[John Smith] is associated with: {Curt Cole (as a co-investigator)}.

-OR, if two relationships-

[John Smith] is associated with: {Charles Smith (as a co-investigator)} and {Rainu Kaushal (as a mentor/mentee)}.

-OR, if three or more relationships--

[John Smith] is associated with: {Curtis Cole (as a co-investigator)}, {Rainu Kaushal (as a mentor/mentee)}, and {J. Schmitz (as a person in the same organizational unit)}.

Affiliation match

The affiliation of the matching author in the article is {affiliation string}. This is consistent with the following:

  • [[email protected]] (known email)
  • [Department of Genetic Medicine] (known departmental affiliation)
  • [Emory University] (known institutional affiliation)
  • [Hospital for Special Surgery] (a common-collaborator of [Weill Cornell Medicine])

Grant match

[John Smith] is listed on grant #[grantNumber], which matches a grant number listed in the article record: {funding statement in article}.

Timing of academic degrees

The article was published in {pubYear}, which is [#] years [before/after] [John Smith] received Bachelor's degree and [#] years [before/after] receiving a terminal degree.

Clustering

The article was selected because it has the following in common with certain selected articles:

Update ReCiter and installation instructions so that it can be run outside of WCMC

Currently, ReCiter only runs if users have a password to an internal WCMC database. We need to update the documentation at README.md with separate installation instructions for people internal to CUMC and external developers; and make any needed changes to the ReCiter code base so that ReCiter can be generalized to run outside of WCMC.

Store output and scores in the database

ReCiter output is currently written to a series of .csv files, one for each target author. The goal of this project is to write ReCiter output data to the ReCiter database (in addition to the .csv files). The table name is rc_results and the column headers are identical to the column headers in the .csv output files (current as of the 7-15-15 versions of the .csv files).

Please note: As mentioned above, writing output to the database is to be added as a secondary default behavior -- we wish to maintain ReCiter's current default behavior of writing the .csv output files (at least for now) because the files in effect filter the results and can be used in a nimble fashion for error analysis.

Enumerate and describe types of errors

Author name disambiguation is a complex challenge in informatics and computer science. Every disambiguation system makes mistakes, and ReCiter is no exception -- we know that ReCiter sometimes does not properly assign articles to target authors. However, as of April 2015 we do not have a sufficiently detailed accounting of the reasons for the errors. A thorough, exhaustive, and iterative error analysis is needed, and we expect that in many cases the results will inform modifications to the system that will improve performance. Additional detail is available on the Wiki at https://github.com/wcmc-its/ReCiter/wiki/Error-Analysis

Identify exact target author's affiliations from Scopus XML

Given a list of known publications for a given author, determine the author's institutional affiliation at the time each article was published. Add the prior institutional affiliations and years to the database.

Example:

  • Suppose we identify this candidate record for a Rainu Kaushal - http://www.ncbi.nlm.nih.gov/pubmed/25571986
  • Look up the equivalent record in Scopus. Here is the record.
  • Count the number of authors that match the surname.
    • If there's one, use that one. That's the case here. There's only one author whose surname is "Kaushal"
    • If there's more than one, you will need to use first name to find the correct author.
  • Alternative to above: use author "rank" but this is sometimes unreliable.
  • The author element looks like this:
<author>
<author-url>http://api.elsevier.com/content/author/author_id:7005295324</author-url>
  <authid>7005295324</authid>
  <authname>Kaushal,R.</authname>
  <surname>Kaushal</surname>
  <given-name>Rainu</given-name>
  <initials>R.</initials>
  <afid>60007997</afid>
  <afid>60007997</afid>
  <afid>60007997</afid>
  <afid>60018043</afid>
  <afid>112593445</afid>
  <afid>60007997</afid>
</author>
  • As far as I know, the AF-ID's are relatively persistent (so you could use them between one lookup and another) and correspond to institutions listed above in the XML
  • Here's what the XML for a given affiliation looks like:
<affiliation>
  <affiliation-url>http://api.elsevier.com/content/affiliation/affiliation_id:60007997</affiliation-url>
  <afid>60007997</afid>
  <affilname>Weill Cornell Medical College</affilname>
  <name-variant>Weill Cornell Medical College</name-variant>
  <name-variant>Weill Medical College of Cornell University</name-variant>
  <name-variant>Cornell University</name-variant>
  <name-variant>Cornell University Medical College</name-variant>
  <affiliation-city>New York</affiliation-city>
  <affiliation-country>United States</affiliation-country>
</affiliation>
  • Of course, Rainu has six (!) affiliations, so you would want to grab all of them.

Year-based clustering and matching

Background:

  • Year comparison
    • Goal: leverage year of publication to create clusters that account for the fact that articles written by the same person tend to be published around the same date.
    • From the gold standard, Paul did a quick (imperfect) analysis of 500k random pairs of articles written by the same person. See here. For example, 88% of article pairs with known shared authors were written within one year or greater of each other. 79% within two years or greater. ~50% of article pairs were written within 5 years of each other.
  • Terminal degree

Phase One clustering:

  • Suggested approach:
    • Compute difference in year between candidate article and closest year in article cluster.
    • Lookup score in table (see link above) and use to decide whether to include in cluster
    • For example, suppose a candidate article has a 7 year difference with the closest year in a cluster. Lookup "> 6" in the table and multiply the similarity score by 0.486 and then some constant.

Phase Two matching:

  • Use for clustering (see above)
  • Leverage terminal degree
  • Suggested approach for terminal degree:
    • Moderately penalize articles that were published slightly before (0 - 7 years) a person's terminal degree, and strongly penalize articles that were published well before (>7 years) a person's terminal degree.

Possible approaches include:

  • Ideal: employ a probabilistic approach where the likelihood diminishes in proportion to the number of standard deviations away from the center of the distribution of the years in which the author's known publications were published
  • Populate a vector of the frequency of articles the author is known to have published in each year
  • Use TF/IDF measure using the individual years as terms

Example:

  • Ben Gold got a PhD in 2001, but these publications (are false positive, should be true negative) were published in 1977.
    • 402849
    • 892358

Use journal similarity for phase one and two matching

Background: Paul downloaded five years of Medline records and their associated MeSH terms. Then, using Jie's workflow for the grant recommendation tool project, Paul calculated the "field scores" for each journal. Then, Paul used some basic (non-rigorous) arithmetic to calculate similarity between journals such that any two journals that had at least one MeSH term in the past five years (n = 2300) have a similarity score relative to each other.

Understanding the scores:

  • The range is 0-1. The average score is ~0.65.
  • A high score suggests high similarity and a greater likelihood that a given author wrote for both journals. A low score suggests the opposite.
  • The matching is done based on Medline title abbreviation. For example, the PMID 25864809 has a title abbreviation of Acad Pediatr.

How can this be used?

  • Phase One: decide if a publication should be part of a cluster; for example:
    • You have 3 articles, A, B, and C. A has the most complete information followed by B.
    • Do a lookup in the table to see how similar A is to B. Their similarity is 0.91. So you put them in the same cluster.
    • The similarity of C to A and C to B is an average of 0.6 (see distribution, and falls below the predetermined arbitrary threshold (0.8?), so C is in its own cluster.
  • Phase Two: use default department scores when people have few publications
    • The same sort of matching described above can work for Phase Two matching.
    • This is especially true if someone has no or few publications when we can calculate default scores department (not ready yet).

I will provide a link to the similarity file (300 MB) outside of Git Hub.

Update ReCiter clustering so that it can be run locally and produce readable output

A data file (e.g., CSV or other common format) in which each row is an article. The first row has field names, which pertain to the following:
-All available article metadata, including title, author names, affiliations, etc., in separate fields
-Which cluster the article was assigned to
-An indication of the order in which the cluster was created as ReCiter ran down through the list of articles. (This could be accomplished by assigning an integer to the clusters, which corresponds to the order in which they were created; maybe you already have this in the code.)
-Evidence ReCiter used when assigning the article to the cluster, if applicable
-Indication of whether the assigned cluster was correct based on reference standard, if applicable
-Other data you think could be useful in examining the output in detail, if any

Create authorAffiliationScoringStrategy

Overview

With this scoring strategy, we're trying to account for the extent to which affiliation of all authors affects the likelihood a given targetAuthor authored an article.

To do this, we need to ask and answer several questions.

  1. Which sources are we using to make the match?
  • Scopus - does institutional disambiguation; provides affiliations as numeric codes (e.g., 6007997)
  • PubMed - affiliations are just strings
  1. Which affiliation(s) are we considering?
  • targetAuthor
  • non-targetAuthor
  1. What type of match is this?
  • explicitly defined for the individual, e.g., Dr. X got an undergraduate degree from Georgetown University, did her residency at Montefiore, etc.
  • explicitly defined for the institution, e.g., Weill Cornell faculty frequently co-author papers with individuals from Hospital for Special Surgery
  • match was not attempted because there was no available affiliation data
  • match was attempted but failed

About Scopus data

There are currently 276,666 institutions in the Identity table, which represents 3,861 unique institutions. This comes from several sources, which use a controlled vocabulary.

We've looked up the Scopus Institution ID for the 1,786 institutions that are most often cited as being a current or historical affiliation. This collectively represents 273,006 affiliations. In other words, ~99% of the time we can predict what the Scopus Institution ID could be. Note that a given institution such as Weill Cornell might have multiple institution IDs.

Values in application.properties

targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1.5
targetAuthor-institutionalAffiliation-matchType-null-score: 0
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2

nonTargetAuthor-institutionalAffiliation-weight: 0.5
nonTargetAuthor-institutionalAffiliation-maxScore: 3

homeInstitution-scopusInstitutionIDs: 60007997, 60019868, 60000247, 60072750, 60109878

homeInstitution-keywords: weill|cornell, weill|medicine, cornell|medicine, cornell|medical, weill|medical, weill|bugando, weill|graduate, cornell|presbyterian, weill|presbyterian, 10065|cornell, 10065|presbyterian, 10021|cornell, 10021|presbyterian, weill|qatar, cornell|qatar, @med.cornell.edu, @qatar-med.cornell.edu

institutionStopwords: of, the, for, and, to

collaboratingInstitutions-scopusInstitutionIDs: 60010570, , 60025849, 60012732, 60018043, 60008981, 60022875, 60019970, 60025879, 60009343, 60009656, 60072743, 60072746, 60104769, 60012981, 60000764, 60004670, 60014933, 60022377, 60005705, 60003158, 60027954, 60003711, 60103484, 60029961, 60031841, 60005208, 60002388, 60024099, 60030304, 60029652, 60026273, 60024541, 60023247, 60007555, 60017027, 60002896, 60011605, 60027565

collaboratingInstitutions-keywords: new|york|presbyterian, HSS, hospital|special|surgery, North|Shore|hospital, Long|Island|Jewish, memorial|sloan, sloan|kettering, hamad, mount|sinai, methodist|houston, National|Institute|Mental|Health, beth israel, University|Pennsylvania|Medicine, Merck|Research, New|York|Medical|College, Medicine|Dentistry|New|Jersey, Montefiore, Lenox|Hill, Cold|Spring|Harbor, St|Luke|Roosevelt, New|York|University|Medicine, Langone, SUNY|Downstate, Albert|Einstein|Medicine, Yeshiva, UMDNJ, Icahn|Medicine, Mount|Sinai, columbia|medical, columbia|physicians

Desired output

Variables

targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType: null
targetAuthor-institutionalAffiliation-matchType: noMatch

targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-source: PubMed

nonTargetAuthor-institutionalAffiliation-source: Scopus
nonTargetAuthor-institutionalAffiliation-source: PubMed

TargetAuthor

Case 1: Target author has affiliation statements in Scopus and PubMed

targetAuthorAffiliation
	Scopus
		1 
			targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
			targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University"
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Medicine" 
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  
			targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
		2
			targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-identity: "Hospital for Special Surgery"
			targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Special Surgery"  
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "61492421"  
			targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 1.5
		3
			targetAuthor-institutionalAffiliation-matchType: noMatch
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "University of Adelaide"  
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "6999421"  
			targetAuthor-institutionalAffiliation-matchType-noMatch-individual-score: -2			
		etc...
	PubMed
			targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Medicine, New York, NY 10065" 

Notes:

  • One target author can have N affiliations in Scopus (as opposed to PubMed). Each of these matches will count towards additional points.
  • We output the PubMed affiliation statement, but that's just for reference. We're not using it for scoring purposes.

Case 2: Target author has affiliation statements in Scopus only

targetAuthorAffiliation
	Scopus
		1 
			targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
			targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University"
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Medicine" 
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  
			targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 3
		2
			targetAuthor-institutionalAffiliation-matchType: noMatch
			targetAuthor-institutionalAffiliation-source: Scopus
			targetAuthor-institutionalAffiliation-article-scopusLabel: "University of Adelaide"  
			targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "6999421"  
			targetAuthor-institutionalAffiliation-matchType-noMatch-individual-score: -2			

Case 3: Target author has affiliation statements only in PubMed

targetAuthorAffiliation
    PubMed
		targetAuthor-institutionalAffiliation-source: PubMed
		targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA."
		targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
		targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
		targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2

Non-target author

Case 4: Non-target author(s) have one or more affiliation statements in Scopus

nonTargetAuthorAffiliation
	Scopus
		nonTargetAuthor-institutionalAffiliation-source: Scopus
		nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Cornell Medicine, 60007997, 3
		nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Graduate School of Medical Sciences, 60000247, 2
		nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: Methodist Hospital System, 60008981, 2
		nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Medical Research Institute, 60022377, 1
		nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Rehabilitation Hospital, 60005705, 1
		nonTargetAuthor-institutionalAffiliation-matchType-match-score: 2.4  /* example */

Notes:

  • Don't worry about displaying PubMed affiliations in this case.

Case 5: Non-target author(s) have an affiliation statement in PubMed but not Scopus

We don't consider this case.

Psuedocode

Evaluate targetAuthor

Decide which source to use for scoring.

We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.

1. As set in application.properties, is use.scopus.articles=true?
  • If yes, go to 2
  • If no, go to 3
2. Does article have a Scopus affiliation for targetAuthor?
  • If no, go to 3
  • If yes, go to "Evaluate Scopus affiliation"
3. Does candidate article have a PubMed affiliation for targetAuthor?
  • If no, go to 4
  • If yes, go to "Evaluate PubMed Affiliation"
4. Return the following:
targetAuthor-institutionalAffiliation-matchType: null
targetAuthor-institutionalAffiliation-matchType-null-score: 0

Evaluate Scopus affiliation

1. Get list of institutions (these are strings) from identity.Institution for target person. Also, get Scopus institution IDs from homeInstitution-scopusInstitutionIDs from application.properties.
2. Get any scopusInstitutionIDs (e.g., 60007997) from article.affiliation for targetAuthor.
3. Use values from identity.Institution to lookup Scopus institutional identifiers in InstitutionAfid table. For example Weill Graduate School of Medical Sciences of Cornell University returns:
  "afids": [
    "60007997",
    "60019868",
    "60000247",
    "60072750",
    "60026978",
    "60025849",
    "105533257"
    ]
4. Attempt match between article and identity.

If there's a positive match between article and identity, output the following:

targetAuthor-institutionalAffiliation-source: Scopus

For EACH positive match between article and identity, output the following:

targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-article-scopusLabel: "Weill Cornell Graduate School of Medical Sciences"  /* example */
targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "60007997"  /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
 /* value stored in application.properties */

If match, go to 7.
If no match, go to 5.

5. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-scopusInstitutionIDs (stored in application.properties). Look for overlap between the two.

If there's any one positive match between article and identity, output the following for all matches:

targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1
 /* value stored in application.properties */
targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Special Surgery"  /* example */
targetAuthor-institutionalAffiliation-article-scopusAffiliationID: "61492421"  /* example */

While there can be multiple matches, the maximum score returned for this type of match should be 1.

If no match, go to 6.

6. There's no match. Output:
targetAuthor-institutionalAffiliation-source: Scopus
targetAuthor-institutionalAffiliation-article-scopusLabel: "Hospital for Sick Children"  /* example */
targetAuthor-institutionalAffiliation-matchType: noMatch
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2  /* value stored in application.properties */

Test case: meb7002 and 22667600

Go to 7.

7. If PubMed affiliation exists, output that (but don't score it):
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Medicine, New York, NY 10065" 

Evaluate PubMed affiliation

1. Get list of institutions (these are strings) from identity.institutions for person under consideration.
2. Get article.affiliation for targetAuthor.
3. Preprocess.

Get list of stopwords from institution-Stopwords field in application.properties.

Remove stopwords, commas, and dashes from article.affiliation and identity.institutions.

Ignore any words inside parentheses. These are typically countries and are not included in affiliation statements.

4. Attempt match from article.affiliation and identity.institutions. The logic here is that keywords from identity.institutions are some substring of article.affiliation.

Here's how we do this match. Grab each affiliation and see if all the keywords are represented in a single affiliation. For example, suppose an author has a known affiliation in identity.institutions of "Weill Cornell Medical College". And, suppose the article affiliation is "Department of Pharmacology, Medical College of Weill Cornell." This would be a match because all the words in the identity affiliation are represented in the article affiliation.

If there's a match, output the following:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA." /* example */
targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
 /* value stored in application.properties */

Maximum of one match.

If there's no match, go to 5.

5. Attempt match against homeInstitution-keywords.

Get homeInstitution-keywords from application.properties.

Look for cases where homeInstitution keywords is present in affiliation string in any order. Here's how we do this. Take any groups of terms from homeInstitution, e.g., "weill|cornell". In order for this to be a match, both terms must be present in any order, with any case.

  • These are matches: "Cornell Weill Medical College", "The Weill Medical School of Cornell University"
  • These are not matches: "Cornell University", "Cornell Med"

If there's a match, output the following:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubmedLabel: "Weill Cornell Graduate School of Medical Sciences, New York, New York, USA." /* example */
targetAuthor-institutionalAffiliation-identity: "Weill Graduate School of Medical Sciences of Cornell University" /* example */
targetAuthor-institutionalAffiliation-matchType: positiveMatch-individual
targetAuthor-institutionalAffiliation-matchType-positiveMatch-individual-score: 2
homeInstitution-Label: Weill Cornell Medicine / NewYork-Presbyterian Hospital
 /* value stored in application.properties */

Maximum of one match.

If there's no match, go to 6.

6. Attempt match using collaborating institutions, which are defined at the institutional level. Grab values from collaboratingInstitutions-keywords (stored in application.properties). Look for overlap between the two.

If there's any one positive match between article and identity, output the following for all matches:

targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution
targetAuthor-institutionalAffiliation-matchType-positiveMatch-institution-score: 1
 /* value stored in application.properties */
targetAuthor-institutionalAffiliation-article-pubMedLabel: "Hospital for Special Surgery, New York, NY 10021"  /* example */

While there can be multiple matches, the maximum score returned for this type of match should be 1.

targetAuthor-institutionalAffiliation-matchType: positiveMatch-institution

If there's no match, go to 7.

7. There's no match. Output:
targetAuthor-institutionalAffiliation-source: PubMed
targetAuthor-institutionalAffiliation-article-pubMedLabel: "Hospital for Sick Children, Quebec City, Quebec, Canada YRV MX1"  /* example */
targetAuthor-institutionalAffiliation-matchType: noMatch
targetAuthor-institutionalAffiliation-matchType-noMatch-score: -2  /* value stored in application.properties */

Evaluate nonTargetAuthor

Decide which source to use

We generally prefer to use Scopus if it's available. If it's not, we still need to provide the option to use PubMed alone.

1. As set in application.properties, is use.scopus.articles=true?
  • If yes, go to 2
  • If no, go to 3
2. Does article have any Scopus affiliation for nonTargetAuthor?
  • If no, go to 3
  • If yes, go to "Evaluate Scopus affiliation"
3. Does candidate article have any PubMed affiliation for nonTargetAuthor?
  • If no, go to 4
  • If yes, go to "Evaluate PubMed Affiliation"
4. Return the following:
nonTargetAuthor-institutionalAffiliation-matchType: null
nonTargetAuthor-institutionalAffiliation-matchType-null-score: 0

Evaluate Scopus affiliation

1. Preprocessing

A. Create scopusIDsNonTargetAuthor-Article.

  • This contains all scopusInstitutionIDs (e.g., 60007997) from article.affiliation for all nonTargetAuthors.

B. Create scopusIDsNonTargetAuthor-Identity-KnownInstitutions.

  • This contains all Scopus Institution IDs from homeInstitution-scopusInstitutionIDs as stored in application.properties.
  • It also contains all Scopus Institution IDs for targetAuthor from identity.institutions; do this by matching against identity.institutionafids as described above.

C. Create scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutions

  • This contains all Scopus Institution IDs from collaboratingInstitution-scopusInstitutionIDs as stored in application.properties.
2. Determine overlap.

Compute the following:

  • countScopusIDNonTargetAuthor-Affiliations - non-unique count of all Scopus affiliation IDs for all authors
  • countScopusIDsNonTargetAuthor-Article-KnownInstitution - count of cases where affiliation ID from scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-KnownInstitutions
  • countScopusIDsNonTargetAuthor-Article-CollaboratingInstitution - count of cases where affiliation IDfrom scopusIDsNonTargetAuthor-Article is in scopusIDsNonTargetAuthor-Identity-CollaboratingInstitutions
  • countScopusIDsNonTargetAuthor-Article-NoMatch - count of cases in which none of the above are true
3. Compute overall score.

Get nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight and nonTargetAuthor-institutionalAffiliation-maxScore from application.properties.

nonTargetAuthor-institutionalAffiliation-maxScore * (countScopusIDsNonTargetAuthor-Article-KnownInstitution + (countScopusIDsNonTargetAuthor-Article-CollaboratingInstitution * nonTargetAuthor-institutionalAffiliation-collaboratingInstitution-weight )) / countScopusIDNonTargetAuthor-Affiliations
4. Output values
nonTargetAuthor-institutionalAffiliation-source: Scopus
nonTargetAuthor-institutionalAffiliation-matchType-match-score: 2.4  /* example */

/* Here we're outputting Scopus institution labels, identifiers, and counts for all matching institutions. */
nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Cornell Medicine, 60007997, 3
nonTargetAuthor-institutionalAffiliation-match-knownInstitution: Weill Graduate School of Medical Sciences, 60000247, 2
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: Methodist Hospital System, 60008981, 2
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Medical Research Institute, 60022377, 1
nonTargetAuthor-institutionalAffiliation-match-CollaboratingInstitution: The Burke Rehabilitation Hospital, 60005705, 1

Evaluate PubMed affiliation

At this time, we're not evaluating PubMed affiliation for nonTargetAuthors.

ReCiter is not storing the exact number of xml results returned by PubMed.

When retrieving xml for a given cwid, sometimes the number of publications retrieved from query is not equal to the number of publications stored on disk.

The bug can be produced by querying

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retstart=0&retmax=10000&usehistory=y&term=Toth%20Miklos[au]

which currently PubMed returns 208 number of publications. But the number of publications stored on disk is only 206.

This is likely due to the values of retstart and retmax.

Decide on data representation for author profiles used in the Phase Two matching

Michael and Paul to work on this
a. department
b. relationships
i. PI. See PrincipalInvestigators-PhDandMDPhDstudents.xlsx (available by request from Michael Bales, [email protected])
ii. Examples of co-authors: http://www.ncbi.nlm.nih.gov/pubmed/?term=Nimer+Hatlen
c. year of terminal degree
d. clinical expertise
i. list of ID's - https://pops.weillcornell.org/providerprofiles/ids.json
ii. sample provider profile - http://pops.weillcornell.org/providerprofiles/38.json
iii. equivalent on web site - http://weillcornell.org/patmarino
e. board certifications. See BoardCertifications.xls (available by request from Michael Bales, [email protected])
f. other fields?

Leverage data on board certifications to improve phase two matching

Background

In phase 2 matching ReCiter selects one or more piles, and the resulting set of articles constitute ReCiter's final determination as to the articles that were written by the target author. In some cases ReCiter fails to include a pile that should be selected for a target author, resulting in errors of omission and reduced recall. One piece of evidence that ReCiter can use to make better decisions in phase 2 matching is board certification data. To be specific, based on cosine similarity, if a given pile of articles includes words that match words in one or more of the target author's board certifications, this should increase the likelihood that the articles in the given pile were in fact written by the target author.

Operationalization

Board certifications for WCMC available here.

  • Using CWID retrieve any available board certifications or areas of expertise.
  • Pre-process data:
    • Break up terms into multiple words: "Blood Banking" >> "Blood", "Banking"
    • Break up terms containing a slash into two distinct terms: "Obstetrics/Gynecology" >> "Obstetrics", "Gynecology"
    • Remove any of the following terms: "and", "the", "medicine", "-", "with", "in", "med", "adult" , "general"
  • Append processed board certification data to the list of elements used by ReCiter in the cluster selection step
  • Build out cluster selection step to include processed board certification data

Examples

  • Jonathan W. Weinsaft (jww2001) has a variety of specialties associated with him:
    • Nuclear Cardiology
    • Cardiology
    • Clinical Expertise
    • Cardiovascular Nuclear Imaging
    • Radioisotope Imaging For Heart Diseases
    • Cardiovascular Stress Testing
    • Stress Testing
  • When you do phase two matching, that should pick up on these PMID's, all of which are related to these topics:
    • 25799706
    • 22835669
    • 22815751
    • 21812692
    • 21757159
    • 20579652
    • 18598895
    • 19808512
    • 17976589
    • 17478239

Linda Vahdat (ltv2001) has a board certification in Medical Oncology. These papers, which are currently false negatives, consistently use the oncology keyword.

  • 24202699
  • 24682463
  • 23403636
  • 21376385
  • 20679609
  • 20299316
  • 19349550
  • 24699910
  • 21937232
  • 18650153
  • 17606975
  • 17606972
  • 18762793
  • 18378531
  • 16921067
  • 16821602
  • 15945506
  • 14585260
  • 12548594
  • 7949141
  • 7949097

  • Thomas A. Caputo (tac2001) is board certified in Obstetrics and Gynecology. Those terms should map to these PMID's:
    • 760018
    • 876565
    • 10920302
    • 3976767
    • 22101154
    • 11104615
    • 11006044
    • 9740708
    • 8626101
    • 9234922
    • 2909447

Explore how scores improve as asserted publications are used to select clusters rather than seed them

As ReCiter is a greedy algorithm that does agglomerative clustering, we may assume that performance will improve if we provide one or more "seed" articles that are known to have written by an author. We wish to quantify the extent to which precision and recall improve in specific cases. This project involves running ReCiter several times for a set group of authors, in each case varying the number of "seed" articles provided. The results of this study could be written up and submitted to a journal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.