GithubHelp home page GithubHelp logo

constantjxyz / promptlink Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 0.0 611 KB

PromptLink: Leveraging Large Language Models for Cross-Source Biomedical Concept Linking

Python 100.00%
biomedical-concept-linking large-language-model retrieve-and-rank pre-trained-language-models prompt-engineering zero-shot-learning

promptlink's Introduction

PromptLink

This repo contains our code for paper "PromptLink: Leveraging Large Language Models for Cross-Source Biomedical Concept Linking".

Task Description

In this paper, we address the biomedical concept linking task, which aims to link biomedical concepts across sources/systems based on their semantic meanings and biomedical knowledge. It solely relies on concept names, and can thus cover a much broader range of real-world applications. This task differs from existing tasks such as entity linking, entity alignment, and ontology matching, which depend on additional contextual or topological information. A toy example of the biomedical concept linking task is described in the following figure.

toy-example

Figure 1: A toy example. Left: concepts in the EHR. Right: concepts in the biomedical KG.

PromptLink Framework

PromptLink is a novel biomedical concept linking framework that leverages Large Language Models (LLMs). It first employs a pre-trained language model specialized in biomedicine to generate candidate concepts that fit within the LLM context windows. Then, it utilizes an LLM to link concepts through two-stage prompts. The first-stage prompt aims to elicit biomedical prior knowledge from the LLM for the concept linking task, while the second-stage prompt compels the LLM to reflect on its own predictions to further enhance their reliability. The overview of the PromptLink Framework is illustrated in the following figure.

framework

Figure 2: Overview of our proposed PromptLink framework.

Package

["requirements.txt" file could be used to download the python packages automatically]

  • python==3.8.10

  • editdistance==0.6.2

  • fire==0.5.0

  • numpy==1.19.5

  • openai==0.28.1

  • pandas==1.3.4

  • rank_bm25==0.2.2

  • scipy==1.12.0

  • simstring-fast==0.3.0

  • textdistance==4.6.1

  • torch==1.10.0+cu111

  • tqdm==4.66.1

  • transformers==4.33.3

Data

We curate two biomedical concept linking benchmark datasets: MIID (MIMIC-III-iBKH-Disease) and CISE (CRADLE-iBKH-Side-Effect), using data from MIMIC-III EHR dataset MIMIC Link, CRADLE EHR dataset (a private EHR dataset collected from a large healthcare system in the United States), iBKH KG dataset iBKH Link, and UMLS coding system UMLS Link. Due to the sensitive nature of medical data and privacy considerations, there are restrictions on data sharing. To gain access to these medical datasets, appropriate training and credentials may be required. For further assistance with data access or other related inquiries, please feel free to reach out to our author team.

Code

Most of the code is stored in three folders: "gen_candidates", "gen_gpt_responses", and "baselines". More details can be found within these folders respectively.

  • Folder "gen_candidates": This folder contains the code for PromptLink's concept representation and candidate generation process.

  • Folder "gen_gpt_responses": This folder shows how PromptLink leverages the LLM to retrieve the final prediction answer.

  • Folder "baselines": This folder contains the code for running all compared baseline methods, including BM25, Levenshtein Distance, BioBERT, and SAPBERT.

promptlink's People

Contributors

constantjxyz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

promptlink's Issues

Access to Data or Embedding

Hello,
Thank you for your on this project. I am particularly interested in the methods and data you've used in your experiments.

  1. Access to Data/Embeddings: Is it possible to obtain access to the embeddings or the dataset utilized in your experiments? Access to these resources would greatly assist others in the community by enabling replication and further research based on your findings.
  2. Could you clarify whether the SAPBert model is solely used for generating entity mentions (names only) and concept mentions from vocabularies like SNOMED and ICD-10? It appears that the system relies on SAPBert as a retriever. If so, could you explain how this setup is distinct from using SAPBert semantic search as mentioned in the SAPBert evaluation code on GitHub? Specifically, I am interested in understanding the comparisons drawn in your results.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.