monarch-initiative / curate-gpt Goto Github PK

View Code? Open in Web Editor NEW

41.0 5.0 9.0 4.38 MB

LLM-driven curation assist tool (pre-alpha)

Home Page: https://monarch-initiative.github.io/curate-gpt/

License: BSD 3-Clause "New" or "Revised" License

Python 23.21% Makefile 0.29% Jupyter Notebook 76.50%

ai curation gpt llm monarchinitiative obofoundry ontogpt ontologies ontology-tools biocuration

curate-gpt's People

Contributors

Stargazers

Watchers

Forkers

oneilsh allthingsllm ashsyal sysang manwaltep justaddcoffee iquxle lucinvitae antonis-georgakopoulos

curate-gpt's Issues

load-db-hpoa_by_pub should stream output

currently this loader will only generate output at the end. the reason it does this is that it needs to aggregate by pub. however the strategy is still pretty dumb. And v inconvenient

curate-gpt/src/curate_gpt/wrappers/clinical/hpoa_wrapper.py

Lines 112 to 114 in 42fbc01

 if self.group_by_publication: 

 for pub in by_pub.values(): 

 yield pub

instead it should

load all hpoa as one TSV
aggregate by pub
index these one at a time, yielding results

@julesjacobsen

Bypass OpenAI server overload and HTTP 500 Error

This issue was already handled in this old PR but it had a commit mistake. A new PR will be made to have a cleaner history.

When loading ontologies into CurateGPT the insertion of the data into chromaDB is very often interrupted because of a server overload on the API side.

openai.error.ServiceUnavailableError: The server is overloaded or not ready yet.

Implementing a exponential_backoff_request helped me to bypass this by trying again with an additional small sleep every time it would fail.
Its not a fancy solution but it can get the job done.

Another often occurring problem would be an HTTP 500 error, which could also be caught.

poetry run curategpt ontology index --index-fields label,definition,relationships -p stagedb -c ont_mp -m openai: sqlite:obo:mp
Configuration file exists at /Users/carlo/Library/Preferences/pypoetry, reusing this directory.

Consider moving TOML configuration files to /Users/carlo/Library/Application Support/pypoetry, as support for the legacy directory will be removed in an upcoming release.
WARNING:curate_gpt.store.chromadb_adapter:Cumulative length = 3040651, pausing ...
WARNING:curate_gpt.store.chromadb_adapter:Cumulative length = 3010451, pausing ...
ERROR:curate_gpt.store.chromadb_adapter:Failed to process batch after retries: The server is overloaded or not ready yet.
poetry run curategpt ontology index --index-fields label,definition,relationships -p stagedb -c ont_mondo -m openai: sqlite:obo:mondo
Configuration file exists at /Users/carlo/Library/Preferences/pypoetry, reusing this directory.

Consider moving TOML configuration files to /Users/carlo/Library/Application Support/pypoetry, as support for the legacy directory will be removed in an upcoming release.
ERROR:curate_gpt.store.chromadb_adapter:Failed to process batch after retries: The server had an error while processing your request. Sorry about that! {
  "error": {
    "message": "The server had an error while processing your request. Sorry about that!",
    "type": "server_error",
    "param": null,
    "code": null
  }
}
 500 {'error': {'message': 'The server had an error while processing your request. Sorry about that!', 'type': 'server_error', 'param': None, 'code': None}} {'Date': 'Wed, 17 Jan 2024 11:49:08 GMT', 'Content-Type': 'application/json', 'Content-Length': '176', 'Connection': 'keep-alive', 'access-control-allow-origin': '*', 'openai-organization': 'lawrence-berkeley-national-laboratory-8', 'openai-processing-ms': '867', 'openai-version': '2020-10-01', 'strict-transport-security': 'max-age=15724800; includeSubDomains', 'x-ratelimit-limit-requests': '10000', 'x-ratelimit-limit-tokens': '10000000', 'x-ratelimit-remaining-requests': '9999', 'x-ratelimit-remaining-tokens': '9998545', 'x-ratelimit-reset-requests': '6ms', 'x-ratelimit-reset-tokens': '8ms', 'x-request-id': '1751b1c8c5e4386f901047e4380709fb', 'CF-Cache-Status': 'DYNAMIC', 'Server': 'cloudflare', 'CF-RAY': '846e5f804ecd79c3-LHR', 'alt-svc': 'h3=":443"; ma=86400'}

use agent-smith-ai to wrap additional endpoints

Currently curategpt allows both static and dynamic wrappers.

Static: loaded in advance
Dynamic: requires the backend to support some kind of relevancy-backed search

This restricts us to ETL-able objects or things like pubmed.

If we integrate agent-smith
https://github.com/monarch-initiative/agent-smith-ai

then we can wrap a lot more knowledge sources

the basic workflow here would be:

user asks a general knowledge question
agent smith figures correct APIs, issues query
chatagent wraps results in a text blob with citations
user clicks curate and text is auto-structured according to schema

Chat interface uses incorrect OpenAI API key

Setting the OpenAI API key as stated in the README may not consistently set it in a way the app can access.
If I do the following:

$ export OPENAI_API_KEY=(key_here)
$ make ont-maxo
$ cp -r stagedb/* db/
$ make app

and then use the Chat interface, I get an authentication error:

openai.error.AuthenticationError: Incorrect API key provided: sk-22ENy***************************************VnKr. You can find your API key at https://platform.openai.com/account/api-keys.
2024-02-02 12:55:41.247 Removing orphaned files...
2024-02-02 12:55:41.448 Script run finished successfully; removing expired entries from MessageCache (max_age=2)

That's not my API key...but it is one I have used in the past!
It's not active anymore so it won't work here, and I'm not certain where CurateGPT is finding or why it isn't using the one I just set to OPENAI_API_KEY.

Add command to load existing embeddings into a collection

See #35

So for example you can do this:

curategpt embeddings index /path/to/local/embeddings.parquet

or this:

curategpt embeddings index https://huggingface.co/datasets/biomedical-translator/monarch_kg_embeddings/resolve/main/deepwalk_embedding.parquet?download=true -f parquet

Add a chat interface

The current way of interacting with the app is incredibly clunky - multiple selectors on the left, confusing, easy to select the wrong thing

It should be a chat interface like @oneilsh's PA interface

There could be power-user CLI commands that bypass GPT, e.g.

!ont_hp/search liver phenotype
!ont_envo/chat[gpt4] what are volcanoes?
!maxoa/extract PMID:123456

But anything else should be passed to a model and use ReAct to trigger appropriate action

Support triple extraction use case

In discussion with RNA-KG group (Marco Mesiti, Elena Casiraghi, Emanuele Cavalleri) and @justaddcoffee -
we would like to be able to extract triples (s, p, o) from a provided text, using graph embeddings to guide the process.
The goal is to find additional content for RNA-KG. Using OntoGPT has worked well for this so far but does not take advantage of the existing relations within the KG.

This would involve:

Including interface (CLI and/or GUI) to use text document as input
Providing way to index KGX and/or derive a schema from it
Building wrapper for graph embeddings.
- Using GRAPE directly through this project would be a heavy lift, so retrieving embeddings from an external source like Huggingface would likely work better, save time, and avoid introduction of many new dependencies
Writing documentation for the above

Integrating some process for comparison of the extracted triples would be ideal (e.g., A vs B appears in 20 documents, 15 of them from different sources, etc).

RNA-KG group has also suggested trying an alternative vector DB (https://www.llamaindex.ai/) to see if it works better for RAG with KG data.

Figure out chromadb slowness issues

At some point after loading a certain number of sources simple metadata/peek operations start going incredibly slowly

Even curategpt collections list takes ~1m

This causes the UI to slow down too

I am pretty sure this is something in the chromadb layer, not something we are doing on top. In the metadata extraction

consider chromadb alternatives

chroma is very slow on the EC2 instance. I don't think the issue is with any fancy vector operations - it's just basic lookup operations and extracting metadata for a collection that seems to be slow.

I am not sure we actually need a vector database. It may be better to use a dedicated document store that has some kind of vector plugin

solr/ES has vector extensions. However, I don't think this would be good as a primary store for editable data

sqlite has vector extensions https://simonwillison.net/2023/Oct/23/embeddings/ -- seems to require. aplugin https://github.com/asg017/sqlite-vss -- this may make the overall build more complicated..

the native datamodel for curategpt is json documents so using mongodb as a base would be perfect. There is atlas but this seems to force some kind of cloud deployment - https://www.mongodb.com/products/platform/atlas-vector-search - this might be a good way to go bit we want the option to keep it simple with local files

cc @julesjacobsen

Consider the use of a standard wrapper for vector databases and chaining RAG operations

This project implements its own abstraction layer over vector dbs and its own logic for RAG

Consider using a standard framework

langchain
llamaindex
dspy

dspy is very promising IMO

Should 'GPT' be used in the app name?

I'm concerned about the use of 'GPT' in the name of this app, since (presumably) this app is not affiliated with OpenAI, and their brand guidelines explicitly forbid this usage:

We do not permit the use of OpenAI models or “GPT” in product or app names because it confuses end users.

See https://openai.com/brand

Unless you strongly disagree with OpenAI's position, you might want to consider adding an obvious disclaimer to the top of the readme (and anywhere else that's applicable) that this app is not affiliated, endorsed, or sponsored by OpenAI, or rename the app.

Evaluate groq

groq has jawdroppingly fast access to mixtral. Currently you can use the UI and API for no cost. There is throttling but it seems quite generous

it's easy to use via the awesome litellm

See https://github.com/monarch-initiative/curate-gpt/blob/main/README.md#selecting-models for general setup

First make sure you are up to date

pipx update litellm

then fire it up:

litellm -m groq/mixtral-8x7b-32768

Add this to extra-openai-models.yaml as detailed in the llm docs:

- model_name: litellm-groq-mixtral
  model_id: litellm-groq-mixtral
  api_base: "http://0.0.0.0:8000"

You can use the CLI: llm -m litellm-groq-mixtral "10 names for a pet pelican"

Unit tests failing

At present, 89 unit tests are failing

Consider https://simonwillison.net/2023/Sep/4/llm-embeddings/

https://simonwillison.net/2023/Sep/4/llm-embeddings/

Note currently we let chromadb handle all things embeddings but the above may provide a better way to decouple the choice of object store (e.g. sqlite, mongodb) vs dedicated vector database

Include examples for using the annotate command

curategpt includes an annotate command that functions as traditional text annotation / CR. Give it some text, and it will give back the ontology term IDs. It's not guaranteed to find the spans, but that could be done as post-processing. The priority has been to find the concepts.

annotate has a --method option with values:

inline
concept_list
two_pass

Under the hood it uses https://github.com/monarch-initiative/curate-gpt/blob/main/src/curate_gpt/agents/concept_recognition_agent.py

We really need to (a) have better docstrings and (b) expose this via sphinx... but for now this issue serves as temporary docs.

We'll use these texts as a running example:

A minimum diagnostic criterion is the combination of either the skin tumours
or multiple odontogenic keratocysts plus a positive family history for this disorder,
bifid ribs, lamellar calcification of the falx cerebri or any one of the skeletal
abnormalities typical of this syndrome
A clinical concept has been produced, with a diagnostic check list including
a genetic and a dermatological routine work up as well as a radiological survey
of the jaws and skeleton

And assume we have hpo pre-indexed using the standard curate-gpt loader.

inline

This is a RAG-based approach that finds the N most relevant concepts in the given ontology (pre-indexed in chromadb). It then presents this in the (system) prompt as a CSV of id, label pairs.

This method is designed to return the annotated spans "inlined" into the existing text, via this prompt:

Your role is to annotate the supplied text with selected concepts.
return the original text with each conceptID in square brackets.
After the occurrence of that concept.
You can use synonyms. For example, if the concept list contains
zucchini // DB:12345
Then for the text 'I love courgettes!' you should return
'I love [courgettes DB:12345]!'
Always try and match the longest span.
the concept ID should come only from the list of candidate concepts supplied to you.

ideally the system prompt will include DB:12345,courgettes but the chance of this diminishes with the size of the input document and to a lesser extent the size of the ontology.

example of output from

curategpt annotate -M inline -m gpt-4-1106-preview --prefix HP --category PhenotypicFeature -l 50 -c ont_hp -I original_id -p stagedb -s -i tests/input/example-disease-text.txt

The output annotated text in one run is:

A minimum diagnostic criterion is the combination of either the skin
  tumours or multiple [odontogenic keratocysts HP:0010603] plus a positive family
  history for this disorder, [bifid ribs HP:0030280], [lamellar calcification of the
  falx cerebri HP:0005462] or any one of the skeletal abnormalities typical of this
  syndrome

that was an easy one since the mentions in the text are more or less exact matches with hpo

For the other text:

A clinical concept has been produced, with a diagnostic check list
  including a genetic and a [dermatological routine work up HP:0001005] as well as
  a radiological survey of the [jaws HP:0000347] and [skeleton HP:0033127].

Hmm. We can see the concepts here:

spans:
- text: dermatological routine work up
  start: null
  end: null
  concept_id: HP:0001005
  concept_label: Dermatological manifestations of systemic disorders
  is_suspect: false
- text: jaws
  start: null
  end: null
  concept_id: HP:0000347
  concept_label: Micrognathia
  is_suspect: false
- text: skeleton
  start: null
  end: null
  concept_id: HP:0033127
  concept_label: Abnormality of the musculoskeletal system
  is_suspect: false

so it's getting creative, and this is wrong, the actual phenotype in GG is jaw cysts not small jaws...

concept list

This is similar to inline but doesn't attempt to relate the match to a span of text.

This is a RAG-based approach that finds the N most relevant concepts in the given ontology (pre-indexed in chromadb). It then presents this in the (system) prompt as a CSV of id, label pairs.

It uses the prompt:

Your role is to list all instances of the supplied candidate concepts in the supplied text.
Return the concept instances as a CSV of ID,label,text pairs, where the ID
is the concept ID, label is the concept label, and text is the mention of the
concept in the text.
The concept ID and label should come only from the list of candidate concepts supplied to you.
Only include a row if the meaning of the text section is that same as the concept.
If there are no instances of a concept in the text, return an empty string.
Do not include additional verbiage.

for the easier text:

curategpt annotate -M concept_list -m gpt-4-1106-preview --prefix HP --category PhenotypicFeature -l 50 -c ont_hp -I original_id -p stagedb -s -i tests/input/example-disease-text.txt

- text: '"odontogenic keratocysts"'
  start: null
  end: null
  concept_id: HP:0010603
  concept_label: Odontogenic keratocysts of the jaw
  is_suspect: false
- text: '"calcification of the falx cerebri"'
  start: null
  end: null
  concept_id: HP:0005462
  concept_label: Calcification of falx cerebri
  is_suspect: false
- text: '"bifid ribs"'
  start: null
  end: null
  concept_id: HP:0030280
  concept_label: Rib gap
  is_suspect: false

for the harder text it has lower recall but nothing is IMO outright wrong:

spans:
- text: radiological survey of the jaws
  start: null
  end: null
  concept_id: HP:0010603
  concept_label: Odontogenic keratocysts of the jaw
  is_suspect: false
- text: radiological survey of the skeleton
  start: null
  end: null
  concept_id: HP:0033127
  concept_label: Abnormality of the musculoskeletal system
  is_suspect: false

two pass

This does a first pass where it asks for all concepts found in the doc to be listed (no RAG, very vanilla chatpgpt usage). These are requested as human readable terms, not IDs, to limit hallucination. It asks for the concepts to be inlined in square brackets

Then a second pass is done on each concept, essentially grounding them. The grounding DOES use the concept_list/inline RAG method above - but in theory this should be more accurate and include the relevant concepts in the cutoff since we are just grounding rather than presenting the whole text.

The grounding prompt is:

Your role is to assign a concept ID that best matches the supplied text, using
the supplied list of candidate concepts.
Return as a string "CONCEPT NAME // CONCEPT ID".
Only return a result if the input text represents the same or equivalent
concept, in the provided context.
If there is no match, return an empty string.

let's see how it does on the easier text:

curategpt annotate -M two_pass -m gpt-4-1106-preview --prefix HP --category PhenotypicFeature -l 5 -c ont_hp -I original_id -p stagedb -s -i tests/input/example-disease-text.txt

(we can use a lower value for -l as for grounding the top 5 is likely to include the right concept [untested])

annotated_text: A minimum diagnostic criterion is the combination of either the [skin
  tumours] or multiple [odontogenic keratocysts] plus a positive family history for
  this disorder, [bifid ribs], [lamellar calcification of the falx cerebri] or any
  one of the [skeletal abnormalities] typical of this syndrome
spans:
- text: skin tumours
  start: null
  end: null
  concept_id: HP:0008069
  concept_label: Neoplasm of the skin
  is_suspect: false
- text: odontogenic keratocysts
  start: null
  end: null
  concept_id: HP:0010603
  concept_label: Odontogenic keratocysts of the jaw
  is_suspect: false
- text: bifid ribs
  start: null
  end: null
  concept_id: HP:0000892
  concept_label: Bifid ribs
  is_suspect: false
- text: lamellar calcification of the falx cerebri
  start: null
  end: null
  concept_id: HP:0005462
  concept_label: Calcification of falx cerebri
  is_suspect: false
- text: skeletal abnormalities
  start: null
  end: null
  concept_id: HP:0011842
  concept_label: Abnormal skeletal morphology
  is_suspect: false

and the harder one:

annotated_text: A clinical concept has been produced, with a diagnostic check list
  including a [genetic] and a [dermatological] routine work up as well as a [radiological]
  survey of the [jaws] and [skeleton].
spans:
- text: jaws
  start: null
  end: null
  concept_id: HP:0012802
  concept_label: Broad jaw
  is_suspect: false
- text: skeleton
  start: null
  end: null
  concept_id: C0037253
  concept_label: null
  is_suspect: false

Consider switching to litellm

switch to using linkml-store

See https://github.com/linkml/linkml-store

This would swap out the abstraction layer here
https://github.com/monarch-initiative/curate-gpt/tree/main/src/curate_gpt/store

Note that no chromadb adapter is required - it's sufficient to use the default duckdb adapter (fast) and the llm lib. See https://linkml.io/linkml-store/examples/Semantic-Search.html

	if self.group_by_publication:
	for pub in by_pub.values():
	yield pub