darpa-askem / gollm Goto Github PK

View Code? Open in Web Editor NEW

0.0 4.0 0.0 150 KB

Service to run GoLLM and define its tools.

License: Apache License 2.0

HCL 2.47% Python 97.08% Dockerfile 0.45%

gollm's Introduction

GoLLM

This is a repository which contains endpoints for various Terrarium LLM workflows.

Getting Started

Running the API

cd into root
run: docker build -t gollm .
run: docker run -p 8000:8000 -e OPENAI_API_KEY=$OPENAI_API_KEY gollm

AMR configuration from paper and AMR

Once the API has been started, the /configure endpoint will consume a JSON with the structure:
{ research_paper: str, amr: obj }

The API will return a model configuration candidate with the structure

{response: obj}

where `response` contains the AMR populated with configuration values.

Note: This is a WIP, is unoptimized and is currently being used as a test case for integrating LLM features with Terrarium.

AMR model card from paper

Once the API has been started, the /model_card endpoint will consume a JSON with the structure:

    {
    research_paper: str,
    }

The API will return a model card in JSON format

{response: obj}

Note: This is a WIP

License

Apache License 2.0

gollm's People

Contributors

Watchers

gollm's Issues

Enrich AMR Task

@pascaleproulx

I have a working GoLLM task for AMR enrichment. We need to integrate it with the rest of the task runner architecture @kbirk. Example outputs below.

BIOMD0000000024

{
  "initials": {
    "protein": {
      "description": "Relative concentration of the effective protein, which is in the molecular state capable of inhibiting mRNA production.",
      "unit": "dimensionless"
    },
    "mRNA": {
      "description": "Relative concentration of mRNA, which is involved in the production of the effective protein.",
      "unit": "dimensionless"
    }
  },
  "parameters": {
    "k": {
      "description": "Scaling constant used in the nonlinear term of the mRNA production rate equation.",
      "unit": "dimensionless"
    },
    "n": {
      "description": "Hill coefficient representing the cooperativity in the negative feedback loop of protein on mRNA production.",
      "unit": "dimensionless"
    },
    "rM": {
      "description": "Scaled mRNA production rate constant.",
      "unit": "hr^-1"
    },
    "m": {
      "description": "Exponent representing the nonlinearity in the protein production cascade.",
      "unit": "dimensionless"
    },
    "parameter_0000009": {
      "description": "Not explicitly described in the provided text.",
      "unit": "N/A"
    },
    "rP": {
      "description": "Protein production rate constant.",
      "unit": "hr^-1"
    },
    "qM": {
      "description": "mRNA degradation rate constant.",
      "unit": "hr^-1"
    },
    "qP": {
      "description": "Protein degradation rate constant.",
      "unit": "hr^-1"
    },
    "compartment_0000004": {
      "description": "Not explicitly described in the provided text.",
      "unit": "N/A"
    }
  }
}

BIOMD0000001048

{
  "initials": {
    "Ttum": {
      "description": "Cell concentration of the original tumor",
      "unit": "cells/ml"
    },
    "Tplas": {
      "description": "Cancer cell concentration in the plasma",
      "unit": "cells/ml"
    },
    "Tnew": {
      "description": "Cell concentration of new and developing tumor",
      "unit": "cells/ml"
    }
  },
  "parameters": {
    "b": {
      "description": "Relative drug efficacy factor for specific growth rate",
      "unit": "dimensionless"
    },
    "kf1": {
      "description": "Rate constant for cell release from the original tumor to plasma",
      "unit": "day^-1"
    },
    "kr1": {
      "description": "Rate constant for cell attachment from plasma to the original tumor",
      "unit": "day^-1"
    },
    "c": {
      "description": "Rate constant for plasma clearance",
      "unit": "day^-1"
    },
    "d": {
      "description": "Relative drug efficacy factor for plasma clearance",
      "unit": "dimensionless"
    },
    "kf2": {
      "description": "Rate constant for cell release from plasma to new tumor",
      "unit": "day^-1"
    },
    "kr2": {
      "description": "Rate constant for cell attachment from new tumor to plasma",
      "unit": "day^-1"
    },
    "T0": {
      "description": "Equilibrium tumor cell concentration in the tumor",
      "unit": "cells/ml"
    },
    "a": {
      "description": "Relative drug efficacy factor for cell release rate",
      "unit": "dimensionless"
    },
    "r": {
      "description": "Specific growth rate of tumor cells",
      "unit": "day^-1"
    },
    "n": {
      "description": "Number of new tumors being developed simultaneously",
      "unit": "dimensionless"
    },
    "Tumor": {
      "description": "General term for tumor cell concentration",
      "unit": "cells/ml"
    }
  }
}

Using MathML for units

MODEL8262229752

{
  "initials": {},
  "parameters": {
    "b_reac_r": {
      "description": "Bio-reaction rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AdoMet_r": {
      "description": "Methionine adenosyl transfer rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Methy_trans": {
      "description": "Methyl transfer rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "SAM_Dec": {
      "description": "SAM decarboxylation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Spermi_uti": {
      "description": "Spermidine utilization rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "MTR_e": {
      "description": "MTR excretion rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Polyamine_uti": {
      "description": "Polyamine utilization rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Pfs_prot_d": {
      "description": "Pfs protein degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "pfs_transl": {
      "description": "Pfs translation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "pfs_mRNA_d": {
      "description": "Pfs mRNA degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "pfs_transc": {
      "description": "Pfs transcription rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Met_recov": {
      "description": "Methionine recovery rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "DPD_deg_r": {
      "description": "DPD degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "LuxS_prot_d": {
      "description": "LuxS protein degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "LuxS_transl": {
      "description": "LuxS translation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "LuxS_mRNA_d": {
      "description": "LuxS mRNA degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "LuxS_transc": {
      "description": "LuxS transcription rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AI2_syn_r": {
      "description": "AI-2 synthesis rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AI2_excret_r": {
      "description": "AI-2 excretion rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AI2_trans_r": {
      "description": "AI-2 transport rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AI2_phos_r": {
      "description": "AI-2 phosphorylation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "MTR_syn_r": {
      "description": "MTR synthesis rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "SAH_Hydro_r": {
      "description": "SAH hydrolysis rate constant",
      "units": {
        "expression": "1/(M*min)",
        "expression_mathml": "<apply><divide/><cn>1</cn><apply><times/><ci>M</ci><ci>min</ci></apply></apply>"
      }
    },
    "SRH_cleav": {
      "description": "SRH cleavage rate constant",
      "units": {
        "expression": "1/(M*min)",
        "expression_mathml": "<apply><divide/><cn>1</cn><apply><times/><ci>M</ci><ci>min</ci></apply></apply>"
      }
    },
    "SpeE_syn_r": {
      "description": "Spermidine synthesis rate constant",
      "units": {
        "expression": "1/(M*min)",
        "expression_mathml": "<apply><divide/><cn>1</cn><apply><times/><ci>M</ci><ci>min</ci></apply></apply>"
      }
    },
    "compartment": {
      "description": "Compartment for the reactions",
      "units": {
        "expression": "n/a",
        "expression_mathml": "<ci>n/a</ci>"
      }
    }
  }
}

MODEL9086926384

{  "initials": {},
  "parameters": {
    "kb": {
      "description": "Rate constant for the backward reaction in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    },
    "kf": {
      "description": "Rate constant for the forward reaction in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    },
    "default_compartment": {
      "description": "The default compartment where the reactions take place, typically representing the synaptic or cytosolic volume.",
      "units": {
        "expression": "fl",
        "expression_mathml": "<ci>fl</ci>"
      }
    },
    "geometry": {
      "description": "The geometric configuration of the synaptic and cytosolic compartments, including their volumes.",
      "units": {
        "expression": "fl",
        "expression_mathml": "<ci>fl</ci>"
      }
    },
    "geometry_sbo_1_sbc_": {
      "description": "Specific geometric parameter related to the synaptic and cytosolic compartments, possibly a scaling factor or specific volume.",
      "units": {
        "expression": "fl",
        "expression_mathml": "<ci>fl</ci>"
      }
    },
    "k1": {
      "description": "Rate constant for a specific reaction involving AMPAR or CaMKII in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    },
    "k2": {
      "description": "Rate constant for another specific reaction involving AMPAR or CaMKII in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    },
    "k3": {
      "description": "Rate constant for yet another specific reaction involving AMPAR or CaMKII in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    }
  }
}

Investigate using smaller context length models for model card creation

Now that we have gathered some baseline results from using GPT4 to create model cards for ~775 different models, we should look into using a cheaper model for doing this.

TODO:

Scope techniques for parsing out model card fields from a research paper using smaller context lengths (i.e., 4096).

Q&A for parameters over documents (TBD)

MITRE request to have the ability to query values for given model parameter over some set of documents.

e.g. "Find me values for the infectivity rate constant (beta) in these documents"

[Task] make sure gollm can extract constant and ranges (min/max)

Use updated format for input models

We need to account for 3 different forms of parameter matrices to reduce ambiguity in configuration pipelines:

subject/outcome
subject/controller
outcome/controller

Update output from HMI and the GoLLM prompts with few shot examples.

Create Docker file to create the image

add embeddings to datasets

given a set of datasets (TBD)

enrich the dataset by creating a data card (MIT enrichment)
add embeddings to dataset
upload back into terarium
provide search support

Add groundings to petrinets within gollm

Add groundings to give the LLM contextual information about states and parameters.

Extend the Model Card generation tool from PetriNet AMR & Epi corpus to Decapode AMR & Climate corpus

Create executables for stateless GoLLM tasks to be run by task runner

Goal

Create executable "tasks" for each GoLLM function

TODO

build tasks under task dir.
Unit tests for tasks
write input validation layer for validating input JSON schema
error handling

Update GitHub actions to run the docker file

Use the MIRA DKG API to ground model variables & parameters to concepts during Model Card generation

DKG = Domain Knowledge Graph

The task is to ground model variables & parameters to concepts in the DKG if they are missing from the input AMR or plainly wrong according to the input document.

These groundings would be later used with the entity_similarity endpoint to automatically map model quantities to dataset features (as in the Calibrate workflow box).

The API is here:
http://34.230.33.149:8771/
http://34.230.33.149:8771/docs

Model Search

Depends on DARPA-ASKEM/data-service#363

Goal

Build out the tasks and data required for semantic search across AMRs

Tasks

build model card task
Scrape models from bio models, convert to amr, fetch associated papers
Create model cards for each model from associated paper
Create embedding task within GoLLM
Create embedding for each model card
Upload amr, paper DOI, model card and model card embedding to ES
Configure facet in HMI server. User query gets sent to embedding task, then an ANN search is performed against the model card embeddings. Top K models are returned to the user.

Misc

Add model created date to model card

Duplicate GoLLM within this repository

Improve OAI consistency when generating model cards

We are seeing inconsistent output formats for model cards:

"ModelCardAuthors": [
                "Giulia Giordano",
                "Franco Blanchini",
                "Raffaele Bruno",
                "Patrizio Colaneri",
                "Alessandro Di Filippo",
                "Angela Di Matteo",
                "Marta Colaneri"
            ],

and

"ModelCardAuthors": [
                {
                    "Giulia Giordano": "Department of Industrial Engineering, University of Trento, Trento, Italy"
                },
                {
                    "Franco Blanchini": "Dipartimento di Scienze Matematiche, Informatiche e Fisiche, University of Udine, Udine, Italy"
                },
                {
                    "Raffaele Bruno": "Division of Infectious Diseases I, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy"
                },
                {
                    "Patrizio Colaneri": "Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy"
                }
            ]

TODO:

use deterministic generation params
add more detailed schema information to the model card prompt

Create fastapi endpoint

Use a fastapi endpoint instead of runnable CLI tool. Update docker file.

Build Task Factory

Build task factory to prevent reproducing Task code.

Create evaluation datasets for various GoLLM Tasks

Problem

We do not have large scale datasets that we can use to evaluate GoLLM tasks
We do not have a method in place for sourcing or creating datasets for new GoLLM tasks

Approach

Create distributions that we can sample from to create synthetic datasets. For example, we can likely create an arbitrary AMR, stratify it and then create synthetic interaction matrices which have cells that map to the newly created AMR.
A dataset of real data will be better where applicable, but we will have significant hurdles to overcome in terms of annotation costs, licensing, and time spent.

Tasks

TBD

[Task] benchmark gollm for config extraction

Calculate Benchmarks Using Google Vision Extracted Features

Rerun benchmarks using additional extractions:

tables
latex

We have created a config from document dataset with approximately 900 pairs of documents and AMRs. For evaluation, mask out the values for parameters and initials and then compare prediction against ground truth. Evaluate using precision, recall, f1. The dataset is uploaded to the shared drive.

Config from dataset strategy is TBD. Perhaps we can use this same dataset, and then map values from the existing AMRs into tabular format. Evaluate the model's ability to map the values in the tables back into the AMR.

[Research] investigate if worth keeping variable extractions if useful to gollm

Document to Model Configuration

User Input

POST /document-to-model-config

{ 
    model: AMR, // Full AMR from `data-service`
    document: text, // OCR text extracted from the document
}

Response

Define Model Configuration options list object.

Score the "probability" for a given paper to contain an AMR-compatible model

Goal

Rank papers based on their likelihood to contain a system of ODEs that can be represented as an AMR

Approach

TBD, but I think we should fit a distribution of points over embeddings of a representative sample of papers that are known to contain AMR-compatible models. We can use the distance between the query to the cluster of AMR compatible points to do this scoring.

Document Search

Goal

Perform semantic search over documents

Tasks

figure out schema with @dgauldie & @kbirk
bulk ingest docs into ES
embedding service for user queries
connect to semantic search UI
build summarization service to summarize search results

Note: I have already built the vector index for this and have shared with the team offline.

Integrate Message Queue

Describe

Because most of the calls on the services will be asynchronous, we need to integrate/use Message Queue to communicate with the hmi-server.

Properly map bad inputs to null space

If a user uploads a paper or dataset that does not match an AMR, GoLLM should properly output null values, or a warning to the user.

@mwdchang can you provide some examples of the bad outputs you were sending GoLLM so I can reproduce?

Update Unittests

All unit tests are out of date
These should be updated after we perform a refactor of GoLLM. Currently the repo is still research-y

darpa-askem / gollm Goto Github PK

gollm's Introduction

GoLLM

Getting Started

Running the API

AMR configuration from paper and AMR

AMR model card from paper

License

gollm's People

Contributors

Watchers

gollm's Issues

User Input

Response

Describe

Recommend Projects

Recommend Topics

Recommend Org

Jobs