GithubHelp home page GithubHelp logo

gollm's Introduction

Build and Publish

GoLLM

This is a repository which contains endpoints for various Terrarium LLM workflows.

Getting Started

Running the API

cd into root
run: docker build -t gollm .
run: docker run -p 8000:8000 -e OPENAI_API_KEY=$OPENAI_API_KEY gollm

AMR configuration from paper and AMR

Once the API has been started, the /configure endpoint will consume a JSON with the structure:
{ research_paper: str, amr: obj }

The API will return a model configuration candidate with the structure

{response: obj}

where `response` contains the AMR populated with configuration values.

Note: This is a WIP, is unoptimized and is currently being used as a test case for integrating LLM features with Terrarium.

AMR model card from paper

Once the API has been started, the /model_card endpoint will consume a JSON with the structure:

    {
    research_paper: str,
    }

The API will return a model card in JSON format
{response: obj}

Note: This is a WIP


License

Apache License 2.0

gollm's People

Contributors

j2whiting avatar kbirk avatar dgauldie avatar jryu01 avatar yohannparis avatar mwdchang avatar

Watchers

Charles Coleman avatar Brandon Rose avatar Joshua Cruz avatar Kostas Georgiou avatar

gollm's Issues

Enrich AMR Task

@pascaleproulx

I have a working GoLLM task for AMR enrichment. We need to integrate it with the rest of the task runner architecture @kbirk. Example outputs below.


BIOMD0000000024

{
  "initials": {
    "protein": {
      "description": "Relative concentration of the effective protein, which is in the molecular state capable of inhibiting mRNA production.",
      "unit": "dimensionless"
    },
    "mRNA": {
      "description": "Relative concentration of mRNA, which is involved in the production of the effective protein.",
      "unit": "dimensionless"
    }
  },
  "parameters": {
    "k": {
      "description": "Scaling constant used in the nonlinear term of the mRNA production rate equation.",
      "unit": "dimensionless"
    },
    "n": {
      "description": "Hill coefficient representing the cooperativity in the negative feedback loop of protein on mRNA production.",
      "unit": "dimensionless"
    },
    "rM": {
      "description": "Scaled mRNA production rate constant.",
      "unit": "hr^-1"
    },
    "m": {
      "description": "Exponent representing the nonlinearity in the protein production cascade.",
      "unit": "dimensionless"
    },
    "parameter_0000009": {
      "description": "Not explicitly described in the provided text.",
      "unit": "N/A"
    },
    "rP": {
      "description": "Protein production rate constant.",
      "unit": "hr^-1"
    },
    "qM": {
      "description": "mRNA degradation rate constant.",
      "unit": "hr^-1"
    },
    "qP": {
      "description": "Protein degradation rate constant.",
      "unit": "hr^-1"
    },
    "compartment_0000004": {
      "description": "Not explicitly described in the provided text.",
      "unit": "N/A"
    }
  }
}

BIOMD0000001048

{
  "initials": {
    "Ttum": {
      "description": "Cell concentration of the original tumor",
      "unit": "cells/ml"
    },
    "Tplas": {
      "description": "Cancer cell concentration in the plasma",
      "unit": "cells/ml"
    },
    "Tnew": {
      "description": "Cell concentration of new and developing tumor",
      "unit": "cells/ml"
    }
  },
  "parameters": {
    "b": {
      "description": "Relative drug efficacy factor for specific growth rate",
      "unit": "dimensionless"
    },
    "kf1": {
      "description": "Rate constant for cell release from the original tumor to plasma",
      "unit": "day^-1"
    },
    "kr1": {
      "description": "Rate constant for cell attachment from plasma to the original tumor",
      "unit": "day^-1"
    },
    "c": {
      "description": "Rate constant for plasma clearance",
      "unit": "day^-1"
    },
    "d": {
      "description": "Relative drug efficacy factor for plasma clearance",
      "unit": "dimensionless"
    },
    "kf2": {
      "description": "Rate constant for cell release from plasma to new tumor",
      "unit": "day^-1"
    },
    "kr2": {
      "description": "Rate constant for cell attachment from new tumor to plasma",
      "unit": "day^-1"
    },
    "T0": {
      "description": "Equilibrium tumor cell concentration in the tumor",
      "unit": "cells/ml"
    },
    "a": {
      "description": "Relative drug efficacy factor for cell release rate",
      "unit": "dimensionless"
    },
    "r": {
      "description": "Specific growth rate of tumor cells",
      "unit": "day^-1"
    },
    "n": {
      "description": "Number of new tumors being developed simultaneously",
      "unit": "dimensionless"
    },
    "Tumor": {
      "description": "General term for tumor cell concentration",
      "unit": "cells/ml"
    }
  }
}

Using MathML for units

MODEL8262229752

{
  "initials": {},
  "parameters": {
    "b_reac_r": {
      "description": "Bio-reaction rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AdoMet_r": {
      "description": "Methionine adenosyl transfer rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Methy_trans": {
      "description": "Methyl transfer rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "SAM_Dec": {
      "description": "SAM decarboxylation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Spermi_uti": {
      "description": "Spermidine utilization rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "MTR_e": {
      "description": "MTR excretion rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Polyamine_uti": {
      "description": "Polyamine utilization rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Pfs_prot_d": {
      "description": "Pfs protein degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "pfs_transl": {
      "description": "Pfs translation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "pfs_mRNA_d": {
      "description": "Pfs mRNA degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "pfs_transc": {
      "description": "Pfs transcription rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "Met_recov": {
      "description": "Methionine recovery rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "DPD_deg_r": {
      "description": "DPD degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "LuxS_prot_d": {
      "description": "LuxS protein degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "LuxS_transl": {
      "description": "LuxS translation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "LuxS_mRNA_d": {
      "description": "LuxS mRNA degradation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "LuxS_transc": {
      "description": "LuxS transcription rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AI2_syn_r": {
      "description": "AI-2 synthesis rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AI2_excret_r": {
      "description": "AI-2 excretion rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AI2_trans_r": {
      "description": "AI-2 transport rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "AI2_phos_r": {
      "description": "AI-2 phosphorylation rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "MTR_syn_r": {
      "description": "MTR synthesis rate constant",
      "units": {
        "expression": "1/min",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>min</ci></apply>"
      }
    },
    "SAH_Hydro_r": {
      "description": "SAH hydrolysis rate constant",
      "units": {
        "expression": "1/(M*min)",
        "expression_mathml": "<apply><divide/><cn>1</cn><apply><times/><ci>M</ci><ci>min</ci></apply></apply>"
      }
    },
    "SRH_cleav": {
      "description": "SRH cleavage rate constant",
      "units": {
        "expression": "1/(M*min)",
        "expression_mathml": "<apply><divide/><cn>1</cn><apply><times/><ci>M</ci><ci>min</ci></apply></apply>"
      }
    },
    "SpeE_syn_r": {
      "description": "Spermidine synthesis rate constant",
      "units": {
        "expression": "1/(M*min)",
        "expression_mathml": "<apply><divide/><cn>1</cn><apply><times/><ci>M</ci><ci>min</ci></apply></apply>"
      }
    },
    "compartment": {
      "description": "Compartment for the reactions",
      "units": {
        "expression": "n/a",
        "expression_mathml": "<ci>n/a</ci>"
      }
    }
  }
}

MODEL9086926384

{  "initials": {},
  "parameters": {
    "kb": {
      "description": "Rate constant for the backward reaction in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    },
    "kf": {
      "description": "Rate constant for the forward reaction in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    },
    "default_compartment": {
      "description": "The default compartment where the reactions take place, typically representing the synaptic or cytosolic volume.",
      "units": {
        "expression": "fl",
        "expression_mathml": "<ci>fl</ci>"
      }
    },
    "geometry": {
      "description": "The geometric configuration of the synaptic and cytosolic compartments, including their volumes.",
      "units": {
        "expression": "fl",
        "expression_mathml": "<ci>fl</ci>"
      }
    },
    "geometry_sbo_1_sbc_": {
      "description": "Specific geometric parameter related to the synaptic and cytosolic compartments, possibly a scaling factor or specific volume.",
      "units": {
        "expression": "fl",
        "expression_mathml": "<ci>fl</ci>"
      }
    },
    "k1": {
      "description": "Rate constant for a specific reaction involving AMPAR or CaMKII in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    },
    "k2": {
      "description": "Rate constant for another specific reaction involving AMPAR or CaMKII in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    },
    "k3": {
      "description": "Rate constant for yet another specific reaction involving AMPAR or CaMKII in the model.",
      "units": {
        "expression": "1/s",
        "expression_mathml": "<apply><divide/><cn>1</cn><ci>s</ci></apply>"
      }
    }
  }
}

Investigate using smaller context length models for model card creation

Now that we have gathered some baseline results from using GPT4 to create model cards for ~775 different models, we should look into using a cheaper model for doing this.

TODO:

Scope techniques for parsing out model card fields from a research paper using smaller context lengths (i.e., 4096).

Q&A for parameters over documents (TBD)

MITRE request to have the ability to query values for given model parameter over some set of documents.

e.g. "Find me values for the infectivity rate constant (beta) in these documents"

Use updated format for input models

We need to account for 3 different forms of parameter matrices to reduce ambiguity in configuration pipelines:

  • subject/outcome
  • subject/controller
  • outcome/controller

Update output from HMI and the GoLLM prompts with few shot examples.

add embeddings to datasets

given a set of datasets (TBD)

  • enrich the dataset by creating a data card (MIT enrichment)
  • add embeddings to dataset
  • upload back into terarium
  • provide search support

Use the MIRA DKG API to ground model variables & parameters to concepts during Model Card generation

DKG = Domain Knowledge Graph

The task is to ground model variables & parameters to concepts in the DKG if they are missing from the input AMR or plainly wrong according to the input document.

These groundings would be later used with the entity_similarity endpoint to automatically map model quantities to dataset features (as in the Calibrate workflow box).

The API is here:
http://34.230.33.149:8771/
http://34.230.33.149:8771/docs

Model Search

Depends on DARPA-ASKEM/data-service#363

Goal

Build out the tasks and data required for semantic search across AMRs

Tasks

  • build model card task
  • Scrape models from bio models, convert to amr, fetch associated papers
  • Create model cards for each model from associated paper
  • Create embedding task within GoLLM
  • Create embedding for each model card
  • Upload amr, paper DOI, model card and model card embedding to ES
  • Configure facet in HMI server. User query gets sent to embedding task, then an ANN search is performed against the model card embeddings. Top K models are returned to the user.

Misc

  • Add model created date to model card

Improve OAI consistency when generating model cards

We are seeing inconsistent output formats for model cards:

"ModelCardAuthors": [
                "Giulia Giordano",
                "Franco Blanchini",
                "Raffaele Bruno",
                "Patrizio Colaneri",
                "Alessandro Di Filippo",
                "Angela Di Matteo",
                "Marta Colaneri"
            ],

and

"ModelCardAuthors": [
                {
                    "Giulia Giordano": "Department of Industrial Engineering, University of Trento, Trento, Italy"
                },
                {
                    "Franco Blanchini": "Dipartimento di Scienze Matematiche, Informatiche e Fisiche, University of Udine, Udine, Italy"
                },
                {
                    "Raffaele Bruno": "Division of Infectious Diseases I, Fondazione IRCCS Policlinico San Matteo, Pavia, Italy"
                },
                {
                    "Patrizio Colaneri": "Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy"
                }
            ]

TODO:

  • use deterministic generation params
  • add more detailed schema information to the model card prompt

Create evaluation datasets for various GoLLM Tasks

Problem

  • We do not have large scale datasets that we can use to evaluate GoLLM tasks
  • We do not have a method in place for sourcing or creating datasets for new GoLLM tasks

Approach

  • Create distributions that we can sample from to create synthetic datasets. For example, we can likely create an arbitrary AMR, stratify it and then create synthetic interaction matrices which have cells that map to the newly created AMR.
  • A dataset of real data will be better where applicable, but we will have significant hurdles to overcome in terms of annotation costs, licensing, and time spent.

Tasks

TBD

Create Evaluation Datasets

  • Config from Document
  • Config from Dataset

We have created a config from document dataset with approximately 900 pairs of documents and AMRs. For evaluation, mask out the values for parameters and initials and then compare prediction against ground truth. Evaluate using precision, recall, f1. The dataset is uploaded to the shared drive.

Config from dataset strategy is TBD. Perhaps we can use this same dataset, and then map values from the existing AMRs into tabular format. Evaluate the model's ability to map the values in the tables back into the AMR.

Document to Model Configuration

User Input

POST /document-to-model-config

{ 
    model: AMR, // Full AMR from `data-service`
    document: text, // OCR text extracted from the document
}

Response

  • Define Model Configuration options list object.

Score the "probability" for a given paper to contain an AMR-compatible model

Goal

Rank papers based on their likelihood to contain a system of ODEs that can be represented as an AMR

Approach

  • TBD, but I think we should fit a distribution of points over embeddings of a representative sample of papers that are known to contain AMR-compatible models. We can use the distance between the query to the cluster of AMR compatible points to do this scoring.

Document Search

Goal

Perform semantic search over documents

Tasks

  • figure out schema with @dgauldie & @kbirk
  • bulk ingest docs into ES
  • embedding service for user queries
  • connect to semantic search UI
  • build summarization service to summarize search results

Note: I have already built the vector index for this and have shared with the team offline.

Integrate Message Queue

Describe

Because most of the calls on the services will be asynchronous, we need to integrate/use Message Queue to communicate with the hmi-server.

Properly map bad inputs to null space

If a user uploads a paper or dataset that does not match an AMR, GoLLM should properly output null values, or a warning to the user.

@mwdchang can you provide some examples of the bad outputs you were sending GoLLM so I can reproduce?

Update Unittests

  • All unit tests are out of date
  • These should be updated after we perform a refactor of GoLLM. Currently the repo is still research-y

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.