GithubHelp home page GithubHelp logo

shawnschulz / dna-llama Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 4.71 MB

A python class for preparing somatic mutation vcfs and tsvs for analysis with llama and other LLMs

License: GNU General Public License v3.0

Shell 1.28% Jupyter Notebook 93.09% Python 5.63%

dna-llama's Introduction

THESE ARE ALL WORKS IN PROGRESS - USE AND COMMENT ON AT YOUR OWN RISK

Usage:

clone the repo:

git clone https://github.com/shawnschulz/dna-llama.git
cd dna-llama

install the requirements:

#right now need to install enformer and bert stuff on your own
pip install -r requirements.txt

In python, can use the dnaDataSet class by copying the python file to the directory with your notebook or script and adding this line to it:

#hope to make this a little less janky soon
from dnaDataSet import dnaDataSet

When you initilaize a dnaDataSet can set the model path to a quantized ggml binary usable by llama.cpp

#the .bin is important!
dnaset = dnaDataSet(modelPath='/path/to/ggml_quantized_llama.bin')

Alternatively you can set the model_path to a hugging face repository string (user/repo-name) or a path to the config.json of a locally installed model

dnaset = dnaDataSet(modelPath='pollner/dna_dataset')
dnaset = dnaDataSet(modelPath='/path/to/config.json')

dnaDataSet.py

Defines the dnaDataSet class, which is meant to be sort of like a scanpy for working with dna mutations with operability with llm datasets for finetuning and prompting with dna info from tsvs and vcfs. Currently just has a method to collect some dna dataset info from an annotated tsv and BAM files, and a method to do few shot learning with llama.cpp quantized models since thats the only thing I can perform inference on reliably at the moment. Also saves convos and outputs as json files.

To-do

  • Modify to add option to specify few shot learning prompts via json file rather than producing automatically via tsvs and BAMs (might do this later)
  • Add methods using langchain or pinecone api for saving and prompt convo info to vector database
  • Add methods to use non llama.cpp models
  • Add methods to fine tune non llama.cpp models
  • Add method to perform vcf2tsv and annovar for user through python

dna-llama.ipynb

Few shot learning using alpaca lora 30B quantized, parses annotated tsv files

To-do

  • Add operability with scanpy to use pinecone api for vector databases with cell embeddings produced by scanpy
  • Format inputs to prompt for few shot learning
  • Add in option to input VCF and do vcf2tsv and annovar for the user

dna-enformer

Take a fasta file containing many DNA artifacts and construct genomic tracks for the mutations, then compare to test files to find differences between genomic tracks, ultimately using differences to remove artifacts

To-do

  • Add in visualization to see whether it worked or not lol
  • If it's working well, remove mutations with low loss

dna-BERT

Similar idea to dna enformer but uses bert instead

dna-llama's People

Contributors

shawnschulz avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.