pinellolab / dna-diffusion Goto Github PK

🧬 Generative modeling of regulatory DNA sequences with diffusion probabilistic models 💨

Home Page: https://pinellolab.github.io/DNA-Diffusion/

License: Other

Python 95.91% Dockerfile 0.13% Shell 1.60% Makefile 2.35%

deep-learning diffusion-models genomics regulatory-genomics stable-diffusion generative-model diffusion-probabilistic-models

dna-diffusion's Introduction

DNA Diffusion

Generative modeling of regulatory DNA sequences with diffusion probabilistic models.

Documentation: https://pinellolab.github.io/DNA-Diffusion

Source Code: https://github.com/pinellolab/DNA-Diffusion

Abstract

The Human Genome Project has laid bare the DNA sequence of the entire human genome, revealing the blueprint for tens of thousands of genes involved in a plethora of biological process and pathways. In addition to this (coding) part of the human genome, DNA contains millions of non-coding elements involved in the regulation of said genes.

Such regulatory elements control the expression levels of genes, in a way that is, at least in part, encoded in their primary genomic sequence. Many human diseases and disorders are the result of genes being misregulated. As such, being able to control the behavior of such elements, and thus their effect on gene expression, offers the tantalizing opportunity of correcting disease-related misregulation.

Although such cellular programming should in principle be possible through changing the sequence of regulatory elements, the rules for doing so are largely unknown. A number of experimental efforts have been guided by preconceived notions and assumptions about what constitutes a regulatory element, essentialy resulting in a "trial and error" approach.

Here, we instead propose to use a large-scale data-driven approach to learn and apply the rules underlying regulatory element sequences, applying the latest generative modelling techniques.

Introduction and Prior Work

The goal of this project is to investigate the application and adaptation of recent diffusion models (see https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ for a nice intro and references) to genomics data. Diffusion models are powerful models that have been used for image generation (e.g. stable diffusion, DALL-E), music generation (recent version of the magenta project) with outstanding results. A particular model formulation called "guided" diffusion allows to bias the generative process toward a particular direction if during training a text or continuous/discrete labels are provided. This allows the creation of "AI artists" that, based on a text prompt, can create beautiful and complex images (a lot of examples here: https://www.reddit.com/r/StableDiffusion/).

Some groups have reported the possibility of generating synthetic DNA regulatory elements in a context-dependent system, for example, cell-specific enhancers. (https://elifesciences.org/articles/41279 , https://www.biorxiv.org/content/10.1101/2022.07.26.501466v1)

Step 1: generative model

We propose to develop models that can generate cell type specific or context specific DNA-sequences with certain regulatory properties based on an input text prompt. For example:

"A sequence that will correspond to open (or closed) chromatin in cell type X"
"A sequence that will activate a gene to its maximum expression level in cell type X"
"A sequence active in cell type X that contains binding site(s) for the transcription factor Y"
"A sequence that activates a gene in liver and heart, but not in brain"

Step 2: extensions and improvements

Beyond individual regulatory elements, so called "Locus Control Regions" are known to harbour multiple regulatory elements in specific configurations, working in concert to result in more complex regulatory rulesets. Having parallels with "collaging" approaches, in which multiple stable diffusion steps are combined into one final (graphical) output, we want to apply this notion to DNA sequences with the goal of designing larger regulatory loci. This is a particularly exciting and, to our knowledge, hitherto unexplored direction.

Besides synthetic DNA creations, a diffusion model can help understand and interpret regulatory sequence element components and for instance be a valuable tool for studying single nucleotide variations (https://www.biorxiv.org/content/10.1101/2022.08.22.504706v1) and evolution. (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1502-5)

Taken together, we believe our work can accelerate our understanding of the intrinsic properties of DNA-regulatory sequence in normal development and different diseases.

Proposed framework

For this work we propose to build a Bit Diffusion model based on the formulation proposed by Chen, Zhang and Hinton https://arxiv.org/abs/2208.04202. This model is a generic approach for generating discrete data with continuous diffusion models. An implementation of this approach already exists, and this is a potential code base to build upon:

https://github.com/lucidrains/bit-diffusion

Tasks and potential roadmap:

Collecting genomic datasets
Implementing the guided diffusion based on the code base
Thinking about the best encoding of biological information for the guided diffusion (e.g. cell type: K562, very strong activating sequence for chromatin, or cell type: GM12878, very open chromatin)
Plans for validation based on existing datasets or how to perform new biological experiments (we need to think about potential active learning strategies).

Deliverables

Dataset: compile and provide a complete database of cell-specific regulatory regions (DNAse assay) to allow scientists to train and generate different diffusion models based on the regulatory sequences.
Models: Provide a model that can generate regulatory sequences given a specific cell type and genomic context.
API: Provide an API to make it possible to manipulate DNA regulatory models and a visual playground to generate synthetic contextual sequences.

Datasets

DHS Index:

Chromatin (DNA + associated proteins) that is actively used for the regulation of genes (i.e. "regulatory elements") is typically accessible to DNA-binding proteins such as transcription factors (review, relevant paper). Through the use of a technique called DNase-seq, we've measured which parts of the genome are accessible across 733 human biosamples encompassing 438 cell and tissue types and states, resulting in more than 3.5 million DNase Hypersensitive Sites (DHSs). Using Non-Negative Matrix Factorization, we've summarized these data into 16 components, each corresponding to a different cellular context (e.g. 'cardiac', 'neural', 'lymphoid').

For the efforts described in this proposal, and as part of an earlier ongoing project in the research group of Wouter Meuleman, we've put together smaller subsets of these data that can be used to train models to generate synthetic sequences for each NMF component.

Please find these data, along with a data dictionary, here.

Other potential datasets:

DNA-sequences data corresponding to annotated regulatory sequences such as gene promoters or distal regulatory sequences such as enhancers annotated (based on chromatin marks or accessibility) for hundreds of cells by the NHGRI funded projects like ENCODE or Roadmap Epigenomics.
Data from MPRA assays that test the regulatory potential of hundred of DNA sequences in parallel (https://elifesciences.org/articles/69479.pdf , https://www.nature.com/articles/s41588-021-01009-4 , ... )
MIAA assays that test the ability of open chromatin within a given cell type.

Models

Input modality:

A) Cell type + regulatory element ex: Brain tumor cell weak Enhancer
B) Cell type + regulatory elements + TF combination (presence or absence) Ex: Prostate cell, enhancer , AR(present), TAFP2a (present) and ER (absent),
C) Cell type + TF combination + TF positions Ex: Blood Stem cell GATA2(presence) and ER(absent) + GATA1 (100-108)
D) Sequencing having a GENETIC VARIANT -> low number diffusion steps = nucleotide importance prediction

Output:

	DNA-sequence

Model size: The number of enhancers and biological sequences isn’t bigger than the number of available images on the Lion dataset. The dimensionality of our generated DNA outputs should not be longer than 4 bases [A,C,T,G] X ~1kb. The final models should be bigger than ~2 GB.

Models: Different models can be created based on the total sequence length.

APIs

TBD depending on interest

Paper

Can the project be turned into a paper? What does the evaluation process for such a paper look like? What conferences are we targeting? Can we release a blog post as well as the paper?

Yes, We intend to have a mix of our in silico generations and experimental validations to study our models' performance on classic regulatory systems ( ex: Sickle cell and Cancer). Our group and collaborators present a substantial reputation in the academic community and different publications in high-impact journals, such as Nature and Cell.

Resources Requirements

What kinds of resources (e.g. GPU hours, RAM, storage) are needed to complete the project?

Our initial model can be trained with small datasets (~1k sequences) in about 3 hours ( ~500 epochs) on a colab PRO (24GB ram ) single GPU Tesla K80. Based on this we expect that to train this or similar models on the large dataset mentioned above ( ~3 million sequences (4x200) we will need several high-performant GPUs for about 3 months. ( Optimization suggestions are welcome!)

Timeline

What is a (rough) timeline for this project?

6 months to 1 year.

Broader Impact

How is the project expected to positively impact biological research at large?

We believe this project will help to better understand genomic regulatory sequences: their composition and the potential regulators acting on them in different biological contexts and with the potential to create therapeutics based on this knowledge.

Reproducibility

We will use best practices to make sure our code is reproducible and with versioning. We will release data processing scripts and conda environments/docker to make sure other researchers can easily run it.

We have several assays and technologies to test the synthetic sequences generated by these models at scale based on CRISPR genome editing or massively parallel reporter assays (MPRA).

Failure Case

Regardless of the performance of the final models, we believe it is important to test diffusion models on novel domains and other groups can build on top of our investigations.

Preliminary Findings

Using the Bit Diffusion model we were able to reconstruct 200 bp sequences that presented very similar motif composition to those trained sequences. The plan is to add the cell conditional variables to the model to check how different regulatory regions depend on the cell-specific context.

Next Steps

Expand the model length to generate complete regulatory regions (enhancers + Gene promoter pairs) Use our synthetic enhancers on in-vivo models and check how they can regulate the transcriptional dynamics in biological scenarios (Besides the MPRA arrays).

How to contribute

If this project sounds exciting to you, please join us! Join the OpenBioML discord: https://discord.gg/Y9CN2dUzQJ, we are discussing this project in the dna-diffusion channel and we will provide instructions on how to get involved.

Known contributors

You can access the contributor list here.

Development

Setup environment

We use hatch to manage the development environment and production build. It is often convenient to install hatch with pipx.

Run unit tests

You can run all the tests with:

hatch run test

Format the code

Execute the following command to apply linting and check typing:

hatch run lint

Publish a new version

You can check the current version with:

hatch version

You can bump the version with commands such as hatch version dev or patch, minor or major. Or edit the src/dnadiffusion/__about__.py file. After changing the version, when you push to github, the Test Release workflow will automatically publish it on Test-PyPI and a github release will be created as a draft.

Serve the documentation

You can serve the mkdocs documentation with:

hatch run docs-serve

This will automatically watch for changes in your code.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Lucas Ferreira da Silva}
🤔 💻

_{Luca Pinello}
🤔

_Simon
🤔 💻

This project follows the all-contributors specification. Contributions of any kind welcome!

dna-diffusion's People

Contributors

Stargazers

Watchers

dna-diffusion's Issues

Create a function called KL_divergence_motifs

def KL_divergence_motifs(original, generated)

'original': list or pandas Series with original sequences used to train the model
'generated':list or pandas Series with generated sequences used to train the model

return a 'score' (float) calculating the Kullback-Leibler divergence

Implement the guided diffusion based on new model

Please use @LucasSilvaFerreira this notebook as starting point: https://github.com/pinellolab/DNA-Diffusion/blob/dna-diffusion/models/vanilla_diffusion/Code_to_refactor_UNET_ANNOTATED%20v2.ipynb

Refactor dataloader code to include conditioning

What is a contribution? contribution file linked to read me

Discussed in #22

^{Originally posted by IhabBendidi October 17, 2022}
An idea is to make a contribution file, to keep account of contributors.
But what is a contribution?

A code push? that would not count the think tanks that theorized the project, so it is unfair. At the same time, what counts as contibuting discussion, and what isn't? Lines can be blurry in some edge cases.

What are your thoughts?

Contribution list is gonna become huge, and we need to keep proper track of it, so a file would be better, that would be linked to the readme

Benchmarking against other tools (GAN paper, Transformer implementation)

Prototype HyperNetwork for conditioning on Luca's notebook

update metadata files

Split single script into subscripts / begin populating dna-diffusion branch

Refactoring and splitting single script obtained from dna-diffusion/notebooks/experiments

label and close inactive issues and PRs

See actions/stale.

document the procedure to regenerate `model4cells_train_split_3_50_dims.pkl`

In #101 a model file was introduced.
We should document how to regenerate it and remove it from the repository.

Create a Data Loader Class with Pytorch Lightning

For the first version make sure the latest version provided by @meuleman here:

https://www.meuleman.org/research/synthseqs/#material

Specifically, we have the following datasets available:

training set: 160k sequences, 10k per NMF component (chr3-chrY, .csv.gz)
validation set: 16k sequences, 1k per NMF component (chr2 only, .csv.gz)
test set: 16k sequences, 1k per NMF component (chr1 only) .csv.gz)

Each of these contains the genomic locations (human genome assembly hg38, first 3 columns) of accessible genome elements, their majority NMF component (column: component, see legend figure above) as well as their 200bp nucleotide sequence (column: raw_sequence).

Column Example Description
seqname chr16 Chromosome
start 68843660 Start position
end 68843880 End position
DHS_width 220 Width of original DHS
summit 68843790 Position of center of mass of DHS
total_signal 122.770678 Sum of DNase-seq signal across biosamples
numsamples 61 Number of biosamples with the DHS
raw_sequence GAGGCATTG… 200bp extracted nucleotide sequence
component 1 Dominant DHS Vocabulary component
proportion 0.767371514 Proportion of NMF loadings in this component

add pre-commit.ci configuration

Integrate PyTorch Ligthning for distributed training

Implement DDIM sampler based on Lucas' conditioned DDPM

Replicate the DDPM logic but with the mathematical tricks from the DDIM paper: https://arxiv.org/abs/2010.02502.

Implement a sequence quality metric based on k-mer composition

Roughly this is the idea: using a set of sequences (e.g. training set or generated set) we write first them in a temporary fasta file, using the obtained file (e.g. sequences.fa) call the kat hist function to get an histogram of kmers count. Here we can use k=7. Then we rescale the number count obtained of each k-mer based on the total sum to get probabilities of k-mers. Now assuming you have two probabilities vectors for k-mer corresponding to train sequences and generated sequences you can calculate the KL-diverge using the pytorch implementation that Cesar is also using. See: #14

Prototype HyperNetwork for conditioning on Luca's notebook

Prototype HyperNetwork for weights generation for our main denoising network.

Implement a training loop

Implement a sequence quality metric based on existing neural networks that can predict enhancer activity (Bert based), expression (Enformer) or chromatin accessibility (BPnet)

@sg134 Can you help to explore sequences -> enhancer models?
We can use these enhancer classifiers as orthogonal metrics to evaluate our synthetic sequences.
Probably we can find tons of these classifiers, but here is someplace to start your search:

bert enhancer
DeepEnhancer

What should be the focus:

Sequence length?
Do we need to retrain it? Can we retrain it on our 16 cell types?
Create a function to receive a sequence and report the probability of it being classified as an enhancer.

Decide and implement evaluation metrics

Implement W&B logger

Write unit tests

Create function motif_count Input

def motif_count Input(sequences, return_probability=False)

'sequences': list of sequences to scan
return_probability: if True scale the occurrences based on the total number of sequences in input
Return: pandas dataframe with index 'motif_id_name' value: 'occurences' (or 'probability') of motif in the sequences

Implement a scalable data loading pipeline

Refactor the DDPM and UNet code for conditional diffusion

Separate sampling metrics from validation metrics

Validation metrics currently exist in both validation_metrics.py and sampling_metrics.py

Consider DNA embedding models to transition to use stable diffusion

Create a CLIP-like embedding+tokenizer for our usecase that can be hooked up to stable diffusion pipeline

Design config file

Design a template config file which we can use to decide the what parameters the train and sampler parsers have to receive and pass to pl.Trainer and what the Diffusion and UNet classes final parameters are.

merge "codebase" work into default branch

Please see the project information pane for information regarding the migration to distributed trunk-based collaboration.

What is your timeline?

3b base pairs, etc

Use reformers to greatly speed up training time and reduce memory usage

Transformers (like the ones used in your model) have a memory and time complexity of O(n^2). Reformers (explained here) have a memory and time complexity of O(n log n) and, as such, require far less ram and compute to train.

I am strongly suggesting you use reformers since they can replace transformers with almost no change and with a massive benefit. For reference, replacing transformers with reformers on one of my projects took the memory usage down from >10GB to <1GB and was able to train at 45 mins per epoch on an 8-core CPU (I was testing it out before migrating to colab)

Implement a sequence quality metric based on k-mer composition

use hydra-zen for configuration

See, for example, the hydra-zen lightning tutorial.

Integrate Maximal Update Parametrization (MuP) for hyperparameter tuning

Create tests and github actions

Add tests for all code functions
Add github actions, for continuous testing of the code

Implement gaussian log likelihood metric as in OpenAI's initial diffusion repository

Would be awesome to have the gaussian log likelihood metric to combine it with KL divergence as they do in https://github.com/openai/guided-diffusion.

Implement DDIM sampler based on conditioned DDPM

Implement train.py and configs

Create training script and configs using pytorch lightning and hydra

add code of conduct

See the contributor covenant code of conduct.

Add p2 weighting to the conditioned DDPM

migrate to src-full layout

See, for example,

Implement train.py/sampler.py CLI argument parser

Fix script pathing issues when training on cluster

Once moved to distributed training on cluster, code from dna-diffusion branch outputs various relative path errors

Implement score-interpolation

GOAL : Implements score-interpolation described in Continuous diffusion for categorical data

Please see Figure 3 below for overall implementation picture requirements.

Up-to-date notebook progress could be found at https://github.com/pinellolab/DNA-Diffusion/blob/score_interpolation_latest/notebooks/experiments/conditional_diffusion/dna_diff_baseline_conditional_UNET.ipynb , click on history button to check actual code commits modifications.

DONE :

Added DPM-Solver++ ODE solver block
Added Interpolate Embeddings block
Added reparameterization trick for p_losses()
Fed the output of Time Embedding stage into LayerNorm inside Unet model

TODO :

Enable the self_conditioning logic properly without runtime dimension issue
Verify the Interpolate Embeddings block
Verify that the LearnedSinusoidalPosEmb class is actually following the theoretical concept behind Random Fourier Embedding block
Enable the initial denoising inside p_losses() properly considering the actual mean and std at denoising timestep t=0
Implement all aspects of Input embedding stage properly for the Interpolate scores block
Enhance the denoising process with noise = sigma * epsilon where sigma is a learned NN parameter, such that the noise will be a deterministic function of the input and the parameter rather than being independent and randomly generated as suggested by chatGPT
Verify the integration of Time Embedding stage into Unet model
Enhance the training pipeline using modified version of tweedie formula in soft-diffusion paper

remove merged/unused branches

We would like to clean up branches that have been merged into the default branch while making sure we do not delete existing branches where work is ongoing. Please comment here with the name of the branch you are currently working on and a projected timeframe to merge.

pre-commit cannot parse config.py containing yaml

see pre-commit.ci run referring to src/config.py, which appears to contain a mixture of python and yaml

Develop a standardized PR ticket template

My interaction over at #50 made me think:

We may or may not have previously talked about setting a template for opening new pull requests. The former interaction has a call for:

A link to the ticket it's addressing
A small explanation so as to why (based on what though?).

Maybe these two could be good starting points for a default ticket template? A la, copy and paste then fill in with your contribution. Additionally maybe, add if your contribution is going into the codebase/experiments/other folder (or where in the codebase, once it grows), although this is probably filled by 1 already. Anything else you guys consider relevant or good/standard practice?