hazyresearch / hyena-dna Goto Github PK

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena

Home Page: https://arxiv.org/abs/2306.15794

License: Apache License 2.0

C++ 1.83% Cuda 0.74% Python 1.68% C 0.01% CSS 0.05% JavaScript 0.10% HTML 2.29% CMake 0.07% Makefile 0.01% Assembly 87.03% PHP 0.27% Pawn 4.87% POV-Ray SDL 1.03% Dockerfile 0.01%

foundation-models genomics language-models

hyena-dna's Introduction

HyenaDNA

Important links:

Intro

Welcome to the HyenaDNA repo! HyenaDNA is a long-range genomic foundation model pretrained on context lengths of up to *1 million tokens* at *single nucleotide resolution*.

The repo is a work in progress, but we're very excited to get this in the hands of researchers, so bare with us :)

This repo is best suited for those who want to pretrain a HyenaDNA model, or try one of the downstream tasks from the paper.

For the easiest entry point though, check out the HyenaDNA colab, a self contained notebook that is Huggingface integrated. You'll be able to load pretrained weights and fine-tune on the GenomicBenchmarks dataset. Also, you'll be able to do inference and get embeddings on DNA sequences up to 450k nucleotides on the free tier. For 1 million long DNA sequences, you can get an A100 on Colab (paid tier), or run the notebook on your own machine.

Credit: much of the code is forked and extended from S4 and Safari.

Discord

Trying Discord out! Maybe it'll be conducive to sharing ideas / tips on how HyenaDNA could be applied in different ways. Feel free to post questions there.

Hugging Face pretrained weights

Check these out :) There are different model sizes, and different training sequence lengths that they can handle up to. All pretrained on a single human reference genome (hg38).

See the suggested GPU requirements for each model.

There's a few way to use these HuggingFace weights, all with different flavors:

colab
Pytorch Lighting in this repo
standalone

Dependencies

For this repo, let's start with the dependancies that are needed. (If you're familiar with Docker, you can skip this section and jump to the docker setup below). The repo is built using Pytorch Lightning (a training library) and Hydra a config oriented ML library. (It'll be super helpful to get familiar with those tools.)

clone repo, cd into it

git clone --recurse-submodules https://github.com/HazyResearch/hyena-dna.git && cd hyena-dna

create a conda environment, with Python 3.8+

conda create -n hyena-dna python=3.8

The repo is developed with Pytorch 1.13, using cuda 11.7

conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.7 -c pytorch -c nvidia

install requirements:

pip install -r requirements.txt

install Flash Attention, these notes will be helpful.

cd hyena-dna
git submodule update --init
cd flash-attention
git submodule update --init
pip install -e . --no-build-isolation

optional fused layers for speed (takes a bit of time)

# from inside flash-attn/
cd csrc/layer_norm && pip install . --no-build-isolation

Dockerfile

Even better, if you're familar with Docker, we have an image you can pull with all the dependencies installed. It's the simplest, surest, but does require some familiarity with using Docker containers.

Slight complication - you also need to clone the flash-attn repo that's used as a submodule in the main hyena-dna repo. That means you need the --recurse-submodules flag, in case you cloned without it.

# clones main and submodule repos
git clone --recurse-submodules https://github.com/HazyResearch/hyena-dna.git && cd hyena-dna

Prepare docker container

# build the image within the hyena-dna repo (it will grab the Dockerfile here).  You need to place $USER_NAME with your own Dockerhub username.
docker build . -t $USER_NAME/hyena-dna

Or,

# pull already built image (our $USER_NAME is hyenadna)
docker pull hyenadna/hyena-dna:latest

# run the container: this will give you an interactive shell with the dependencies
docker run --gpus all -it -p80:3000 hyenadna/hyena-dna /bin/bash

Update:

We actually have a second Docker image, which has all the Nucleotide Transformer datasets, checkpoint, and exact commands and hyperparameter settings used to reproduce the best results in the HyenaDNA paper.

docker pull hyenadna/hyena-dna-nt6:latest 
docker run --gpus all -it -p80:3000 hyenadna/hyena-dna-nt6 /bin/bash

This will land you inside the /wdr, which has a file named launch_commands_nucleotide_transformer with all the launch commands and (associated hyperparameters) for the 18 Nucleotide Transformer datasets.

What's the difference with the first Docker image you ask? Not much, just some different dependancy versions.

Quick Entry point

A quick start for this the repo is to train from scratch on a small genomics dataset. Let's try this just to see if things are set up ok.

The command below should auto-download a small dataset into data/. It uses a small 2 layer HyenaDNA model with a linear decoder (head) on a binary classification task. It already beats the SotA by 7 pts (one task from GenomicBenchmarks), but we can do even better with a pretrained model.

python -m train wandb=null experiment=hg38/genomic_benchmark_scratch

Let's describe this.

-m lets you run the script as a module (no .py used in name).
train is calling the main train.py script that launches all training / finetuning experiments.
wandb=null, this connects to wandb too, but for quick testing I set to null. Otherwise you can use something like wandb.group=custom_name_here.
experiment is passing the config for experiment, using the genomic_benchmark_scratch.yaml file, located in configs/experiments/hg38/.
You can pass other configs in the command line the same way, eg, dataset=your_custom_datset_name. But more on that later.

Loading pretrained weights

There are 2 ways to use the pretrained weights from HuggingFace:

HuggingFace integration (best example), via colab
Pytorch Lightning in this repo:

You can clone the HuggingFace repo, and pass the ckpt path to Pytorch Lighting (the .ckpt is from Lightning actually)
the flag is train.pretrained_model_path=/path/to/ckpt
you'll need to make sure the model config settings are the same when launching. The config is also in the HuggingFace repo.

Standalone code (HuggingFace too)

We actually have a 3rd way, but it's really just a copy of the colab but put into this repo as a .py file (in case that's more your thing). It's HuggingFace integrated, not Pytorch Lightning, so you don't get all the bells and whistles, but it is standalone, meaning it's easier to port to your own codebase. It assumes you have all the dependencies installed already.

see the huggingface.py script for example of inference, loading pretrained weights from HF
and the standalone_hyenadna.py, which has all the classes you need to create a HyenaDNA model

Experiments

We share our training and dataloading code for pretraining on the human reference genome (HG38), fine-tuning on a number of downstreams, and examples of our in-context learning variants using soft prompt tokens and instruction fine-tuning. You'll need to download and preprocess on your own for now, we'll share our steps for those later.

In general, get comfortable with the configs in configs/experiments/hg38, all our (sample) experiment settings are there.

Pretraining on Human Reference Genome

First step is download the Human Reference Genome data. It's comprised of 2 files, 1 with all the sequences (the .fasta file), and with the intervals we use (.bed file).

The file structure should look like

data
|-- hg38/
    |-- hg38.ml.fa
    |-- human-sequences.bed

Download fasta (.fa format) file (of the entire human genome) into hyena-dna/data/hg38. ~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically. Then download the .bed file with sequence intervals (contains chromosome name, start, end, split, which then allow you to retrieve from the fasta file)

mkdir -p data/hg38/
curl https://storage.googleapis.com/basenji_barnyard2/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
curl https://storage.googleapis.com/basenji_barnyard2/sequences_human.bed > data/hg38/human-sequences.bed

launch pretraining run

python -m train wandb=null experiment=hg38/hg38_hyena model.d_model=128 model.n_layer=2 dataset.batch_size=256 train.global_batch_size=256 dataset.max_length=1024 optimizer.lr=6e-4 trainer.devices=1

Let's describe a little about this command.

experiment=hg38/hg38_hyena passes the config for this experiment using a Hyena(DNA) model
model.d_model=128, and model.n_layer=2 select the model width and depth, key hyparams
dataset.max_length=1024 sets the max sequence length sampled from the dataset, the model layer max length is set from this too, or...
model.layer.l_max # you can set the max model length manually
model.d_inner # likewise, the reverse bottleneck with can be set manually too (default is 4x d_model)

Lots of other commands you can pass and customize, feel free to check out the experiment=hg38/hg38_hyena for details.

Pretraining on your own data

To pretrain on your own data, all you need is (ideally) a .fasta file. You don't need a .bed file like we used for HG38, we just used that for convenience. You can follow our species classification dataloader for how to setup a general pretraining dataloader that would randomly sample a chromosome and then a sequence of a given length.

Sample pretraining dataloader

src/dataloaders/datasets/species_dataset.py

The species dataloader can be used for pretraining as well by swapping out the .fasta file (for your own) and doing some wrangling with the configs. There is also some code change to map the actual chromosomes you have in your .fasta file, so you'll have to dive into the code and what the dataloader is doing. Most of the work in general for using this repo is just setting up dataloaders and configs (which takes time, but it's worth it!).

Note: if you plan on pretraining on your own data, make sure to preprocess your data correctly, and your samples are what you expect in the dataloader. Things like, uppercase/lowercase, unknown characters, etc. Also, if your sequences are variable length (in our setting we used fixed lengths mostly, since next token prediction should theoretically be introduced to variable length sequences) then the padding may become significant or an issue. ie, if your length range is 100-32k, then the 100 sequence will have a lot of padding, so you'll need to ignore those tokens in the loss to avoid instability in training. The padding token should be 4 by default, so you can pass this in the command line, +task.loss.ignore_index=4, or modify the config too (under task.loss).

GenomicBenchmarks

The GenomicBenchmarks is an easy to use set of datasets for sequence level classification. We use as a good entry point to try new things out.

Sample run:

python -m train wandb=null experiment=hg38/genomic_benchmark dataset_name=human_enhancers_cohn train.pretrained_model_path=/path/to/ckpt dataset.max_length=500 model.layer.l_max=1024

This runs a HyenaDNA model on one of the datasets, auto-downloaded into data/. Here are the other datasets and their stats, which you can pass into this config too. The config in configs/dataset/genomic_benchmark is setup to pull in the correct dataset metadata (num_samples, classes, etc).

Just like the quick entry point explained above, you'll need to set the flags for dataset.max_length you want to use, as well as the model.layer.l_max, which tells the model the max length you want to use. The inputs will be padded up to model.layer.l_max. eg, data sample = 500, and l_max = 1024, then it will pad 501 to l_max.

The new flag here for this fine-tune experiment is to pass a pretrained ckpt via train.pretrained_model_path=/path/to/ckpt.

There are 8 datasets in this suite, choose 1 at a time (passing the dataset.dataset_name sets the num_classes and num_seqs automatically).

# name                                num_seqs        num_classes     median len    std
# dummy_mouse_enhancers_ensembl       1210            2               2381          984.4  
# demo_coding_vs_intergenomic_seqs    100_000         2               200           0
# demo_human_or_worm                  100_000         2               200           0
# human_enhancers_cohn                27791           2               500           0
# human_enhancers_ensembl             154842          2               269           122.6
# human_ensembl_regulatory            289061          3               401           184.3
# human_nontata_promoters             36131           2               251           0
# human_ocr_ensembl                   174756          2               315           108.1

Nucleotide Transformer datasets

You can check out the Nucleotide Transformer paper appendix for how to download and process the datasets.

If you'd like to use the pretrained weights we used to finetune on, you'll need the tiny-1k-d256 weights on Huggingface.

Update: Or, you can invest a bit of time and learn how use Docker, and just use our pre-built Docker image that has the exact Nucleotide Transformer datasets/splits, pretrained weights, and hyperparameters used to obtain the results in the HyenaDNA paper (by far most convenenient way to reproduce results).

sample run

# trains from scratch
python -m train wandb=null experiment=hg38/nucleotide_transformer dataset_name=enhancer dataset.max_length=500 model.layer.l_max=1026

Similarly with GenomicBenchmarks, we need to select which dataset to use from the 17 Nucleotide Transformer datasets.

See the dataset config in configs/dataset/nucleotide_transformer for more dataset metadata, but here's some:

Fields
name max_len n_classes n_samples metric

# enhancer 200   2  14968 MCC
# enhancer_types 200   3  14968 MCC
# H3 500   2  13468 MCC
# H3K4me1  500   2  28509 MCC
# H3K4me2  500   2  27614 MCC
# H3K4me3  500   2  33119 MCC
# H3K9ac   500   2  25003 MCC
# H3K14ac  500   2  29743 MCC
# H3K36me3 500   2  31392 MCC
# H3K79me3 500   2  25953 MCC
# H4 500   2  13140 MCC
# H4ac  500   2  30685 MCC
# promoter_all   300   2  53276 F1
# promoter_non_tata 300   2  47759 F1
# promoter_tata  300   2  5517  F1
# splice_sites_acceptor   600   2  19961 F1
# splice_sites_donor   600   2  19775 F1

The file structure for the data should look like:

data
|-- nucleotide_transformer/
    |-- enhancer/
        |-- all_test_enhancer.fasta
        |-- all_train_enhancer.fasta
    |-- H3/
        |-- H3_test.fasta
        |-- H3_train.fasta
    |-- promoter_tata/
        |-- promoter_tata_test.fasta
        |-- promoter_tata_train.fasta
    |-- ...

In-context Learning

We use the GenomicBenchmarks for exploring in-context learning (ICL). It should autodownload the data into data/.

Soft prompting example run:

python -m evals/soft_prompting_genomics

instruction fine-tune example:

python -m evals/instruction_tuned_genomics

Chromatin Profile

You'll need to see the DeepSea and repo for info how to download and preprocess.

example chromatin profile run:

python -m train wandb=null experiment=hg38/chromatin_profile dataset.ref_genome_path=/path/to/fasta/hg38.ml.fa dataset.data_path=/path/to/chromatin_profile dataset.ref_genome_version=hg38

dataset.ref_genome_path # path to a human ref genome file (the input sequences)
dataset.ref_genome_version # the version of the ref genome (hg38 or hg19, we use hg38)
dataset.data_path # path to the labels of the dataset

Species Classification

You'll need to download fasta files for each species that you want to use (just the .zips, the dataloader wil unzip automatically). You can download them using the following commands:

# Human
wget -P human/ -r -nH --cut-dirs=12 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
# Lemur
wget -P lemur/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Lemur_catta/latest_assembly_versions/GCA_020740605.1_mLemCat1.pri/GCA_020740605.1_mLemCat1.pri_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
# House mouse
wget -P mouse/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Mus_musculus/latest_assembly_versions/GCA_921998355.2_A_J_v3/GCA_921998355.2_A_J_v3_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
# Pig
wget -P pig/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Sus_scrofa/latest_assembly_versions/GCA_002844635.1_USMARCv1.0/GCA_002844635.1_USMARCv1.0_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/
# Hippo
wget -P hippo/ -r -nH --cut-dirs=11 --no-parent ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/Hippopotamus_amphibius/latest_assembly_versions/GCA_023065835.1_ASM2306583v1/GCA_023065835.1_ASM2306583v1_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/

Your folder struture should look like this:

data
|-- species/
    |-- chimpanzee/
        |-- chr1.fna
        |-- chr2.fna
        |-- ...
    |-- hippo/
        |-- chr1.fna
        |-- chr2.fna
        |-- ...
    |-- human/
        |-- chr1.fna
        |-- chr2.fna
        |-- ...
    |-- mouse/
        |-- chr1.fna
        |-- chr2.fna
        |-- ...
    |-- orangutan/
        |-- chr1.fna
        |-- chr2.fna
        |-- ...
    |-- other species ...
|-- ...

Sample species run:

python -m train wandb=null experiment=hg38/species dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=1 dataset.batch_size=1 dataset.max_length=1024 dataset.species_dir=/path/to/data/species/ model.layer.l_max=1026 model.d_model=128 model.n_layer=2 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null

Let's break some of these args down:

experiment=hg38/species # main config for this experiment
dataset.species # list of species you want (and already downloaded their .fasta files)
decoder.mode=last # using the last token to classify (instead of default pooling)
train.pretrained_model_path # if using a pretrained model, point to it, if not, set to null
train.pretrained_model_state_hook=null # if using a pretrained model, this will load the weights properly (and not head). if not, set to null

More advanced stuff below

Setting up downstream experiments (fine tuning)

Let's see what's needed to set up a downstream task.

The main ingredients are:

Model weights and model config (which are provided via HuggingFace at the top)
Custom dataset class and dataloader class
Configs for experiment, dataset, pipeline, model. Don't worry, we have examples for each of these.

Again, example run, breakdown in launch command:

python -m train wandb=null experiment=hg38/genomic_benchmark

Model config:

We talked about some of the model config setting above. We placed the model config within the experiment config for convenience (which can override, basically), but you can place in the configs/model dir if you want. There is a separate layer config at configs/model/layer. This is where it's useful to understand the Hydra config stuff.

Flags for using ultralong context (gradient checkpointing)

We have a checkpoint flag that allows ~3x less memory on a GPU (to enable longer sequences). However, this means that you may have trouble loading checkpoints if you don't set the flags correctly (they need to be True if it was pretrained with these, and False if not).

model.checkpoint_mixer: True # set true for memory reduction
model.checkpoint_mlp: True # set true for memory reduction

Note, if it's not in the config and you want to pass it in the commandline, you would add a + in front, like this: +model.checkpoint_mixer=True

If you get an error (like below) with the state_dict keys not matching, it's likely due to these flags, so toggle these on/off

Missing key in pretrained model! backbone.layers.0.mixer.layer.filter_fn.bias

Setting up a Dataset class

Here's a sample dataset class for a DNA downstream task.

src/dataloaders/datasets/genomic_bench_dataset.py

It's basically a standard Pytorch dataset. Place data in the data/, with something like /data/your_custom_dataset_name, so the repo can find it.

Here's a sample dataloader for a DNA downstream task. There's some more actually connecting with the HyenaDNA repo required here.

src/dataloaders/genomic_bench_dataloader.py

Notice the name is placed with _name_ = "genomic_benchmark" as a class attribute. This name is how we find it. Also, we need to add the dataloader file to the __init__, see the top of this script, src/dataloaders/__init__.py.

I would emulate this dataloader file. It's basically a way for Pytorch lightning to handle a lot of the dataloading stuff in the background. Pass params to the init that you need to create it. Notice the def setup(), this is where the dataset class is instantiated. setup() gets called in the training script (more on that later).

There are 3 dataloader functions that create the train/val/test dataloaders. (In this example, the dataset only uses train and test dataloader.)

Creating Configs

As mentioned above, the main config is the experiment config, and for our example, located here configs/experiment/hg38/genomic_benchmark.yaml.

You can think of each of these sections as their own configs too. eg, model, task, optimizer etc. You can write them in here, or have it referenced at the top (as default or overide, subtle differences).

For a new dataset, we need a new dataset config and a pipeline config. These configs get passed when they're instantiated.

The pipeline config hasn't been mentioned yet, but it's where we define a few different things. Take a look inside:

configs/pipeline/genomic_benchmark.yaml

Try to emulate this config too, which will get reference at the top of the experiment config. We select the optimizer, scheduler, name of the dataset, the task (typically classification for these downsteams, but we have other options for the decoder). Don't worry about the encoder. We do use a decoder, which is just a single MLP that maps the backbone to the number of classes we're trying to predict. When you create the dataset class, it will require a d_output for the number of classes, and the decoder will automatically pull this attribute in the background, as well as the dimension of the backbone from d_model. The decoder can also have options, like pool, where we average the token embeddings, or last or first, meaning which token we use for the MLP to learn from.

If want to train at different sequence lengths, there's a few places we would need to change too. Namely, the dataset config and the model configs. You could change these in the experiment config, or individually setup defaults in the standalone dataset / dataloader configs too, up to you.

dataset config expects a max_length to be set.

model.layer.l_max expects a length too. Usually set to the dataset max_length + 2

Launch a finetuning experiment

# example downstream task
python -m train wandb=null experiment=hg38/genomic_benchmark train.pretrained_model_path=<path_to_ckpt>

The dataset will automatically download to the data/ dir (probably), and it's not that large, ~5-10 min setup. All you need to do is download the weights from HuggingFace above, and change the configs to match the model settings, and the dataset seq_len you want to use. Might take some fumbling around to get right, but it'll be worth it!

To describe this experiment config a little more, let's dive in. It finetunes a HyenaDNA (GPT-like). Let's focus on the train arguments.

remove_test_loader_in_eval: true # no test set in this benchmark
We have the option to remove an extra test_loader, eg, if val and test are the same.
pretrained_model_strict_load: False # false allows encoder/decoder to be used if new model uses it
Set false to play nicely when loading pretrained weights

for loading backbone and not head, requires both of these flags below

pretrained_model_path: /home/workspace/eric/safari-internal/outputs/2023-03-23/07-10-41-239444/checkpoints/val/loss.ckpt This is where we pass the pretrained model to use as a backbone
pretrained_model_state_hook
_name_: load_backbone This is a custom hook function that will load the backbone properly with a new MLP decoder head for the downstream task.
freeze_backbone: false # seems to work much better if false (ie finetune entire model)
We have the option to freeze here.

Loading a finetuned model

Next we'll show an example of loading weights (that were finetuned) on a downstream task (it will continue to train though).

see weights from HuggingFace above.
They are for a 2 layer, d_model=128 (width), with a max_length=1024 (sequence len)
Place these somewhere in the repo, typically we place them in the outputs/dir.

The main things we need to do now are to update appropriate args in a config.

# path to config finetuned model config
safari-internal/configs/experiment/hg38/genomic_benchmark_load_finetuned_model.yaml

For this config, select the dataset you want to train with dataset.dataset_name, which we'll use human_nontata_promoters, since this is what the weights above are fine tuned on.

Next, you need to update train.pretrained_model_path: path_to_ckpt, to wherever you placed them in the repo.

Now we can launch a run with this:

python -m train wandb=null experiment=hg38/genomic_benchmark_load_finetuned_model

This will run the main src/train.py script.

Let's point out a few keys locations in the train.py script, since it's a little confusing where all the stuff gets called.

loading weights occurs with the train.py, def load_state_dict() function. It actually calls a custom state hook to load gracefully (in the src/models/sequence/long_conv_lm.py, inthe load_backbone() function.
forward prop is done in the def forward() function, inside SequenceLightning module of train.py, but realy, it calls self.task.forward(), which actually makes the call to the model. That is to say, you need to go src/tasks/tasks.py, and fine class LMTask, and its def forward() function. Here you'll see the actual call to the model. Note, the decoder head (a single MLP for classification) is separate from the main model backbone (feature extractor).

Sequence Length Warmup Callback

We have sequence length warmup scheduler, implemented using a callback, which will increase sequence length in stages during training. Basically the script will check what epoch and "stage" the training is at, and update the dataset/dataloaders to the parameters for that stage. Currently, you need to specify the stages manually in a config, the example config is at, and the relevant portion is at the bottom of the config, and here below too:

configs/experiment/hg38/hg38_hyena_seqlen_warmup_reload.yaml

Guidance: You have to be careful to know ahead of time that the batch size and seq len will fit into memory for EACH stage.

To make your dataloader compatible with the seqlen warmup, you need to implement an interface, which is init_datasets(). Here's what it looks like:

The sharp edges:

To use this callback, we'll use the sample config above, configs/experiment/hg38/hg38_hyena_seqlen_warmup_reload.yaml.

You'll need to design the stages manually, ie, what epoch and seq len you want to gradually increase the seq len (and lower batch size). Note, the epochs at each stage means how long we run that stage for (it's not cummulative).

callbacks:
  seqlen_warmup_reload:
    # epochs refers to how long to run at that stage (not cummulative!)
    # this is just a sample
    stage_params:
      - epochs: 2  # means run this stage for 2 epochs (0, and 1)
        seq_len: 1024
        batch_size: 256  # in the background, grad accum = 1, since train.global_batch_size=256
      - epochs: 2  # run for 2 epochs (2 and 3)
        seq_len: 2048
        batch_size: 128
      - epochs: 2  # run for epochs 4, 5
        seq_len: 4096  #
        batch_size: 64
      - epochs: 2  # epoch 6, 7
        seq_len: 8192  
        batch_size: 32
      - epochs: 4  #  epoch 8, 9, 10, 11
        seq_len: 16_384  # 
        batch_size: 16
      - epochs: 4  # epoch 12, 13, 14, 15
        seq_len: 32_768
        batch_size: 8

As for the other parameters you run in the command line that are important:

In the sample config, see the

train.global_batch_size don't forget to set this! It will control the accumulate_grad_batches to keep the lr consistent each stage. eg, 256 or 128 typically (maybe 64 for very long seqs)
dataset.batch_size now refers to the test (or final seq len and batch). the test set will always be the same
dataset.max_length now refers to the test (or final seq len and max_length). the test set will always be the same
model.layer.l_max needs to be set to the highest seq len +2 (the test set size)

Things to note:

Train dataset will change during training, but the test set will always be fixed. The test len/batch size is set the normal way in your command launch, ie, dataset.batch_size, dataset.

Getting logits from pretrained model

Here's a simple script to get the logits from a pretrained model.

This isn't automated, so you'll need to download the weights manually from HF, and place them locally somewhere. You need the model head to get the logits.

Difference from the Huggingface: this script is meant for getting embeddings easily, which doesn't use the model head. We don't have a current use case for the logits yet, so there's some extra steps if you want those.

Experimental

We have an experimental bidirectional implementation of HyenaDNA. We used this in a recent ablation on the GenomicBenchmarks dataset where we trained from scratch, ie, did not pretrain using masked language modeling via BERT. We compared this to the standard causal HyenaDNA, and the causal version performed better. But some people very much want a bidirectional HyenaDNA, so we provide one instantiation of this, of which there are many ways to do bidirectionalality.

In regards to how we implementated it, we simply manipulate the padding of the FFT convolution. Checkout the src/models/sequence/hyena.py script for more details (eg just search for bidirectional).

To use bidirectional, pass in the flag (at launch) model.bidirectional=True, that's it!

Note, the codebase only supports bidirectional training from scratch on a downstream task, ie, no masked language model pretraining. It doesn't make sense to do causal pretraining using bidirectionalality, so use at your own risk!

For downstream tasks, we added an option to only use / average over masked tokens. We updated the GenomicBenchmarks and Nuc Trans dataset with this ability, see there dataset classes for how it's implemented. To use it:

you need to also set the right config settings. See their experiment configs, eg, /src/configs/experiment/hg38/genomic_benchmark.yaml, and in particular, use dataset.return_mask=True, and dataset.padding_side=right
you need to set the new task created, which is called masked_multiclass, also in the experiment config. All this does (differently than before) is handle the passing of the masks correctly to the model.

In practice, for short range tasks with not a lot padding, we noticed it didn't make too much of a difference. But if your sequences have a ton of padding, then this will definitely help. In the paper, we didn't use this, and had left side padding by default.

Change Log / updates:

Added more weights to huggingface.
Added docker image with Nucleotide Transformer datasets, weights, and exact hyper parameters to reproduce results.
There's an experimential bidirectional option added. See Experimental.
We added an option to pass a mask and ignore padded tokens for downstream tasks. See Experimental.
Added some tips on pretraining your on your own data.
Example to get logits from pretrained model.

Citation

Feel free to cite us if you find our work useful :)

@article{nguyen2023hyenadna,
      title={HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution}, 
      author={Eric Nguyen and Michael Poli and Marjan Faizi and Armin Thomas and Callum Birch-Sykes and Michael Wornow and Aman Patel and Clayton Rabideau and Stefano Massaroli and Yoshua Bengio and Stefano Ermon and Stephen A. Baccus and Chris Ré},
      year={2023},
      eprint={2306.15794},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

hyena-dna's People

Contributors

Stargazers

Watchers

hyena-dna's Issues

Nucleotide vs codons

Have you ever consider to use codons instead of single nucleotide prediction.
I mean convert sequence to codon code, train and try to predict next codon?

flash-attention installment

Hi All,

Thank you for building this tool, we are excited to try. I'm having an issue with flash-attention installment

i have torch installed, although with the pip install -e ., i'm getting a module error.

cd flash-attention/
git submodule update --init
pip install -e .

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining file:///home/rohan/rohan/hyena-dna/flash-attention
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... error
ERROR: Command errored out with exit status 1:
command: /home/rohan/anaconda3/bin/python /home/rohan/anaconda3/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_editable /tmp/tmpdk8hywoz
cwd: /home/rohan/rohan/hyena-dna/flash-attention
Complete output (17 lines):
Traceback (most recent call last):
File "/home/rohan/anaconda3/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in
main()
File "/home/rohan/anaconda3/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/rohan/anaconda3/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 144, in get_requires_for_build_editable
return hook(config_settings)
File "/tmp/pip-build-env-n_q31tvm/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 450, in get_requires_for_build_editable
return self.get_requires_for_build_wheel(config_settings)
File "/tmp/pip-build-env-n_q31tvm/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 341, in get_requires_for_build_wheel
return self._get_build_requires(config_settings, requirements=['wheel'])
File "/tmp/pip-build-env-n_q31tvm/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 323, in _get_build_requires
self.run_setup()
File "/tmp/pip-build-env-n_q31tvm/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 338, in run_setup
exec(code, locals())
File "", line 13, in
ModuleNotFoundError: No module named 'torch'

Flash Attention 2

Hi team Hazy Research,
It's just a matter of time before you get this question but is HyennaDNA going to use Flash Attention 2 vs 1? The improvements listed on repo for v2 seem pretty significant but v. 1 is linked in HyennaDNA.
I also see that you work with Flash Attention team based on author/ contributor list so probably won't be long until we see this change...

Language model inference

Thank you for your great work!

While the Huggingface examples seem to be about embeddings, I would love to do inference with the language modeling head. I'm particularly interested in doing variant effect prediction using the log-likelihood of a sequence (as in this protein paper). It would also be helpful if you could implement the AutoModelForCausalLM API.

Next token prediction - head code location / config to pass

Hello! Great work and thanks for the opensource! I'm trying to check the pretraining and see the next token prediction on a dataset we have.

I find the standalone model probably better for loading pretrained weights and changing the dataloading for this purpose. But in the standalone model, there is no head provided for the next token prediction. It is said to be in the main hyenaDNA code but it is a bit hard for me to find where it lies and how I can modify the standalone .py for it. Can you help me with where this part of the code is or maybe how we can modify the config file for this purpose?

Also, I'm not an expert in genome study so I'm not familiar with the data structure if it is assumed to be well known. Since I don't have access to the main data you use but with some other single chromosome genome sequences, can you let me know the data structure so I can generate the correct files needed? (the .fa, I think consists of a line of info and a line of sequence for each record? And for .bed is the starting and ending position information?)

Thanks!

Thank you for doing this, and thank you for open-sourcing it!

Thank you for creating this repo. I have been working on this problem for around 6 months, without much success on my end. Thank you for doing all this! This is a historic achievement.

How to convert the batch cell from the GenomicBenchmarks data to user data? CUDA memory overload if running "Single example" cell multiple times to produce embeddings.

Could you, please, help me with using HyenaDNA for inference? I'm trying to produce embeddings for a series of long sequences (about 1500 sequences of up to 400,000 nucleotides). When I try running the "single example" method from colab notebook, it can only be run one time before CUDA memory is filled (torch.cuda.empty_cache() doesn't help) and colab session needs to be restarted. Most likely it is necessary to use the "Batch example" method but it seems to be designed around the GenomicBenchmarks dataset. Is there any way to repurpose it towards user-input data? Effectively I have a list of DNA sequences strings; how do I pass them to the model correctly in batch format?

Predicting probability vectors of equal length to input sequence

Hi Eric and HazyResearch team

Thank you for providing such exciting research to the public!

I am currently interested in whether HyenaDNA fine-tuning can operate without the binary classification decoder.

The reason is because I wish to predict at least two probabilities per input nucleotide, for instance:
input: [A, T, C, G, ...]
output: [ [0.8, 0, 0, 0], [0, 0, 0, 0.7] ]

May I seek your advice on whether this is possible, and if so, how to do so?

Thank you for your time.

Best Regards
WY

Trouble reproducing Nucleotide Transformer Benchmarks Result

Unable to download the Human Reference Genome data

Unable to download hg38 dataset:AccessDeniedException: 403 user does not have service usage.services.use access to the Google Cloud project. Permission 'serviceusage.services.use' denied on resource (or it may not exist).
Is there anyway we could download/access the dataset?. Thanks

CUFFT-type error when running huggingface.py to generate embeddings

Hello,
I am using a slightly modified version of the huggingface.py script to generate embeddings from fasta files. I am using the largest model (1Mb window size), and running it on a A100 80Gb.

I just added a loop ad the end of the huggingface.py which loads fasta files and gets embeddings:

for record in records:
            print(record.id)
            sequence = str(record.seq)[0:max_length]
            tok_seq = tokenizer(sequence)
            tok_seq = tok_seq["input_ids"]  # grab ids

            # place on device, convert to tensor
            tok_seq = torch.LongTensor(tok_seq).unsqueeze(0)  # unsqueeze for batch dim
            tok_seq = tok_seq.to(device)

            # prep model and forward
            model.to(device)
            model.eval()
            with torch.inference_mode():
                embeddings = model(tok_seq)

However, after a few hundred iterations I get the following CUFFT error, which seems related to out of memory issues:

Traceback (most recent call last):
  File "huggingface_1Mbp.py", line 271, in <module>
    embeddings = model(tok_seq)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 914, in forward
    hidden_states = self.backbone(input_ids, position_ids=position_ids)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 728, in forward
    hidden_states, residual = layer(hidden_states, residual)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 530, in forward
    hidden_states = self.mixer(hidden_states, **mixer_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 288, in forward
    v = self.filter_fn(v, l_filter, k=k[o], bias=bias[o])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hyena-dna/standalone_hyenadna.py", line 222, in forward
    y = fftconv(x, k, bias)
  File "/home/hyena-dna/standalone_hyenadna.py", line 53, in fftconv
    k_f = torch.fft.rfft(k, n=fft_size) / fft_size
RuntimeError: cuFFT error: CUFFT_ALLOC_FAILED

So I was wondering, if there is a way to flush the memory between iterations, in order to prevent this kind of error?
Thanks!

CUDA out of memory with hyena-1m on A100-80G

Hello,

Thank you for sharing such great work!
Based on the A100-80G, I tried to use hyena-1m on a species classification task but got the error "CUDA out of memory."
Here is my training command
python -m train wandb=null experiment=hg38/species dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=4 dataset.batch_size=1 dataset.max_length=1000000 dataset.species_dir=/data/species_cls/ model.layer.l_max=1000002 model.d_model=256 model.n_layer=8 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null

I noticed that the A100-80G should be able to train 1m models. Is there anything extra I should be aware of?

How to define which dataset to use in command?

for example, if I want to run model on H3_txt in EMP dataset how do I specify it in command?

Addition to Transformers

First of all, excellent work! This model holds so much promise!

Is this model already in hugging face transformers? If not, I would like to volunteer in adding this model to Hugging face Transformers framework so it can be used by researchers easily.

Thanks!

Questions on running as module

i have completed the conda-environment setting. And the code in "Quick Entry point" , which is
python -m train wandb=null experiment=hg38/genomic_benchmark_scratch , has been run correctly.

But when i run the code in "In-context Learning“ section, something wrong happened as following. Could you help me figure out what wrong? Thank you for your time ~

(hyena-dna) zhguo@Dell:~/git/hyena-dna$ python -m evals/soft_prompting_genomics
/home/zhguo/app/miniconda/envs/hyena-dna/bin/python: No module named evals/soft_prompting_genomics

Bugs when I try to access the embeddings

Hi, I met a bug to access the embeddings from hyenaDNA, especially for the code:

/evals/hg38_inference.py

Traceback (most recent call last):
File "/gpfs/gibbs/pi/zhao/tl688/hyena-dna/evals/hg38_inference.py", line 115, in
task = HG38Encoder(args.model_cfg, args.ckpt_path, max_seq_len=1024)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/gibbs/pi/zhao/tl688/hyena-dna/evals/hg38_inference.py", line 34, in init
self.model, self.tokenizer = self.load_model(model_cfg, ckpt_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfs/gibbs/pi/zhao/tl688/hyena-dna/evals/hg38_inference.py", line 60, in load_model
config = yaml.load(open(model_cfg, 'r'), Loader=yaml.FullLoader)
^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './configs/evals/hyena_small_150b.yaml'

git-lfs missing from container

Hi!

Thanks for this very cool repository, the preprint is very cool, too!

I am a total beginner to using relevant machine learning libraries, but I think the git-lfs is missing from the docker container of hyena-dna. Admittedly, I did a bit of weird stuff with it because I can't execute docker on the HPC. I converted it to Singularity. Nevertheless, I'd expect to be able to call git-lfs from there, too, and it seems to be missing.

Here's what I did:

# build image from existing docker container
singularity build hyena-dna.sif docker://hyenadna/hyena-dna-public:latest

# can not use the hyena-dna inside the container because it tries to write into the same folder, therefore using a local clone, container only holds the dependencies
git clone https://github.com/HazyResearch/hyena-dna.git
cd hyena-dna

# SINGULARITYENV_CUDA_VISIBLE_DEVICES=1 says "use GPU_1" (it will otherwise use GPU_0 by default)
SINGULARITYENV_CUDA_VISIBLE_DEVICES=1 singularity exec --nv ~/images/hyena-dna.sif python -m train wandb=null experiment=hg38/genomic_benchmark_scratch # works like charm! 

SINGULARITYENV_CUDA_VISIBLE_DEVICES=1 singularity exec --nv ~/images/hyena-dna.sif python -m huggingface # fails

Error:

Using device: cuda
git: 'lfs' is not a git command. See 'git --help'.

The most similar command is
	log
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/opt/conda/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/hoffk83/images/git/hyena-dna/huggingface.py", line 251, in <module>
    inference_single()
  File "/home/hoffk83/images/git/hyena-dna/huggingface.py", line 209, in inference_single
    model = HyenaDNAPreTrainedModel.from_pretrained(
  File "/home/hoffk83/images/git/hyena-dna/huggingface.py", line 106, in from_pretrained
    config = json.load(open(os.path.join(pretrained_model_name_or_path, 'config.json')))
FileNotFoundError: [Errno 2] No such file or directory: './checkpoints/hyenadna-small-32k-seqlen/config.json'
(base) hoffk83@vision-05:~/images/git/hyena-dna$ ls ./checkpoints/hyenadna-small-32k-seqlen/config.json
ls: cannot access './checkpoints/hyenadna-small-32k-seqlen/config.json': No such file or directory
(base) hoffk83@vision-05:~/images/git/hyena-dna$ ls ./checkpoints/hyenadna-small-32k-seqlen/config.json
ls: cannot access './checkpoints/hyenadna-small-32k-seqlen/config.json': No such file or directory

I think it could be fixed by adding git-lfs to the requirements.txt, and then rebuilding the container.

Trouble reproducing Genomics Benchmark Result

Hello,

Thank you for the great work and repo!

I am trying to reproduce the HyenaDNA column of Table 4.1 (GenomicsBenchmark). I am using the weights from LongSafari/hyenadna-tiny-1k-seqlen. However, I am unable to reproduce the results.

Would it be possible for you to specify which hyperparms from Table A.3 (screenshot below) were used for each of the datasets in this benchmark:

need to swap layer norm op for triton-based layer norm?

In the Flash-attention repo here, there is now a note that the fused CUDA op has been replaced with a Triton op.

in light of that, is it now reasonable to remove from the dependencies section of this readme the suggestion to pip install the layer norm op?

Error in Pretraining on Human Genome

Hello, I was trying to follow your directions on pretraining on the human genome (as a test before I try to pretrain on my own data) and I keep getting this error:

RuntimeError: Trying to resize storage that is not resizable

The first time it happened after training epoch 40 and the second time after training epoch 60. Do you know what the error could be?

Thanks for any help. I do not seem to have any problems with Fine tuning.

Thanks,
LeAnn

Epoch 60: 95%|▉| 135/142 [00:15<00:00, 8.94it/s, loss=1.17, val/loss=1.170, val/num_tokens=1.37e+8, val/perplexity=3.230, test/loss=1.170, test/num_tokens=1.2e+8, test/perplexError executing job with overrides: ['wandb=null', 'experiment=hg38/hg38_hyena', 'model.d_model=128', 'model.n_layer=2', 'dataset.batch_size=256', 'train.global_batch_size=256', 'dataset.max_length=1024', 'optimizer.lr=6e-4', 'trainer.devices=1']
Traceback (most recent call last):
File "/uufs/chpc.utah.edu/common/home/sundar-group2/PHAGE/MODELS/P100_HYENA/hyena-dna/train.py", line 691, in main
train(config)
File "/uufs/chpc.utah.edu/common/home/sundar-group2/PHAGE/MODELS/P100_HYENA/hyena-dna/train.py", line 672, in train
trainer.fit(model)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
self._run_validation()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
self.val_loop.run()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance
batch = next(data_fetcher)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 265, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
batch = next(iterator)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
return self._process_data(data)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in collate
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 143, in
return [collate(samples, collate_fn_map=collate_fn_map) for samples in transposed] # Backwards compatibility.
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 120, in collate
return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File "/uufs/chpc.utah.edu/common/home/u1323098/software/pkg/miniconda3/envs/p100_hyena-dna/lib/python3.8/site-packages/torch/utils/data/utils/collate.py", line 162, in collate_tensor_fn
out = elem.new(storage).resize(len(batch), *list(elem.size()))

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Running with standard Huggingface config and trainer files does not give optimal results

Hello, I have been running your model since last summer using a standard huggingface model framework, (see code below). And it has not been giving us the same results on the benchmarking tests as you report in the paper. For example:

GenomicBenchmarks

Mouse Enhancers, you report 85.1. our results 63.6
Human Enhancers Cohn, you report 74.2, our results 66.3

I think it is possible that it is because we are not using the parameter that you have at the bottom of the config file,
**freeze_backbone: false **
but I am not sure how to incorporate this into a standard Huggingface trainer.

Do you support the huggingface trainer or only the hydra/lightning trainer?

My concern is that since we are not able to match your reported results, perhaps your model is not performing optimally on our specific classification task. I had expected it to be comparable in performance to DNABERT2 but it was not. I think this may be because we have not set up our run correctly. Any direction would be appreciated. Thank you.

Sample Code
checkpoint = 'LongSafari/hyenadna-tiny-16k-seqlen-d128-hf'
max_length = 4010
args = {
"output_dir": "test_output",
"num_train_epochs": 25,
"per_device_train_batch_size": 512,
"per_device_eval_batch_size": 512,
"gradient_accumulation_steps": 4,
"gradient_checkpointing": False,
"learning_rate": 2e-5,
"evaluation_strategy": "steps",
"eval_steps": 1,
"wandb": "null",
}
training_args = TrainingArguments(**args)

trainer = Trainer(model=model, args=training_args, train_dataset=ds_tok_train, eval_dataset=ds_tok_val, compute_metrics=compute_metrics)
trainer.train()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto", pad_token_id=tokenizer.pad_token_id, trust_remote_code=True)

ImportError: dropout_add_layer_norm is not installed

Hi,

Great work! When I was trying to run python -m train wandb=null experiment=hg38/genomic_benchmark_scratch, I encountered the following error:

Error executing job with overrides: ['wandb=null', 'experiment=hg38/genomic_benchmark_scratch']
Traceback (most recent call last):
File "/home/fankunjie/toxic_gene/hyena-dna/train.py", line 691, in main
train(config)
File "/home/fankunjie/toxic_gene/hyena-dna/train.py", line 653, in train
model = SequenceLightningModule(config)
File "/home/fankunjie/toxic_gene/hyena-dna/train.py", line 148, in init
self.setup() ## Added by KS
File "/home/fankunjie/toxic_gene/hyena-dna/train.py", line 172, in setup
self.model = utils.instantiate(registry.model, self.hparams.model)
File "/home/fankunjie/toxic_gene/hyena-dna/src/utils/config.py", line 104, in instantiate
return obj()
File "/home/fankunjie/toxic_gene/hyena-dna/src/models/sequence/dna_embedding.py", line 36, in init
self.backbone = LMBackbone(
File "/home/fankunjie/toxic_gene/hyena-dna/src/models/sequence/long_conv_lm.py", line 305, in init
raise ImportError("dropout_add_layer_norm is not installed")
ImportError: dropout_add_layer_norm is not installed

I followed exactly the same steps mentioned in the Readme, except the installation of flash attention. I had trouble installing flash attention, and successfully installed it using pip install flash-attn --no-build-isolation following this post: Dao-AILab/flash-attention#246.

Thanks for your help!

The default for pretrained_model_path in config files is a personal directory

I noticed when trying to repeat the genomic benchmark experiments that your default model path in all of your config files is a checkpoint within a personal directory

For example: Line 92 in hyena-dna/configs/experiment/hg38
/genomic_benchmark.yaml

pretrained_model_path: /local-scratch/nigam/projects/mwornow/projects/safari-internal/outputs/2023-04-14/2_128_1024.ckpt

I would like to use a pretrained model, but this parameter does not seem to accept the huggingface model, or a downloaded directory of the huggingface model (I tried both). It is looking for a file ending in .ckpt not a directory.

Can you provide us with the appropriate pre-trained checkpoint files?

Or provide instructions on how to obtain them from the model files on huggingface?

Thank you,
LeAnn

How to correctly provide padding tokens to forward pass of pretrained model?

Hi there, thanks for this repo and the pretrained models.

I have a question on batching sequences of varying length. I've found the padding token and tokenizer to work effectively, but I see no input of an attention mask to the forward pass of the model.

I've tried passing a padded sequence, e.g. padded with 4s as output by the tokenizer, and a non-padded sequence. The resulting embeddings of at least the last few tokens are very different between these two examples.

The common pattern is to also provide an attention mask. I try to pass this like model(input_ids, attn_mask=attn_mask) but it this isn't how it's set up. I looked through the source code and can't find the way that an attention mask mechanism would work in it.

Is there a way to batch sequences of varying length and how should I do this?

Sanity Checking DataLoader error

I am running the quick start command to check my hyena-dna installation.
python -m train wandb=null experiment=hg38/genomic_benchmark_scratch
and I got the error report below, does anyone know how to fix this?

Sanity Checking DataLoader 0:   0%|                                                                      | 0/2 [00:00<?, ?it/s]
CUDA Error: invalid device function /PATH/hyena-dna/flash-attention/csrc/layer_norm/ln_fwd_kernels.cuh 271

Best
Dié

ImportError: Error loading 'src.models.sequence.dna_embedding.DNAEmbeddingModel':

When running the code as instructed (quick entry "python -m train wandb=null experiment=hg38/genomic_benchmark_scratch"), there was an error: "ImportError: Error loading 'src.models.sequence.dna_embedding.DNAEmbeddingModel':". Can you please look into it? Thanks!

Some doubts about downstream tasks

Hello, first of all, thank you for your open-source contributions and the detailed README file!

I'm an undergraduate student who has just started exploring deep learning. My research focus is on DNA tokenization, but there are very few datasets available. My idea is to use prompt learning to tackle this task.

Given the limited research in this area, I've come across the possibility of using DNA tokenization as a downstream task for your model.However, after carefully reading your paper and the repository's README file, especially the "More advanced stuff below" section, I find it challenging to understand all the content due to my limited expertise. I'm still unsure if DNA tokenization can be used as a downstream task for your model. I would like to ask if this is possible？

I would greatly appreciate it if you could provide some advice or guidance on this.

I understand that this may not be within your obligations, so if you're too busy to respond, please feel free to close this issue. Thank you for taking the time to consider my request!

nucleotide finetuning

when I ran python -m train wandb=null experiment=hg38/nucleotide_transformer dataset_name=enhancer dataset.max_length=500 model.layer.l_max=1026,
something wrong,
Could not override 'dataset_name'.
To append to your config use +dataset_name=enhancer
Key 'dataset_name' is not in struct
full_key: dataset_name
object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Segmentation fault (core dumped)

The tokenizer's bug in Huggingface

The code in tokenizer

has a bug, it seem to be missing the [CLS]

the code in huggingface is

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        # cls = [self.cls_token_id]
        result = token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

but I think it should be

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        result = cls + token_ids_0 + sep
        if token_ids_1 is not None:
            result += token_ids_1 + sep
        return result

according to the code from github

Clarifying the models available on HF

Hi,

On the LongSafari HF space there appear to be 2 copies of each model, one with -hf at the end of the name and one without.

I was wondering what the difference is between these models (other than one being compatible with AutoModel), because despite the names being the same and the variables in the config files looking almost identical (i.e., same d_model and n_layers), they have very different number of parameters. For example,

LongSafari/hyenadna-medium-160k-seqlen has 6.6M parameters
LongSafari/hyenadna-medium-160k-seqlen-hf has 12.9M parameters

Which version of these models corresponds to the ones used in the paper experiments? If I am not mistaken, it should be the first one (i.e., the one without -hf in the name)?

Genome build versions are inconsistent in reference and chromatin profile (DeepSEA benchmark)

I am attempting to replicate the validation process for the DeepSEA benchmark. The original DeepSEA version is hg19, while the reference genome is hg38. I've noticed that liftover is available in the source code, specifically within the ChromatinProfileDataset class. However, using this liftover functionality seems to be restricted unless I directly use the ChromatinProfileDataset.

Within the class ChromatinProfile, arguments for ChromatinProfileDataset are:

ref_genome_version = self.ref_genome_version
coords_target_path = f'{self.data_path}/{split}_{self.ref_genome_version}_coords_targets.csv'

This code forces the genome version of the reference and dataset to be the same.

My question is whether it's possible to introduce flexibility into the package or provide an updated version of the DeepSEA benchmark that supports the hg38 reference genome?

Model output changes randomly

First of all, thanks for making this great tool available!

I was trying to obtain embeddings for some sequences, and noticed that the model output changed when calling it repeatedly with the same input:

tokenizer = CharacterTokenizer(
        characters=['A', 'C', 'G', 'T', 'N'],
        model_max_length=32768 + 2,
        add_special_tokens=False,
        padding_side='left',
    )

model = HyenaDNAPreTrainedModel.from_pretrained(
            './checkpoints',
            'hyenadna-small-32k-seqlen',
            download=True,
            config=None,
            device="cpu",
            use_head=False,
            n_classes=2,
        )

sequence = "ATCG"

model(torch.LongTensor(tokenizer(sequence)["input_ids"]).unsqueeze(0))[0,0,:5]
# [-0.6169,  1.0377,  0.0526, -1.0487,  0.9169]
model(torch.LongTensor(tokenizer(sequence)["input_ids"]).unsqueeze(0))[0,0,:5]
# [-0.5084,  0.6375,  0.4707, -0.8912,  1.1417]
model(torch.LongTensor(tokenizer(sequence)["input_ids"]).unsqueeze(0))[0,0,:5]
# [-0.5762,  0.3669,  0.1919, -0.7438,  1.0702]

I would have expected the model to be deterministic... what am I missing here?

MCC value problem on Nucleotide Transformer

I really appreciate your work, but when I train model on the downstream task Nucleotide Transformer, it shows that no MCC result returned, the details are as follows, could you please help me to solve it, thank you very mcuh!
Traceback (most recent call last):
File "/root/code/dnabert3/hyena-dna/train.py", line 692, in main
train(config)
File "/root/code/dnabert3/hyena-dna/train.py", line 673, in train
trainer.fit(model)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 295, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1380, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 312, in on_train_epoch_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/root/miniconda3/envs/hyena-dna/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 367, in _save_topk_checkpoint
raise MisconfigurationException(m)
lightning_lite.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val/mcc') could not find the monitored key in the returned metrics: ['trainer/loss', 'trainer/epoch', 'val/accuracy', 'val/loss', 'train/accuracy', 'train/loss', 'epoch', 'step']. HINT: Did you call log('val/mcc', value) in the LightningModule?

pretrain model weights miss pos_emb

Hi,exnx

I'm sorry to bother you. I want to use the pre-trained model weights you provided to fine-tune for downstream tasks. I found that some model parameters are missing. Is there anything wrong with my approach?

Some refactoring of the model parameter names was done using the code you provided before loading the model weights

Here is a more detailed translation:

pretrained_model_name = 'hyenadna-small-32k-seqlen'

max_lengths = {
'hyenadna-tiny-1k-seqlen': 1024,
'hyenadna-small-32k-seqlen': 32768,
'hyenadna-medium-160k-seqlen': 160000,
'hyenadna-medium-450k-seqlen': 450000, # T4 up to here
'hyenadna-large-1m-seqlen': 1_000_000, # only A100 (paid tier)
}

max_length = max_lengths[pretrained_model_name] # auto selects

%# data settings:
use_padding = True
rc_aug = False # reverse complement augmentation
add_eos = False # add end of sentence token

%# we need these for the decoder head, if using
use_head = True
n_classes = 1 # not used for embeddings only

%# you can override with your own backbone config here if you want,
%# otherwise we'll load the HF one in None
backbone_cfg = json.load(open(f'hungging_face_models/hyenadna/{pretrained_model_name}/config.json', 'r'))

model = hyenadna.HyenaDNAModel(**backbone_cfg, use_head=use_head, n_classes=n_classes)
pretrain_w = torch.load(f'hungging_face_models/hyenadna/{pretrained_model_name}/pytorch_model.bin')

RuntimeError Traceback (most recent call last)
Cell In[12], line 1
----> 1 model.load_state_dict(pretrain_w)

File /Data/luokai/biotools/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:1667, in Module.load_state_dict(self, state_dict, strict)
1662 error_msgs.insert(
1663 0, 'Missing key(s) in state_dict: {}. '.format(
1664 ', '.join('"{}"'.format(k) for k in missing_keys)))
1666 if len(error_msgs) > 0:
-> 1667 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
1668 self.class.name, "\n\t".join(error_msgs)))
1669 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for HyenaDNAModel:
Missing key(s) in state_dict: "backbone.layers.0.mixer.filter_fn.pos_emb.z", "backbone.layers.0.mixer.filter_fn.pos_emb.t", "backbone.layers.0.mixer.filter_fn.implicit_filter.3.freq", "backbone.layers.0.mixer.filter_fn.implicit_filter.5.freq", "backbone.layers.0.mixer.filter_fn.modulation.deltas", "backbone.layers.1.mixer.filter_fn.pos_emb.z", "backbone.layers.1.mixer.filter_fn.pos_emb.t", "backbone.layers.1.mixer.filter_fn.implicit_filter.3.freq", "backbone.layers.1.mixer.filter_fn.implicit_filter.5.freq", "backbone.layers.1.mixer.filter_fn.modulation.deltas", "backbone.layers.2.mixer.filter_fn.pos_emb.z", "backbone.layers.2.mixer.filter_fn.pos_emb.t", "backbone.layers.2.mixer.filter_fn.implicit_filter.3.freq", "backbone.layers.2.mixer.filter_fn.implicit_filter.5.freq", "backbone.layers.2.mixer.filter_fn.modulation.deltas", "backbone.layers.3.mixer.filter_fn.pos_emb.z", "backbone.layers.3.mixer.filter_fn.pos_emb.t", "backbone.layers.3.mixer.filter_fn.implicit_filter.3.freq", "backbone.layers.3.mixer.filter_fn.implicit_filter.5.freq", "backbone.layers.3.mixer.filter_fn.modulation.deltas", "head.output_transform.weight", "head.output_transform.bias".

Thanks,

Human Reference Genome questions

Hello, thank you very much for open-sourcing your work. I have a couple of questions about the Human Genome Reference dataset.

[Q1] In your instructions for downloading Human Reference Genome you mention:

First step is download the Human Reference Genome data. It's comprised of 2 files, 1 with all the sequences (the .fasta file), and with the intervals we use (.bed file).
However, you'll need to have a GCP account to download the exact files we used (from the Enformer), and it cost a little to download. At some point we'll try to upload somewhere to share that data.

As far as I know Enformer was using a mix of Basenji dataset and Human Reference Genome. Specifically in their paper they say:

We modified the Basenji2 dataset by extending the input sequence to 196,608bp from the original 131,072bp using the hg38 reference genome.

In the filenames for Human Reference Genome you also have reference to Basenji, i.e.

Download fasta (.fa format) file (of the entire human genome) into hyena-dna/data/hg38. ~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically
gsutil -u hai-gcp-hippo cp gs://basenji_barnyard/hg38.ml.fa.gz ./ && gunzip hg38.ml.fa.gz

Hyena-DNA paper does not contain any reference to Basenji. So I wonder how do these two datasets relate to each other? Do you train on mix of them as Enformer or only on Human Reference Genome? Also in the Appendix you mention:

For pretraining, we use a single human reference genome (Genome Reference Consortium, 2013), and leverage the training and validation intervals (start and end) from (Avsec et al., 2021).

Does it imply that you use completely the same data as Enformer and the same train/val splits?

[Q2] Since accessing the data on GCP incurs a cost, you mentioned plans to make the data more accessible. Do you have a timeline for this? Maybe there is script to convert the original data into Enformer format?

hyenaDNA for regression?

I would like to use the pre-trained hyenaDNA model for a regression task. I took your code from Google colab and have adapted it successfully for a classification task on the same dataset. However, when I attempted to modify my code for regression (essentially just setting n_classes = 1 and changing the dataloader to not encode the labels) the model train runs without errors but the training loss does not decrease at all over time.

Have you tried to use hyenaDNA for a regression task before? I can provide more code if you expect it to work for this type of task

Model weights trained on the Genomics Benchmark Dataset

Are there weights that are fine-tuned on the GenomicsBenchmark dataset?
This would be helpful to just perform inference without requiring us to train the models on these datasets.
Is there a plan to release this?

cannot run hg38_hyena_seqlen_warmup_reload

Hi,

Thanks for your great work! I am trying to do hg38/hg38_hyena_seqlen_warmup_reload.yaml experiment. Got the following error msg:

I had some initial search on this issue and found this. I set monitor: test/loss and it still doesn't work. But i have no problem running 'g38/hg38_hyena.yaml'.

Do you have any insights on this issue? Is this related to the sentence length warmup callback? because I can run 'g38/hg38_hyena.yaml' without this callback. i am using pytorch_lightning v1.8.6

Pre-training on local genome data

Hi Authors,
this work is interesting and worth following.
Can you offer an example of how to pre-train a model on a local genome data, let's say, a species dataset like human or mouse?
Thanks!

Cuda out of memory for huggingface pre-trained model on A100-80GB

I'm using the standalone_hyenadna.py script and loading the pre-trained weights of the large-1m model from huggingface in order to have a standalone code similar to the colab.
When performing fine-tuning and testing on the dummy_mouse_enhancers_ensembl dataset from Genomic Benchmark with a sequence max_length of 3400, I get a "CUDA out of memory" error on an A100 GPU.
As you suggest in the readme I tried to modify the downloaded config.json file found in checkpoints_path/hyenadna-large-1m-seqlen by setting to True the fields:
checkpoint_mixer: True
checkpoint_mlp: True
But now I'm getting the error:
scratch_dict[key] = pretrained_dict[key_loaded]
KeyError: 'model.backbone.layers.0.mixer.layer.in_proj.weight'
As you suggest I tried to toggle on/off these params in order to find the working combination, but I either get this key error or the cuda out of memory error.

It seems to me that since the pre-trained model loaded from huggingface probably has been trainined with those flags set to False, now there's a configuration mismatch.
Am I missing something? How can I work with ultralong sequences without getting memory errors by using pre-trained models downloaded from huggingface?

Thanks in advance for your response and for your valuable contribution to this research field!

Pre-trained model for genomic benchmarks

Hello. We've been working through replicating the results from your arXiv paper to better understand HyenaDNA and how we can use it for our own purposes. Using the hyena-dna-nt6 Docker image, we were able to successfully replicate the "Nucleotide Transformer" experiments. However, we've been unable to replicate the "Genomic Benchmark" experiments because the pretrained_model_path in the configs/experiment/hg38/genomic_benchmark.yaml file is set to a path that doesn't exist in the container: /local-scratch/nigam/projects/mwornow/projects/safari-internal/outputs/2023-04-14/2_128_1024.ckpt.

Is this file available somewhere that we can access it? This appears to be the file that's generated by the Quick Entry point experiment, but we would like to use a pre-trained model if possible before generating our own for this first round of validation.

Thanks!

Chromatin Preprocessing

Im working on and using the Chromatin portion of the project; however, I am running into issues with the preprocessing. Can you provide any more details on how you generated the initial data? I have followed Sei and deepsea but I seem to be missing the train_hg38_coords_targets.csv. There been several versions of Sei and deepsea so Im currently going through the old versions but when I follow previous steps I seem to be missing that file?

Reproducing the HyenaDNA results on NT Benchmarks

Hello,

I have been unsuccessful at reproducing your NT Benchmarks from just running the tests in my environment on the NT datasets. I would like to be able to see the parameters that you used but unfortunately, our CHPC system does not allow Docker, they recommend Apptainer. I have been able to successfully run HyenaDNA on my own datasets and on the NT datasets, I am just not getting the same results that you report in the paper.

I have gotten this far using the instructions in the apptainer thread (see below):

But I can't seem to see the file that you speak about in the instructions: "This will land you inside the /wdr, which has a file named launch_commands_nucleotide_transformer with all the launch commands and (associated hyperparameters) for the 18 Nucleotide Transformer datasets."

Would it be possible for you to release this file (launch commands and associated hyperparameters) outside of the Docker container for those of us that have had challenges reproducing your results on HPC systems that do not allow Docker?

Thank you,
LeAnn

(A100-3-hyena-dna) [u1323098@notch372:hyena-dna]$ apptainer exec --nv hyena-dna-nt6.sif /bin/bash

INFO: gocryptfs not found, will not be able to use gocryptfs

INFO: underlay of /etc/localtime required more than 50 (96) bind mounts

INFO: underlay of /usr/bin/nvidia-smi required more than 50 (475) bind mounts

13:4: not a valid test operator: (

13:4: not a valid test operator: 550.54.14

Apptainer>Apptainer> ls -lrt
total 17203372
-rw-r--r-- 1 u1323098 sundar 11357 Dec 5 09:52 LICENSE
-rw-r--r-- 1 u1323098 sundar 407 Dec 5 09:52 Dockerfile
-rw-r--r-- 1 u1323098 sundar 35655 Dec 5 09:52 README.md
drwxr-xr-x 2 u1323098 sundar 34 Dec 5 09:52 assets
drwxr-xr-x 13 u1323098 sundar 244 Dec 5 09:53 configs
drwxr-xr-x 3 u1323098 sundar 29 Dec 5 09:53 csrc
drwxr-xr-x 2 u1323098 sundar 155 Dec 5 09:53 evals
-rw-r--r-- 1 u1323098 sundar 8622 Dec 5 09:53 huggingface.py
-rw-r--r-- 1 u1323098 sundar 530 Dec 5 09:53 requirements.txt
drwxr-xr-x 8 u1323098 sundar 121 Dec 5 09:53 src
-rw-r--r-- 1 u1323098 sundar 42633 Dec 5 09:53 standalone_hyenadna.py
-rw-r--r-- 1 u1323098 sundar 27561 Dec 5 09:53 train.py
drwxr-xr-x 9 u1323098 sundar 4096 Dec 5 17:06 flash-attention
drwxr-xr-x 2 u1323098 sundar 42 Dec 5 17:07 pycache
drwxr-xr-x 4 u1323098 sundar 55 Dec 5 17:17 data
drwxr-xr-x 4 u1323098 sundar 54 Jan 13 13:54 outputs
drwxr-xr-x 10 u1323098 sundar 4096 Jan 14 08:52 wandb
-rwxr-xr-x 1 u1323098 sundar 4279 Mar 13 15:17 finetune_model_test.py
-rwxr-xr-x 1 u1323098 sundar 9645101056 Mar 25 07:28 hyena-dna-nt6.sif
-rwxr-xr-x 1 u1323098 sundar 7970955264 Mar 25 07:29 hyena-dna.sif
Apptainer>

How to pre-train on custom dataset using hyena-dna

How to pre-train on custom dataset using hyena-dna?

How to recreate the result of DNABERT in paper

Hi,
I wonder is there code to recreate the result of DNABERT in paper

tokenizer do not padding sequence to same length autoly.

Question about HyenaDNA working

Hello,

I have been using HyenaDNA to pre-train it on my custom dataset, and I wanted some insight into its working based on the output logs from the pre-training run. For every epoch, after updating model weights based on the training data, the tool evaluates the performance on the validation data by calculating val/loss. In the process, it loads validation data from two different dataloaders (Validation Dataloader 0 and Validation Dataloader 1). I provided the test data separately from the validation data, and after varying the train : validation : test splits, I found that the size of Validation Dataloader 1 changes in proportion to the size of the test set. Why does this happen, given that the test set is separate from the validation set, and that the tool reports a separate value of test/loss after every epoch?

ICL question

Hi! Thanks for the great work!
I'm currently investigating your ICL solutions (soft tokens and instruction tuning) and can't understand what actually "shots" means and what is it for. Can you please add more details?
Thanks

Could this HyenaDNA model be used for a pure language task?

Could this HyenaDNA model be used for a pure language task? Of course with some changes, such as a tokenizer for language! And maybe some other things ? Which other things would that be?

If this can be done, than that would give an enormous advantage of being able to work with a giant context size while still having an acceptable (or even a very good) performance for training and inference! Am I correct here?

Also I saw a mention of the HyenaDNA model being able of in-context learning, which is a very important prerequisite of such a model!

I have not read the paper, but could you show a table with a comparison between the pros and cons of a standard normal transformer and a HyenaDNA model?

But my main question is:
Could this HyenaDNA model be used for a pure language task? And how exactly to go about actually implementing that? What would be gained in comparison with using a conventional Transformer for that?

Thank you for this amazing development! What a time to be alive!

Happy New Year for you and the entire team of HyenaDNA!

Pretraining runtimes from the paper

Hi! Great work, and also great youtube presentation, thanks for making that public.

I have a question about the runtimes. In the Table A.2 it says that pre-training took 80min for the model with 1.6M parameters. When I pretrain on my dataset the model with 3.3M parameters (input size 16k, 3 Hyena layers, emb.dim. 256) it takes me around 16 hours for the dataset with only 21000 samples. Anything is wrong with my setup? Could you please specify more explicitly, what data size went into the Table A.2? Like, how many samples of which sequence length with what batch size.
And if it's possible to tell, what share of nucleotides of human genome the pretrained model (like, the one with the batch size 32k) ended up seeing?

Thank you for the nice work!

hazyresearch / hyena-dna Goto Github PK

hyena-dna's Introduction

HyenaDNA

Important links:

Intro

Discord

Hugging Face pretrained weights

Dependencies

Dockerfile

Quick Entry point

Loading pretrained weights

Standalone code (HuggingFace too)

Experiments

Pretraining on Human Reference Genome

Pretraining on your own data

GenomicBenchmarks

Nucleotide Transformer datasets

In-context Learning

Chromatin Profile

Species Classification

More advanced stuff below

Setting up downstream experiments (fine tuning)

Model config:

Flags for using ultralong context (gradient checkpointing)

Setting up a Dataset class

Creating Configs

Launch a finetuning experiment

Loading a finetuned model

Sequence Length Warmup Callback

Getting logits from pretrained model

Experimental

Change Log / updates:

Citation

hyena-dna's People

Contributors

Stargazers

Watchers

Forkers

hyena-dna's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs