GithubHelp home page GithubHelp logo

cognano / avida-sars-cov-2 Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 0.0 8.95 MB

Code for paper "A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models"

Home Page: https://datasets.cognanous.com

License: MIT License

Python 88.99% Shell 9.01% Dockerfile 2.00%
antibody antigen-antibody-interaction dataset drug-discovery language-model vhh

avida-sars-cov-2's Introduction

A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

This repository contains the supplementary material accompanying the paper "A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models." In this paper, we introduced AVIDa-SARS-CoV-2, a labeled dataset of SARS-CoV-2-VHH interactions, and VHHCorpus-2M, which contains over two million VHH sequences, providing novel datasets for the evaluation and pre-training of antibody language models. The datasets are available at https://datasets.cognanous.com under a CC BY-NC 4.0 license.

dataset-generation-overview

Overview of data generation process for AVIDa-SARS-CoV-2.

Table of Contents

Environment

To get started, clone this repository and run the following command to create a virtual environment.

python -m venv ./venv
source ./venv/bin/activate
pip install -r requirements.txt

Datasets

Links

Dataset Links
VHHCorpus-2M Hugging Face Hub         Project Page
AVIDa-SARS-CoV-2 Hugging Face Hub         Project Page

Data Processing

The code for converting the raw data (FASTQ file) obtained from next-generation sequencing (NGS) into the labeled dataset, AVIDa-SARS-CoV-2, can be found under ./dataset. We released the FASTQ files for antigen type "OC43" here so that the data processing can be reproduced.

First, you need to create a Docker image.

docker build -t vhh_constructor:latest ./dataset/vhh_constructor

After placing the FASTQ files under dataset/raw/fastq, execute the following command to output a labeled CSV file.

bash ./dataset/preprocess.sh

Benchmarks

Pre-training

VHHBERT is a RoBERTa-based model pre-trained on two million VHH sequences in VHHCorpus-2M. VHHBERT can be pre-trained with the following commands.

python benchmarks/pretrain.py --vocab-file "benchmarks/data/vocab_vhhbert.txt" \
  --epochs 20 \
  --batch-size 128 \
  --save-dir "outputs"

Arguments:

Argument Required Default Description
--vocab-file Yes Path of the vocabulary file
--epochs No 20 Number of epochs
--batch-size No 128 Size of mini-batch
--seed No 123 Random seed
--save-dir No ./saved Path of the save directory

The pre-trained VHHBERT, released under the MIT License, is available on the Hugging Face Hub.

Fine-tuning

To evaluate the performance of various pre-trained language models for antibody discovery, we defined a binary classification task to predict the binding or non-binding of unknown antibodies to 13 antigens using AVIDa-SARS-CoV-2. For more information on the benchmarking task, see the paper.

Fine-tuning of the language models can be performed using the following command.

python benchmarks/finetune.py --palm-type "VHHBERT" \
  --epochs 30 \
  --batch-size 32 \
  --save-dir "outputs"

palm-type must be one of the following:

  • VHHBERT
  • VHHBERT-w/o-PT
  • AbLang
  • AntiBERTa2
  • AntiBERTa2-CSSP
  • IgBert
  • ProtBert
  • ESM-2-150M
  • ESM-2-650M

Arguments:

Argument Required Default Description
--palm-type No VHHBERT Model name
--embeddings-file No ./benchmarks/data/antigen_embeddings.pkl Path of embeddings file for antigens
--epochs No 20 Number of epochs
--batch-size No 128 Size of mini-batch
--seed No 123 Random seed
--save-dir No ./saved Path of the save directory

Citation

If you use AVIDa-SARS-CoV-2, VHHCorpus-2M, or VHHBERT in your research, please use the following citation.

@article{tsuruta2024sars,
  title={A {SARS}-{C}o{V}-2 Interaction Dataset and {VHH} Sequence Corpus for Antibody Language Models},
  author={Hirofumi Tsuruta and Hiroyuki Yamazaki and Ryota Maeda and Ryotaro Tamura and Akihiro Imura},
  journal={arXiv preprint arXiv:2405.18749},
  year={2024}
}

avida-sars-cov-2's People

Contributors

tsurubee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.