GithubHelp home page GithubHelp logo

chaitjo / geometric-rna-design Goto Github PK

View Code? Open in Web Editor NEW
141.0 9.0 12.0 184.18 MB

gRNAde: Geometric Deep Learning for 3D RNA inverse design

Home Page: https://arxiv.org/abs/2305.14749

License: MIT License

Python 20.48% Jupyter Notebook 79.52%
geometric-deep-learning graph-neural-networks inverse-design pytorch pytorch-geometric rna-structure biomolecule-design

geometric-rna-design's Introduction

๐Ÿ’ฃ gRNAde: Geometric Deep Learning for 3D RNA Inverse Design

gRNAde is a geometric deep learning pipeline for 3D RNA inverse design, analogous to ProteinMPNN for protein design.

๐Ÿงฌ Tutorial notebook to get started: gRNAde 101 Open In Colab

โš™๏ธ Using gRNAde for custom RNA design scenarios: Design notebook Open In Colab

โœ๏ธ New to 3D RNA modelling? Here's a currated reading + watch list for beginners: Resources

๐Ÿ“„ For more details on the methodology, see the accompanying paper: 'gRNAde: Geometric Deep Learning for 3D RNA inverse design'

Chaitanya K. Joshi, Arian R. Jamasb, Ramon Viรฑas, Charles Harris, Simon Mathis, Alex Morehead, and Pietro Liรฒ. gRNAde: Geometric Deep Learning for 3D RNA inverse design. ICML Computational Biology Workshop, 2023.

PDF | Tweet | Slides

gRNAde generates an RNA sequence conditioned on one or more 3D RNA backbone conformations, i.e. both single- and multi-state fixed-backbone sequence design. RNA backbones are featurized as geometric graphs and processed via a multi-state GNN encoder which is equivariant to 3D roto-translation of coordinates as well as conformer order, followed by conformer order-invariant pooling and sequence design.

Installation

In order to get started, set up a python environment by following the installation instructions below. We have tested gRNAde on Linux with Python 3.10.12 and CUDA 11.8 on NVIDIA A100 80GB GPUs and Intel XPUs, as well as on MacOS (CPU).

# Clone gRNAde repository
cd ~  # change this to your prefered download location
git clone https://github.com/chaitjo/geometric-rna-design.git
cd geometric-rna-design

# Install mamba (a faster conda)
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh
source ~/.bashrc
# You may also use conda or virtualenv to create your environment

# Create new environment and activate it
mamba create -n rna python=3.10
mamba activate rna

Set up your new python environment, starting with PyTorch and PyG:

# Install Pytorch on Nvidia GPUs (ensure appropriate CUDA version for your hardware)
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install Pytorch Geometric (ensure matching torch + CUDA version to PyTorch)
pip install torch_geometric
pip install torch_scatter torch_cluster -f https://data.pyg.org/whl/torch-2.1.2+cu118.html
Install Pytorch/PyG on Intel XPUs (specific to Cambridge's Dawn supercomputer)
module load default-dawn
source /usr/local/dawn/software/external/intel-oneapi/2024.0/setvars.sh
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
python -m pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install torch_scatter torch_cluster

Next, install other compulsory dependencies:

# Install other python libraries
mamba install jupyterlab matplotlib seaborn pandas biopython biotite -c conda-forge
pip install wandb gdown pyyaml ipdb python-dotenv tqdm cpdb-protein torchmetrics einops ml_collections mdanalysis MDAnalysisTests draw_rna

# Install X3DNA for secondary structure determination
cd ~/geometric-rna-design/tools/
tar -xvzf x3dna-v2.4-linux-64bit.tar.gz
./x3dna-v2.4/bin/x3dna_setup
# Follow the instructions to test your installation

# Install EternaFold for secondary structure prediction
cd ~/geometric-rna-design/tools/
git clone --depth=1 https://github.com/eternagame/EternaFold.git && cd EternaFold/src
make
# Notes: 
# - Multithreaded version of EternaFold did not install for me
# - To install on MacOS, start a shell in Rosetta using `arch -x86_64 zsh`

# Download RhoFold checkpoint (~500MB)
cd ~/geometric-rna-design/tools/rhofold/
gdown https://drive.google.com/uc?id=1To2bjbhQLFx1k8hBOW5q1JFq6ut27XEv
Optionally, you can also set up some extra tools and dependencies.
# (Optional) Install CD-HIT for sequence identity clustering
mamba install cd-hit -c bioconda

# (Optional) Install US-align/qTMclust for structural similarity clustering
cd ~/geometric-rna-design/tools/
git clone https://github.com/pylelab/USalign.git && cd USalign/ && git checkout 97325d3aad852f8a4407649f25e697bbaa17e186
g++ -static -O3 -ffast-math -lm -o USalign USalign.cpp
g++ -static -O3 -ffast-math -lm -o qTMclust qTMclust.cpp

Once your python environment is set up, create your .env file with the appropriate environment variables; see the .env.example file included in the codebase for reference.

cd ~/geometric-rna-design/
touch .env

You're now ready to use gRNAde via the tutorial. In order to train your own models from scratch though, you still need to download and process raw RNA structures from RNAsolo (instructions below).

Directory Structure and Usage

Detailed usage instructions are available in the tutorial notebook.

.
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ LICENSE
|
โ”œโ”€โ”€ gRNAde.py                       # gRNAde python module and command line utility
โ”œโ”€โ”€ main.py                         # Main script for training and evaluating models
|
โ”œโ”€โ”€ .env.example                    # Example environment file
โ”œโ”€โ”€ .env                            # Your environment file
|
โ”œโ”€โ”€ checkpoints                     # Saved model checkpoints
โ”œโ”€โ”€ configs                         # Configuration files directory
โ”œโ”€โ”€ data                            # Dataset and data files directory
โ”œโ”€โ”€ notebooks                       # Directory for Jupyter notebooks
โ”œโ”€โ”€ tutorial                        # Tutorial with example usage
|
โ”œโ”€โ”€ tools                           # Directory for external tools
|   โ”œโ”€โ”€ draw_rna                    # RNA secondary structure visualization
|   โ”œโ”€โ”€ EternaFold                  # RNA sequence to secondary structure prediction tool
|   โ”œโ”€โ”€ RhoFold                     # RNA sequence to 3D structure prediction tool
|   โ”œโ”€โ”€ ribonanzanet                # RNA sequence to chemical mapping prediction tool
|   โ””โ”€โ”€ x3dna-v2.4                  # RNA secondary structure determination from 3D
|
โ””โ”€โ”€ src                             # Source code directory
    โ”œโ”€โ”€ constants.py                # Constant values for data, paths, etc.
    โ”œโ”€โ”€ evaluator.py                # Evaluation loop and metrics
    โ”œโ”€โ”€ layers.py                   # PyTorch modules for building Multi-state GNN models
    โ”œโ”€โ”€ models.py                   # Multi-state GNN models for gRNAde
    โ”œโ”€โ”€ trainer.py                  # Training loop
    |
    โ””โ”€โ”€ data                        # Data-related code
        โ”œโ”€โ”€ clustering_utils.py     # Methods for clustering by sequence and structural similarity
        โ”œโ”€โ”€ data_utils.py           # Methods for loading PDB files and handling coordinates
        โ”œโ”€โ”€ dataset.py              # Dataset and batch sampler class
        โ”œโ”€โ”€ featurizer.py           # Featurizer class
        โ””โ”€โ”€ sec_struct_utils.py     # Methods for secondary structure prediction and determination

Downloading and Preparing Data

gRNAde is trained on all RNA structures from the PDB at โ‰ค4A resolution (12K 3D structures from 4.2K unique RNAs) downloaded via RNASolo with date cutoff: 31 October 2023. If you would like to train your own models from scratch, download and extract the raw .pdb files via the following script into the data/raw/ directory (or another location indicated by the DATA_PATH environment variable in your .env file).

๐Ÿšจ Note: Alternatively to the instructions below, you can download a pre-processed .pt file and .csv metadata, and place them into the data/ directory.

Method 1: Script

# Download structures in PDB format from RNAsolo (31 October 2023 cutoff)
mkdir ~/geometric-rna-design/data/raw
cd ~/geometric-rna-design/data/raw
gdown https://drive.google.com/uc?id=10NidhkkJ-rkbqDwBGA_GaXs9enEBJ7iQ
tar -zxvf RNAsolo_31102023.tar.gz
Older instuctions for downloading from RNAsolo (not working)
curl -O https://rnasolo.cs.put.poznan.pl/media/files/zipped/bunches/pdb/all_member_pdb_4_0__3_300.zip
unzip all_member_pdb_4_0__3_300.zip
rm all_member_pdb_4_0__3_300.zip

RNAsolo recently stopped hosting downloads for older versions, such as the 31 October 2023 cutoff that we used in our current work, so you can download the exact data we used via our Google Drive link.

Method 2: Manual

Manual download link: https://rnasolo.cs.put.poznan.pl/archive. Select the following for creating the download: 3D (PDB) + all molecules + all members + res. โ‰ค4.0

Next, process the raw PDB files into our ML-ready format, which will be saved under data/processed.pt. You need to install the optional dependencies (US-align, CD-HIT) for processing.

# Process raw data into ML-ready format (this may take several hours)
cd ~/geometric-rna-design/
python data/process_data.py

Each RNA will be processed into the following format (most of the metadata is optional for simply using gRNAde):

{
    'sequence'                   # RNA sequence as a string
    'id_list'                    # list of PDB IDs
    'coords_list'                # list of structures, i.e. 3D coordinates of shape ``(length, 27, 3)``
    'sec_struct_list'            # list of secondary structure strings in dotbracket notation
    'sasa_list'                  # list of per-nucleotide SASA values
    'rfam_list'                  # list of RFAM family IDs
    'eq_class_list'              # list of non-redundant equivalence class IDs
    'type_list'                  # list of structure types (RNA-only, RNA-protein complex, etc.)
    'rmsds_list'                 # dictionary of pairwise C4' RMSD values between structures
    'cluster_seqid0.8'           # cluster ID of sequence identity clustering at 80%
    'cluster_structsim0.45'      # cluster ID of structure similarity clustering at 45%
}

Splits for Benchmarking

We have provided the splits used in our experiments in the data/ directory:

  • Single-state split from Das et al., 2010: data/das_split.pt (called the Das split for compatibility with older code)
  • Multi-state split of structurally flexible RNAs: data/structsim_split.pt

The precise procedure for creating the splits (which can be used to modify and customise them) can be found in the notebooks/ directory. The exact PDB IDs used for each of the splits are also available in the data/split_ids/ directory, in case you are using a different version of RNAsolo after the 31 October 2023 cutoff.

Citation

@article{joshi2023grnade,
  title={gRNAde: Geometric Deep Learning for 3D RNA inverse design},
  author={Joshi, Chaitanya K. and Jamasb, Arian R. and Vi{\~n}as, Ramon and Harris, Charles and Mathis, Simon and Morehead, Alex and Anand, Rishabh and Li{\`o}, Pietro},
  journal={arXiv preprint},
  year={2023},
}

geometric-rna-design's People

Contributors

chaitjo avatar eltociear avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

geometric-rna-design's Issues

Splits for benchmark

Hi,

Congrats on the great work. As I understand data/seqid_split.pt data/das_split.pt and data/structsim_split.pt contains the train, test, validation idx for different splits. And the corresponding pdbid_<chain> for those can be extracted from processed_df.csv. Can you provide this preprocessed processed_df.csv file or directly the PDB ids for easier access?

Thanks!

Where is RibonanzaNet?

Hello,

Thanks for your paper and open source codes. Currently I am working on the implementation of your codes. I found one problem which is missing from tools.ribonanzanet.network import RibonanzaNet in notebooks/design.ipynb. How can I solve this issue?

Thanks in advance!

Best,

list index out of range

Dear authors,

Thank you so much for such great work. I'm really interested in it.
I got an issue here, after getting processed.pt, I tried to run main.py. It uses das_split.pt to split the data into train, val and test, right? But I got an "index out of range" error. I wonder if you have any clues why this happened? By the way, I saw under the data folder, you have three '_split.pt' files, can you please tell me the difference between them?

Error log:
46 Traceback (most recent call last):
47 File "/home/yjwang/geometric-rna-design/main.py", line 246, in
48 main(config, device)
49 File "/home/yjwang/geometric-rna-design/main.py", line 39, in main
50 train_list, val_list, test_list = get_data_splits(config, split_type=config.split)
51 File "/home/yjwang/geometric-rna-design/main.py", line 119, in get_data_splits
52 train_list = index_list_by_indices(data_list, train_idx_list)
53 File "/home/yjwang/geometric-rna-design/main.py", line 113, in index_list_by_indices
54 return [lst[index] for index in indices]
55 File "/home/yjwang/geometric-rna-design/main.py", line 113, in
56 return [lst[index] for index in indices]
57 IndexError: list index out of range

Thanks in advance,
yuehua

where is cluster.txt?

Your work is very impressive, and i want to process data by your instructions. But when i run process_data.py, i met the error like this:

Traceback (most recent call last):
File "/cpfs01/projects-HDD/cfff-282dafecea22_HDD/cfff_siqi/RNA/geometric-rna-design/data/process_data.py", line 230, in
cluster_list_structsim = cluster_structure_similarity(
File "/cpfs01/projects-HDD/cfff-282dafecea22_HDD/cfff_siqi/RNA/geometric-rna-design/data/../src/data/clustering_utils.py", line 148, in cluster_structure_similarity
clustered_structures = run_qtmclust(
File "/cpfs01/projects-HDD/cfff-282dafecea22_HDD/cfff_siqi/RNA/geometric-rna-design/data/../src/data/clustering_utils.py", line 116, in run_qtmclust
output_clusters = parse_qtmclust_cluster_file(output_cluster_filepath)
File "/cpfs01/projects-HDD/cfff-282dafecea22_HDD/cfff_siqi/RNA/geometric-rna-design/data/../src/data/clustering_utils.py", line 76, in parse_qtmclust_cluster_file
with open(file_path) as file:
FileNotFoundError: [Errno 2] No such file or directory: 'cluster.txt'

I searched the ./data folder carefully, but can't find this 'cluster.txt' file. I wondered if this file missed in this repo or any other errors? I'm looking forward to your answer, thanks!

Request for checkpoints and hyperparameters for NAR models

Hi @chaitjo, thank you for this awesome work!

Could you kindly share the checkpoints and the hyperparameters for the NAR models reported in Table-1 of your paper (image attached)? I tried to reproduce the scores using the configs/default.yaml file, with 'model' set to 'NARv1', and 'num_layers' set to '8', as discussed in the paper. But I think I am missing something here, because the sequence recovery scores I got were much lower (~0.52 vs 0.584).

Thank you for your help!

-Sazan

image

Ask for checkpoints

Hi @chaitjo,

Thanks for your great work first!
Could you provide a pretrained checkpoint to run inference directly?

Best regards,
Han

evalๅ’Œtest็š„้€Ÿๅบฆๅพˆๆ…ข๏ผŒ่ฟ™็งๆƒ…ๅ†ตๆญฃๅธธๅ—

ไฝœ่€…ไฝ ๅฅฝ๏ผŒ
่ฏท้—ฎๆˆ‘ๅœจๅค็Žฐๅฎž้ชŒ็š„ๆ—ถๅ€™๏ผŒๅ‡บ็Žฐไบ†eval ๅ’Œ test็š„ๆ—ถ้—ดๅพˆ้•ฟ็š„็Žฐ่ฑก๏ผŒ่ฟ™ๆ˜ฏๆญฃๅธธๆƒ…ๅ†ตๅ—๏ผŸ
image

ๅ› ไธบๆ•ฐๆฎ้›†ๅ’ŒๅŽŸๅง‹ๆ•ฐๆฎ้›†ๆœ‰ไบ†ไธ€ไบ›ๆ›ดๆ–ฐ๏ผŒๅฏผ่‡ดๅ‡บ็Žฐไบ†ๅพˆๅคšๆ–ฐ็š„ๆ•ฐๆฎ๏ผˆ่ฟ™ไบ›ๆ•ฐๆฎไธญๆœ‰้ƒจๅˆ†็š„ sec_strucไธญ ๅทฆๆ‹ฌๅทๆ•ฐ้‡ๅคšไบŽๅณๆ‹ฌๅท๏ผŒๅฏผ่‡ด่ฎก็ฎ—ๅฏนๅบ”็š„adj็š„ๆ—ถๅ€™ๆŠฅ้”™๏ผ‰๏ผŒไบŽๆ˜ฏๆˆ‘ๅฏนๅŽŸๅง‹็š„ไปฃ็ ๅšไบ†ๅฆ‚ไธ‹ไฟฎๆ”น๏ผŒๅฐ†่ฎก็ฎ—่ฟ‡็จ‹ๆ”พๅˆฐไบ†try้‡Œ้ข๏ผŒๅฆ‚ๆžœๆ˜ฏ้”™่ฏฏ็š„ๅฐฑ่ทณๅ‡บ๏ผš

def dotbracket_to_adjacency(sec_struct: str) -> np.ndarray: """ Convert secondary structure in dot-bracket notation to adjacency matrix. """ n = len(sec_struct) adj = np.zeros((n, n), dtype=np.int8) stack = [] empty_count = 0 for i, db_char in enumerate(sec_struct): if db_char == '(': stack.append(i) elif db_char == ')': try: j = stack.pop() adj[i, j] = 1 adj[j, i] = 1 except IndexError: break return adj

ๅ…ถไป–ๅœฐๆ–น้ƒฝๆฒกๆœ‰ๆ”นๅ˜

ๅคงๆฆ‚eval ๅ’Œ test ๆฏไธช้ƒฝ่ฆๆŽฅ่ฟ‘8ๅฐๆ—ถ๏ผŒๆŸฅ็œ‹ๅ ็”จ๏ผŒๅ‘็Žฐgpu้•ฟๆ—ถ้—ดไธ็”จ๏ผŒไธ€็›ดๆ˜ฏcpuๅœจ่ท‘ไธ€ไธชๅซๅšcontrafold็š„่ฟ›็จ‹๏ผŒๅธŒๆœ›ๅคงไฝฌ่งฃๆƒ‘ใ€‚

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.