GithubHelp home page GithubHelp logo

finitenet / rna-fm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ml4bio/rna-fm

0.0 0.0 0.0 2.24 MB

RNA foundation model

Home Page: https://ml4bio.github.io/RNA-FM/

License: MIT License

Python 79.78% Jupyter Notebook 20.22%

rna-fm's Introduction

RNA-FM

This repository contains codes and pre-trained models for RNA foundation model (RNA-FM). RNA-FM outperforms all tested single-sequence RNA language models across a variety of structure prediction tasks as well as several function-related tasks. You can find more details about RNA-FM in our paper, "Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions" (Chen et al., 2022).

Overview

Citation
@article{chen2022interpretable,
  title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},
  author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},
  journal={arXiv preprint arXiv:2204.00300},
  year={2022}
}
Table of contents

Create Environment with Conda

First, download the repository and create the environment.

git clone https://github.com/ml4bio/RNA-FM.git
cd ./RNA-FM
conda env create -f environment.yml

Then, activate the "RNA-FM" environment and enter into the workspace.

conda activate RNA-FM
cd ./redevelop

Access pre-trained models.

Download pre-trained models from this gdrive link and place the pth files into the pretrained folder.

Apply RNA-FM with Existing Scripts.

1. Embedding Extraction.

python launch/predict.py --config="pretrained/extract_embedding.yml" \
--data_path="./data/examples/example.fasta" --save_dir="./resuts" \
--save_frequency 1 --save_embeddings

RNA-FM embeddings with shape of (L,640) will be saved in the $save_dir/representations.

2. Downstream Prediction - RNA secondary structure.

python launch/predict.py --config="pretrained/ss_prediction.yml" \
--data_path="./data/examples/example.fasta" --save_dir="./resuts" \
--save_frequency 1

The predicted probability maps will be saved in form of .npy files, and the post-processed binary predictions will be saved in form of .ct files. You can find them in the $save_dir/r-ss.

3. Online Version - RNA-FM server.

If you have any trouble with the deployment of the local version of RNA-FM, you can access its online version from this link, RNA-FM server. You can easily submit jobs on the server and download results from it afterwards, without setting up environment and occupying any computational resources.

Quick Start for Further Development.

PyTorch is the prerequisite package which you must have installed to use this repository. You can install rna-fm in your own environment with the following pip command if you just want to use the pre-trained language model. you can either install rna-fm from PIPY:

pip install rna-fm

or install rna-fm from github:

cd ./RNA-FM
pip install .

After installation, you can load the RNA-FM and extract its embeddings with the following code:

import torch
import fm

# Load RNA-FM model
model, alphabet = fm.pretrained.rna_fm_t12()
batch_converter = alphabet.get_batch_converter()
model.eval()  # disables dropout for deterministic results

# Prepare data
data = [
    ("RNA1", "GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCU"),
    ("RNA2", "GGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),
    ("RNA3", "CGAUUCNCGUUCCC--CCGCCUCCA"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract embeddings (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[12])
token_embeddings = results["representations"][12]

More tutorials can be found from https://ml4bio.github.io/RNA-FM/. The related notebooks are stored in the tutorials folder.

Citations

If you find the models useful in your research, we ask that you cite the relevant paper:

For RNA-FM:

@article{chen2022interpretable,
  title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},
  author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},
  journal={arXiv preprint arXiv:2204.00300},
  year={2022}
}

The model of this code builds on the esm sequence modeling framework. And we use fairseq sequence modeling framework to train our RNA language modeling. We very appreciate these two excellent works!

License

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

rna-fm's People

Contributors

mydkzgj avatar liyu95 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.