mxfold / mxfold2 Goto Github PK

View Code? Open in Web Editor NEW

106.0 106.0 31.0 14.31 MB

MXfold2: RNA secondary structure prediction using deep learning with thermodynamic integration

License: MIT License

CMake 0.16% Python 72.65% C++ 22.50% Dockerfile 0.62% Shell 4.08%

deep-learning rna-secondary-structure-prediction

mxfold2's Introduction

MXfold: the max-margin based RNA folding algorithm

Requirements

C++11 compatible compiler (tested on Apple LLVM version 6.1.0 and GCC version 4.8.1)
Vienna RNA package (>= 2.3)

Install

export PKG_CONFIG_PATH=/path/to/vienna-rna/lib/pkgconfig:${PKG_CONFIG_PATH}
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make

Usage

MXfold can take a FASTA formatted RNA sequence as input, then predicts its secondary structure.

% mxfold test.fa
> DS4440
GGAUGGAUGUCUGAGCGGUUGAAAGAGUCGGUCUUGAAAACCGAAGUAUUGAUAGGAAUACCGGGGGUUCGAAUCCCUCUCCAUCCG
>structure
(((((((........(((((..(((.......)))...)))))..(((((......))))).(((((.......)))))))))))).

Web server

A web server is working at http://www.dna.bio.keio.ac.jp/mxfold/.

License

Copyright (c) 2017-2019 Kengo Sato, Manato Akiyama
Released under the MIT license
http://opensource.org/licenses/mit-license.php

Acknowledgments

MXfold is based on the source code of CONTRAfold.

References

Akiyama, M., Sato, K., Sakakibara, Y.: A max-margin training of RNA secondary structure prediction integrated with the thermodynamic model, J. Bioinform. Comput. Biol., 16(6), 1840025 (2018), DOI: 10.1142/S0219720018400255.

mxfold2's People

Contributors

Stargazers

Watchers

mxfold2's Issues

Score output of Neural Network

Hi Dr Kengo Sat,

Please let me know if raising issues here about MXFold2 bothers you. I do have a lot of problem when trying to understand MXFold2 and apply it myself. I can email you if you think that's more appropriate.

This issue is about zuker.py(https://github.com/keio-bioinformatics/mxfold2/blob/master/mxfold2/fold/zuker.py) and Fig. 2 The network structure of our algorithm from you paper published on Nature Communications.

Assuming we using --model MixC and --pair-join cat:

From Fig .2 there are four types of score outputs, Helix Stacking, Unpaired Region, Helix Opening and Helix Closing, while zuker.py line 103-113, there are 10 scores (all of shape L *L). I can see score_helix_stacking should be Helix Stacking (first dimension of score_paired), how about score_basepair(all zeros), four types of score_mismatch (second dimension of score_paired) and four types of score_base (third dimension of score_paired)?
There are also 7 other scores from line 39-45, can you explain a bit about them as well?I can't find any matches from Fig. 2.

Thanks a lot in advance.

Different with retrain param and default param

Hi, I retrain the model with the default hyper-param, but the recall is 0.63, with the default param, the recall is 0.685, which will cause this different? Thanks.

RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 132426121216 bytes. Error code 12 (Cannot allocate memory)

So how to limitize the CPU requirement in the codes?

mxfold2-0.1.1.tar.gz vs mxfold2-0.1.2.tar.gz

About mxfold2-0.1.1.tar.gz and mxfold2-0.1.2.tar.gz. The former install successfully but not the latter on Windows. What could be the reason? An unsuccessful installation of the latter comes with the following message:
Building wheel for mxfold2 (pyproject.toml) did not run successfully.
exit code: 1
[228 lines of output]
INFO: could not find files for the given pattern(s).

energy

Hi,

What does the number inside the bracket show?
I mean output from 'mxfold2 predict'.
Is it the energy of the secondary structure?
Or how can I extract energy from it?

Best,
Peyman

Code of Loss (objective) function

Hi Dr Kengo Sato,

I have been reading your paper as well as trying to understand this github repo code. There has been several doubts I would like to ask you about. For this issue, I will ask about the corresponding code of loss function.

In the paper, the objective function is:

I find the corresponding python loss function code from loss.py:

class StructuredLossWithTurner(nn.Module):
    def __init__(self, model, loss_pos_paired=0, loss_neg_paired=0, loss_pos_unpaired=0, loss_neg_unpaired=0, 
                l1_weight=0., l2_weight=0., sl_weight=1., verbose=False):
        super(StructuredLossWithTurner, self).__init__()
        self.model = model
        self.loss_pos_paired = loss_pos_paired
        self.loss_neg_paired = loss_neg_paired
        self.loss_pos_unpaired = loss_pos_unpaired
        self.loss_neg_unpaired = loss_neg_unpaired
        self.l1_weight = l1_weight
        self.l2_weight = l2_weight
        self.sl_weight = sl_weight
        self.verbose = verbose
        from .fold.rnafold import RNAFold
        from . import param_turner2004
        if getattr(self.model, "turner", None):
            self.turner = self.model.turner
        else:
            self.turner = RNAFold(param_turner2004).to(next(self.model.parameters()).device)

    def forward(self, seq, pairs, fname=None):
        pred, pred_s, _, param = self.model(seq, return_param=True, reference=pairs,
                                loss_pos_paired=self.loss_pos_paired, loss_neg_paired=self.loss_neg_paired, 
                                loss_pos_unpaired=self.loss_pos_unpaired, loss_neg_unpaired=self.loss_neg_unpaired)
        ref, ref_s, _ = self.model(seq, param=param, constraint=pairs, max_internal_length=None)
        with torch.no_grad():
            ref2, ref2_s, _ = self.turner(seq, constraint=pairs, max_internal_length=None)
        l = torch.tensor([len(s) for s in seq], device=pred.device)
        loss = (pred - ref) / l
        loss += self.sl_weight * (ref-ref2) * (ref-ref2) / l
        if self.verbose:
            print("Loss = {} = ({} - {})".format(loss.item(), pred.item(), ref.item()))
            print(seq)
            print(pred_s)
            print(ref_s)
        if loss.item()> 1e10 or torch.isnan(loss):
            print()
            print(fname)
            print(loss.item(), pred.item(), ref.item())
            print(seq)

        if self.l1_weight > 0.0:
            for p in self.model.parameters():
                loss += self.l1_weight * torch.sum(torch.abs(p))

        # if self.l2_weight > 0.0:
        #     l2_reg = 0.0
        #     for p in self.model.parameters():
        #         l2_reg += torch.sum((self.l2_weight * p) ** 2)
        #     loss += torch.sqrt(l2_reg)

        return loss

From the python code only, I am not able to match to the function from paper. I can see both L1 and L2 regularization terms, but I am not able to find the code for structured hinge loss function (margin term and max function), I suppose f(x, y) from paper means the score (first return item) from self.model. I assume that the max margin part is embedded in self.model which is in the C++ code part. From the paper, you are using MixedFold, can you please kindly point out which part and lines of C++ code are the max margin function?

Batch size

I can't find how to change the batch size. With the param in the default, the training is so slow, so how can I change the batch size to speed the train. Thanks.

'mxfold2' terminated by signal SIGSEGV (Address boundary error)

I can't get mxfold2 to run on my MacBook at all. Here's what happens:

BenjaminLee@mbp ~/r/viroid-search (master) [SIGSEGV]> clang --version                                                          (viroid-search)
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: x86_64-apple-darwin22.1.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
BenjaminLee@mbp ~/r/viroid-search (master)> python --version                                                                   (viroid-search)
Python 3.8.13
BenjaminLee@mbp ~/r/viroid-search (master)> pip install torch                                                                  (viroid-search)
Requirement already satisfied: torch in /usr/local/Caskroom/miniconda/base/envs/viroid-search/lib/python3.8/site-packages (1.13.0)
Requirement already satisfied: typing-extensions in /usr/local/Caskroom/miniconda/base/envs/viroid-search/lib/python3.8/site-packages (from torch) (4.2.0)
WARNING: Error parsing requirements for chardet: [Errno 2] No such file or directory: '/usr/local/Caskroom/miniconda/base/envs/viroid-search/lib/python3.8/site-packages/chardet-3.0.4.dist-info/METADATA'
BenjaminLee@mbp ~/r/viroid-search (master)> mxfold2 --help                                                                     (viroid-search)
fish: Job 1, 'mxfold2 --help' terminated by signal SIGSEGV (Address boundary error)
BenjaminLee@mbp ~/r/viroid-search (master) [SIGSEGV]> mxfold2 predict                                                          (viroid-search)
fish: Job 1, 'mxfold2 predict' terminated by signal SIGSEGV (Address boundary error)
BenjaminLee@mbp ~/r/viroid-search (master) [SIGSEGV]> mxfold2                                                                  (viroid-search)
fish: Job 1, 'mxfold2' terminated by signal SIGSEGV (Address boundary error)

I have tried installing via both the wheel and sdist but can't get it to work. Do you have any pointers?

cannot import name 'interface' from 'mxfold2'

I cloned the repository into my university's cluster. To work around importing issues and get things working properly, I put this file in my home directory:

# Called `mf2`
import re
import sys
from mxfold2.mxfold2.__main__ import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

and have this repository cloned in the same directory, and when I run

python mf2 predict some_sequence.fasta

I get the following errors:

Traceback (most recent call last):
  File "/home/ewhiting/mxfold2/mf2", line 5, in <module>
    from mxfold2.__main__ import main
  File "/home/ewhiting/mxfold2/mxfold2/__main__.py", line 5, in <module>
    from .predict import Predict
  File "/home/ewhiting/mxfold2/mxfold2/predict.py", line 14, in <module>
    from .fold.mix import MixedFold
  File "/home/ewhiting/mxfold2/mxfold2/fold/mix.py", line 2, in <module>
    from .. import interface
ImportError: cannot import name 'interface' from 'mxfold2' (/home/ewhiting/mxfold2/mxfold2/__init__.py)

Is there a better workaround for this?

Changing folding model and seed doesn't change output

Changing --model or --seed doesn't change the output for either the .bpp or .bpseq file. Is this expected?

Running Issue -- interface.cpp

Hi,

I got an error when I run ''mxfold2 predict test.fa''

Here is the error info

from .. import interface
ImportError: cannot import name 'interface' from 'mxfold2' (/mnt/home/wangru25/anaconda3/lib/python3.8/site-packages/mxfold2/init.py)

Would you please help me with this issue.

Thank you!

Instructions for building from source?

Hi, can you provide instructions for building from source the latest in this repo? Thank you!

Difficulty replicating webserver results using command line

What method is used for predicting structures on the webserver? As it is not detailed anywhere, I have tried several different models to predict RNA structures using the command line tool, but can not find a way to replicate the results I get from the webserver.

I am using different variations of the command mxfold2 predict @./different_models.conf sequence.fasta

Is there a .conf file I can point to which will replicate webserver results?

Thanks

mxfold2-0.1.2-cp310-cp310-manylinux_2_17_x86_64.whl is not a supported wheel on this platform.

Using the latest build (Apr version), trying to install it on WSL2 with python 3.11 gives the error in the header.

mxfold2-0.1.2-cp310-cp310-manylinux_2_17_x86_64.whl is not a supported wheel on this platform.

I have also tried installing from source, however, i get the following error:

ERROR: Cannot install mxfold2, mxfold2==0.1.2 and torchvision==0.15.2+cpu because these package versions have conflicting dependencies.

The conflict is caused by:
    mxfold2 0.1.2 depends on torch<2.0 and >=1.4
    torchvision 0.15.2+cpu depends on torch==2.0.1
    mxfold2 0.1.2 depends on torch<2.0 and >=1.4
    torchvision 0.15.2 depends on torch==2.0.1
    mxfold2 0.1.2 depends on torch<2.0 and >=1.4
    torchvision 0.15.1 depends on torch==2.0.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/ekofman/new_anaconda3/envs/mxfold2_python38/lib/python3.8/site-packages/mxfold2/interface.cpython-38-x86_64-linux-gnu.so)

Looks like it installed successfully, but now when I try to run it I get the following error:

ImportError: /lib64/libstdc++.so.6: version GLIBCXX_3.4.20' not found (required by /home/ekofman/new_anaconda3/envs/mxfold2_python38/lib/python3.8/site-packages/mxfold2/interface.cpython-38-x86_64-linux-gnu.so)`

Anybody seen this before/ know how to resolve it? Thank you!

mxfold2-0.1.1-cp38-cp38-linux_x86_64.whl is not a supported wheel on this platform

Hi there,

Any ideas what might be happening to yield this error? When I type:

(mxfold2) [ekofman@tscc-1-36]$ uname -m

I see that my distribution is:

x86_64

I have pip3 installed, but when I type:

(mxfold2) [ekofman@tscc-1-36]$ pip3 install mxfold2-0.1.1-cp38-cp38-linux_x86_64.whl
I see:

ERROR: mxfold2-0.1.1-cp38-cp38-linux_x86_64.whl is not a supported wheel on this platform.

Sequences > 1000 nt

Hi,

I was wondering if there is any way to predict structures of sequences > 1000 nt? I have only used the web server, if using the command line tools is this possible?

Many thanks,

Nick

mxfold2 output

Hi, I am wondering if you could clarify what the output from mxfold2 represents? Specifically the numbers in the parenthesis at the bottom right? Are these a score or metric of some kind? Is a larger number better or worse for a given output?

Apologies if this is stated somewhere, but I couldn't find it!

Inconsistent prediction result with Web Service

Hello Developers,
I’m trying to use mxfold2 to predict for some RNA sequences. However, I encountered a inconsistent result between my local prediction and the web service at http://www.dna.bio.keio.ac.jp/mxfold2/predict as being shown below:

#my local running
$mxfold2 predict  demo5.fa 
>KC131142_UTR5
AGTTGTTAGTCTACGTGGACCGACAAAGACAGATTCTTCGAGGAAGCTAAGCTTAACGTAGTTCTAACAGTTTTTTAATTAGAGAGCAGATCTCTG
................................................................................................ (4.8)
>KR920365_UTR5
AGTTGTTAGTCTACGTGGACCGACAAAGACAGATTCTTTGAGGAAGCTAAGCTTAACGTAGTTCTAACAGTTTTTTAATTAGAGAGCAGATCTCTG
................................................................................................ (4.7)


$mxfold2 predict @./mxfold2/models/TrainSetAB.conf demo5.fa 
>KC131142_UTR5
AGTTGTTAGTCTACGTGGACCGACAAAGACAGATTCTTCGAGGAAGCTAAGCTTAACGTAGTTCTAACAGTTTTTTAATTAGAGAGCAGATCTCTG
................................................................................................ (4.8)
>KR920365_UTR5
AGTTGTTAGTCTACGTGGACCGACAAAGACAGATTCTTTGAGGAAGCTAAGCTTAACGTAGTTCTAACAGTTTTTTAATTAGAGAGCAGATCTCTG
................................................................................................ (4.7)

The input two sequences in `demo5.fa` has no secondary structure from my local prediction, neither `default` settings , nor the `mxfold2/models/TrainSetAB.conf`.
However, the same sequences were input to `mxfold2` Web Service, and will have two well-formed secondary structures:

Why the same input sequences have a different results? Did I use the wrong local running settings? Could you please help me ?