GithubHelp home page GithubHelp logo

rfam / rfam-family-pipeline Goto Github PK

View Code? Open in Web Editor NEW
3.0 5.0 2.0 5.94 MB

Backend for the Rfam family building pipeline

Perl 98.83% R 0.26% Shell 0.13% Dockerfile 0.18% Python 0.60%
ncrna bioinformatics perl

rfam-family-pipeline's Introduction

Running the Rfam family-building-pipeline locally using Docker

  1. Download and install docker for your favourite OS
  2. Clone the rfam-family-pipeline repository from GitHub:
git clone https://github.com/Rfam/rfam-family-pipeline.git
  1. Go to the rfam-family-pipeline directory and build a docker image using the Dockerfile:
cd /path/to/rfam-family-pipeline
docker image build -t rfam-local .
  1. When you have the image built and start a new container by calling:
docker run -i -t rfam-local:latest /bin/bash
  1. In the container, create a working directory and start building families

⚠️ Your work will be lost after killing the container. To prevent that from happening do one of the following:

  1. Copy all your hard work from within the container to your local machine:
docker container ls # use this to find the container id
docker container cp CONTAINER_ID:/workdir/within/container /path/to/local/dir
  1. Mount a dedicated working directory on your local machine to the one in the docker container:
docker run -i -t rfam-local:latest -v /path/to/local/workdir:/workdir /bin/bash

Note: Update Rfam/Conf/rfam.conf to provide a local location of the sequence database and Rfam/Conf/rfam_local.conf to establish a connection with the public MySQL database.

For Developers:

To easily test any changes your make to the code, mount your local directory to the rfam-family-pipeline directory in the docker container.

docker run -i -t rfam-local:latest -v /path/to/local/rfam-family-pipeline:/Rfam/rfam-family-pipeline bash

❗ You can also mount a directory on your machine to the /workdir inside the container to have any testing output generated directly on your local machine:

docker run -i -t rfam-local:latest -v /path/to/local/rfam-family-pipeline:/Rfam/rfam-family-pipeline -v /path/to/local/dir:/workdir bash

rfam-family-pipeline's People

Contributors

alexbateman1 avatar antonpetrov avatar aurel-l avatar blakesweeney avatar carlosribas avatar codegit avatar emmaco avatar evanfloden avatar ilavidas avatar jainamistry avatar jgtate avatar kalvari avatar nawrockie avatar neomorphic avatar ppgardne avatar samgriffithsjones avatar swburge avatar testuser12342 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

pythseq hlkfoz

rfam-family-pipeline's Issues

Improve -onlydesc option in rfci

Committing old families with -onlydesc option currently raises a Full_Region error due to QC. The QC tries to match hits in the full_region table with hits in the outlist and tblout files of the family. The QC will fail for families that were build from old sequence databases, as those hits have been replaced by hits from the most recent genome searches. We need to either disable this or limit it to new families only by including an additional option in rfci

Rewrite the PDB plugin

  • The pdb plugin should delete any entries associated with a family accession and then populate table with new hits
  • The PDB fasta file should be generated dynamically and its location should be stored in rfam.conf

Resolve SS_cons2 tags in Rfam.seed alignments

Some of the families in Rfam from release 12.0 onwards, include SS_cons2 tags which do ton comply with STOCKHOLM format standards and therefore not recognized by other tools like easel.

Conflict with R2R tags.

Omit versions of rfamseq on Cloud

The filename of Rfamseq on the EBI cluster always displays the version e.g. rfamseq14_2.fa
To prevent having to update rfam_cloud.conf with every release replace location to latest version of the sequence database with a symbolic link being constant throughout releases.

For example:

Replace path in rfam_cloud.conf from /Rfam/rfamseq/rfamseq14_2.fa to /Rfam/rfamseq/rfamseq.fa

ln -s /path/to/rfamseq/14.2/rfamseq14_2.fa  /Rfam/rfamseq/rfamseq.fa

NOTE: /Rfam/rfamseq/rfamseq.fa needs to be re-indexed using esl-sfetch

Author duplicates

I've noticed a couple of duplicates in the author table. It looks like a synonym lookup weakness, meaning that although the author's synonym is there the software is not capable of identifying that, resulting in a new entry in the table.

TO DO:

  • Fix the bug in the code
  • Resolve any data issues in the tables author and family_author

Add author ORCiD in DESC files

Add Manja Marz ORCiD in the DESC files of the following families:

  • U3
  • Fungi_U3
  • U11

Make sure the orcid is in the database

Update ZWD families missing pseudoknots

A while back (see RT #282705) Zasha reported that some of his families are missing pseudoknots. I suggest updating the following families using the original ZWD alignments with RNAcentral IDs, as discussed in the Rfam 14 NAR paper.

Here are the families:
For more information relate to this updates
https://docs.google.com/spreadsheets/d/1JTZwtmNm6Zrozm4H7oYMWWBAj9CZPLeX20eF9yV9PbA/edit?usp=sharing

  • RF01689|AdoCbl-variant
  • RF01696|Chlorobi-1
  • RF01734|crcB
  • RF01704|Downstream-peptide
  • RF01735|epsC
  • RF01739|glnA
  • RF01745|manA
  • RF01750|pfl
  • RF01717|PhotoRC-II
  • RF01754|radC
  • RF01725|SAM-I-IV-variant
  • RF01728|STAXI
  • RF01761|wcaG
  • RF01763|ykkC-III
  • RF02032 GOLLD
  • RF02033 HEARO
  • RF02034 IMES-1
  • RF02035 IMES-2

You can find all ZWD alignments with RNAcentral IDs here:
https://www.dropbox.com/s/7q7r3bairt3s4in/zwd-rnacentral-ids.zip?dl=0

Please note that some of the families could have been already fixed.

Rfam view plugins crashing

One of the view plugins is crashing for families RF00641 and RF00230 messing up pseudoknot extraction and R-scape ss image generation.

Pseudoknots and SecondaryStructure plugins run without any issues on their own.

Do not validate taxids when using -onlydesc in rfci.pl

The following command

rfci.pl -onlydesc -m 'Fix characters in AU line' RF04184

resulted in an error:

ERROR trying to fetch taxids from genbank, reached maximum allowed number of failed attempts (10)

The family was successfully checked in after a few attempts, but in general -onlydesc should have prevented the taxid check.

Seems like the problem is in _commitEntry of Commit.pm.

More logs:

rfci.pl -onlydesc -m 'Fix characters in AU line' RF04184
Successfully loaded local copy of RF04184 through middleware
Successfully loaded SVN copy of RF04184 through middleware
No GO mappings for this family-have you tried to add any?
Look up:
http://www.geneontology.org/ or  http://www.ebi.ac.uk/QuickGO/
Failed to commit family, RF04184: [A repository hook failed: Commit failed (details follow):: Commit blocked by pre-commit hook (exit code 1) with output:
[
    [0] "trunk/Families/RF04184/DESC"
]
trunk/Families/RF04184/DESC[]
[]
ERROR trying to fetch taxids from genbank, reached maximum allowed number of failed attempts (10)
Tried to fetch taxid
-
at rfam-family-pipeline/Rfam/Lib/Bio/Rfam/FamilyIO.pm line 4156.
DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at perl5/lib/perl5/Carp.pm line 291
 at rfam-family-pipeline/Rfam/Lib/Bio/Rfam/SVN/Client.pm line 597.
]
 at rfam-family-pipeline/Rfam/Lib/Bio/Rfam/SVN/Client.pm line 599.
	Bio::Rfam::SVN::Client::commitFamilyDESC(Bio::Rfam::SVN::Client=HASH(0x1cf17c8), "RF04184") called at rfam-family-pipeline/Rfam/Scripts/svn/rfci.pl line 187


RF02360 mismatching seed regions

The regions in the SEED alignment (start-end positions) do not match the ones in the seed_region table. Due to this the sequences cannot be extracted from the seed_alignment.

We need to:

  • Resolve the inconsistency
  • Generate seed region md5s for this family
  • Update the fasta file on the ftp

Develop rfview.pl tool to manually launch family view processes

This will be a new perl script to enable curator to launch view processes following committing a new family to Rfam.

The script will:

  1. fetch the corresponding family UUID from the database
  2. submit a new LSF job running all family view plugins
  3. notify the user via email when the results are ready.

The corresponding family page URL should also be included in the email. For example, https://rfam.org/family/RF03116

possible options:

  • --rfam-acc: specifies the family accession
  • --plugin: specifies which plugin to launch (e.g. SecondaryStructure). Run all if not specified
  • --no-email: disables email notifications
  • --email [email@address]: allows the user to specify an alternative email address

Overhaul rfreplace.pl script for new independent SEED paradigm

This was originally motivated by a suggestion from Franz Lang that users should not need to enforce the Rfam convention for sequence names in SEED alignments - a script should be able to do that. Anton suggested rfreplace could do that. After looking at current rfreplace code it is clear it needs to be scrapped and rewritten anyway for the new independent SEED paradigm (SEED seqs now only need to be in GenBank/RNAcentral as opposed to in Rfamseq (old paradigm)). Whereas current rfreplace looked only at hits in Rfamseq as candidates for replacing existing SEED seqs, we can now look at all seqs in GenBank/RNAcentral. Getting back to Franz's suggestion, if a sequence is found that is 100% identical in GenBank/RNAcentral that sequence's name can be used to rename the sequence in the input SEED.

Multiple taxids for a single sequence

@nancyontiveros encountered a failure when running rfci.pl to check in RF00050 which has sequence AE017333.1 in its seed. AE017333.1 has multiple taxids in GenBank (https://www.ncbi.nlm.nih.gov/nuccore/AE017333.1/) due to the presence of prophage sequences in the genome (search for 'taxon' on that page).

Excerpts from the rfci.pl error, reported by Nancy:

A repository hook failed: Commit failed (details follow):: Commit blocked by pre-commit hook
genbank_fetch_seq_info for AE017333.1, > 1 taxids read: 279010 and 1230651

I remember coding this and putting in the check that only 1 taxid be returned per sequence because otherwise we'd have to know which one was correct and it was not immediately obvious how to know that. I'll look into it more now.

Family inconsistency between rfam_live and SVN

There is some inconsistency between the families existing in rfam_live database and the SVN repo, which would need to be resolved.

The corresponding Rfam family accessions are listed below:

  • RF03551
  • RF03928
  • RF03708
  • RF04034
  • RF03529

These accessions belong to miRNA families, which were recently updated from miRBase.

For example:
RF03551 represents miRNA mir-506 along with accessions RF03529 and RF01910 the latter of which is the old/initial accession assigned to this family. Attempting to re-commit families with ids already existing in Rfam, which create a new entry in rfam_live, but SVN commit will fail resulting in "ghost families".

All associated entries can be found by executing the following query:

Select * from family where rfam_id='mir-506'

This query will return the following 3 accessions for mir-506:

  • RF01910
  • RF03529
  • RF03551

Solution:

  1. Find all associated miRNA entries from rfam_live
  2. Checkout oldest families from the SVN using rfco.pl (e.g. RF01910)
  3. Replace old SEED with the updated one from miRBase
  4. Rerun rfsearch.pl followed by rfmake.pl (for thresholds see the relevant report)
  5. Update DESC with miRBase latest literature ref using add_ref.pl
  6. Ensure family passes QC using rqc-all.pl
  7. Recommit family back to Rfam using rfci.pl
  8. Delete redundant entries from rfam_live (e.g. RF03529, RF03551)

Note: rfkill.pl does not work in this case because there are no entities in the SVN repository for accessions RF03529, RF03551. Hence the term "ghost families".

Re-threshold RF00857

According to Sam's suggestions, mir-233 (RF00857) requires re-thresholding with cutoff set to 52.10 to get rid of any sequences that aren't C.elegans

Install Perl libraries on Codon

  • Generate an autobundle on the old cluster and copy to Codon
    • Use an existing bundle in case of problems /nfs/production/xfam/rfam/Snapshot_2020_01_28_00.pm
  • Install on Codon
  • 🤞

Enable rfco/rfci on the cloud

Tools accessing the SVN currently do not work on Rfam cloud. We need to enable SVN functionality on cloud to allow the editing of existing families and Rfam curators to commit new families.

  • rfco
  • rfci

rfseed.pl remove

to remove SEED sequence. After rfseed.pl addme it is not possible to quickly remove a sequence once it is added. Currently it can be done manually but then other lines have to be adjusted.

QC step based on length of model

Anton and I decided that a new QC step should be added that prevents families being built with consensus lengths less than 50, and warns users for lengths under 60.

Install external software on Codon

  • Use Rfam Cloud Dockerfile as a guide to install the software from the list below on Codon. At the end of the process the Old folder and the New folder should be identical.
  • Create a new file Dockerfile_codon based on Dockerfile_cloud_dev that can install all the software from the list below.

Old folder: /nfs/production/xfam/rfam/rfam_rh74/software
New folder: /hps/software/users/agb/rfam

Everything needs to be done using the rfamprod user to ensure correct permissions.

⚠️ Try installing the same software versions where possible, except for Infernal and R-scape which should be on the latest version.

  • alimask
  • aspell
  • bedToBigBed
  • clustalo
  • cmalign
  • cmbuild
  • cmcalibrate
  • cmconvert
  • cmemit
  • cmfetch
  • cmpress
  • cmscan
  • cmsearch
  • cmstat
  • dnaml
  • esl-afetch
  • esl-alimanip
  • esl-alimap
  • esl-alimask
  • esl-alimerge
  • esl-alipid
  • esl-alirev
  • esl-alistat
  • esl-cluster
  • esl-compalign
  • esl-compstruct
  • esl-construct
  • esl-histplot
  • esl-mask
  • esl-randomize-sqfile.pl
  • esl-reformat
  • esl-selectn
  • esl-seqrange
  • esl-seqstat
  • esl-sfetch
  • esl-shuffle
  • esl-ssdraw
  • esl-translate
  • esl-weight
  • faSplit
  • FastTree
  • fetchChromSizes
  • hmmalign
  • hmmbuild
  • hmmconvert
  • hmmemit
  • hmmfetch
  • hmmlogo
  • hmmpgmd
  • hmmpgmd_shard
  • hmmpress
  • hmmscan
  • hmmsearch
  • hmmsim
  • hmmstat
  • hubCheck
  • jackhmmer
  • mafft
  • makehmmerdb
  • muscle
  • nextflow
  • nhmmer
  • nhmmscan
  • phmmer
  • plot_outlist.R
  • RNA2Dfold
  • RNAaliduplex
  • RNAalifold
  • RNAcode
  • RNAcofold
  • RNAdistance
  • RNAdos
  • RNAduplex
  • RNAeval
  • RNAfold
  • RNAforester
  • RNAheat
  • RNAinverse
  • RNALalifold
  • RNALfold
  • RNAlocmin
  • RNApaln
  • RNAparconv
  • RNApdist
  • RNAPKplex
  • RNAplex
  • RNAplfold
  • RNAplot
  • RNApvmin
  • RNAsnoop
  • RNAsubopt
  • RNAup
  • R-scape (latest)
  • seqkit
  • stockholm2Arc.R
  • t_coffee

Allow PDB sequences in alignments

The new rfci-md5 code currently only allows a sequence to exist in GenBank, RNAcentral or Rfamseq. The code needs to be expanded in order to allow PDB sequences in SEED alignments too. For examlpe, this improvement will enable the building of the sgRNA family with RNAcentral accession: URS0000B21DDC with PDB sequences in the SEED alignment.

Resolve group name issue in rfam/cloud:kubes image

This isn't really an error, but has been confusing for the users. Find a way to silence group error message when running:

rfcloud --start

The previous command will result in the following error:

groups: cannot find name for group ID 2000
rfam-user@rfam-login-pod-username-59cdc77574-s2kzd:/workdir$ 

Fix clan checkin (rclci) to populate clan_membership table

On clan check in clan_membership table does not get populated. This issue occurred when clan CL00114 was stuck on pending mode. Clan has been restored and successfully checked in the SVN repository,. However no family members have been assigned to it.
CLANDESC and all member family DESC files seem to be correct.

Fix family entry RF04034 (_2 label issue)

This is a new entry for MIR2118 (old Rfam entry RF01911). The family was not correctly committed to SVN repository due to the previous version. However a new was created in rfam_live.

TO DO:

  • Delete erroneous RF04034 entry from the database
  • Checkout RF01911 from the SVN
  • Replace old RF01911 SEED with new MIR2118 from miRBase
  • Rebuild and commit the new family

*Note: Potential candidate for merging SEEDs

Fix duplicated regions in SEEDSCORES files

The current version of the code oddly misbehaves by appending a second pair of coordinates to the already existing ones in the SEEDSCORES files.

Example from the RF00050 SEEDSCORES file:

ABCH01000022.1/43466-43613/1-148    1  148  ABCH01000022.1/43466-43613    133.2  2.1e-27   1  140  0  seed
AAKK02000001.1/258130-258269/1-140  1  140  AAKK02000001.1/258130-258269  132.1  4.1e-27   1  140  0  seed
AE016796.2/1572120-1571980/1-141    1  141  AE016796.2/1572120-1571980    131.7  5.3e-27   1  140  0  seed
AAWP01000010.1/3949-3809/1-141      1  141  AAWP01000010.1/3949-3809      130.9  8.7e-27   1  140  0  seed
AANE01000019.1/87182-87037/1-146    1  146  AANE01000019.1/87182-87037    130.6    1e-26   1  140  0  seed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.