rfam / rfam-family-pipeline Goto Github PK

Backend for the Rfam family building pipeline

Perl 98.83% R 0.26% Shell 0.13% Dockerfile 0.18% Python 0.60%

rfam-family-pipeline's Introduction

Running the Rfam family-building-pipeline locally using Docker

Download and install docker for your favourite OS
Clone the rfam-family-pipeline repository from GitHub:

git clone https://github.com/Rfam/rfam-family-pipeline.git

Go to the rfam-family-pipeline directory and build a docker image using the Dockerfile:

cd /path/to/rfam-family-pipeline
docker image build -t rfam-local .

When you have the image built and start a new container by calling:

docker run -i -t rfam-local:latest /bin/bash

In the container, create a working directory and start building families

⚠️ Your work will be lost after killing the container. To prevent that from happening do one of the following:

Copy all your hard work from within the container to your local machine:

docker container ls # use this to find the container id
docker container cp CONTAINER_ID:/workdir/within/container /path/to/local/dir

Mount a dedicated working directory on your local machine to the one in the docker container:

docker run -i -t rfam-local:latest -v /path/to/local/workdir:/workdir /bin/bash

Note: Update Rfam/Conf/rfam.conf to provide a local location of the sequence database and Rfam/Conf/rfam_local.conf to establish a connection with the public MySQL database.

For Developers:

To easily test any changes your make to the code, mount your local directory to the rfam-family-pipeline directory in the docker container.

docker run -i -t rfam-local:latest -v /path/to/local/rfam-family-pipeline:/Rfam/rfam-family-pipeline bash

❗ You can also mount a directory on your machine to the /workdir inside the container to have any testing output generated directly on your local machine:

docker run -i -t rfam-local:latest -v /path/to/local/rfam-family-pipeline:/Rfam/rfam-family-pipeline -v /path/to/local/dir:/workdir bash

rfam-family-pipeline's People

Contributors

Stargazers

Watchers

Forkers

pythseq hlkfoz

rfam-family-pipeline's Issues

Improve -onlydesc option in rfci

Committing old families with -onlydesc option currently raises a Full_Region error due to QC. The QC tries to match hits in the full_region table with hits in the outlist and tblout files of the family. The QC will fail for families that were build from old sequence databases, as those hits have been replaced by hits from the most recent genome searches. We need to either disable this or limit it to new families only by including an additional option in rfci

Rewrite the PDB plugin

The pdb plugin should delete any entries associated with a family accession and then populate table with new hits
The PDB fasta file should be generated dynamically and its location should be stored in rfam.conf

Resolve SS_cons2 tags in Rfam.seed alignments

Some of the families in Rfam from release 12.0 onwards, include SS_cons2 tags which do ton comply with STOCKHOLM format standards and therefore not recognized by other tools like easel.

Conflict with R2R tags.

Omit versions of rfamseq on Cloud

The filename of Rfamseq on the EBI cluster always displays the version e.g. rfamseq14_2.fa
To prevent having to update rfam_cloud.conf with every release replace location to latest version of the sequence database with a symbolic link being constant throughout releases.

For example:

Replace path in rfam_cloud.conf from /Rfam/rfamseq/rfamseq14_2.fa to /Rfam/rfamseq/rfamseq.fa

ln -s /path/to/rfamseq/14.2/rfamseq14_2.fa  /Rfam/rfamseq/rfamseq.fa

NOTE: /Rfam/rfamseq/rfamseq.fa needs to be re-indexed using esl-sfetch

Author duplicates

I've noticed a couple of duplicates in the author table. It looks like a synonym lookup weakness, meaning that although the author's synonym is there the software is not capable of identifying that, resulting in a new entry in the table.

TO DO:

Fix the bug in the code
Resolve any data issues in the tables author and family_author

HCV families

Add `is_significant` to PdbFullRegion model

The same changes will be used in the web code.

Add author ORCiD in DESC files

Add Manja Marz ORCiD in the DESC files of the following families:

U3
Fungi_U3
U11

Make sure the orcid is in the database

Update perl test data and resolve failures

Perl tests bio_rfam_htmlalignment.t and rfmake.t are failing after the changes made on branch hotfix-bitsc2evalue. Inputs/data need to be replaced by new examples.

Update ZWD families missing pseudoknots

A while back (see RT #282705) Zasha reported that some of his families are missing pseudoknots. I suggest updating the following families using the original ZWD alignments with RNAcentral IDs, as discussed in the Rfam 14 NAR paper.

Here are the families:
For more information relate to this updates
https://docs.google.com/spreadsheets/d/1JTZwtmNm6Zrozm4H7oYMWWBAj9CZPLeX20eF9yV9PbA/edit?usp=sharing

You can find all ZWD alignments with RNAcentral IDs here:
https://www.dropbox.com/s/7q7r3bairt3s4in/zwd-rnacentral-ids.zip?dl=0

Please note that some of the families could have been already fixed.

Rfam view plugins crashing

One of the view plugins is crashing for families RF00641 and RF00230 messing up pseudoknot extraction and R-scape ss image generation.

Pseudoknots and SecondaryStructure plugins run without any issues on their own.

Do not validate taxids when using -onlydesc in rfci.pl

The following command

rfci.pl -onlydesc -m 'Fix characters in AU line' RF04184

resulted in an error:

ERROR trying to fetch taxids from genbank, reached maximum allowed number of failed attempts (10)

The family was successfully checked in after a few attempts, but in general -onlydesc should have prevented the taxid check.

Seems like the problem is in _commitEntry of Commit.pm.

More logs:

rfci.pl -onlydesc -m 'Fix characters in AU line' RF04184
Successfully loaded local copy of RF04184 through middleware
Successfully loaded SVN copy of RF04184 through middleware
No GO mappings for this family-have you tried to add any?
Look up:
http://www.geneontology.org/ or  http://www.ebi.ac.uk/QuickGO/
Failed to commit family, RF04184: [A repository hook failed: Commit failed (details follow):: Commit blocked by pre-commit hook (exit code 1) with output:
[
    [0] "trunk/Families/RF04184/DESC"
]
trunk/Families/RF04184/DESC[]
[]
ERROR trying to fetch taxids from genbank, reached maximum allowed number of failed attempts (10)
Tried to fetch taxid
-
at rfam-family-pipeline/Rfam/Lib/Bio/Rfam/FamilyIO.pm line 4156.
DBIx::Class::Storage::TxnScopeGuard::DESTROY(): A DBIx::Class::Storage::TxnScopeGuard went out of scope without explicit commit or error. Rolling back. at perl5/lib/perl5/Carp.pm line 291
 at rfam-family-pipeline/Rfam/Lib/Bio/Rfam/SVN/Client.pm line 597.
]
 at rfam-family-pipeline/Rfam/Lib/Bio/Rfam/SVN/Client.pm line 599.
	Bio::Rfam::SVN::Client::commitFamilyDESC(Bio::Rfam::SVN::Client=HASH(0x1cf17c8), "RF04184") called at rfam-family-pipeline/Rfam/Scripts/svn/rfci.pl line 187

ykkC family improvements

rename families
add new publications
update descriptions

Look for Breaker publications

RF02360 mismatching seed regions

The regions in the SEED alignment (start-end positions) do not match the ones in the seed_region table. Due to this the sequences cannot be extracted from the seed_alignment.

We need to:

Resolve the inconsistency
Generate seed region md5s for this family
Update the fasta file on the ftp

Fix `Tick-borne` spelling on codon

Currently we have 3 TBFV families with 3 different versions of spelling:

https://rfam.org/search?q=(RF03536%20or%20RF03537%20or%20RF03538)%20AND%20entry_type:%22Family%22

It should be Tick-borne everywhere.

Develop rfview.pl tool to manually launch family view processes

This will be a new perl script to enable curator to launch view processes following committing a new family to Rfam.

The script will:

fetch the corresponding family UUID from the database
submit a new LSF job running all family view plugins
notify the user via email when the results are ready.

The corresponding family page URL should also be included in the email. For example, https://rfam.org/family/RF03116

possible options:

--rfam-acc: specifies the family accession
--plugin: specifies which plugin to launch (e.g. SecondaryStructure). Run all if not specified
--no-email: disables email notifications
--email [email@address]: allows the user to specify an alternative email address

Add timestamps to view process tables

This will make view process troubleshooting easier

Remove family description from R-scape images

The first text SVG element should be removed because the text is too small to be useful in text search results.

Example: remove RF00360_snoZ107_R87 from this SVG:
http://rfam.xfam.org/family/RF00360/image/rscape

Fix Pending families

Duplicated authors

Burge SW author_id: 11
Burge S author_id: 64

Flavivirus PLRV

Overhaul rfreplace.pl script for new independent SEED paradigm

This was originally motivated by a suggestion from Franz Lang that users should not need to enforce the Rfam convention for sequence names in SEED alignments - a script should be able to do that. Anton suggested rfreplace could do that. After looking at current rfreplace code it is clear it needs to be scrapped and rewritten anyway for the new independent SEED paradigm (SEED seqs now only need to be in GenBank/RNAcentral as opposed to in Rfamseq (old paradigm)). Whereas current rfreplace looked only at hits in Rfamseq as candidates for replacing existing SEED seqs, we can now look at all seqs in GenBank/RNAcentral. Getting back to Franz's suggestion, if a sequence is found that is 100% identical in GenBank/RNAcentral that sequence's name can be used to rename the sequence in the input SEED.

Multiple taxids for a single sequence

@nancyontiveros encountered a failure when running rfci.pl to check in RF00050 which has sequence AE017333.1 in its seed. AE017333.1 has multiple taxids in GenBank (https://www.ncbi.nlm.nih.gov/nuccore/AE017333.1/) due to the presence of prophage sequences in the genome (search for 'taxon' on that page).

Excerpts from the rfci.pl error, reported by Nancy:

A repository hook failed: Commit failed (details follow):: Commit blocked by pre-commit hook

genbank_fetch_seq_info for AE017333.1, > 1 taxids read: 279010 and 1230651

I remember coding this and putting in the check that only 1 taxid be returned per sequence because otherwise we'd have to know which one was correct and it was not immediately obvious how to know that. I'll look into it more now.

Unable to download accessions via tree controls

https://rfam.org/family/RNaseP_nuc#tabview=tab4

Family inconsistency between rfam_live and SVN

There is some inconsistency between the families existing in rfam_live database and the SVN repo, which would need to be resolved.

The corresponding Rfam family accessions are listed below:

RF03551
RF03928
RF03708
RF04034
RF03529

These accessions belong to miRNA families, which were recently updated from miRBase.

For example:
RF03551 represents miRNA mir-506 along with accessions RF03529 and RF01910 the latter of which is the old/initial accession assigned to this family. Attempting to re-commit families with ids already existing in Rfam, which create a new entry in rfam_live, but SVN commit will fail resulting in "ghost families".

All associated entries can be found by executing the following query:

Select * from family where rfam_id='mir-506'

This query will return the following 3 accessions for mir-506:

RF01910
RF03529
RF03551

Solution:

Find all associated miRNA entries from rfam_live
Checkout oldest families from the SVN using rfco.pl (e.g. RF01910)
Replace old SEED with the updated one from miRBase
Rerun rfsearch.pl followed by rfmake.pl (for thresholds see the relevant report)
Update DESC with miRBase latest literature ref using add_ref.pl
Ensure family passes QC using rqc-all.pl
Recommit family back to Rfam using rfci.pl
Delete redundant entries from rfam_live (e.g. RF03529, RF03551)

Note: rfkill.pl does not work in this case because there are no entities in the SVN repository for accessions RF03529, RF03551. Hence the term "ghost families".

Get authorization token

Re-threshold RF00857

According to Sam's suggestions, mir-233 (RF00857) requires re-thresholding with cutoff set to 52.10 to get rid of any sequences that aren't C.elegans

Install Perl libraries on Codon

Generate an autobundle on the old cluster and copy to Codon
- Use an existing bundle in case of problems /nfs/production/xfam/rfam/Snapshot_2020_01_28_00.pm
Install on Codon
🤞

Add plant telomerase family and include in clan http://europepmc.org/article/MED/31392988

Enable rfco/rfci on the cloud

Tools accessing the SVN currently do not work on Rfam cloud. We need to enable SVN functionality on cloud to allow the editing of existing families and Rfam curators to commit new families.

rfco
rfci

rfseed.pl remove

to remove SEED sequence. After rfseed.pl addme it is not possible to quickly remove a sequence once it is added. Currently it can be done manually but then other lines have to be adjusted.

QC step based on length of model

Anton and I decided that a new QC step should be added that prevents families being built with consensus lengths less than 50, and warns users for lengths under 60.

Install external software on Codon

Use Rfam Cloud Dockerfile as a guide to install the software from the list below on Codon. At the end of the process the Old folder and the New folder should be identical.
Create a new file Dockerfile_codon based on Dockerfile_cloud_dev that can install all the software from the list below.

Old folder: /nfs/production/xfam/rfam/rfam_rh74/software
New folder: /hps/software/users/agb/rfam

Everything needs to be done using the rfamprod user to ensure correct permissions.

⚠️ Try installing the same software versions where possible, except for Infernal and R-scape which should be on the latest version.

Allow PDB sequences in alignments

The new rfci-md5 code currently only allows a sequence to exist in GenBank, RNAcentral or Rfamseq. The code needs to be expanded in order to allow PDB sequences in SEED alignments too. For examlpe, this improvement will enable the building of the sgRNA family with RNAcentral accession: URS0000B21DDC with PDB sequences in the SEED alignment.

Resolve group name issue in rfam/cloud:kubes image

This isn't really an error, but has been confusing for the users. Find a way to silence group error message when running:

rfcloud --start

The previous command will result in the following error:

groups: cannot find name for group ID 2000
rfam-user@rfam-login-pod-username-59cdc77574-s2kzd:/workdir$

Fix family RF02791 - no entries in full_region table

Fix clan checkin (rclci) to populate clan_membership table

On clan check in clan_membership table does not get populated. This issue occurred when clan CL00114 was stuck on pending mode. Clan has been restored and successfully checked in the SVN repository,. However no family members have been assigned to it.
CLANDESC and all member family DESC files seem to be correct.

Create a new S6 regulator family

This is a new family. In one of Michelle Meyer's papers.
The paper also includes an alignment

Update names for RF00080

Michelle Meyer suggested that we update the names for family RF00080 and pointed us to paper 25794618

Add Telomerase_Asco to a clan

Potential new families in paper 23396277

Michelle pointed to one of her papers (PMID:23396277) which was published in 2013.
The paper contains new and old families that already exist in Rfam

Update literature references of L10 families

There is a 2013 paper which can be added to the literature reference of the L10 families.
For example: RF00557

Fix family entry RF04034 (_2 label issue)

This is a new entry for MIR2118 (old Rfam entry RF01911). The family was not correctly committed to SVN repository due to the previous version. However a new was created in rfam_live.

TO DO:

Delete erroneous RF04034 entry from the database
Checkout RF01911 from the SVN
Replace old RF01911 SEED with new MIR2118 from miRBase
Rebuild and commit the new family

*Note: Potential candidate for merging SEEDs

Find new HCV families and review packaging signals

Family RF02585 - could be a signal to start replication

Fix duplicated regions in SEEDSCORES files

The current version of the code oddly misbehaves by appending a second pair of coordinates to the already existing ones in the SEEDSCORES files.

Example from the RF00050 SEEDSCORES file:

ABCH01000022.1/43466-43613/1-148    1  148  ABCH01000022.1/43466-43613    133.2  2.1e-27   1  140  0  seed
AAKK02000001.1/258130-258269/1-140  1  140  AAKK02000001.1/258130-258269  132.1  4.1e-27   1  140  0  seed
AE016796.2/1572120-1571980/1-141    1  141  AE016796.2/1572120-1571980    131.7  5.3e-27   1  140  0  seed
AAWP01000010.1/3949-3809/1-141      1  141  AAWP01000010.1/3949-3809      130.9  8.7e-27   1  140  0  seed
AANE01000019.1/87182-87037/1-146    1  146  AANE01000019.1/87182-87037    130.6    1e-26   1  140  0  seed

Remove duplicates from rfamseq

e.g. BBXM01000045.1/167634-167567

Remove hard-coded paths in updateTaxonomyWebsearch.pl

Last date tested: 03/03/2021

A better/automated way should be established for updating updateTaxonomyWebsearch.pl to be in sync with the most recent data at NCBI and ensuring RfamLive is always up to date for a new release.

To be included as a required release pre-processing step.