The rfam-website from rfam

Improve GO term provenance

Currently Rfam families are associated with GO terms but no additional metadata is captured. It would be useful to provide the following for all GO terms:

Qualifier (especially important if it is 'NOT’)
Reference (e.g. PubMed ID or a GO_REF describing how the annotations were made)
Evidence code
Date
Assigned by

Documentation: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/README

If implemented, the information can be propagated to RNAcentral.

Add GO terms to all families

As many families as possible need to have GO terms.
GO terms need to be consistent within RNA types (for example, all riboswitches should have a translation regulation GO term).

Problem downloading FASTA alignment of motifs

I tried to download a fasta alignment of motif and I got an empty file in safari. For example the fasta alignment here http://rfam.xfam.org/motif/RM00004#tabview=tab1 gives an empty file. The stockholm file looks correct.

Nagios status page

Nagios needs a page with a standard string of characters that will only be present if the site is up and the database is available.

Indexing new data for release 13.0

We need to add Genome as a fourth entity type in Rfam search in addition to Family, Clan, and Motif.

The Genome object should contain the following fields:

uniprot_reference_proteome_id - empty if UPID not available
rfam_genome_id - empty if UPID is available
gca_accession - empty if GCA not available
description
length - in nucleotides
taxonomy_lineage - string like Bacteria; Firmicutes; etc
ncbi_taxonomy_id cross reference - see comment below
num_rfam_hits - number of significant Rfam family hits
num_rfam_families - number of distinct Rfam families with significant hits

Change wiki titles of Flavivirus

The flavivirus families have an error on the Summary web page, the wiki info is not loading.
These families have an auto_wiki identifier of 2759 which is Flavivirus_3_UTR - however, https://en.wikipedia.org/wiki/Flavivirus_3_UTR is not a valid entry.
It seems these families should be 2754 Flavivirus_3'_UTR
Will change these families to Flavivirus_3'_UTR by updating their auto_wiki ID in the database.

RF03536
RF03537
RF03538
RF03539
RF03540
RF03541
RF03542
RF03544
RF03545
RF03547

Text search improvements

Incomplete Rfam.tar.gz mapping for Rfam>=14.0

It seems that the file containing mapping of rfam families to pdb codes is incomplete in rfam 14.0 and 14.1.

In 14.0 (ftp://ftp.ebi.ac.uk/pub/databases/Rfam/14.0/Rfam.tar.gz) pdb code is missing:

pdb_id	chain	pdb_start	pdb_end	bit_score	evalue_score	cm_start	cm_end	hex_colour
RF00001	C	3	118	77.20	1.6e-20	1	119	0064f4
RF00001	9	1	121	77.40	1.4e-20	1	119	ebeb30
RF00001	9	1	121	77.40	1.4e-20	1	119	93c090
RF00001	B	1	121	77.40	1.4e-20	1	119	c008ae
RF00001	B	1	121	77.40	1.4e-20	1	119	8484c0

In 14.1 (ftp://ftp.ebi.ac.uk/pub/databases/Rfam/14.1/Rfam.pdb.gz) the header is missing and there is no rfam ids:

3hjw	D	1	57	48.50	1.3e-11	1	70	1fc01f
3lwo	D	1	57	48.50	1.3e-11	1	70	ff87a4
3lwp	D	1	57	48.50	1.3e-11	1	70	ebeb30
3lwq	D	1	57	48.50	1.3e-11	1	70	f29242
3lwr	D	1	57	48.50	1.3e-11	1	70	8585e6

Add GFF and BED support for new genome-centric content

provide downloadable files on FTP
add links to sequence and genome summary pages

The average sequence length field

Request to bring back the average sequence length field, which will allow sorting through families based on increasing length. User feed back: It was very useful having an overview of sequence lengths in Rfam families when selecting a data set.

Add GO and SO term visualisation

Add a page that lets the users browse GO and SO terms associated with Rfam families.

Inaccurate PDB mapping

It looks like some PDB chains do not match the right family.

For example, chains 1A and 2A from 5DOX should match the Bacterial LSU but they match the 23S pseudoknot (RF01118).

Sort entries by number of 3D structures

Trying to list all the Rfam families that have pseudoknots and 3D structures, but at the moment it is not possible to sort by the # of 3D structures and would be very useful.

Blame @kalvari

Fix broken Sanger links on Rfam wiki pages

Example: http://www.sanger.ac.uk/cgi-bin/Rfam/getacc?RF00177

Switch to the new R-chie

There is a new version of R-chie - we should try to install it and switch to it on the Rfam website soon:
https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa708/5911741

R-scape improvements

indicate how many Rscape-significant base pairs are not shown in the displayed structure. This would be both a quality control hint (if an Rfam curated structure was messed up and missing a bunch of base pairs), plus an indication of whether a pseudoknot might be present.
include a note of how many Rfam-annotated base pairs are not shown in the Rfam structure because they're an annotated pseudoknot.
display pseudoknots. Perhaps draw a bracket over a contiguous string of nucleotides that are one side of the PK stem and lettering the two halves A and a, B and b.

Ensure full https support

The left sidebar links are broken when the site is loaded over https.

Change links from NDB to NAKB

NDB is being replaced by NAKB (https://nakb.org/). The links in Rfam that point to NDB should be replaced by links to NAKB.

Missing wiki content

For example, GNRA Rfam motif links to Tetraloop on Wikipedia but the wiki content does not appear on the Rfam page.

Sequence search issue

I used web-search to search for accession AP011496.1 (RF02547 : mtPerm-5S) and the sequence matched all the families in the LSU clan apart from the 5S RNA family I took the sequence from.

The families are listed below:
RF02543 - LSU_rRNA_eukarya
RF02541 - LSU_rRNA_bacteria
RF02540 - LSU_rRNA_archaea

Check that CMs can be downloaded

The website had a bug where CMs could not be downloaded. It isn't clear why but this should be fixed. We should check it post release 14.9.

Fixes for Rfam 13.0

Add ORCID links to all authors

Each author name should appear on a separate line in DESC files followed by ORCID id:

AU Author AB 0000-0000-0000-0000

add ORCID to curation tools
display on the website

Mapping spreadsheet

Fix issue with HTML alignments

It seems not all HTML alignments can be viewed. One which cannot is:

https://rfam.org/family/RF04222#tabview=tab2

When trying to view the HTML alignment nothing shows up.

Develop new curation tools for adding new genomes to sequence database

Rfam Curator needs to be able to add or delete a genome using UPID or GCA accession.

Update websearch to clan compete hits

The websearch currently does not clan compete hits resulting in spurious results.

Show only significant PDB hits

Use is_significant field to filter the entries.

Redesign homepage

update header and footer
show featured families
embed Twitter feed
remove old search options

Add secondary structure to sequence summary pages

Fold the sequence using the CM, show the structure in Rfam and export it to RNAcentral too.

Add a new Downloads tab to Family pages

Several users pointed out that it's difficult to find where to download various family specific files, so I suggest creating a new Downloads tab on the left hand side of Rfam family pages. It would aggregate all the downloadable items that are spread across all the other tabs, such as:

Covariance model
All FASTA sequences
Sequences as a table
Seed alignment in various formats
Seed tree
Species tree
Secondary structure diagrams as images

There may be other downloadable items that I missed so please click around the family pages to double check.

We can keep the existing download links as they are or drop them in favour of the new tab (this can be decided on a case by case basis).

Add SO term to all thermometers

RNA Thermometer - SO:0002168

See The-Sequence-Ontology/SO-Ontologies#404 for details.

curl SSLv3 alert handshake failure when accessing the website from Ubuntu 20.04

Hi,

This fails with Ubuntu 20.04:

curl https://rfam.xfam.org
#curl: (35) error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure

but works fine with Ubuntu < 20.04 and on Windows and macOS Mojave.

This seems to happen with some websites because of a combination of three reasons: server misconfiguration, increased TLS security level in Ubuntu 20.04 by default, and a bug in OpenSSL 1.1.1. See Ensembl/ensembl-rest#427 for a similar issue with the Ensembl server.

FWIW this breaks Bioconductor package rfaRm: https://bioconductor.org/checkResults/3.12/bioc-LATEST/rfaRm/nebbiolo1-install.html

Internally the package tries to access rfam.xfam.org with the following R code:

> library(xml2)

> read_xml("https://rfam.xfam.org/clans")
Error in open.connection(x, "rb") :
  error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure

> sessionInfo()
R version 4.0.2 Patched (2020-08-04 r78971)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r78971/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r78971/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xml2_1.3.2

loaded via a namespace (and not attached):
[1] compiler_4.0.2 curl_4.3

Thanks!
H.

Misleading labels in species sunburst

Example

Species sunburst for Clostridia in RF01315 shows that there are 64 sequences:

An example SQL query confirming the number of sequences:

SELECT CONCAT(t1.rfamseq_acc, '/', seq_start, '-', `seq_end`)
FROM full_region t1, rfamseq t2, taxonomy t3
WHERE t1.rfam_acc = 'RF01315' 
AND t1.rfamseq_acc = t2.rfamseq_acc
AND t2.ncbi_id = t3.ncbi_id
AND t3.tax_string LIKE '%Clostridia;%'
AND is_significant = 1
GROUP BY rfamseq_acc;

64 rows (like in sunburst UI) - note the GROUP BY clause

However, there are many more annotated regions:

SELECT CONCAT(t1.rfamseq_acc, '/', seq_start, '-', `seq_end`)
FROM full_region t1, rfamseq t2, taxonomy t3
WHERE t1.rfam_acc = 'RF01315' 
AND t1.rfamseq_acc = t2.rfamseq_acc
AND t2.ncbi_id = t3.ncbi_id
AND t3.tax_string LIKE '%Clostridia;%'
AND is_significant = 1;

6222 rows - no GROUP BY clause

So the number of entries in the resulting FASTA file is inconsistent with sunburst UI.

Problem with Family page tab redirects on the old Rfam URL

When a user is on an Rfam page and clicks on one of the tabs on the left panel, this action redirects him/her to the Rfam home page rather than displaying the requested data on the page.

This issue only occurs when accessing Rfam using the old URL http://rfam.xfam.org. Website functionality through the new URL http://rfam.org seems normal.

Retire old URL?

Sequence download extracts the wrong sequences

Sequence download is not working correctly. The sequence extracted isn't the correct one. For example there is an inconsistency between the SEED sequence of family RF00005

AM286415.1/4382210-4382145
GAAGGCACGACAUUGCUCACAUUGCUUCCAGUGUUUACUU-AGCCAGC---CGGG-UGCUGGCUUUUUUUU

and the sequence downloaded directly from the website:

AM286415.1/4382210-4382145
CTTCCGTGCTGTAACGAGTGTAACGAAGGTCACAAATGAATCGGTCGGCCCACGACCGAAAAAAAA

Review sequence search code to prepare for Job Dispatcher changes

When a user submits a job using Job Dispatcher APIs, a job identifier (i.e. iprscan5-R20170926-112045-0446-74958378-es) is returned that will also include "-p1m" and "-p2m" at the end of the string. This is in addition to current values "-es", "-pg" and "-oy". The latter three will eventually disappear and another announcement will follow with proposed dates. These characters denote independent job clusters, which we use to provide resilience and failover.

If the code takes into account these last characters after the dash please take note of the change.

Examples of existing job identifiers:

iprscan5-R20170926-112045-0446-74958378-es
ncbiblast-R20170926-112324-0724-35376033-pg
emboss_needle-I20170926-112309-0557-5534481-oy

In addition to the above you should expect:

emboss_needle-I20170926-112309-0557-5534481-p2m
ncbiblast-R20170926-112324-0724-35376033-p1m
emboss_needle-I20170926-112309-0557-5534481-p2m

rfam / rfam-website Goto Github PK

rfam-website's People

Contributors

Stargazers

Watchers

Forkers

rfam-website's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs