GithubHelp home page GithubHelp logo

rfam / rfam-website Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 2.0 33.89 MB

Rfam website source code

Home Page: https://rfam.org

License: Apache License 2.0

Perl 34.74% R 0.07% Shell 0.03% CSS 2.02% JavaScript 50.63% Python 11.38% Makefile 0.37% C++ 0.44% C 0.08% Emacs Lisp 0.02% Dockerfile 0.01% EJS 0.01% Cython 0.20%
bioinformatics docker ncrna perl

rfam-website's People

Contributors

alexbateman1 avatar antonpetrov avatar aurel-l avatar blakesweeney avatar carlosribas avatar codegit avatar emmaco avatar evanfloden avatar ilavidas avatar jainamistry avatar jgtate avatar nawrockie avatar neomorphic avatar ppgardne avatar samgriffithsjones avatar swburge avatar testuser12342 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

ankitskvmdam

rfam-website's Issues

Improve GO term provenance

Currently Rfam families are associated with GO terms but no additional metadata is captured. It would be useful to provide the following for all GO terms:

  • Qualifier (especially important if it is 'NOT’)
  • Reference (e.g. PubMed ID or a GO_REF describing how the annotations were made)
  • Evidence code
  • Date
  • Assigned by

Documentation: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/README

If implemented, the information can be propagated to RNAcentral.

Add GO terms to all families

  • As many families as possible need to have GO terms.
  • GO terms need to be consistent within RNA types (for example, all riboswitches should have a translation regulation GO term).

Nagios status page

Nagios needs a page with a standard string of characters that will only be present if the site is up and the database is available.

Indexing new data for release 13.0

We need to add Genome as a fourth entity type in Rfam search in addition to Family, Clan, and Motif.

The Genome object should contain the following fields:

  • uniprot_reference_proteome_id - empty if UPID not available
  • rfam_genome_id - empty if UPID is available
  • gca_accession - empty if GCA not available
  • description
  • length - in nucleotides
  • taxonomy_lineage - string like Bacteria; Firmicutes; etc
  • ncbi_taxonomy_id cross reference - see comment below
  • num_rfam_hits - number of significant Rfam family hits
  • num_rfam_families - number of distinct Rfam families with significant hits

Change wiki titles of Flavivirus

The flavivirus families have an error on the Summary web page, the wiki info is not loading.
These families have an auto_wiki identifier of 2759 which is Flavivirus_3_UTR - however, https://en.wikipedia.org/wiki/Flavivirus_3_UTR is not a valid entry.
It seems these families should be 2754 Flavivirus_3'_UTR
Will change these families to Flavivirus_3'_UTR by updating their auto_wiki ID in the database.

RF03536
RF03537
RF03538
RF03539
RF03540
RF03541
RF03542
RF03544
RF03545
RF03547

Text search improvements

  • add a facet has_pseudoknot to find families with pseudoknots
  • add export in CSV/JSON formats + CM download option
  • index more data (number of non-gapped columns, alignment length etc)
  • add autocomplete
  • index GO/SO term definitions, not just ids
  • add alphabetic sorting by name option
  • hide 0 sequences, 0 structure, 0 species brown panel at the top of the search results page
  • index all taxa (Mammalia, Bacteria etc), not just the terminal leaves
  • hide text labels in R-scape previews
  • show an alternative 2D preview in search results when R-scape SVG is not available (e.g. RF01795)

Incomplete Rfam.tar.gz mapping for Rfam>=14.0

It seems that the file containing mapping of rfam families to pdb codes is incomplete in rfam 14.0 and 14.1.

In 14.0 (ftp://ftp.ebi.ac.uk/pub/databases/Rfam/14.0/Rfam.tar.gz) pdb code is missing:

pdb_id	chain	pdb_start	pdb_end	bit_score	evalue_score	cm_start	cm_end	hex_colour
RF00001	C	3	118	77.20	1.6e-20	1	119	0064f4
RF00001	9	1	121	77.40	1.4e-20	1	119	ebeb30
RF00001	9	1	121	77.40	1.4e-20	1	119	93c090
RF00001	B	1	121	77.40	1.4e-20	1	119	c008ae
RF00001	B	1	121	77.40	1.4e-20	1	119	8484c0

In 14.1 (ftp://ftp.ebi.ac.uk/pub/databases/Rfam/14.1/Rfam.pdb.gz) the header is missing and there is no rfam ids:

3hjw	D	1	57	48.50	1.3e-11	1	70	1fc01f
3lwo	D	1	57	48.50	1.3e-11	1	70	ff87a4
3lwp	D	1	57	48.50	1.3e-11	1	70	ebeb30
3lwq	D	1	57	48.50	1.3e-11	1	70	f29242
3lwr	D	1	57	48.50	1.3e-11	1	70	8585e6

The average sequence length field

Request to bring back the average sequence length field, which will allow sorting through families based on increasing length. User feed back: It was very useful having an overview of sequence lengths in Rfam families when selecting a data set.

Inaccurate PDB mapping

It looks like some PDB chains do not match the right family.

For example, chains 1A and 2A from 5DOX should match the Bacterial LSU but they match the 23S pseudoknot (RF01118).

R-scape improvements

  • indicate how many Rscape-significant base pairs are not shown in the displayed structure. This would be both a quality control hint (if an Rfam curated structure was messed up and missing a bunch of base pairs), plus an indication of whether a pseudoknot might be present.

  • include a note of how many Rfam-annotated base pairs are not shown in the Rfam structure because they're an annotated pseudoknot.

  • display pseudoknots. Perhaps draw a bracket over a contiguous string of nucleotides that are one side of the PK stem and lettering the two halves A and a, B and b.

Missing wiki content

For example, GNRA Rfam motif links to Tetraloop on Wikipedia but the wiki content does not appear on the Rfam page.

Sequence search issue

I used web-search to search for accession AP011496.1 (RF02547 : mtPerm-5S) and the sequence matched all the families in the LSU clan apart from the 5S RNA family I took the sequence from.

The families are listed below:
RF02543 - LSU_rRNA_eukarya
RF02541 - LSU_rRNA_bacteria
RF02540 - LSU_rRNA_archaea

Check that CMs can be downloaded

The website had a bug where CMs could not be downloaded. It isn't clear why but this should be fixed. We should check it post release 14.9.

Fixes for Rfam 13.0

  • num_full field in the family table is not updated
  • no sequences are marked as seed in the Sequences tab
  • PDB chains are still redundant (example: chain 0 from 1s72)
  • Schistosoma japonicum is not found in RF00163 anymore - no reference proteome for this species
  • delete sRNA type in RF01684 so that it's just Gene
  • check that families from this search do not have 0 species http://rfamlive.xfam.org/search?q=GO:0039703
  • species sunburst has Unknown/Uncategorised entries (example: SAM)
  • select * from family where number_of_species = 0; should not return any rows
  • fix all families with 0 species - example: RF00162
  • Example where the sunburst does not include all species (Pseudomonas syringae pv. tomato str. DC3000 is not in the sunburst) http://rfamlive.xfam.org/search?q=RF02749
  • add seed rfamseq entries so that sunbursts have the same species as trees

Redesign homepage

  • update header and footer
  • show featured families
  • embed Twitter feed
  • remove old search options

Add a new Downloads tab to Family pages

Several users pointed out that it's difficult to find where to download various family specific files, so I suggest creating a new Downloads tab on the left hand side of Rfam family pages. It would aggregate all the downloadable items that are spread across all the other tabs, such as:

  • Covariance model
  • All FASTA sequences
  • Sequences as a table
  • Seed alignment in various formats
  • Seed tree
  • Species tree
  • Secondary structure diagrams as images

There may be other downloadable items that I missed so please click around the family pages to double check.

We can keep the existing download links as they are or drop them in favour of the new tab (this can be decided on a case by case basis).

curl SSLv3 alert handshake failure when accessing the website from Ubuntu 20.04

Hi,

This fails with Ubuntu 20.04:

curl https://rfam.xfam.org
#curl: (35) error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure

but works fine with Ubuntu < 20.04 and on Windows and macOS Mojave.

This seems to happen with some websites because of a combination of three reasons: server misconfiguration, increased TLS security level in Ubuntu 20.04 by default, and a bug in OpenSSL 1.1.1. See Ensembl/ensembl-rest#427 for a similar issue with the Ensembl server.

FWIW this breaks Bioconductor package rfaRm: https://bioconductor.org/checkResults/3.12/bioc-LATEST/rfaRm/nebbiolo1-install.html

Internally the package tries to access rfam.xfam.org with the following R code:

> library(xml2)

> read_xml("https://rfam.xfam.org/clans")
Error in open.connection(x, "rb") :
  error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure

> sessionInfo()
R version 4.0.2 Patched (2020-08-04 r78971)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r78971/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r78971/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xml2_1.3.2

loaded via a namespace (and not attached):
[1] compiler_4.0.2 curl_4.3      

Thanks!
H.

Misleading labels in species sunburst

Example

Species sunburst for Clostridia in RF01315 shows that there are 64 sequences:

screen shot 2016-02-19 at 17 15 47

An example SQL query confirming the number of sequences:

SELECT CONCAT(t1.rfamseq_acc, '/', seq_start, '-', `seq_end`)
FROM full_region t1, rfamseq t2, taxonomy t3
WHERE t1.rfam_acc = 'RF01315' 
AND t1.rfamseq_acc = t2.rfamseq_acc
AND t2.ncbi_id = t3.ncbi_id
AND t3.tax_string LIKE '%Clostridia;%'
AND is_significant = 1
GROUP BY rfamseq_acc;

64 rows (like in sunburst UI) - note the GROUP BY clause

However, there are many more annotated regions:

SELECT CONCAT(t1.rfamseq_acc, '/', seq_start, '-', `seq_end`)
FROM full_region t1, rfamseq t2, taxonomy t3
WHERE t1.rfam_acc = 'RF01315' 
AND t1.rfamseq_acc = t2.rfamseq_acc
AND t2.ncbi_id = t3.ncbi_id
AND t3.tax_string LIKE '%Clostridia;%'
AND is_significant = 1;

6222 rows - no GROUP BY clause

So the number of entries in the resulting FASTA file is inconsistent with sunburst UI.

Sequence download extracts the wrong sequences

Sequence download is not working correctly. The sequence extracted isn't the correct one. For example there is an inconsistency between the SEED sequence of family RF00005

AM286415.1/4382210-4382145
GAAGGCACGACAUUGCUCACAUUGCUUCCAGUGUUUACUU-AGCCAGC---CGGG-UGCUGGCUUUUUUUU

and the sequence downloaded directly from the website:

AM286415.1/4382210-4382145
CTTCCGTGCTGTAACGAGTGTAACGAAGGTCACAAATGAATCGGTCGGCCCACGACCGAAAAAAAA

Review sequence search code to prepare for Job Dispatcher changes

When a user submits a job using Job Dispatcher APIs, a job identifier (i.e. iprscan5-R20170926-112045-0446-74958378-es) is returned that will also include "-p1m" and "-p2m" at the end of the string. This is in addition to current values "-es", "-pg" and "-oy". The latter three will eventually disappear and another announcement will follow with proposed dates. These characters denote independent job clusters, which we use to provide resilience and failover.

If the code takes into account these last characters after the dash please take note of the change.

Examples of existing job identifiers:

iprscan5-R20170926-112045-0446-74958378-es
ncbiblast-R20170926-112324-0724-35376033-pg
emboss_needle-I20170926-112309-0557-5534481-oy

In addition to the above you should expect:

emboss_needle-I20170926-112309-0557-5534481-p2m
ncbiblast-R20170926-112324-0724-35376033-p1m
emboss_needle-I20170926-112309-0557-5534481-p2m

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.