Light

nextstrain / dengue Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 10.0 6.72 MB

Nextstrain build for dengue virus

Home Page: https://nextstrain.org/dengue

Python 78.82% Shell 20.38% Perl 0.80%

dengue's Introduction

This repository is archived and contains the content used to build the documentation and splash page found in nextstrain.org. This content can now be found here.

License and copyright

Source code to Nextstrain is made available under the terms of the GNU Affero General Public License (AGPL). Nextstrain is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

dengue's People

Contributors

Stargazers

Watchers

Forkers

global-localhost global19 global19-atlassian-net prosaddas rhysinward j23414 jubair231dd chantisakee zhengzha2000

dengue's Issues

Establish some deduplication guidelines within the phylogenetic workflow

Context

Flagged by #28 (comment) as well as prior historical discussions.

Design and implement some deduplication paths in the phylogentic workflow.

Description

Examples

Possible solution

Preferably, leverage the existing tools in the nextstrain dockerfile, with seqkit being a probable choice.

Removal of genome containing plasmid sequence

Dear @j23414

I see that some of the sequences in all_sequences.fasta file have plasmid DNA as well and therefore are circular DNA and have length greater than 12000 bp. These include

AY243466
AY243467
AY243468
AY243469
AY376438
AY648301
AY656167
AY656168
AY656169
AY656170
AY744148

Shouldn't these be either removed or the plasmid sequence be chopped off?
I also see some extremely small genomes of length 2K. Too lar or too small genomes can influence the MSA so shouldn't they be removed from the resource? If so what's thresholds would you recommend for filtering the uninformative genomes. Given the graph below, I was thinking to take Upper boundary (i.e. 12184bp) and lower boundary (i.e. 8670bp)

Bug: Update dropped strains file to list accession instead of strain

Current Behavior

Currently, strains listed in phylogenetic/config/dropped_strains.txt are not being dropped since 8ab810f

Expected behavior

Strains listed in dropped_strains.txt are not in the final phylogenetic tree.

How to reproduce

Possible solution

Perhaps cherry pick a commit like:

67016d1

Your environment: if browsing Nextstrain online

Operating system:
Browser:

Your environment: if running Nextstrain locally

Operating system:
Browser:
Version (e.g. auspice 2.7.0):

Additional context

Add any other context about the problem here.

Add manual serotype annotations along with justifications to "annotations.tsv"

Context

In response to comment: #28 (comment)

Description

We are relying on ncbi_tax_id to split dengue records into "DENV1" - "DENV4" but some records are missing this information.

Examples

Possible solution

Incorporate any manual annotation into the "annotations.tsv" file in the form of:

DI401607	ncbi_serotype	denv1 # Based on DEFINITION line in GenBank

Switch DENV2 genotypes to numeric to be consistent with DENV1, 3, and 4

Context

In response to comment: #28 (comment)

DENV2/AA --> DENV2/III
DENV2/AI --> DENV2/V 
DENV2/AM --> DENV2/I
DENV2/C --> DENV2/II
DENV2/S --> DENV2/VI
DENV4/S --> DENV4/IV

One modification is to keep the S groups, since S=Sylvatic.

Description

Transition to using numeric lineage labels, and less geography-tied naming conventions.

DENV2/AA --> DENV2/III
DENV2/AI --> DENV2/V 
DENV2/AM --> DENV2/I
DENV2/C --> DENV2/II

Before implementing this, check if this is standard in the literature or will cause any confusion.

Examples

Possible solution

(Optional)

ENH: Generalize taxon id to serotype map definitions to a configuration file

Context

As a potential enhancement, it may be beneficial to allow users to configure the serotype (and taxon ID) list. This suggestion is inspired by the discussions in the following comments:

This would be particularly useful if we intend to permit users to modify the list of serotypes for curation, especially if taxon IDs become more detailed (e.g., the taxonomy subtree for Dengue).

Possible solution

Open to more suggestions or feedback here, but some solutions include:

Store the list and map in a dedicated config/taxid_to_serotype_map.tsv file.
Store the list and map directly in the config/build.config, following a similar approach to the NCBI field_map configuration.

Harmonize with pathogen repo guide

Context

Part of updating the pathogen repos to match a golden path:

https://github.com/nextstrain/pathogen-repo-guide

To Dos

Rename ingest/workflow/snakemake_rules to ingest/rules
Rename ingest/rules/*.smk to match pathogen-repo-guide/rules/*
Move ingest/source-data/* files to ingest/config
Modernize ncbi-field-map in config
Add a CHANGELOG.md file
Rename "config" to "defaults"
Move nextstrain automation rules and configs to ingest/build-configs

Modernize `ncbi-field-map` in config

Add workflow for producing the Nextclade dengue dataset

Context

Add a workflow for producing the Nextclade dataset for dengue serotypes and subtypes in a nextclade folder, following the pathogen-repo-guide. This will ease dataset creation, testing, and debugging.

Description

TBD

Examples

Possible solution

TBD

Split by dengue serotype (denv1-denv4)

Description

Implement strategies (or an ensemble of strategies for cross-validation) to produce pairs of sequences_{serotype}.fasta and metadata_{serotype}.tsv files.

Context

Following the merge of #13, all ingested dengue records now exist in a unified pair of sequences.fasta and metadata.tsv files.

For subsequent analysis #18, and to maintain consistency with the previous approach and ensure a seamless integration with the phylogenetic pipeline, it is necessary to separate these files based on dengue serotypes (e.g., from sequences_denv1.fasta to sequences_denv4.fasta).

Possible solution(s)

Rely on NCBI taxon id annotations for serotype segregation

Historically, for dengue, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a virus-tax-id field, that we can use to separate the serotypes.

Notably, this method carries the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID. (~3k records) which can potentially be further refined with a nextclade all dataset.

Create a Nextclade dataset for finer subtype classification

Originally, the plan was to leverage Nextclade assignment to categorize records into major dengue serotypes and subsequent minor subtypes #16. However, due to the diversity within dengue, the major serotype classification did not align with expectations. Therefore the idea is to rely on NCBI taxon ids for major serotypes, and nextclade datasets for within-serotype sub-classification.

An ensemble method

Ideally, employ a combination of the above methods to consistently and accurately classify records into major serotypes and minor subtypes.

Tasks to solve this issue

#20
#16
- which possibly requires #25

Add E gene builds

Context

By user request:

Is there any chance we could get a E gene build of nextstrain dengue? Much more sequences of E than full genome, especially in some parts of the world

Description

Examples

Possible steps to a solution

Pull out E gene sequence from the dengue reference.gb file to be used as the reference for the E gene builds.
a. Or follow rsv rules
Add a filter_length_per_group function for “all_E”, “denv1_E”, “denv2_E”, etc similar to filter_sequences_per_group.
Add E to the dropdown under “Dataset” by appending _E and _genome (e.g. dengue_denv1_genome.json and dengue_denv1_E.json and updating the nextstrain.org manifest file.

Dependencies

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble