replikation / porecov Goto Github PK

View Code? Open in Web Editor NEW

39.0 39.0 15.0 29.11 MB

SARS-CoV-2 workflow for nanopore sequence data

Home Page: https://case-group.github.io/

License: GNU General Public License v3.0

Nextflow 62.99% Python 33.58% Shell 2.41% R 1.02%

artic basecalling bioinformatics nanopore nanopore-data sars-cov-2 workflow

porecov's Introduction

porecov's People

Contributors

Stargazers

Watchers

Forkers

mult1fractal angelovangel novapyth phiweger raverjay pastvir ggabernet bauerfyr marielataretu wm75 oliverdrechsel nebiolabs oracle5th hyzhou1990 biowu

porecov's Issues

pangolin version tag

for sample flag a pangolin field needs to be added about "which version" was used

missing bases in low-covered homopolymer stretches

In regions with lower ONT read coverage single bases might be missed in the generated consensus due to issues in basecalling homopolymer stretches.

For example (position 11075, end of ORF1):

Maybe this can be fixed by additionally checking for deletions after Medaka that are in homopolymer (e.g. length > 6 nt) stretches and by again comparing to a reference (Wuhan) sequence. However, can be difficult if real deletions occur in homopolymers.

Maybe it's already fixed in the current ARTIC pipeline.

ARTIC nextflow pipeline

https://github.com/connor-lab/ncov2019-artic-nf

Maybe we can have a look here as well in terms of

updates
additional steps that can be included
qc/reports

Installation from source without conda/singularity/docker

Hi,

Is it possible to install the poreCov workflow directly from source without using conda/singularity/docker?

Could you please publish a list of software/tools that are required for this workflow?

What I could figure out from the README is that guppy is required. On our HPC cluster we already have a guppy 4.0.11 installation with CUDA support. What else is needed?

Best regards

Sam

Singularity containers needed for GPU and CPU Guppy

We need to check if guppy also works with Singularity in GPU and CPU mode.

Report Insertions as well

It seems Nextclade also provides information on insertions and not only deletions:

> colnames(x)
 [1] "seqName"
 [2] "clade"
 [3] "qc.overallScore"
 [4] "qc.overallStatus"
 [5] "totalGaps"
 [6] "totalInsertions"
 [7] "totalMissing"
 [8] "totalMutations"
 [9] "totalNonACGTNs"
[10] "totalPcrPrimerChanges"
[11] "substitutions"
[12] "deletions"
[13] "insertions"
[14] "missing"
[15] "nonACGTNs"
[16] "pcrPrimerChanges"
[17] "aaSubstitutions"
[18] "totalAminoacidSubstitutions"
[19] "aaDeletions"
[20] "totalAminoacidDeletions"
[21] "alignmentEnd"
[22] "alignmentScore"
[23] "alignmentStart"
[24] "qc.missingData.missingDataThreshold"
[25] "qc.missingData.score"
[26] "qc.missingData.status"
[27] "qc.missingData.totalMissing"
[28] "qc.mixedSites.mixedSitesThreshold"
[29] "qc.mixedSites.score"
[30] "qc.mixedSites.status"
[31] "qc.mixedSites.totalMixedSites"
[32] "qc.privateMutations.cutoff"
[33] "qc.privateMutations.excess"
[34] "qc.privateMutations.score"
[35] "qc.privateMutations.status"
[36] "qc.privateMutations.total"
[37] "qc.snpClusters.clusteredSNPs"
[38] "qc.snpClusters.score"
[39] "qc.snpClusters.status"
[40] "qc.snpClusters.totalSNPs"
[41] "errors"

see [13]

Could be worth to integrate this as well.

Module and subworkflow structure

A minor thing, but as sub workflows are used with the process code living here:
https://github.com/replikation/poreCov/tree/master/workflows/process

I think the processes such as:
https://github.com/replikation/poreCov/blob/master/modules/align_to_reference.nf

should also go in the workflows folder?

€: invalid

€ invalid. thx @hoelzer

Optimize execution: multithreading, other tools, memory

Should go through everything and check for efficient execution:

CPU core usage
better/newer programs (even awk etc have new and faster alternatives)
minimap2 instead of bwa mem
dynamically check for available memory?

Update guppy

Should we update Guppy GPU and CPU with the next PR?

4.4.1-1--a3fcea3

add pangolin to workflow

add this https://github.com/hCoV-2019/pangolin

Simple Read QC

It would be good to have some qc of the input data. I think the easiest is a NanoPlot module.

Maybe PycoQC as an option if the summary.txt is available.

As a final output, it would be great to have a summary QC for all input samples/barcodes where one can immediately observe strange samples.

Update consensus QC

president has a new version v0.6.0:
https://gitlab.com/RKIBioinformaticsPipelines/president/-/releases/v0.6.0

Besides some bug fixes that do not really affect poreCov, the order of the columns in the report.tsv was adjusted to be more intuitive (and not simply alphabetically sorted).

However, I assume this will break poreCov when the container is updated w/o applying some changes to the code where metrics are extracted from the report.tsv?

Readme

cleanup and simplify readme

Suggestion: Including Medaka VCF output in results

How about including VCF output from medaka in the results, which gives a nice overview of (potential) SNPs with confidences?

Cheers

Add CPU support for Guppy basecalling

Compare: https://gitlab.com/RKIBioinformaticsPipelines/nanoqc/-/blob/master/modules/guppy.nf

We could provide for users w/o GPU access a CPU Guppy version. At the moment, I think, only GPU is supported.

report primerpairs for "N" regions

why? you could do quick PCR with selected primers only and "resequence samples" via flongle quickly

Report percentage of Ns

Maybe also rather present %Ns rather than absolute numbers (will make it easier for people who have a hard time quickly dividing by 29902)?

Or both? I know that you are likely just taking the information from president so I could also escalate that to that tool so you could then just grep it from the report.tsv?

Summary report also in text table format

It would be really helpful to get a table with the data instead of only an HTML file. This is because often content has to be copied or accessed programmatically per column of data.

The HTML report is great, but having the main content also written as a summary table would be awesome.

@replikation I think you already write some JSON files for your auto database input?

But for other users an XLSX (CSV) would be actually also good.

@RaverJay I imagine it should be relatively easy to print out a dataframe that is anyway constructed for the HTML report also as a CSV or TSV?

I put this w/ a prio high bc/ it would really help people that are already actively using the pipeline - but of course I know that there are also other busy things todo! ;)

periscope inclusion

include this (?): https://github.com/sheffield-bioinformatics-core/periscope

Publish basecalled FASTQ files

For downstream analyses often the basecalled FASTQ files are needed.

Can we add a parameter e.g. --publish_fastq to activate publishing of .fastq.gz files after basecalling and demultiplexing?

Change/Remove tree approach

remove quick time tree build feature and reference to nextstrain?
or allow a simple drop in of A.) gisaid fasta + b.) gisaid metadata ?

ukj-Flag to create mongoDB database entry and "filename support"

Create Flag [--ukj]

if used PoreCov uses the runinfo.txt-file provided in the run-directory
for each barcode a string in runinfo.txt is give, containing following information about the sample:
YYYYMMDD_Location_SampleID_Abbreviation

-> YYYYMMDD = Isolation-date
-> Location = Location of isolation
-> SampleID = internal sample ID
-> Abbreviation = Combination of letter S (= Survillance) or E (= Extern) with a number, describing that itś the x run of this sample (beginning at 0)

parses this info together with the rki-report, selected president-results, the primer-info and the analysing-date for each sample into one .csv-file for the upload into the database.

Generate one summary consensus QC TSV

It would be great to provide one summary TSV that tells the user which samples passed QC and which not. This makes downstream analyses easier/faster as well.

Warning when older version of pipeline is used

Is it possible to add a warning when an older version of the pipeline is used? So that people can get automatically aware of new versions? Not sure if such a feature is maybe even part of Nextflow?

collect consensus sequences in one folder

Hi,

would it be possible that the final consensus sequences would be put/linked into an output folder? Anything like 'consensus_sequences' ? This would facilitate copying out all consensus sequences for further use outside of the pipeline.

threads for guppy

add more cores for guppy demultiplexing

nanopolish addition beside medaka

implement the nanopolish option also if fast5 was provided as input?

Clean-up

Is it possible to implement a parameter for a final clean-up after the pipeline finished? For example, it would be great to automatically get rid of the -w work folder once the pipeline finished succesfully.

Might be possible via onComplete?

Add reference genome coverage to the report

One important metric for us is also the genome reference coverage. That means, how well is the Wuhan reference genome supported by the reads. As we anyway map the reads to the Wuhan reference this should be easy to calculate based on a BED file?

In the Illumina pipeline, we use a coverage of at least 20X as a cutoff.

In the end, what I would like to have is basically: "reads from sampleXY cover 98.12345% of the Wuhan reference genome with at least 20X" as another number in the report.

Secondly, based on that it would then be also easy to report a median coverage value (or do you think this is not as useful @replikation as we discussed in some other thread regarding nanopore amplicon sequencing)

Centrifuge for human host contamination

add some metrics for human host contamination check (e.g. centrifuge)
need to clip primers prior analysis

Nextclade assignment

@replikation can we also add (besides pangolin) https://github.com/nextstrain/nextclade ?

It seems that Nextclade also flags sequences based on quality which could be another interesting metric.

I will also try to get some example code and add here

more verbose error messages if the sample input is faulty or wrong

a better input validation would be gread to validate the input csv
current observation is that users are confused as the workflow is not explicitly stating if something is wrong with their input
solutions might be some groovy syntax to validate the file first before the channel magic happens?

Mask coverage region parameterization

will be implemented in the new artic version 1.3.0 - needs to be implemented here asap
see PR
current cov masking seems to be 20x

Report also Nextstrain INDEL positions

A little thing: nextstrain also issues indel positions, could you add that w/ the other mutations or as an extra column?

Output QC

I think currently there is no quality check of the generated consensuses?

I suggest having at least some simple qc e.g.

check for the number of N
pairwise sequence identity to e.g. the Wuhan reference sequence

to automatically decide which samples yield good enough consensus sequences for further processing.

We could even simply implement a python script that people are working on right now:
https://gitlab.com/RKIBioinformaticsPipelines/president

Here, an input FASTA sequence is aligned to the Wuhan reference strain an a pairwise identity is calculated and reported in a tabular format.

Add absolute classification numbers to the report

The new report is great!

One thing: to directly see the amount of reads that account to SC2 or human (in particular for a negative control) it would be great to add these numbers instead/ in addition to the percentage values. 100 % SC2 in a negative control sounds worrying but then the user anyway has to look into the kraken classification/ krona plot to see that these are maybe only a handful (false positive) reads.

It would be then important though, to take care of large numbers and format them in a nice way (e.g. 1k 1m 1g) to not spoil the nice table view.

replace rki csv

new template for upload:

SENDING_LAB;DATE_DRAW;SEQ_TYPE;SEQ_REASON;SAMPLE_TYPE;PUBLICATION_STATUS;OWN_FASTA_ID
12346;20210119;ILLUMINA;X;s001;Y;A-899
12346;20210119;OXFORD_NANOPORE;Y;s006;N;A_900
12346;20210119;OTHER;A[B.1.1.7/B.1.351];s017;P;A.901

samples from samplesheet.yml are not reported

This might be an issue with an older version (analyses was run end of February).

Samples that are given in the samplesheet.yml and yield no reads in demultiplexing are not reported in output. this may lead to missing information in tracking samples further downstream.

could you add the information of lacking reads in a report.csv or similar?

Keep alignment file

For a more detailed variant calling QC it would be good as a first step to keep the alignment file. Best would be to have PSL format (I think like minimap2 directly can provide it)

local config

Hi, thanks for implementing this workflow. New fan here!
Just a minor request for the local profile. There're some processes which one would expect to require small resources (e.g. nanoplot), but the local profile assigns cpus = params.cores. In my case, I'm running it on a local server with 32 cpus. That leads to "allocating" the whole server to a single instance of (say) nanoplot, when this process only require little more than 1 cpu at most, slowing down the whole wf execution (blocks parallel execution of multiple instances).
Very easy to solve with a new configuration, my request is just to improve the experience to users new to nextflow. I think for most of these processes giving a cpus = 2 as default would be enough, and would allow multiple parallel instances. Some of them may require a little more tho, as artic and bwa.

allow input of a "raw basecalled fastq" dir

allow direkt input of a "raw" fastq dir as input (to let porecov check and collect all files)

Only add genomes to final RKI FASTA and report.csv that meet QC criteria

Question/Thought

I think currently in #28 all final consensus seqs are added to the all_genomes.fasta and rki_report.csv?

Whereas this is fine (because then all information is in one place) we could also think of just writing sequences to these summary files that also meet the consensus QC. Otherwise, they might be anyway rejected when submitted to RKI.

However, people might want to work also with sequences that don't meet the QC criteria internally and would otherwise miss them when they are not part of the summary files?

And: when QC thresholds change, this must be also reflected in poreCov to not reject sequences that might actually pass the later QC

Adjust percentage of classified reads

Currently, the report shows e.g. 100% SARS-CoV-2 and 0% human even if there are only 10 reads classified as SARS-CoV-2 (and the rest is unclassified).

Thus, it is good to have the absolute numbers now as well but would it not be better to report the percentages in accordance with the Krona plot? E.g. 4 % SARS-Cov-2 if there is 96 % unclassified?

rki conform metadata output

add RKI conform metadata output

IMS_ID;SENDING_LAB;DATE_DRAW;SEQ_TYPE;SEQ_REASON;SAMPLE_TYPE;OWN_FASTA_ID
IMS-12345-CVDP-00001;12346;20210119;ILLUMINA;X;s001;A-899
IMS-12345-CVDP-00002;12346;20210119;OXFORD_NANOPORE;Y;s006;A_900
IMS-12345-CVDP-00003;12346;20210119;OTHER;A[B.1.1.7/B.1.351];s017;A.901

source

Output also the mapping files for QC

I think BAMs are anyway generated for the coverage plots? We just also need them as an output for downstream steps.

@replikation what is actually used for mapping (minimap2?) in the pipeline?

report

It would be good to have at least a simple summary report for the reconstructed consensus sequence. This should include:

used version of poreCov
used tools and versions within poreCov
basic stats about the reconstructed consensuses (length, N50, number Ns, maybe pairwise identity to Wuhan strain ...)
if possible some stats about the called variants

E.g. in a single PDF report per run.

For the first part (technical stats) it might be also enough to use the nextflow internal functions for reporting.

Optimize CPU basecalling

Current PR integrates CPU basecalling #18

We can check for some optimization of the CPU basecalling process, if needed (number of callers , callers per cpu, ...).

E.g. @hoelzer can check how people are currently run CPU basecalling on larger machines to improve the current basic command.

Filter FASTQ by length - report

As an initial step, FASTQ files are filtered by length and if the file size is too small, the FASTQ is not processed any further. It would be good to have this somehow reported.

E.g. I just tested a FAST5 run (V3 primers) that resulted in 24 barcoded FASTQ and it seems 9 of them were sorted out and not processed any further.

It would be good to have a TSV with e.g. all IDs and a column that states which were sorted out due to low number of reads after filtering.

add molecular clock to augur

add params for --clock-rate {params.clock_rate} \
https://nextstrain-augur.readthedocs.io/en/stable/usage/cli/refine.html
annual evolution rate of 1.24 x 10-3 substitutions per site for SARS-CoV-2

replikation / porecov Goto Github PK

porecov's Introduction

porecov's People

Contributors

Stargazers

Watchers

Forkers

porecov's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs