replikation / porecov Goto Github PK
View Code? Open in Web Editor NEWSARS-CoV-2 workflow for nanopore sequence data
Home Page: https://case-group.github.io/
License: GNU General Public License v3.0
SARS-CoV-2 workflow for nanopore sequence data
Home Page: https://case-group.github.io/
License: GNU General Public License v3.0
for sample flag a pangolin field needs to be added about "which version" was used
In regions with lower ONT read coverage single bases might be missed in the generated consensus due to issues in basecalling homopolymer stretches.
For example (position 11075, end of ORF1):
Maybe this can be fixed by additionally checking for deletions after Medaka that are in homopolymer (e.g. length > 6 nt) stretches and by again comparing to a reference (Wuhan) sequence. However, can be difficult if real deletions occur in homopolymers.
Maybe it's already fixed in the current ARTIC pipeline.
https://github.com/connor-lab/ncov2019-artic-nf
Maybe we can have a look here as well in terms of
Hi,
Is it possible to install the poreCov workflow directly from source without using conda/singularity/docker?
Could you please publish a list of software/tools that are required for this workflow?
What I could figure out from the README is that guppy is required. On our HPC cluster we already have a guppy 4.0.11 installation with CUDA support. What else is needed?
Best regards
Sam
We need to check if guppy also works with Singularity in GPU and CPU mode.
It seems Nextclade also provides information on insertions and not only deletions:
> colnames(x)
[1] "seqName"
[2] "clade"
[3] "qc.overallScore"
[4] "qc.overallStatus"
[5] "totalGaps"
[6] "totalInsertions"
[7] "totalMissing"
[8] "totalMutations"
[9] "totalNonACGTNs"
[10] "totalPcrPrimerChanges"
[11] "substitutions"
[12] "deletions"
[13] "insertions"
[14] "missing"
[15] "nonACGTNs"
[16] "pcrPrimerChanges"
[17] "aaSubstitutions"
[18] "totalAminoacidSubstitutions"
[19] "aaDeletions"
[20] "totalAminoacidDeletions"
[21] "alignmentEnd"
[22] "alignmentScore"
[23] "alignmentStart"
[24] "qc.missingData.missingDataThreshold"
[25] "qc.missingData.score"
[26] "qc.missingData.status"
[27] "qc.missingData.totalMissing"
[28] "qc.mixedSites.mixedSitesThreshold"
[29] "qc.mixedSites.score"
[30] "qc.mixedSites.status"
[31] "qc.mixedSites.totalMixedSites"
[32] "qc.privateMutations.cutoff"
[33] "qc.privateMutations.excess"
[34] "qc.privateMutations.score"
[35] "qc.privateMutations.status"
[36] "qc.privateMutations.total"
[37] "qc.snpClusters.clusteredSNPs"
[38] "qc.snpClusters.score"
[39] "qc.snpClusters.status"
[40] "qc.snpClusters.totalSNPs"
[41] "errors"
see [13]
Could be worth to integrate this as well.
A minor thing, but as sub workflows are used with the process code living here:
https://github.com/replikation/poreCov/tree/master/workflows/process
I think the processes such as:
https://github.com/replikation/poreCov/blob/master/modules/align_to_reference.nf
should also go in the workflows
folder?
€ invalid. thx @hoelzer
Should go through everything and check for efficient execution:
Should we update Guppy GPU and CPU with the next PR?
4.4.1-1--a3fcea3
It would be good to have some qc of the input data. I think the easiest is a NanoPlot
module.
Maybe PycoQC
as an option if the summary.txt is available.
As a final output, it would be great to have a summary QC for all input samples/barcodes where one can immediately observe strange samples.
president
has a new version v0.6.0:
https://gitlab.com/RKIBioinformaticsPipelines/president/-/releases/v0.6.0
Besides some bug fixes that do not really affect poreCov
, the order of the columns in the report.tsv
was adjusted to be more intuitive (and not simply alphabetically sorted).
However, I assume this will break poreCov
when the container is updated w/o applying some changes to the code where metrics are extracted from the report.tsv
?
How about including VCF output from medaka in the results, which gives a nice overview of (potential) SNPs with confidences?
Cheers
Compare: https://gitlab.com/RKIBioinformaticsPipelines/nanoqc/-/blob/master/modules/guppy.nf
We could provide for users w/o GPU access a CPU Guppy version. At the moment, I think, only GPU is supported.
Maybe also rather present %Ns rather than absolute numbers (will make it easier for people who have a hard time quickly dividing by 29902)?
Or both? I know that you are likely just taking the information from president
so I could also escalate that to that tool so you could then just grep it from the report.tsv
?
It would be really helpful to get a table with the data instead of only an HTML file. This is because often content has to be copied or accessed programmatically per column of data.
The HTML report is great, but having the main content also written as a summary table would be awesome.
@replikation I think you already write some JSON files for your auto database input?
But for other users an XLSX (CSV) would be actually also good.
@RaverJay I imagine it should be relatively easy to print out a dataframe that is anyway constructed for the HTML report also as a CSV or TSV?
I put this w/ a prio high bc/ it would really help people that are already actively using the pipeline - but of course I know that there are also other busy things todo! ;)
For downstream analyses often the basecalled FASTQ files are needed.
Can we add a parameter e.g. --publish_fastq
to activate publishing of .fastq.gz
files after basecalling and demultiplexing?
Create Flag [--ukj]
-> YYYYMMDD = Isolation-date
-> Location = Location of isolation
-> SampleID = internal sample ID
-> Abbreviation = Combination of letter S (= Survillance) or E (= Extern) with a number, describing that itś the x run of this sample (beginning at 0)
It would be great to provide one summary TSV that tells the user which samples passed QC and which not. This makes downstream analyses easier/faster as well.
Is it possible to add a warning when an older version of the pipeline is used? So that people can get automatically aware of new versions? Not sure if such a feature is maybe even part of Nextflow?
Hi,
would it be possible that the final consensus sequences would be put/linked into an output folder? Anything like 'consensus_sequences' ? This would facilitate copying out all consensus sequences for further use outside of the pipeline.
Is it possible to implement a parameter for a final clean-up after the pipeline finished? For example, it would be great to automatically get rid of the -w work
folder once the pipeline finished succesfully.
Might be possible via onComplete
?
One important metric for us is also the genome reference coverage. That means, how well is the Wuhan reference genome supported by the reads. As we anyway map the reads to the Wuhan reference this should be easy to calculate based on a BED file?
In the Illumina pipeline, we use a coverage of at least 20X as a cutoff.
In the end, what I would like to have is basically: "reads from sampleXY cover 98.12345% of the Wuhan reference genome with at least 20X" as another number in the report.
Secondly, based on that it would then be also easy to report a median coverage value (or do you think this is not as useful @replikation as we discussed in some other thread regarding nanopore amplicon sequencing)
@replikation can we also add (besides pangolin) https://github.com/nextstrain/nextclade ?
It seems that Nextclade also flags sequences based on quality which could be another interesting metric.
I will also try to get some example code and add here
A little thing: nextstrain also issues indel positions, could you add that w/ the other mutations or as an extra column?
I think currently there is no quality check of the generated consensuses?
I suggest having at least some simple qc e.g.
to automatically decide which samples yield good enough consensus sequences for further processing.
We could even simply implement a python script that people are working on right now:
https://gitlab.com/RKIBioinformaticsPipelines/president
Here, an input FASTA sequence is aligned to the Wuhan reference strain an a pairwise identity is calculated and reported in a tabular format.
The new report is great!
One thing: to directly see the amount of reads that account to SC2 or human (in particular for a negative control) it would be great to add these numbers instead/ in addition to the percentage values. 100 % SC2 in a negative control sounds worrying but then the user anyway has to look into the kraken classification/ krona plot to see that these are maybe only a handful (false positive) reads.
It would be then important though, to take care of large numbers and format them in a nice way (e.g. 1k 1m 1g) to not spoil the nice table view.
new template for upload:
SENDING_LAB;DATE_DRAW;SEQ_TYPE;SEQ_REASON;SAMPLE_TYPE;PUBLICATION_STATUS;OWN_FASTA_ID
12346;20210119;ILLUMINA;X;s001;Y;A-899
12346;20210119;OXFORD_NANOPORE;Y;s006;N;A_900
12346;20210119;OTHER;A[B.1.1.7/B.1.351];s017;P;A.901
This might be an issue with an older version (analyses was run end of February).
Samples that are given in the samplesheet.yml and yield no reads in demultiplexing are not reported in output. this may lead to missing information in tracking samples further downstream.
could you add the information of lacking reads in a report.csv or similar?
For a more detailed variant calling QC it would be good as a first step to keep the alignment file. Best would be to have PSL format (I think like minimap2 directly can provide it)
Hi, thanks for implementing this workflow. New fan here!
Just a minor request for the local
profile. There're some processes which one would expect to require small resources (e.g. nanoplot), but the local
profile assigns cpus = params.cores
. In my case, I'm running it on a local server with 32 cpus. That leads to "allocating" the whole server to a single instance of (say) nanoplot, when this process only require little more than 1 cpu at most, slowing down the whole wf execution (blocks parallel execution of multiple instances).
Very easy to solve with a new configuration, my request is just to improve the experience to users new to nextflow. I think for most of these processes giving a cpus = 2
as default would be enough, and would allow multiple parallel instances. Some of them may require a little more tho, as artic
and bwa
.
Question/Thought
I think currently in #28 all final consensus seqs are added to the all_genomes.fasta
and rki_report.csv
?
Whereas this is fine (because then all information is in one place) we could also think of just writing sequences to these summary files that also meet the consensus QC. Otherwise, they might be anyway rejected when submitted to RKI.
However, people might want to work also with sequences that don't meet the QC criteria internally and would otherwise miss them when they are not part of the summary files?
And: when QC thresholds change, this must be also reflected in poreCov to not reject sequences that might actually pass the later QC
Currently, the report shows e.g. 100% SARS-CoV-2 and 0% human even if there are only 10 reads classified as SARS-CoV-2 (and the rest is unclassified).
Thus, it is good to have the absolute numbers now as well but would it not be better to report the percentages in accordance with the Krona plot? E.g. 4 % SARS-Cov-2 if there is 96 % unclassified?
IMS_ID;SENDING_LAB;DATE_DRAW;SEQ_TYPE;SEQ_REASON;SAMPLE_TYPE;OWN_FASTA_ID
IMS-12345-CVDP-00001;12346;20210119;ILLUMINA;X;s001;A-899
IMS-12345-CVDP-00002;12346;20210119;OXFORD_NANOPORE;Y;s006;A_900
IMS-12345-CVDP-00003;12346;20210119;OTHER;A[B.1.1.7/B.1.351];s017;A.901
I think BAMs are anyway generated for the coverage plots? We just also need them as an output for downstream steps.
@replikation what is actually used for mapping (minimap2?) in the pipeline?
It would be good to have at least a simple summary report for the reconstructed consensus sequence. This should include:
E.g. in a single PDF report per run.
For the first part (technical stats) it might be also enough to use the nextflow internal functions for reporting.
As an initial step, FASTQ files are filtered by length and if the file size is too small, the FASTQ is not processed any further. It would be good to have this somehow reported.
E.g. I just tested a FAST5 run (V3 primers) that resulted in 24 barcoded FASTQ and it seems 9 of them were sorted out and not processed any further.
It would be good to have a TSV with e.g. all IDs and a column that states which were sorted out due to low number of reads after filtering.
--clock-rate {params.clock_rate} \
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.