GithubHelp home page GithubHelp logo

replikation / porecov Goto Github PK

View Code? Open in Web Editor NEW
39.0 39.0 15.0 29.11 MB

SARS-CoV-2 workflow for nanopore sequence data

Home Page: https://case-group.github.io/

License: GNU General Public License v3.0

Nextflow 62.99% Python 33.58% Shell 2.41% R 1.02%
artic basecalling bioinformatics nanopore nanopore-data sars-cov-2 workflow

porecov's Introduction

Anurag's github stats

porecov's People

Contributors

angelovangel avatar bwlang avatar dataspott avatar hoelzer avatar marielataretu avatar raverjay avatar replikation avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

porecov's Issues

pangolin version tag

for sample flag a pangolin field needs to be added about "which version" was used

missing bases in low-covered homopolymer stretches

In regions with lower ONT read coverage single bases might be missed in the generated consensus due to issues in basecalling homopolymer stretches.

For example (position 11075, end of ORF1):
Screenshot from 2021-01-08 12-41-56

Maybe this can be fixed by additionally checking for deletions after Medaka that are in homopolymer (e.g. length > 6 nt) stretches and by again comparing to a reference (Wuhan) sequence. However, can be difficult if real deletions occur in homopolymers.

Maybe it's already fixed in the current ARTIC pipeline.

Installation from source without conda/singularity/docker

Hi,

Is it possible to install the poreCov workflow directly from source without using conda/singularity/docker?

Could you please publish a list of software/tools that are required for this workflow?

What I could figure out from the README is that guppy is required. On our HPC cluster we already have a guppy 4.0.11 installation with CUDA support. What else is needed?

Best regards

Sam

Report Insertions as well

It seems Nextclade also provides information on insertions and not only deletions:

> colnames(x)
 [1] "seqName"
 [2] "clade"
 [3] "qc.overallScore"
 [4] "qc.overallStatus"
 [5] "totalGaps"
 [6] "totalInsertions"
 [7] "totalMissing"
 [8] "totalMutations"
 [9] "totalNonACGTNs"
[10] "totalPcrPrimerChanges"
[11] "substitutions"
[12] "deletions"
[13] "insertions"
[14] "missing"
[15] "nonACGTNs"
[16] "pcrPrimerChanges"
[17] "aaSubstitutions"
[18] "totalAminoacidSubstitutions"
[19] "aaDeletions"
[20] "totalAminoacidDeletions"
[21] "alignmentEnd"
[22] "alignmentScore"
[23] "alignmentStart"
[24] "qc.missingData.missingDataThreshold"
[25] "qc.missingData.score"
[26] "qc.missingData.status"
[27] "qc.missingData.totalMissing"
[28] "qc.mixedSites.mixedSitesThreshold"
[29] "qc.mixedSites.score"
[30] "qc.mixedSites.status"
[31] "qc.mixedSites.totalMixedSites"
[32] "qc.privateMutations.cutoff"
[33] "qc.privateMutations.excess"
[34] "qc.privateMutations.score"
[35] "qc.privateMutations.status"
[36] "qc.privateMutations.total"
[37] "qc.snpClusters.clusteredSNPs"
[38] "qc.snpClusters.score"
[39] "qc.snpClusters.status"
[40] "qc.snpClusters.totalSNPs"
[41] "errors"

see [13]

Could be worth to integrate this as well.

Update guppy

Should we update Guppy GPU and CPU with the next PR?

4.4.1-1--a3fcea3

Simple Read QC

It would be good to have some qc of the input data. I think the easiest is a NanoPlot module.

Maybe PycoQC as an option if the summary.txt is available.

As a final output, it would be great to have a summary QC for all input samples/barcodes where one can immediately observe strange samples.

Update consensus QC

president has a new version v0.6.0:
https://gitlab.com/RKIBioinformaticsPipelines/president/-/releases/v0.6.0

Besides some bug fixes that do not really affect poreCov, the order of the columns in the report.tsv was adjusted to be more intuitive (and not simply alphabetically sorted).

However, I assume this will break poreCov when the container is updated w/o applying some changes to the code where metrics are extracted from the report.tsv?

Readme

  • cleanup and simplify readme

Report percentage of Ns

Maybe also rather present %Ns rather than absolute numbers (will make it easier for people who have a hard time quickly dividing by 29902)?

Or both? I know that you are likely just taking the information from president so I could also escalate that to that tool so you could then just grep it from the report.tsv?

Summary report also in text table format

It would be really helpful to get a table with the data instead of only an HTML file. This is because often content has to be copied or accessed programmatically per column of data.

The HTML report is great, but having the main content also written as a summary table would be awesome.

@replikation I think you already write some JSON files for your auto database input?

But for other users an XLSX (CSV) would be actually also good.

@RaverJay I imagine it should be relatively easy to print out a dataframe that is anyway constructed for the HTML report also as a CSV or TSV?

I put this w/ a prio high bc/ it would really help people that are already actively using the pipeline - but of course I know that there are also other busy things todo! ;)

Publish basecalled FASTQ files

For downstream analyses often the basecalled FASTQ files are needed.

Can we add a parameter e.g. --publish_fastq to activate publishing of .fastq.gz files after basecalling and demultiplexing?

Change/Remove tree approach

  • remove quick time tree build feature and reference to nextstrain?
  • or allow a simple drop in of A.) gisaid fasta + b.) gisaid metadata ?

ukj-Flag to create mongoDB database entry and "filename support"

Create Flag [--ukj]

  • if used PoreCov uses the runinfo.txt-file provided in the run-directory
  • for each barcode a string in runinfo.txt is give, containing following information about the sample:
    YYYYMMDD_Location_SampleID_Abbreviation

-> YYYYMMDD = Isolation-date
-> Location = Location of isolation
-> SampleID = internal sample ID
-> Abbreviation = Combination of letter S (= Survillance) or E (= Extern) with a number, describing that itś the x run of this sample (beginning at 0)

  • parses this info together with the rki-report, selected president-results, the primer-info and the analysing-date for each sample into one .csv-file for the upload into the database.

Generate one summary consensus QC TSV

It would be great to provide one summary TSV that tells the user which samples passed QC and which not. This makes downstream analyses easier/faster as well.

Warning when older version of pipeline is used

Is it possible to add a warning when an older version of the pipeline is used? So that people can get automatically aware of new versions? Not sure if such a feature is maybe even part of Nextflow?

collect consensus sequences in one folder

Hi,

would it be possible that the final consensus sequences would be put/linked into an output folder? Anything like 'consensus_sequences' ? This would facilitate copying out all consensus sequences for further use outside of the pipeline.

Clean-up

Is it possible to implement a parameter for a final clean-up after the pipeline finished? For example, it would be great to automatically get rid of the -w work folder once the pipeline finished succesfully.

Might be possible via onComplete?

Add reference genome coverage to the report

One important metric for us is also the genome reference coverage. That means, how well is the Wuhan reference genome supported by the reads. As we anyway map the reads to the Wuhan reference this should be easy to calculate based on a BED file?

In the Illumina pipeline, we use a coverage of at least 20X as a cutoff.

In the end, what I would like to have is basically: "reads from sampleXY cover 98.12345% of the Wuhan reference genome with at least 20X" as another number in the report.

Secondly, based on that it would then be also easy to report a median coverage value (or do you think this is not as useful @replikation as we discussed in some other thread regarding nanopore amplicon sequencing)

more verbose error messages if the sample input is faulty or wrong

  • a better input validation would be gread to validate the input csv
  • current observation is that users are confused as the workflow is not explicitly stating if something is wrong with their input
  • solutions might be some groovy syntax to validate the file first before the channel magic happens?

Output QC

I think currently there is no quality check of the generated consensuses?

I suggest having at least some simple qc e.g.

  • check for the number of N
  • pairwise sequence identity to e.g. the Wuhan reference sequence

to automatically decide which samples yield good enough consensus sequences for further processing.

We could even simply implement a python script that people are working on right now:
https://gitlab.com/RKIBioinformaticsPipelines/president

Here, an input FASTA sequence is aligned to the Wuhan reference strain an a pairwise identity is calculated and reported in a tabular format.

Add absolute classification numbers to the report

The new report is great!

One thing: to directly see the amount of reads that account to SC2 or human (in particular for a negative control) it would be great to add these numbers instead/ in addition to the percentage values. 100 % SC2 in a negative control sounds worrying but then the user anyway has to look into the kraken classification/ krona plot to see that these are maybe only a handful (false positive) reads.

It would be then important though, to take care of large numbers and format them in a nice way (e.g. 1k 1m 1g) to not spoil the nice table view.

replace rki csv

new template for upload:

SENDING_LAB;DATE_DRAW;SEQ_TYPE;SEQ_REASON;SAMPLE_TYPE;PUBLICATION_STATUS;OWN_FASTA_ID
12346;20210119;ILLUMINA;X;s001;Y;A-899
12346;20210119;OXFORD_NANOPORE;Y;s006;N;A_900
12346;20210119;OTHER;A[B.1.1.7/B.1.351];s017;P;A.901

samples from samplesheet.yml are not reported

This might be an issue with an older version (analyses was run end of February).

Samples that are given in the samplesheet.yml and yield no reads in demultiplexing are not reported in output. this may lead to missing information in tracking samples further downstream.

could you add the information of lacking reads in a report.csv or similar?

Keep alignment file

For a more detailed variant calling QC it would be good as a first step to keep the alignment file. Best would be to have PSL format (I think like minimap2 directly can provide it)

local config

Hi, thanks for implementing this workflow. New fan here!
Just a minor request for the local profile. There're some processes which one would expect to require small resources (e.g. nanoplot), but the local profile assigns cpus = params.cores. In my case, I'm running it on a local server with 32 cpus. That leads to "allocating" the whole server to a single instance of (say) nanoplot, when this process only require little more than 1 cpu at most, slowing down the whole wf execution (blocks parallel execution of multiple instances).
Very easy to solve with a new configuration, my request is just to improve the experience to users new to nextflow. I think for most of these processes giving a cpus = 2 as default would be enough, and would allow multiple parallel instances. Some of them may require a little more tho, as artic and bwa.

Only add genomes to final RKI FASTA and report.csv that meet QC criteria

Question/Thought

I think currently in #28 all final consensus seqs are added to the all_genomes.fasta and rki_report.csv?

Whereas this is fine (because then all information is in one place) we could also think of just writing sequences to these summary files that also meet the consensus QC. Otherwise, they might be anyway rejected when submitted to RKI.

However, people might want to work also with sequences that don't meet the QC criteria internally and would otherwise miss them when they are not part of the summary files?

And: when QC thresholds change, this must be also reflected in poreCov to not reject sequences that might actually pass the later QC

Adjust percentage of classified reads

Currently, the report shows e.g. 100% SARS-CoV-2 and 0% human even if there are only 10 reads classified as SARS-CoV-2 (and the rest is unclassified).

Thus, it is good to have the absolute numbers now as well but would it not be better to report the percentages in accordance with the Krona plot? E.g. 4 % SARS-Cov-2 if there is 96 % unclassified?

rki conform metadata output

  • add RKI conform metadata output
IMS_ID;SENDING_LAB;DATE_DRAW;SEQ_TYPE;SEQ_REASON;SAMPLE_TYPE;OWN_FASTA_ID
IMS-12345-CVDP-00001;12346;20210119;ILLUMINA;X;s001;A-899
IMS-12345-CVDP-00002;12346;20210119;OXFORD_NANOPORE;Y;s006;A_900
IMS-12345-CVDP-00003;12346;20210119;OTHER;A[B.1.1.7/B.1.351];s017;A.901

report

It would be good to have at least a simple summary report for the reconstructed consensus sequence. This should include:

  • used version of poreCov
  • used tools and versions within poreCov
  • basic stats about the reconstructed consensuses (length, N50, number Ns, maybe pairwise identity to Wuhan strain ...)
  • if possible some stats about the called variants

E.g. in a single PDF report per run.

For the first part (technical stats) it might be also enough to use the nextflow internal functions for reporting.

Optimize CPU basecalling

Current PR integrates CPU basecalling #18

We can check for some optimization of the CPU basecalling process, if needed (number of callers , callers per cpu, ...).

E.g. @hoelzer can check how people are currently run CPU basecalling on larger machines to improve the current basic command.

Filter FASTQ by length - report

As an initial step, FASTQ files are filtered by length and if the file size is too small, the FASTQ is not processed any further. It would be good to have this somehow reported.

E.g. I just tested a FAST5 run (V3 primers) that resulted in 24 barcoded FASTQ and it seems 9 of them were sorted out and not processed any further.

It would be good to have a TSV with e.g. all IDs and a column that states which were sorted out due to low number of reads after filtering.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.