GithubHelp home page GithubHelp logo

bccdc-phl / plasmid-screen Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 92 KB

Screen plasmids for carbapenemase resistance genes, and classify using MOB Suite

Python 50.47% Nextflow 49.53%
bioinformatics-pipeline plasmids antibiotic-resistance genomic-epidemiology

plasmid-screen's People

Contributors

dfornika avatar

Stargazers

 avatar

Watchers

 avatar

plasmid-screen's Issues

Samplesheet input mode

Add feature to allow input via a samplesheet with the following fields:

If only reads are supplied:
ID,R1,R2,ASSEMBLY

If running in --pre_assembled mode, with assemblies:
ID,R1,R2,ASSEMBLY

sample name collision in provenance files

if multiple files have the same "sampleID/ref plasmid/carbapenemase combo" then these cause an error when generating provenance files. This was caused when there were >1 copy of the same carbapenemase gene.

Support (optional) long-reads

Add support for incorporating long-read data into the analysis. Long reads could be mapped against plasmid assemblies & reconstructions using minimap2. Other uses(?)

Add optional versioned output directory

The pipeline currently creates one output directory per sample and publishes all outputs there. eg:

publishDir "${params.outdir}/${sample_id}", pattern: "${sample_id}_abricate.tsv", mode: 'copy'

When combining this pipeline with others, it may be useful to encapsulate the outputs from this pipeline in a sub-directory that is named with the pipeline name and version.

So by default we would create outputs of this structure:

sample-01/
├── sample-01_20211207163723_provenance.yml
├── sample-01_abricate.tsv
├── ...
├── sample-01_quast.csv
└── NC_019152.1.fa

...but when running with a --versioned_outdir flag, we would produce:

.
└── sample-01
    └── plasmid-screen-v0.1-output
        ├── sample-01_20211207163723_provenance.yml
        ├── sample-01_abricate.tsv
        ├── ...
        ├── sample-01_quast.csv
        └── NC_019152.1.fa

...then a subsequent analysis could produce similar outputs alongside:

...and if this pipeline were updated, we could store those outputs alongside as well:

.
└── sample-01
    ├── plasmid-screen-v0.1-output
    │   ├── sample-01_abricate.tsv
    │   └── ...
    └── plasmid-screen-v0.2-output
        ├── sample-01_abricate.tsv
        └── ...

Provenance file may not be produced when no resistance genes localized to plasmid in sample

Due to the way that provenance info is collected in this pipeline:

plasmid-screen/main.nf

Lines 94 to 105 in 7cde3e9

ch_provenance = mob_recon.out.provenance
ch_provenance = ch_provenance.join(abricate.out.provenance).map{ it -> [it[0], [it[1]] << it[2]] }
ch_provenance = ch_provenance.join(trim_reads.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
ch_provenance = ch_provenance.join(hash_files_fastq.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
if (params.pre_assembled) {
ch_provenance = ch_provenance.join(hash_files_assemblies.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
}
ch_provenance = ch_provenance.join(align_reads_to_reference_plasmid.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
ch_provenance = ch_provenance.join(call_snps.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
ch_provenance = ch_provenance.join(quast.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
ch_provenance = ch_provenance.join(ch_fastq.map{ it -> it[0] }.combine(ch_pipeline_provenance)).map{ it -> [it[0], it[1] << it[2]] }
collect_provenance(ch_provenance)

...there may be cases where the provenance file is lost for certain samples. For example, the .join statement on this line:

ch_provenance = ch_provenance.join(align_reads_to_reference_plasmid.out.provenance).map{ it -> [it[0], it[1] << it[2]] }

Results in samples that do not go through the reference plasmid aligning step being lost from the provenance collection process.

Long/complicated Contig IDs in resistance gene report when using Dragonflye

We've been getting good results doing hybrid assemblies using the dragonflye assembler through our BCCDC-PHL/dragonflye-nf pipeline. But we've found that when we run those assemblies through this pipeline, we see this sort of thing as the resistance_gene_contig_id in the resistance_gene_report.tsv file

sample_id  assembly_file                  resistance_gene_contig_id
XXXXXXXXX  XXXXXXXXX_plasmid_XXXXX.fasta  XXXXXXXXX_contig00005_len=11790_cov=233.0_origname=contig_2_polypolish_polish=racon:1_round(s);polypolish:short_reads_1_round(s);_sw=dragonflye-flye/1.1.0_date=20230907_circular=Y

The 'plasmid reconstruction' (often a single contig for hybrid assembly, but not always) that's produced by mob-recon also includes the long ID:

>XXXXXXXXX_contig00005_len=11790_cov=233.0_origname=contig_2_polypolish_polish=racon:1_round(s);polypolish:short_reads_1_round(s);_sw=dragonflye-flye/1.1.0_date=20230907_circular=Y

But the fasta headers from the original dragonflye assembly include spaces:

>XXXXXXXXX_contig00005 len=11790 cov=233.0 origname=contig_2_polypolish polish=racon:1 round(s);polypolish:short_reads,1 round(s); sw=dragonflye-flye/1.1.0 date=20230907 circular=Y

...so the 'contig ID' would be only the first part of the header up to the first whitespace. We'd prefer if this appeared in the resistance gene report like this:

sample_id  assembly_file                  resistance_gene_contig_id
XXXXXXXXX  XXXXXXXXX_plasmid_XXXXX.fasta  XXXXXXXXX_contig00005

Align reads against plasmid reconstructions

We currently align the input reads for each sample against the closest reference plasmids. But it would also be helpful for QC and review purposes to align the reads against the plasmid reconstruction(s) themselves.

Collect outputs

Add support for a --collect_outputs flag, which will cause tabular outputs to be collected together into a single file per analysis.

Use the flag --collected_outputs_prefix to control the filename prefixes for the collected outputs, with default value "collected"

Pipeline fails when no plasmid contigs are found

The mob_recon process will fail because some of the expected outputs are not created when no contigs are assigned to plasmids and no plasmid reconstructions are created:

rename: sample-X/plasmid*.fasta: rename to sample-X/sample-X_plasmid*.fasta failed: No such file or directory
  cp: cannot stat ‘sample-X/sample-X_plasmid*.fasta’: No such file or directory
  cp: cannot stat ‘sample-X/mobtyper_results.txt’: No such file or directory

Resistance gene report sometimes includes duplicate entries

There are sometimes duplicate (identical) entries in the resistance_gene_report.tsv file. This might complicate collecting results into downstream systems and databases, so it should not happen.

Investigate the cause, and prevent duplicate entries in the resistance_gene_report.tsv output.

Pipeline crashes when multiple carbapenemase genes found on same plasmid reconstruction

Due to the way that we are naming provenance files, the pipeline will crash if multiple resistance genes are found on the same plasmid reconstruction:

Error executing process > 'collect_provenance (sample-X)'

Caused by:
  Process `collect_provenance` input file name collision -- There are multiple input files for each of the following file names: sample-X_pBC17Kpn036_bwa_samtools_provenance.yml, sample-X_pBC17Kpn036_freebayes_provenance.yml


Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Align against multiple candidate reference plasmids

Our current approach is to use the single reference plasmid with the lowest mash distance to our plasmid reconstruction as a reference for alignment and SNP-calling. But there are often cases where several suitable reference plasmids are closely related. We may get better alignments from those plasmids, even if their mash distances are not the lowest available.

Align & call SNPs against a set of candidate reference plasmids.

Collect provenance info

For each sample, collect info on which tool versions were used, which input files were used (including hashes) and which pipeline version was used for the analysis.

Look for replicon genes using abricate with plasmidfinder database

We currently get some information about replicon genes from our mob-typer plasmid reports, but we could get more detailed information about exactly which contig the gene is located on, where within the contig, and the accuracy of the match if we were to collect that info using abricate with the plasmidfinder database.

Add another abricate process similar to the one we have, but using the plasmidfinder database.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.