bccdc-phl / plasmid-screen Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 92 KB

Screen plasmids for carbapenemase resistance genes, and classify using MOB Suite

Python 50.47% Nextflow 49.53%

bioinformatics-pipeline plasmids antibiotic-resistance genomic-epidemiology

plasmid-screen's People

Contributors

Stargazers

Watchers

plasmid-screen's Issues

Samplesheet input mode

Add feature to allow input via a samplesheet with the following fields:

If only reads are supplied:
ID,R1,R2,ASSEMBLY

If running in --pre_assembled mode, with assemblies:
ID,R1,R2,ASSEMBLY

Update provenance format to match provenance schema

We're trying to standardize the format of our pipeline provenance data to match a common schema

Make any changes necessary to make this pipeline's provenance format match that schema.

Add size of resistance gene contig to resistance gene report

Add the size of the contig that the resistance gene is located on to the resistance gene report.

sample name collision in provenance files

if multiple files have the same "sampleID/ref plasmid/carbapenemase combo" then these cause an error when generating provenance files. This was caused when there were >1 copy of the same carbapenemase gene.

Support (optional) long-reads

Add support for incorporating long-read data into the analysis. Long reads could be mapped against plasmid assemblies & reconstructions using minimap2. Other uses(?)

Add optional versioned output directory

The pipeline currently creates one output directory per sample and publishes all outputs there. eg:

plasmid-screen/modules/abricate.nf

Line 5 in 8419c4f

 publishDir "${params.outdir}/${sample_id}", pattern: "${sample_id}_abricate.tsv", mode: 'copy' 

When combining this pipeline with others, it may be useful to encapsulate the outputs from this pipeline in a sub-directory that is named with the pipeline name and version.

So by default we would create outputs of this structure:

sample-01/
├── sample-01_20211207163723_provenance.yml
├── sample-01_abricate.tsv
├── ...
├── sample-01_quast.csv
└── NC_019152.1.fa

...but when running with a --versioned_outdir flag, we would produce:

.
└── sample-01
    └── plasmid-screen-v0.1-output
        ├── sample-01_20211207163723_provenance.yml
        ├── sample-01_abricate.tsv
        ├── ...
        ├── sample-01_quast.csv
        └── NC_019152.1.fa

...then a subsequent analysis could produce similar outputs alongside:

...and if this pipeline were updated, we could store those outputs alongside as well:

.
└── sample-01
    ├── plasmid-screen-v0.1-output
    │   ├── sample-01_abricate.tsv
    │   └── ...
    └── plasmid-screen-v0.2-output
        ├── sample-01_abricate.tsv
        └── ...

Clean up & organize outputs

Decide on a set of output files, filenames and output directory structure.

Provenance file may not be produced when no resistance genes localized to plasmid in sample

Due to the way that provenance info is collected in this pipeline:

plasmid-screen/main.nf

Lines 94 to 105 in 7cde3e9

 ch_provenance = mob_recon.out.provenance 

 ch_provenance = ch_provenance.join(abricate.out.provenance).map{ it -> [it[0], [it[1]] << it[2]] } 

 ch_provenance = ch_provenance.join(trim_reads.out.provenance).map{ it -> [it[0], it[1] << it[2]] } 

 ch_provenance = ch_provenance.join(hash_files_fastq.out.provenance).map{ it -> [it[0], it[1] << it[2]] } 

 if (params.pre_assembled) { 

 ch_provenance = ch_provenance.join(hash_files_assemblies.out.provenance).map{ it -> [it[0], it[1] << it[2]] } 

 } 

 ch_provenance = ch_provenance.join(align_reads_to_reference_plasmid.out.provenance).map{ it -> [it[0], it[1] << it[2]] } 

 ch_provenance = ch_provenance.join(call_snps.out.provenance).map{ it -> [it[0], it[1] << it[2]] } 

 ch_provenance = ch_provenance.join(quast.out.provenance).map{ it -> [it[0], it[1] << it[2]] } 

 ch_provenance = ch_provenance.join(ch_fastq.map{ it -> it[0] }.combine(ch_pipeline_provenance)).map{ it -> [it[0], it[1] << it[2]] } 

 collect_provenance(ch_provenance)

...there may be cases where the provenance file is lost for certain samples. For example, the .join statement on this line:

plasmid-screen/main.nf

Line 101 in 7cde3e9

 ch_provenance = ch_provenance.join(align_reads_to_reference_plasmid.out.provenance).map{ it -> [it[0], it[1] << it[2]] } 

Results in samples that do not go through the reference plasmid aligning step being lost from the provenance collection process.

Long/complicated Contig IDs in resistance gene report when using Dragonflye

We've been getting good results doing hybrid assemblies using the dragonflye assembler through our BCCDC-PHL/dragonflye-nf pipeline. But we've found that when we run those assemblies through this pipeline, we see this sort of thing as the resistance_gene_contig_id in the resistance_gene_report.tsv file

sample_id  assembly_file                  resistance_gene_contig_id
XXXXXXXXX  XXXXXXXXX_plasmid_XXXXX.fasta  XXXXXXXXX_contig00005_len=11790_cov=233.0_origname=contig_2_polypolish_polish=racon:1_round(s);polypolish:short_reads_1_round(s);_sw=dragonflye-flye/1.1.0_date=20230907_circular=Y

The 'plasmid reconstruction' (often a single contig for hybrid assembly, but not always) that's produced by mob-recon also includes the long ID:

>XXXXXXXXX_contig00005_len=11790_cov=233.0_origname=contig_2_polypolish_polish=racon:1_round(s);polypolish:short_reads_1_round(s);_sw=dragonflye-flye/1.1.0_date=20230907_circular=Y

But the fasta headers from the original dragonflye assembly include spaces:

>XXXXXXXXX_contig00005 len=11790 cov=233.0 origname=contig_2_polypolish polish=racon:1 round(s);polypolish:short_reads,1 round(s); sw=dragonflye-flye/1.1.0 date=20230907 circular=Y

...so the 'contig ID' would be only the first part of the header up to the first whitespace. We'd prefer if this appeared in the resistance gene report like this:

sample_id  assembly_file                  resistance_gene_contig_id
XXXXXXXXX  XXXXXXXXX_plasmid_XXXXX.fasta  XXXXXXXXX_contig00005

Align reads against plasmid reconstructions

We currently align the input reads for each sample against the closest reference plasmids. But it would also be helpful for QC and review purposes to align the reads against the plasmid reconstruction(s) themselves.

Collect outputs

Add support for a --collect_outputs flag, which will cause tabular outputs to be collected together into a single file per analysis.

Use the flag --collected_outputs_prefix to control the filename prefixes for the collected outputs, with default value "collected"

Pipeline fails when no plasmid contigs are found

The mob_recon process will fail because some of the expected outputs are not created when no contigs are assigned to plasmids and no plasmid reconstructions are created:

rename: sample-X/plasmid*.fasta: rename to sample-X/sample-X_plasmid*.fasta failed: No such file or directory
  cp: cannot stat ‘sample-X/sample-X_plasmid*.fasta’: No such file or directory
  cp: cannot stat ‘sample-X/mobtyper_results.txt’: No such file or directory

Add GitHub Actions-based testing workflow

Add a testing workflow. See our routine-assembly pipeline for an example.

Resistance gene report sometimes includes duplicate entries

There are sometimes duplicate (identical) entries in the resistance_gene_report.tsv file. This might complicate collecting results into downstream systems and databases, so it should not happen.

Investigate the cause, and prevent duplicate entries in the resistance_gene_report.tsv output.

Pipeline crashes when multiple carbapenemase genes found on same plasmid reconstruction

Due to the way that we are naming provenance files, the pipeline will crash if multiple resistance genes are found on the same plasmid reconstruction:

Error executing process > 'collect_provenance (sample-X)'

Caused by:
  Process `collect_provenance` input file name collision -- There are multiple input files for each of the following file names: sample-X_pBC17Kpn036_bwa_samtools_provenance.yml, sample-X_pBC17Kpn036_freebayes_provenance.yml


Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Align against multiple candidate reference plasmids

Our current approach is to use the single reference plasmid with the lowest mash distance to our plasmid reconstruction as a reference for alignment and SNP-calling. But there are often cases where several suitable reference plasmids are closely related. We may get better alignments from those plasmids, even if their mash distances are not the lowest available.

Align & call SNPs against a set of candidate reference plasmids.

Collect provenance info

For each sample, collect info on which tool versions were used, which input files were used (including hashes) and which pipeline version was used for the analysis.

Remove support for `--versioned_outdir` flag.

The --versioned_outdir flag hasn't proven to be useful in practice. Remove it.

Choose reference plasmid for each resistance plasmid reconstruction, not for each sample

Some samples harbour multiple resistance plasmids. We're currently selecting a single reference plasmid for each sample, but we should be choosing a reference plasmid for each resistance plasmid reconstruction.

Look for replicon genes using abricate with plasmidfinder database

We currently get some information about replicon genes from our mob-typer plasmid reports, but we could get more detailed information about exactly which contig the gene is located on, where within the contig, and the accuracy of the match if we were to collect that info using abricate with the plasmidfinder database.

Add another abricate process similar to the one we have, but using the plasmidfinder database.

	ch_provenance = mob_recon.out.provenance
	ch_provenance = ch_provenance.join(abricate.out.provenance).map{ it -> [it[0], [it[1]] << it[2]] }
	ch_provenance = ch_provenance.join(trim_reads.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
	ch_provenance = ch_provenance.join(hash_files_fastq.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
	if (params.pre_assembled) {
	ch_provenance = ch_provenance.join(hash_files_assemblies.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
	}
	ch_provenance = ch_provenance.join(align_reads_to_reference_plasmid.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
	ch_provenance = ch_provenance.join(call_snps.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
	ch_provenance = ch_provenance.join(quast.out.provenance).map{ it -> [it[0], it[1] << it[2]] }
	ch_provenance = ch_provenance.join(ch_fastq.map{ it -> it[0] }.combine(ch_pipeline_provenance)).map{ it -> [it[0], it[1] << it[2]] }
	collect_provenance(ch_provenance)

bccdc-phl / plasmid-screen Goto Github PK

plasmid-screen's People

Contributors

Stargazers

Watchers

plasmid-screen's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs