cmkobel / assemblycomparator2 Goto Github PK

View Code? Open in Web Editor NEW

27.0 2.0 2.0 82.11 MB

🦠📇 Genomes to report pipeline - for Bacteria and Archaea

Home Page: https://assemblycomparator2.readthedocs.io

License: GNU General Public License v3.0

Shell 2.26% Python 78.26% R 5.94% Dockerfile 13.55%

microbial-genomics snakemake genomes-to-report hpc conda apptainer bioinformatics pbs slurm bioconda

assemblycomparator2's Issues

Report v3 progress

Almost all is done, but some things are missing

Weird keyerror when --untilling

assemblycomparator (shell) and workflow (py) don't agree on the files included

See "Warning: the file: .Rhistory doesn't look like a fasta file. Consider its inclusion." which is not part of the list made på the shell-script.

┌─┐┌─┐┌─┐┌─┐┌┬┐┌┐ ┬ ┬ ┬  ┌─┐┌─┐┌┬┐┌─┐┌─┐┬─┐┌─┐┌┬┐┌─┐┬─┐KMA
├─┤└─┐└─┐├┤ │││├┴┐│ └┬┘  │  │ ││││├─┘├─┤├┬┘├─┤ │ │ │├┬┘
┴ ┴└─┘└─┘└─┘┴ ┴└─┘┴─┘┴   └─┘└─┘┴ ┴┴  ┴ ┴┴└─┴ ┴ ┴ └─┘┴└─

Report issues at
    https://github.com/cmkobel/assemblycomparator/issues

Info: The blastp-identity threshold is set to 95 (default).
        (can be changed with the --blastp argument)

These are the 10 assemblies considered for project 2020_07_Ecoli_test_iqtree:
 B18_236667.fa
 B18_241039.fa
 B18_309150.fa
 B18_312563.fa
 B18_343222.fa
 B18_390375.fa
 B18_412827.fa
 B18_558476.fa
 B18_576661.fa
 B18_630107.fa

Do you wish to proceed? [y/n] y
 proceeding...
 activating environment...
 validating assembly files...
 backing up old content...
 archiving content...

These are the jobs:
BLASTP: 95
Warning: the file: .Rhistory doesn't look like a fasta file. Consider its inclusion.
cmp_copy_2020_07_Ecoli_test_iqtree_             shouldrun       0.00% [1/0/0/0]
cmp_kraken2_                                    shouldrun       0.00% [2/0/0/0]
cmp_abricate_                                   shouldrun       0.00% [2/0/0/0]
cmp_prokka_2020_07_Ecoli_test_iqtree_           shouldrun       0.00% [2/0/0/0]
cmp_summary_tables_2020_07_Ecoli_test_iqtree    shouldrun      88.24% [4/0/0/30]
cmp_mlst_2020_07_Ecoli_test_iqtree              shouldrun      83.33% [2/0/0/10]
cmp_roary_95_2020_07_Ecoli_test_iqtree          shouldrun      86.96% [3/0/0/20]
cmp_fasttree_2020_07_Ecoli_test_iqtree          shouldrun      83.33% [4/0/0/20]
cmp_iqtree_2020_07_Ecoli_test_iqtree            shouldrun      83.33% [4/0/0/20]
cmp_roary_plots_2020_07_Ecoli_test_iqtree       shouldrun      80.00% [5/0/0/20]
cmp_panito_2020_07_Ecoli_test_iqtree            shouldrun      80.00% [5/0/0/20]
cmp_mail_2020_07_Ecoli_test_iqtree              shouldrun      78.43% [11/0/0/40]

Do you wish to submit this job list? [y/n]

Roary hack

https://github.com/cmkobel/assemblycomparator2/blob/62d1518c59a421532cac5c8e6b812ff987970754/conda_envs/roary.yaml#L3

.. only works if you do:

export PERL5LIB=/home/cmkobel/assemblycomparator2/conda_base/b51707c89e77d1771344e5a65ab516a5_/lib/perl5/site_perl/5.22.0

But that is never gonna be a good solution.

Press ctrl-c to skip validation

GTDBtk

https://ecogenomics.github.io/GTDBTk/installing/index.html

kraken2 access on genomedk

export ASSCOM2_KRAKEN2_DB='/project/ClinicalMicrobio/faststorage/database/kraken2/k2_pluspf_20210127'

roary dir overwrites on change of thresholds

The blastp-threshold should be part of the roary directory name.
Thus running more roary analyses with different thresholds, would be comparable.

Considering to remove singularity support and just go with conda.

If the pipeline is to work with conda for local setups anyway, I might as well ditch singularity, and focus all the debugging effort on the conda envs instead.

pseudo reads-ize the genomes for kraken2

Idea inspired by Tseemanns shred-reads command for snippy.

The idea is, that if you cut the genomes into reads, kraken will get much higher resolution. If I cut into the size that the kraken database is made up of, the species recognition will become more granulated.

Visualization of GC content throughout contigs

Use tabseq for development.

perl root and perl moo fails when the snakemake conda prefix variable is set

When I set the snakemake conda prefix variable to something, prokka (and possibly also mlst) fail because of problems with perl root and perl moo.

In the installation, the variable is set as such.
echo "export SNAKEMAKE_CONDA_PREFIX=${ASSCOM2_BASE}/conda_base" >> ~/.bashrc

A workaround is to delete the variable from the ~/.bashrc andor ~/.zshrc

Note: this issue only pertains to using conda (--use-conda)

Remove Card and Plasmidfinder from report

Results are not so relevant I think.

Some rules should not run when there is only one sample.

Many of the comparison tools. Mashtree, roary, iqtree....

The tools will just fail, which isn't a big deal. But would be nice if this was handled in a more graceful manner.

Add some form of clustering

.. based on the minhash distances.

Update: or on gene absence/presence

Parametrize databases path

Consider if database setup can be made more parametric. Caveat with docker images is that it may be harder to acquire access to the databases. But maybe there is a way of doing that.

checkm2_download fails when using apptainer

Because it can't write the path setting.

The report will fail if the metadata has not been created

Running assemblycomparator2 --until <rule> for any rule which is not metadata, on an uninitialized directory, the report will fail because the metadata file does not exist.

On solution is to have the metadata output as input in every single rule in the pipeline such that metadata will be forced to be created. Update, this solution might be suboptimal as it means that all jobs will have to run again if you update the metadata?

Bootstrap the tree

Ide til løsning af rapport-caveat

Lige nu kører rapporten some en rule. Men hvad med at installere R og alle pakkerne direkte i assemblycomparator2-condamiljøet. I dette tilfælde vil det være muligt at lave rapporten som et sidste script kald. Jeg mener der er en mulighed for at definere et script når pipelinen er færdig med at køre men jeg kan simpelthen ikke finde det.
Alternativt kunne man definere det som en ekstra ting i aliaset, således at når snakemake afslutter (exit 0) kan et nyt kald blive lavet. Hvis rule report ikke er med i outputlisten kunne man bare skrive ... && snakemake <blabla> --until report og så vil rapporten komme ud.

Håber det giver mening når jeg læser det her om 2 år.

versioning of packages

Fix (fixate) versions for everything in the conda environments.

header should write the version number

`read` in shell is confusing

Pressing a mouse button or arrow keys emulates a "no"-answer.

It would be better if the user could confirm the entered value with an enter stroke.

Use a hash table to check if an assembly has already been annotated (prokka).

And then prefill the annotation.

Resistance tables are much too wide

Solution: Pivot, so the samples are columns, and res. gene calls are rows.

gtdbtk running time should be scaled to number of samples

https://github.com/cmkobel/assemblycomparator2/blob/2d1a2b0cd131f67306c68d35dfb582592955549c/snakefile#L537

For that matter, all batch-running rules should be scaled accordingly.

Roary doesn't create a unique file on the blastp_identity

It should, because otherwise it might not rerun if the user changes the setting between runs.

Sliding k-mer frequency visualization.

Use tabseq to calculate frequencies over the contigs. Visualize with some beautiful colors.

Why is there a || ?

https://github.com/cmkobel/assemblycomparator/blob/98713e7c490f6ce7e0a084c07a772fe90bafe49c/workflow_templates.py#L246

Should only pass if the dir exists already.

assemblycomparator2/snakefile

Line 92 in 59b3eef

except:

CheckM

Confusing dirnames

Rename docker_imgs to docker_files.

Makes more sense.

Use python script instead of bash alias

Make a real install script that configures that python script.

The python script should use argparse, and calls the subsequent snakemake pipeline.

Default inputs/outputs means that starting the pipeline can be called as always (as with the alias)

add option to skip any2fasta

Often the user knows that the genomes are good stuff. And any2fasta can be skipped. Think about ~1000 genome datasets.

Resume analysis on error

If a .gwf directory already exists in the run folder, assemblycomparator can skip everything before running gwf.

pseudotargets

Make a pseudotarget named "cheap" or "fast" that just runs the quick stuff like:

sequence_lengths
assembly_stats
mashtree
?something more?

spaces

It seems that asscom2 cannot handle spaces in the parent path of the working directory.
Neither in the filenames of the assemblies.

Check that kraken path exists before starting the workflow

Gives the user a chance to fix the problem ahead of time.

Continuous integration with some testing?

Running a few tests.

Installing, both with conda and apptainer
Downloading databases.
Running a few tests and checking that everything is produced correctly:
- Single sample
- Handful of samples
Set up on thylakoid with cron
- Enable error reporting through email or something..

Pseudo rules like interproscan and prokka could be removed

since --until would work just as good.

I wonder why I never thought about that, as if all repeated jobs must end in a single??

Pan/core legend is way too big

Solution: Make it into a continuous color scale legend bar.

Support for different sub analyses

--cpo runs the CPO-relevant analyses (plasmids, resistance, etc)

--agg runs something relevant to Aggregatibacter

etc..

cmkobel / assemblycomparator2 Goto Github PK

assemblycomparator2's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs