GithubHelp home page GithubHelp logo

cmkobel / assemblycomparator2 Goto Github PK

View Code? Open in Web Editor NEW
27.0 2.0 2.0 82.11 MB

πŸ¦ πŸ“‡ Genomes to report pipeline - for Bacteria and Archaea

Home Page: https://assemblycomparator2.readthedocs.io

License: GNU General Public License v3.0

Shell 2.26% Python 78.26% R 5.94% Dockerfile 13.55%
microbial-genomics snakemake genomes-to-report hpc conda apptainer bioinformatics pbs slurm bioconda

assemblycomparator2's Introduction

                                      🦠  🧫  πŸ”¬  πŸ‘Ύ  πŸ§ͺ  πŸ’‰  ☣ 

Hello. I'm a bioinformatician working in microbial ecology. I have a focus on software engineering and holo-omics. I research the interaction between gut microbes and their hosts. I try to keep my pipelines lean and share my workflows here on github mostly to train myself to make my work portable and well documented. I largely follow people on github who are into microbiology and omic-integrative methods.

My opinion on holo-omics is that we need to do more fundamental research and basic bioinformatic tool sharpening before we can truly dive in and understand the holistic interactions in the holobiont super-organisms. That is why I am currently also focusing on developing basic stuff like annotation and pipeline tools.

πŸš€ My current flagship project is the Assemblycomparator2 genomes-to-report pipeline which has just been published on bioconda.

I'm looking for a postdoctoral fellowship position in Copenhagen or online from spring 2025. Let me know if you have a lead.

                                   🦾   πŸ”¬   πŸ’»   πŸ”£   πŸ’Ύ   🚲   🧬

assemblycomparator2's People

Contributors

cmkobel avatar oliverkjhansen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

assemblycomparator2's Issues

Resume analysis on error

If a .gwf directory already exists in the run folder, assemblycomparator can skip everything before running gwf.

Use python script instead of bash alias

Make a real install script that configures that python script.

The python script should use argparse, and calls the subsequent snakemake pipeline.

Default inputs/outputs means that starting the pipeline can be called as always (as with the alias)

pseudotargets

Make a pseudotarget named "cheap" or "fast" that just runs the quick stuff like:

  • sequence_lengths
  • assembly_stats
  • mashtree
  • ?something more?

add option to skip any2fasta

Often the user knows that the genomes are good stuff. And any2fasta can be skipped. Think about ~1000 genome datasets.

The report will fail if the metadata has not been created

Running assemblycomparator2 --until <rule> for any rule which is not metadata, on an uninitialized directory, the report will fail because the metadata file does not exist.

On solution is to have the metadata output as input in every single rule in the pipeline such that metadata will be forced to be created. Update, this solution might be suboptimal as it means that all jobs will have to run again if you update the metadata?

kraken2 access on genomedk

export ASSCOM2_KRAKEN2_DB='/project/ClinicalMicrobio/faststorage/database/kraken2/k2_pluspf_20210127'

assemblycomparator (shell) and workflow (py) don't agree on the files included

See "Warning: the file: .Rhistory doesn't look like a fasta file. Consider its inclusion." which is not part of the list made pΓ₯ the shell-script.

β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”β”Œβ”€β”β”Œβ”¬β”β”Œβ” ┬ ┬ ┬  β”Œβ”€β”β”Œβ”€β”β”Œβ”¬β”β”Œβ”€β”β”Œβ”€β”β”¬β”€β”β”Œβ”€β”β”Œβ”¬β”β”Œβ”€β”β”¬β”€β”KMA
β”œβ”€β”€β””β”€β”β””β”€β”β”œβ”€ β”‚β”‚β”‚β”œβ”΄β”β”‚ β””β”¬β”˜  β”‚  β”‚ β”‚β”‚β”‚β”‚β”œβ”€β”˜β”œβ”€β”€β”œβ”¬β”˜β”œβ”€β”€ β”‚ β”‚ β”‚β”œβ”¬β”˜
β”΄ β”΄β””β”€β”˜β””β”€β”˜β””β”€β”˜β”΄ β”΄β””β”€β”˜β”΄β”€β”˜β”΄   β””β”€β”˜β””β”€β”˜β”΄ β”΄β”΄  β”΄ ┴┴└─┴ β”΄ β”΄ β””β”€β”˜β”΄β””β”€

Report issues at
    https://github.com/cmkobel/assemblycomparator/issues

Info: The blastp-identity threshold is set to 95 (default).
        (can be changed with the --blastp argument)

These are the 10 assemblies considered for project 2020_07_Ecoli_test_iqtree:
 B18_236667.fa
 B18_241039.fa
 B18_309150.fa
 B18_312563.fa
 B18_343222.fa
 B18_390375.fa
 B18_412827.fa
 B18_558476.fa
 B18_576661.fa
 B18_630107.fa

Do you wish to proceed? [y/n] y
 proceeding...
 activating environment...
 validating assembly files...
 backing up old content...
 archiving content...

These are the jobs:
BLASTP: 95
Warning: the file: .Rhistory doesn't look like a fasta file. Consider its inclusion.
cmp_copy_2020_07_Ecoli_test_iqtree_             shouldrun       0.00% [1/0/0/0]
cmp_kraken2_                                    shouldrun       0.00% [2/0/0/0]
cmp_abricate_                                   shouldrun       0.00% [2/0/0/0]
cmp_prokka_2020_07_Ecoli_test_iqtree_           shouldrun       0.00% [2/0/0/0]
cmp_summary_tables_2020_07_Ecoli_test_iqtree    shouldrun      88.24% [4/0/0/30]
cmp_mlst_2020_07_Ecoli_test_iqtree              shouldrun      83.33% [2/0/0/10]
cmp_roary_95_2020_07_Ecoli_test_iqtree          shouldrun      86.96% [3/0/0/20]
cmp_fasttree_2020_07_Ecoli_test_iqtree          shouldrun      83.33% [4/0/0/20]
cmp_iqtree_2020_07_Ecoli_test_iqtree            shouldrun      83.33% [4/0/0/20]
cmp_roary_plots_2020_07_Ecoli_test_iqtree       shouldrun      80.00% [5/0/0/20]
cmp_panito_2020_07_Ecoli_test_iqtree            shouldrun      80.00% [5/0/0/20]
cmp_mail_2020_07_Ecoli_test_iqtree              shouldrun      78.43% [11/0/0/40]

Do you wish to submit this job list? [y/n]

spaces

It seems that asscom2 cannot handle spaces in the parent path of the working directory.
Neither in the filenames of the assemblies.

`read` in shell is confusing

Pressing a mouse button or arrow keys emulates a "no"-answer.

It would be better if the user could confirm the entered value with an enter stroke.

Neat conda package

Instead of setting up aliases and system varibles, and installing snakemake+mamba, maybe there could just be a conda package that would do all that for you.
Setting up the slurm/pbs stuff should still be manual but that is a different topic.

Report v3 progress

Almost all is done, but some things are missing

  • mlst
  • abricate
  • fasttree/quicktree on core (consider skipping)
  • bakta (just copy-paste prokka)
  • eggnog (not so urgent)
  • gapseq (Should be easy, just make a heatmap of pathways)
  • dbcan results
  • panaroo

Ide til lΓΈsning af rapport-caveat

  1. Lige nu kΓΈrer rapporten some en rule. Men hvad med at installere R og alle pakkerne direkte i assemblycomparator2-condamiljΓΈet. I dette tilfΓ¦lde vil det vΓ¦re muligt at lave rapporten som et sidste script kald. Jeg mener der er en mulighed for at definere et script nΓ₯r pipelinen er fΓ¦rdig med at kΓΈre men jeg kan simpelthen ikke finde det.

  2. Alternativt kunne man definere det som en ekstra ting i aliaset, sΓ₯ledes at nΓ₯r snakemake afslutter (exit 0) kan et nyt kald blive lavet. Hvis rule report ikke er med i outputlisten kunne man bare skrive ... && snakemake <blabla> --until report og sΓ₯ vil rapporten komme ud.

HΓ₯ber det giver mening nΓ₯r jeg lΓ¦ser det her om 2 Γ₯r.

pseudo reads-ize the genomes for kraken2

Idea inspired by Tseemanns shred-reads command for snippy.

The idea is, that if you cut the genomes into reads, kraken will get much higher resolution. If I cut into the size that the kraken database is made up of, the species recognition will become more granulated.

Continuous integration with some testing?

Running a few tests.

  • Installing, both with conda and apptainer
  • Downloading databases.
  • Running a few tests and checking that everything is produced correctly:
    • Single sample
    • Handful of samples
  • Set up on thylakoid with cron
    • Enable error reporting through email or something..

Parametrize databases path

Consider if database setup can be made more parametric. Caveat with docker images is that it may be harder to acquire access to the databases. But maybe there is a way of doing that.

perl root and perl moo fails when the snakemake conda prefix variable is set

When I set the snakemake conda prefix variable to something, prokka (and possibly also mlst) fail because of problems with perl root and perl moo.

In the installation, the variable is set as such.
echo "export SNAKEMAKE_CONDA_PREFIX=${ASSCOM2_BASE}/conda_base" >> ~/.bashrc

A workaround is to delete the variable from the ~/.bashrc andor ~/.zshrc

Note: this issue only pertains to using conda (--use-conda)

Timed report

After 24 hours, or when everything is done:
write a report, with the results that have been created so far.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.