GithubHelp home page GithubHelp logo

cvua-rrw / foodme Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 1.0 3.97 MB

A reproducible and scalable snakemake workflow for the analysis of DNA metabarcoding experiments, with a special focus on food and feed samples.

Home Page: https://cvua-rrw.github.io/FooDMe

License: BSD 3-Clause "New" or "Revised" License

Python 76.04% Shell 16.44% R 7.52%
ngs ngs-pipeline metabarcoding targeted-sequencing snakemake food-monitoring public-health food-authenticity workflow pipeline

foodme's People

Contributors

actions-user avatar gregdenay avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

gregdenay

foodme's Issues

Snakemake version

Version 7 now released.
Check compatibility and update where nescessary.

How can I use this workflow for Illumina single-end data?

Thank you for developing this workflow.

I really want to analyze my data using this workflow, but I only have Illumina single-end data. How can I use this workflow for Illumina single-end data?

I created the samplesheet.tsv for only fq1, but it gives me this error:

InputFunctionException in rule primer_trimming_stats in file /home/karta0/pbs-files/food-auth/FooDMe/workflow/rules/trimming.smk, line 89:
Error:
  KeyError: "None of [Index(['fq2'], dtype='object')] are in the [index]"
Wildcards:
  sample=1
Traceback:
  File "/home/karta0/pbs-files/food-auth/FooDMe/workflow/rules/trimming.smk", line 93, in <lambda>
  File "/home/karta0/pbs-files/food-auth/FooDMe/workflow/rules/common.smk", line 30, in get_fastq
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1067, in __getitem__
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1247, in _getitem_tuple
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 991, in _getitem_lowerdim
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6070, in _get_indexer_strict
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6130, in _raise_if_missing

I would greatly appreciate any recommendation or suggestion you can give me

Implement organized and usefull logging

  • All rules need a log block
  • Logging should provide useful and explicit information
  • Logging should be organized in fewer files

1 log file per sample (+1 common and 1 global)
Log rule name and rule stdout and stderr if any
Log verbose output but keep the console clean.

[Request] Organism name in Benchmarking report

From Dr. Lina-Juana Dolch (CVUA-RRW):

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
I woul dlike to have the organism name appearing in the benchmarking report (Confusion matrix") next to the taxid.

Describe alternatives you've considered
Check the taxid on www.ncbi.nlm.nih.gov/Taxonomy/Browser
Time consuming

Additional context
Please

Software versions missing

Version

Foodme Version: 1.5.0

Bug report

Bash block for env file parsing is broken. Save the headache and replace by functionning python script.

Add warnings for low quality input data

Should some warning be shown (eg. colored table entries) when some qualtity values are low?

Problems:

  • How to decide on what are sufficient values ?
  • How to adapt them to different applications ?

Rule prep_taxonomy fails in 1.4.5

rule prep_taxonomy fails for taxid_filter value None

taxonomy_prep.log:

Traceback (most recent call last):
  File "/home/warmann/foodme/FooDMe/.snakemake/scripts/tmpck1f1v1d.filter_taxonomy.py", line 23, in <module>
    main(snakemake.params['nodes'],
  File "/home/warmann/foodme/FooDMe/.snakemake/scripts/tmpck1f1v1d.filter_taxonomy.py", line 18, in main
    tax.prune(taxid)
  File "/home/warmann/foodme/FooDMe/.snakemake/conda/e5bea28993829f6373b74c29065f59fe/lib/python3.10/site-packages/taxidTools/Taxonomy.py", line 600, in prune
    nodes = self.getAncestry(taxid)
  File "/home/warmann/foodme/FooDMe/.snakemake/conda/e5bea28993829f6373b74c29065f59fe/lib/python3.10/site-packages/taxidTools/Taxonomy.py", line 313, in getAncestry
    return Lineage(self[str(taxid)])
  File "/home/warmann/foodme/FooDMe/.snakemake/conda/e5bea28993829f6373b74c29065f59fe/lib/python3.10/collections/__init__.py", line 1106, in __getitem__
    raise KeyError(key)
KeyError: 'None'

Missing error values

Version

Foodme Version: 1.5.0

Bug report
Some Error calculation give nan as output. This shouldn't be happening

Deploy with minimal env requirements

As of now taxidtools needs to be installed in the base environement, which porevents deployement with sankedeploy.
Taxidtools should be a requirement on a per-rule basis and managed in conda environements.

May need to transform some run blocks in script or shell blocks.

[BUG] Ambiguous characters in Primer sequences are not treated as such

Version

Foodme Version: 1.6.5
OS: Debian buster
Snakemake version: 7.25.0

Bug report

What happened: Cutadapt has no prior of IUPAC encoding and treats sequences as-is. This results in an inflation of the error-rate of primer matching when providing primers with ambiguous characters and the loss of many reads due to filtering,

What I expected to happen: I expect the workflow to be aware of IUPAC encoding for ambiguous characters.

Logs

Additional context

This behaviour is not clear from the FoodDMe documentation.

Documentation

  • Rework README
  • need a proper guide
  • FAQ
  • maybe some kind of tutorial for benchmarking (together with the benchmarking module) on different kind of samples

Calculation of components concentration

Components concentrations are based on all reads making it through the analysis, including unassigned reads.
Expected behaviour is to not take these reads into account, however outputing the proportion of reads not assigned can be useful for QC.

Solution: In the report build 2 columns:

  • one for quantification based on all reads
  • one for quantification of assigned reads only

[BUG] Incorrect parsing of the `trim_primers_3end` parameter

Version

Foodme Version: 1.6.5
OS : Debian buster
Snakemake version: 7.25.0

Bug report

What happened: The above mentionned parameter is parsed as a boolean (True/False) but the cutadapt shell block expects it as false resulting in the rule always defaulting to trimming on both ends.

What I expected to happen: False should be evaluated as False

Logs

Additional context

Speed up taxdump parsing

Repeated parsing of the taxdumps files takes up a large part of runtime.

  • parse once and filter accroding to the "taxid_filter" argument + maybe a list of ranks ?
  • save as json
  • load the json in subsequent steps

Better error catching for Dada

So far any error raised in the Dada2 R script will result in blank output.
This is needed to handle NTC samples where no reads cluster but it also makes debugging and problem finding harder and should be corrected.

Implement proper error catching and handling (as far as R allows it) in the script.
The analysis should go on if a sample dose not lead to cluster formation (no/few reads, bad quality) and pipeline should stop on error for other cases.

Remove wrapper

Snakemake config file less messy.
Less maintenance.
More traceability.

Taxid Blocklist

Users may want to prevent a subset of taxids to be identified and show up in the results (eg. common contaminants, extinct species,...).

These taxids could be listed in a file that is provided as optional argument to the pipeline and merged with other filters during the database masking step.
This list should also appear in the report for the audit trail.

Ion torrent sequences

Hello, how are you?

I would like to know if I can use single-end sequences from Ion Torrent and, if so, what should I do?

Thank you very much.

Foodme environment error

Under Debian 11

Job 1: Generating global html report

Activating conda environment: .snakemake/conda/49331715801f124dd0f1cdc91c25109a
ERROR: This cross-compiler package contains no program /home/warmann/foodme/FooDMe/.snakemake/conda/49331715801f124dd0f1cdc91c25109a/bin/x86_64-conda_cos6-linux-gnu-gfortran
INFO: activate-gfortran_linux-64.sh made the following environmental changes:
+HOST=x86_64-conda_cos6-linux-gnu
-HOST=x86_64-conda-linux-gnu
[Fri May  6 10:33:09 2022]
Finished job 1.
27 of 28 steps (96%) done
Select jobs to execute...

[Fri May  6 10:33:09 2022]
localrule all:
    input: reports/report.html
    jobid: 0
    resources: tmpdir=/tmp

[Fri May  6 10:33:09 2022]
Finished job 0.
28 of 28 steps (100%) done
Complete log: .snakemake/log/2022-05-06T102756.305089.snakemake.log

Workflow finished, no error

Workflow finishes without errors BUT .html is not produced

Cluster assignement merging

The clusters assignements are so far merged by similar ranks:

Cluster result count
Cluster1 genus1 species1 10
Cluster2 genus1 species1 20
Cluster3 genus1 10

Would be merged as :

result count
genus1 species1 30
genus1 10

It may be advantageous to be able to merge cluster assignements up to some ranks.
For example, merging up to genus:

result count
genus1 40

Check first:

  • Would this result in information loss?
  • How should it look like in the report

Default behaviour should stay merging on same ranks only

"No match" reads

Reads from cluster with no BLAST result (shown as "no match" in the report) should not count in the sample composition.
Likewise the total number of reads used to determine the composition should be shown in the report.

Rework read trimming

Would like to use cutadapt to trim primers and feed into FIGARO to get quality trimming site before DADA.

Panda CopyWarning

Cause of warning is concatenate_uniq() in workflow/rules/common.smk

Needs to be fixed to avoid problems later

CopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)

[Request] Exclude specific sequences from the nucleotide database based on SeqID

Is your feature request related to a problem? Please describe.
Many sequences in the BLAST database are wrongly annotated and show significant discrepancies with other sequences from the same taxon. This leads difficulties with consensus determination and results being annotated at the genus level or higher.

Describe the solution you'd like
Problematic sequences can be identified by careful examination of the results and could be marked as such and be excluded from the results, similarly to what is done for the taxid-blocklist process that is already implemented.

Describe alternatives you've considered
the blastn CLI only allows to exclude taxa OR sequences but not both simultaneously. The easiest implemntation would be to filter sequences ID on the BLAST results.

Additional context
The better solution would be to provide means to curate the database, e.g. with Spec4ID. This is a somewhat more complicated approach and doesn't exclude filtering specific sequences.

Add disambiguation report for ranks > Species

Currently only the last common node is visible in the report when more than one taxid is found. Discovering what the actual hits were requires to look through the BLAST reports.
The actual hits should be displayed in the end report to help evaluate the results.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.