cvua-rrw / foodme Goto Github PK

A reproducible and scalable snakemake workflow for the analysis of DNA metabarcoding experiments, with a special focus on food and feed samples.

Home Page: https://cvua-rrw.github.io/FooDMe

License: BSD 3-Clause "New" or "Revised" License

Python 76.04% Shell 16.44% R 7.52%

ngs ngs-pipeline metabarcoding targeted-sequencing snakemake food-monitoring public-health food-authenticity workflow pipeline

foodme's People

Contributors

Stargazers

Watchers

Forkers

gregdenay

foodme's Issues

Snakemake version

Version 7 now released.
Check compatibility and update where nescessary.

How can I use this workflow for Illumina single-end data?

Thank you for developing this workflow.

I really want to analyze my data using this workflow, but I only have Illumina single-end data. How can I use this workflow for Illumina single-end data?

I created the samplesheet.tsv for only fq1, but it gives me this error:

InputFunctionException in rule primer_trimming_stats in file /home/karta0/pbs-files/food-auth/FooDMe/workflow/rules/trimming.smk, line 89:
Error:
  KeyError: "None of [Index(['fq2'], dtype='object')] are in the [index]"
Wildcards:
  sample=1
Traceback:
  File "/home/karta0/pbs-files/food-auth/FooDMe/workflow/rules/trimming.smk", line 93, in <lambda>
  File "/home/karta0/pbs-files/food-auth/FooDMe/workflow/rules/common.smk", line 30, in get_fastq
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1067, in __getitem__
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1247, in _getitem_tuple
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 991, in _getitem_lowerdim
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1073, in __getitem__
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6070, in _get_indexer_strict
  File "/home/karta0/miniconda3/envs/snakemake/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6130, in _raise_if_missing

I would greatly appreciate any recommendation or suggestion you can give me

Move long shell or python blocks to scripts.

Re write as python (or R) scripts with shell() calls when nescessary.

Use of bc bash calculator causes compatibility problems on some distributions

Version

Foodme Version: 1.4.8
OS : Linux - Ubuntu 20.04, Debian 10
Snakemake version: irrelevant

Bug report

What happened:

In OTU mode the statistics calculation fails on some distribution because of missing bc.
This happens also in github actions.

What I expected to happen:

No compatibility problems

Implement organized and usefull logging

All rules need a log block
Logging should provide useful and explicit information
Logging should be organized in fewer files

1 log file per sample (+1 common and 1 global)
Log rule name and rule stdout and stderr if any
Log verbose output but keep the console clean.

[Request] Organism name in Benchmarking report

From Dr. Lina-Juana Dolch (CVUA-RRW):

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
I woul dlike to have the organism name appearing in the benchmarking report (Confusion matrix") next to the taxid.

Describe alternatives you've considered
Check the taxid on www.ncbi.nlm.nih.gov/Taxonomy/Browser
Time consuming

Additional context
Please

Software versions missing

Version

Foodme Version: 1.5.0

Bug report

Bash block for env file parsing is broken. Save the headache and replace by functionning python script.

Add warnings for low quality input data

Should some warning be shown (eg. colored table entries) when some qualtity values are low?

Problems:

How to decide on what are sufficient values ?
How to adapt them to different applications ?

Rule prep_taxonomy fails in 1.4.5

rule prep_taxonomy fails for taxid_filter value None

taxonomy_prep.log:

Traceback (most recent call last):
  File "/home/warmann/foodme/FooDMe/.snakemake/scripts/tmpck1f1v1d.filter_taxonomy.py", line 23, in <module>
    main(snakemake.params['nodes'],
  File "/home/warmann/foodme/FooDMe/.snakemake/scripts/tmpck1f1v1d.filter_taxonomy.py", line 18, in main
    tax.prune(taxid)
  File "/home/warmann/foodme/FooDMe/.snakemake/conda/e5bea28993829f6373b74c29065f59fe/lib/python3.10/site-packages/taxidTools/Taxonomy.py", line 600, in prune
    nodes = self.getAncestry(taxid)
  File "/home/warmann/foodme/FooDMe/.snakemake/conda/e5bea28993829f6373b74c29065f59fe/lib/python3.10/site-packages/taxidTools/Taxonomy.py", line 313, in getAncestry
    return Lineage(self[str(taxid)])
  File "/home/warmann/foodme/FooDMe/.snakemake/conda/e5bea28993829f6373b74c29065f59fe/lib/python3.10/collections/__init__.py", line 1106, in __getitem__
    raise KeyError(key)
KeyError: 'None'

Missing error values

Version

Foodme Version: 1.5.0

Bug report
Some Error calculation give nan as output. This shouldn't be happening

Deploy with minimal env requirements

As of now taxidtools needs to be installed in the base environement, which porevents deployement with sankedeploy.
Taxidtools should be a requirement on a per-rule basis and managed in conda environements.

May need to transform some run blocks in script or shell blocks.

[BUG] Ambiguous characters in Primer sequences are not treated as such

Version

Foodme Version: 1.6.5
OS: Debian buster
Snakemake version: 7.25.0

Bug report

What happened: Cutadapt has no prior of IUPAC encoding and treats sequences as-is. This results in an inflation of the error-rate of primer matching when providing primers with ambiguous characters and the loss of many reads due to filtering,

What I expected to happen: I expect the workflow to be aware of IUPAC encoding for ambiguous characters.

Logs

Additional context

This behaviour is not clear from the FoodDMe documentation.

Documentation

Rework README
need a proper guide
FAQ
maybe some kind of tutorial for benchmarking (together with the benchmarking module) on different kind of samples

Add frequencies to the disambiguation column

Adding frequencies would allow to filter for dubious and poorly supported consensus

Calculation of components concentration

Components concentrations are based on all reads making it through the analysis, including unassigned reads.
Expected behaviour is to not take these reads into account, however outputing the proportion of reads not assigned can be useful for QC.

Solution: In the report build 2 columns:

one for quantification based on all reads
one for quantification of assigned reads only

Improve performance of the BLAST filtering rule

It's way too slow on large BLAST results.

Outdated BLAST+ version

BLAST now in version 2.13.
Upgade required for future support of new identifiers.

[BUG] Incorrect parsing of the `trim_primers_3end` parameter

Version

Foodme Version: 1.6.5
OS : Debian buster
Snakemake version: 7.25.0

Bug report

What happened: The above mentionned parameter is parsed as a boolean (True/False) but the cutadapt shell block expects it as false resulting in the rule always defaulting to trimming on both ends.

What I expected to happen: False should be evaluated as False

Logs

Additional context

Speed up taxdump parsing

Repeated parsing of the taxdumps files takes up a large part of runtime.

parse once and filter accroding to the "taxid_filter" argument + maybe a list of ranks ?
save as json
load the json in subsequent steps

"No primer found" field always at 100%

see title

Better error catching for Dada

So far any error raised in the Dada2 R script will result in blank output.
This is needed to handle NTC samples where no reads cluster but it also makes debugging and problem finding harder and should be corrected.

Implement proper error catching and handling (as far as R allows it) in the script.
The analysis should go on if a sample dose not lead to cluster formation (no/few reads, bad quality) and pipeline should stop on error for other cases.

Remove wrapper

Snakemake config file less messy.
Less maintenance.
More traceability.

Taxid Blocklist

Users may want to prevent a subset of taxids to be identified and show up in the results (eg. common contaminants, extinct species,...).

These taxids could be listed in a file that is provided as optional argument to the pipeline and merged with other filters during the database masking step.
This list should also appear in the report for the audit trail.

Ion torrent sequences

Hello, how are you?

I would like to know if I can use single-end sequences from Ion Torrent and, if so, what should I do?

Thank you very much.

Foodme environment error

Under Debian 11

Job 1: Generating global html report

Activating conda environment: .snakemake/conda/49331715801f124dd0f1cdc91c25109a
ERROR: This cross-compiler package contains no program /home/warmann/foodme/FooDMe/.snakemake/conda/49331715801f124dd0f1cdc91c25109a/bin/x86_64-conda_cos6-linux-gnu-gfortran
INFO: activate-gfortran_linux-64.sh made the following environmental changes:
+HOST=x86_64-conda_cos6-linux-gnu
-HOST=x86_64-conda-linux-gnu
[Fri May  6 10:33:09 2022]
Finished job 1.
27 of 28 steps (96%) done
Select jobs to execute...

[Fri May  6 10:33:09 2022]
localrule all:
    input: reports/report.html
    jobid: 0
    resources: tmpdir=/tmp

[Fri May  6 10:33:09 2022]
Finished job 0.
28 of 28 steps (100%) done
Complete log: .snakemake/log/2022-05-06T102756.305089.snakemake.log

Workflow finished, no error

Workflow finishes without errors BUT .html is not produced

Cluster assignement merging

The clusters assignements are so far merged by similar ranks:

Cluster	result	count
Cluster1	genus1 species1	10
Cluster2	genus1 species1	20
Cluster3	genus1	10

Would be merged as :

result	count
genus1 species1	30
genus1	10

It may be advantageous to be able to merge cluster assignements up to some ranks.
For example, merging up to genus:

result	count
genus1	40

Check first:

Would this result in information loss?
How should it look like in the report

Default behaviour should stay merging on same ranks only

"No match" reads

Reads from cluster with no BLAST result (shown as "no match" in the report) should not count in the sample composition.
Likewise the total number of reads used to determine the composition should be shown in the report.

Rework read trimming

Would like to use cutadapt to trim primers and feed into FIGARO to get quality trimming site before DADA.

Panda CopyWarning

Cause of warning is concatenate_uniq() in workflow/rules/common.smk

Needs to be fixed to avoid problems later

CopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)

[Request] Exclude specific sequences from the nucleotide database based on SeqID

Is your feature request related to a problem? Please describe.
Many sequences in the BLAST database are wrongly annotated and show significant discrepancies with other sequences from the same taxon. This leads difficulties with consensus determination and results being annotated at the genus level or higher.

Describe the solution you'd like
Problematic sequences can be identified by careful examination of the results and could be marked as such and be excluded from the results, similarly to what is done for the taxid-blocklist process that is already implemented.

Describe alternatives you've considered
the blastn CLI only allows to exclude taxa OR sequences but not both simultaneously. The easiest implemntation would be to filter sequences ID on the BLAST results.

Additional context
The better solution would be to provide means to curate the database, e.g. with Spec4ID. This is a somewhat more complicated approach and doesn't exclude filtering specific sequences.

Add disambiguation report for ranks > Species

Currently only the last common node is visible in the report when more than one taxid is found. Discovering what the actual hits were requires to look through the BLAST reports.
The actual hits should be displayed in the end report to help evaluate the results.

cvua-rrw / foodme Goto Github PK

foodme's People

Contributors

Stargazers

Watchers

Forkers

foodme's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs