nf-core / test-datasets Goto Github PK

View Code? Open in Web Editor NEW

87.0 138.0 318.0 20.12 GB

Test data to be used for automated testing with the nf-core pipelines

Home Page: https://nf-co.re

License: MIT License

nextflow pipelines test-data testing test-datasets nf-core workflow

test-datasets's Introduction

Test data to be used for automated testing with the nf-core pipelines

⚠️ Do not merge your test data to master! Each pipeline has a dedicated branch (and a special one for modules)

Introduction

nf-core is a collection of high quality Nextflow pipelines. This repository contains various files for CI and unit testing of nf-core pipelines and infrastructure.

The principle for nf-core test data is as small as possible, as large as necessary. Please see the guidelines for more detailed information. Always ask for guidance on the nf-core slack before adding new test data.

Documentation

nf-core/test-datasets comes with documentation in the docs/ directory:

Downloading test data

Due the large number of large files in this repository for each pipeline, we highly recommend cloning only the branches you would use.

git clone <url> --single-branch --branch <pipeline/modules/branch_name>

To subsequently clone other branches¹

git remote set-branches --add origin [remote-branch]
git fetch

Support

For further information or help, don't hesitate to get in touch on our Slack organisation (a tool for instant messaging).

From stackoverflow ↩

test-datasets's People

Contributors

Stargazers

Watchers

Forkers

apeltzer ewels hadrieng chuan-wang remiolsen philpalmer kochtobi glormph likelet antunderwood andreas-wilm ignaciot sbilge nservant maxibor lpantano annasyme pilm-bioinformatics bixbeta propan2one pranathivemuri lwratten drpatelh biopsykold t-neumann jpfeuffer cying111 snafees ritaht annacprice jeremy1805 piotr-faba-ardigen felixkrueger thanhleviet skrakau bu-isciii joseespinosa barquistlab cgpu thejacksonlaboratory sc13-bioinf pappewaio wkang0 reganhayward fasterius rikenbit pjonsson-ctx subwaystation heuermh kevinmenden huipengl emnilsson zxclovezby jonathanbader presta-hub luslab juke34 lkuchenb phue barrydigby d4straub akv3001 weinformatics friederikehanssen yuukiiwa oist oisinmccaffrey charles-plessy combiz mgordon09 mjmansfi immcantation praveenraj2018 jtangrot ronfinn dhwani07 kevbrick mtmcgowan gongyh sdelgadoa jianhong daniel-vm erikrikarddaniel dladd bwlang luantunez sguizard jemten gcjmackenzie abhi18av fbdtemme daisyhan97 letovesnoi rpetit3 tuberinfo midnighter edmundmiller goja288 ggabernet avantonder

test-datasets's Issues

Split modules vs pipeline test module

This repository is getting very large and having to shallow clone/single-branch clone can be tricky, so we could move the modules test data branch to a separate repository.

Use git-lfs

Making an argument for tracking to move everything to git-lfs so that downloading the repos locally isn't multiple GBs, and it might clean up the history a bit. But that's a larger discussion because of bandwidth

So we could use something like Hugging Face, but that ruins the beauty of all the collaboration going on in that repo.

Couple of broken links in docs

https://github.com/nf-core/test-datasets/blob/master/docs/ADD_NEW_DATA.md
link not working for "nf-core/test-datasets repository"

https://github.com/nf-core/test-datasets/blob/master/docs/USE_EXISTING_DATA.md
link not working for ".travis.yml template"

error in gff file (rnaseq branch test-datasets/reference/genes.gff)

We found a problem in the gff file you have as test.

rnaseq
test-datasets/reference/genes.gff

I	ensembl	transcript	335	649	.	+	.	ID=YAL069W;Parent=YAL069W;geneID=YAL069W;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129
I	ensembl	exon	335	649	.	+	.	Parent=YAL069W;exon_id=YAL069W.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129
I	ensembl	CDS	335	649	.	+	0	Parent=YAL069W;exon_number=1;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;protein_id=YAL069W;protein_version=1;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129

ID and Parent attributes of transcript features have same IDs. This is not allowed in GFF3 specifications.

We use AGAT that deals with that problem by automatically updating the parent ID to be uniq.
Using this file to test/build pipelines might be problematic. This should be updated.

Add cyano nucleotide taxonomy to phyloplace test data

Add database subsampled data for snpSift module (dbnsnpf)

Data has been downloaded from from SnpEff website. A reduction of the original file has been done extracting samples of chr 21. Later, with the new generated chr21 file, a subsample has been done keeping first 100K lines.

fix missing `cellranger` data summary info

#878 did a major reshuffle of Cellranger test data for 10X Genomics analyses, but failed to document the new directory structure and data files in the repo README for module test data

this should be a pretty simple fix

Remove ambiguous bases from MAG test data

sample2 had IPUC characters such as Y R etc. in reads, which cause some tools to fail (like AdapterRemoval). This PR replaces such characters with N.

Add sets of species nucleotide.fasta or protein.fasta for orthology analysis

Hey,

I need a dataset to run orthology software (orthofinder primarily). It could be for example, ~10 species with ~50 genes of a particular gene family (say 10 mammal species with Hox gene family genes. In either nucleotide fasta (and I will translate within the process), or as protein sequences.

I couldn't find a suitable dataset already present.

Pipedreams

Sync the test data repo to an s3 bucket
Sync the repo to a bucket on a push to the repo
Redo the test datasets repo with a refgenie git-ops magical rewrite. (which I guess could also be Nextflow)
#992

Homo sapiens Genome.gtf test-dataset is invalid

The genome.gtf file is not sorted and contains features whose parent feature is absent. Surprisingly, this has not caused issues until now.

Issue noticed when writing a module for RiboCode

Paired modules issue: #4188

add testdata for variantbenchmarking pipeline

Testdata needed for variantbenchmarking pipeline will be added.

cellranger data refactor broke tests for non-cellranger modules

PR #878 dramatically reorganized 10X Genomics single cell data for Cellranger modules

but I unknowingly failed to update module tests for other modules that used the 10X Genomics data

affected modules:

kallistobustools/count
universc

Rename spatialomics to imaging for modules branch

As discussed on slack, @FloWuenne and @jmuhlich have proposed to rename the spatialomics directory in the modules branch to imaging. This better reflects the specific use cases for image registration, segmentation and quantification.

This will impact only two modules that are currently in development and will need to point towards this new directory

Need testdata for mcmicro-celesta

see title

Re-run tests of all affected modules when a given test-data file is updated/modified

Notes from meeting:

Can we reverse map test data 👉 modules?
- If so, we can test all affected modules when updating test data
- Don't need to keep multiple similar copies of test data 👍

What should we do with `docs/ADD_NEW_DATA.md` and `docs/USE_EXISTING_DATA.md` on pipeline branches?

It seems like not all branch handle ADD_NEW_DATA.md and USE_EXISTING_DATA.md in the same way. For instance, rnaseq and methylseq have deleted them, while sarek and demultiplex still have them. Is this something we want to standardize between all the branches? Or should we only keep these files on the default branch maybe?

refactor test data for `cellranger` modules

in writing a cellranger multi module (nf-core/modules#3229), it became evident that the corresponding test data PR (#848) doesn't fully test the module

there are two options:

simply adding more data (easiest)
refactoring tests for the extant cellranger modules (count, mkgtf, mkref, vdj, mkvdjref, mkfastq, and multi) to rely on a single data store

the problem with simply adding data is that it bloats the repo

10X furnishes a few multiomic datasets that we could incorporate:

in a refactor, we could pick and choose which data to use for each module, e.g. cellranger count would rely on one of the GEX datasets, cellranger vdj on one of the immune profiling datasets, etc.

cellranger multi would test all datasets

the original datasets include full FASTQs, but previous cellranger tests have successfully downsampled them for nf-core module testing

Including indexing files for new module verifybamid

Including premixed SVD index and Generated SVD index for new module testing.

nf-core/modules#1957

Request to add ONT-Amplicon seq for SARS-CoV-2:Wastewater

Here is the following BioProject : https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1031245.
Please add 2 samples from this Bioproject.
Platform: Oxford Nanopore
Strategy: Amplicon
Source: ViralRNA
Selection: PCR
Layout: Single-end
Primer Scheme: Artic or VSSA

git clone not working

Hello and thank you for sharing the datasets. I am not being able to download the resources and testdata folders with the
git clone https://github.com/nf-core/test-datasets.git command
Could you help me? Thank you in advance

Add new test data for modkit_pileup

New data is needed for modkit_pileup module.
The data needed is a bam file and a bai file.

[`taxprofiler` branch] Add new taxprofiler `sourmash` db

A tiny sourmash database is required for testing in taxprofiler.
It will be based on the same input data as used for the other test databases in taxprofiler.
Related with nf-core/taxprofiler#112

add a new dataset for est-sfs

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

Add new test data for sammyseq

Check here that there isn't already a branch containing data that could be used
Fork the nf-core/test-datasets repository to your GitHub account
Create a new branch on your fork
Check your proposed test data follows the guidelines
Add your test dataset
- If you clone it locally use git clone <url> --branch <branch> --single-branch
Make a PR on a new branch with a relevant name
Wait for the PR to be merged
Use this newly created branch for your tests

Move unsorted data into the generic data dir

I submitted data into the wrong directory, this PR fixes that.

Use version tags on branches

Instead of having directories with versions and then referencing them like:
https://raw.githubusercontent.com/nf-core/test-datasets/demultiplex/samplesheet/1.3.0/SampleSheet.csv

I'm proposing we tag them whenever we do a release on the pipeline to "sync" them.

https://raw.githubusercontent.com/nf-core/test-datasets/1.3.2/samplesheet/samplesheet.csv

We should probably do the tags as demultiplex/1.3.2 instead of just 1.3.2

Is there any RNAseq data to test transcriptome Assembly pipeline?

Hi !
Everything is in the title...

CAT - Contig Annotation Tool custom database

As CAT database is quite large, a custom one would be needed to test nf pipelines/wf.

This is apparently feasible to generate a small enought database for CI purposes MGXlab/CAT_pack#65, and would already be needed for nf-core/mag#501

Update & complete documentation on modules test-data for homo-sapiens

Documentation on how different files for the modules test for Homo sapiens were generated, is in different places and/or incomplete. We should try to fix this as trying to figure out later how to regenerate a file, as it is sometimes necessary, will be trickier.

Test data required for testing scimap/spatiallda module

Requires .csv with X_centroid, Y_centroid, phenotype and sample_id columns.

Add bed file for ONCOCNV

I need a annotated bed file to run tests in ONCOCNV.

Structuring this repository

Can we maybe have a quick discussion on whether such a structure would be appropriate?:

extra/<pipeline-name>/...
reference_data/<pipeline-name>/...
testdata/<pipeline-name>/...

Extra would keep e.g. BED/PED files for certain pipelines which should be kind of unique per pipeline.
Reference_data could keep reference genome data, and can be shared in my opinion (does that make sense?)
testdata could keep per-pipeline testdata

What do others think about these points? @nf-core/admin ?

Mismatch in test .csv file

In the provided test https://github.com/nf-core/test-datasets/blob/bacass/bacass_short.csv there is a mismatch between the second supplied ID and the second supplied test file. Probably not a fatal error.

ID	R1	R2	LongFastQ	Fast5	GenomeSize
ERR044595	https://github.com/nf-core/test-datasets/raw/bacass/ERR044595_1M_1.fastq.gz	https://github.com/nf-core/test-datasets/raw/bacass/ERR044595_1M_2.fastq.gz	NA	NA	2.8m
ER064912	https://github.com/nf-core/test-datasets/raw/bacass/ERR064912_1M_1.fastq.gz	https://github.com/nf-core/test-datasets/raw/bacass/ERR064912_1M_2.fastq.gz	NA	NA	2.8m

[modules] Dataset at `data/genomics/homo_sapiens/illumina/bcl` is too simple

There is a recent bug that went unnoticed in the bcl2fastq and bclconvert modules (see nf-core/modules#3794). The issue was that output files were not always captured by the output glob depending on the sample name they were associated with.

To prevent this bug from occurring again it would be nice to add a wider range of sample names (instead of just Sample1) to the dataset at data/genomics/homo_sapiens/illumina/bcl.

Add Unsorted text data

I'm adding the GNU_SORT module, and there isn't any unsorted bed-like files in the repo.

Make a README on master to explain how it work

Add test datasets for imaging modules

Clean up bam/cram homo_sapiens files for modules tests data

Bam and Cram files did (partially) not use the same reference fasta for generating them. It is not clear by name which used which and defeats the purpose of having mapped, duplicatemarked,recalibrated bam/cram files all based on the same original dataset. This probably happened when we faced repeated issues with the then existing files not having enough coverage for some variant calling tools. I think it should be cleaned up and replaced though.

Documentation for applying sarek to nf-core test data

Hi,
I have some serious problems to apply sarek to the nf-core test data. I am searching for the specific nextflow call that generates the sarek outputs of the test for the nf-core test data. I could not find any explicit command in the readme.

Thanks!
Ben

nf-core / test-datasets Goto Github PK

test-datasets's Introduction

Introduction

Documentation

Downloading test data

Support

Footnotes

test-datasets's People

Contributors

Stargazers

Watchers

Forkers

test-datasets's Issues

Tasks

Tasks

Recommend Projects

Recommend Topics

Recommend Org

Jobs