GithubHelp home page GithubHelp logo

nf-core / test-datasets Goto Github PK

View Code? Open in Web Editor NEW
87.0 138.0 318.0 20.12 GB

Test data to be used for automated testing with the nf-core pipelines

Home Page: https://nf-co.re

License: MIT License

nextflow pipelines test-data testing test-datasets nf-core workflow

test-datasets's Introduction

nfcore/test-datasets

Test data to be used for automated testing with the nf-core pipelines

⚠️ Do not merge your test data to master! Each pipeline has a dedicated branch (and a special one for modules)

Introduction

nf-core is a collection of high quality Nextflow pipelines. This repository contains various files for CI and unit testing of nf-core pipelines and infrastructure.

The principle for nf-core test data is as small as possible, as large as necessary. Please see the guidelines for more detailed information. Always ask for guidance on the nf-core slack before adding new test data.

Documentation

nf-core/test-datasets comes with documentation in the docs/ directory:

  1. Add a new test dataset
  2. Use an existing test dataset

Downloading test data

Due the large number of large files in this repository for each pipeline, we highly recommend cloning only the branches you would use.

git clone <url> --single-branch --branch <pipeline/modules/branch_name>

To subsequently clone other branches1

git remote set-branches --add origin [remote-branch]
git fetch

Support

For further information or help, don't hesitate to get in touch on our Slack organisation (a tool for instant messaging).

Footnotes

  1. From stackoverflow

test-datasets's People

Contributors

apeltzer avatar chriswyatt1 avatar chuan-wang avatar drpatelh avatar dschreyer avatar edmundmiller avatar erikdanielsson avatar ewels avatar heuermh avatar jfy133 avatar jianhong avatar maxulysse avatar praveenraj2018 avatar reganhayward avatar scorreard avatar sguizard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

test-datasets's Issues

Split modules vs pipeline test module

This repository is getting very large and having to shallow clone/single-branch clone can be tricky, so we could move the modules test data branch to a separate repository.

Use git-lfs

Making an argument for tracking to move everything to git-lfs so that downloading the repos locally isn't multiple GBs, and it might clean up the history a bit. But that's a larger discussion because of bandwidth

So we could use something like Hugging Face, but that ruins the beauty of all the collaboration going on in that repo.

error in gff file (rnaseq branch test-datasets/reference/genes.gff)

We found a problem in the gff file you have as test.

  • rnaseq
  • test-datasets/reference/genes.gff
I	ensembl	transcript	335	649	.	+	.	ID=YAL069W;Parent=YAL069W;geneID=YAL069W;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129
I	ensembl	exon	335	649	.	+	.	Parent=YAL069W;exon_id=YAL069W.1;exon_number=1;exon_version=1;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129
I	ensembl	CDS	335	649	.	+	0	Parent=YAL069W;exon_number=1;gene_biotype=protein_coding;gene_name=YAL069W;gene_source=ensembl;gene_version=1;p_id=P3634;protein_id=YAL069W;protein_version=1;transcript_biotype=protein_coding;transcript_source=ensembl;transcript_version=1;tss_id=TSS1129

ID and Parent attributes of transcript features have same IDs. This is not allowed in GFF3 specifications.

We use AGAT that deals with that problem by automatically updating the parent ID to be uniq.
Using this file to test/build pipelines might be problematic. This should be updated.

fix missing `cellranger` data summary info

#878 did a major reshuffle of Cellranger test data for 10X Genomics analyses, but failed to document the new directory structure and data files in the repo README for module test data

this should be a pretty simple fix

Add sets of species nucleotide.fasta or protein.fasta for orthology analysis

Hey,

I need a dataset to run orthology software (orthofinder primarily). It could be for example, ~10 species with ~50 genes of a particular gene family (say 10 mammal species with Hox gene family genes. In either nucleotide fasta (and I will translate within the process), or as protein sequences.

I couldn't find a suitable dataset already present.

Pipedreams

  • Sync the test data repo to an s3 bucket
  • Sync the repo to a bucket on a push to the repo
  • Redo the test datasets repo with a refgenie git-ops magical rewrite. (which I guess could also be Nextflow)
  • #992

Homo sapiens Genome.gtf test-dataset is invalid

The genome.gtf file is not sorted and contains features whose parent feature is absent. Surprisingly, this has not caused issues until now.

Issue noticed when writing a module for RiboCode

Paired modules issue: #4188

refactor test data for `cellranger` modules

in writing a cellranger multi module (nf-core/modules#3229), it became evident that the corresponding test data PR (#848) doesn't fully test the module

there are two options:

  • simply adding more data (easiest)
  • refactoring tests for the extant cellranger modules (count, mkgtf, mkref, vdj, mkvdjref, mkfastq, and multi) to rely on a single data store

the problem with simply adding data is that it bloats the repo

10X furnishes a few multiomic datasets that we could incorporate:

in a refactor, we could pick and choose which data to use for each module, e.g. cellranger count would rely on one of the GEX datasets, cellranger vdj on one of the immune profiling datasets, etc.

cellranger multi would test all datasets

the original datasets include full FASTQs, but previous cellranger tests have successfully downsampled them for nf-core module testing

add a new dataset for est-sfs

Tasks

Tasks

No tasks being tracked yet.

Add new test data for sammyseq

  • Check here that there isn't already a branch containing data that could be used
  • Fork the nf-core/test-datasets repository to your GitHub account
  • Create a new branch on your fork
  • Check your proposed test data follows the guidelines
  • Add your test dataset
    • If you clone it locally use git clone <url> --branch <branch> --single-branch
  • Make a PR on a new branch with a relevant name
  • Wait for the PR to be merged
  • Use this newly created branch for your tests

Structuring this repository

Can we maybe have a quick discussion on whether such a structure would be appropriate?:

extra/<pipeline-name>/...
reference_data/<pipeline-name>/...
testdata/<pipeline-name>/...

Extra would keep e.g. BED/PED files for certain pipelines which should be kind of unique per pipeline.
Reference_data could keep reference genome data, and can be shared in my opinion (does that make sense?)
testdata could keep per-pipeline testdata

What do others think about these points? @nf-core/admin ?

Mismatch in test .csv file

In the provided test https://github.com/nf-core/test-datasets/blob/bacass/bacass_short.csv there is a mismatch between the second supplied ID and the second supplied test file. Probably not a fatal error.

ID	R1	R2	LongFastQ	Fast5	GenomeSize
ERR044595	https://github.com/nf-core/test-datasets/raw/bacass/ERR044595_1M_1.fastq.gz	https://github.com/nf-core/test-datasets/raw/bacass/ERR044595_1M_2.fastq.gz	NA	NA	2.8m
ER064912	https://github.com/nf-core/test-datasets/raw/bacass/ERR064912_1M_1.fastq.gz	https://github.com/nf-core/test-datasets/raw/bacass/ERR064912_1M_2.fastq.gz	NA	NA	2.8m

Add Unsorted text data

I'm adding the GNU_SORT module, and there isn't any unsorted bed-like files in the repo.

Clean up bam/cram homo_sapiens files for modules tests data

Bam and Cram files did (partially) not use the same reference fasta for generating them. It is not clear by name which used which and defeats the purpose of having mapped, duplicatemarked,recalibrated bam/cram files all based on the same original dataset. This probably happened when we faced repeated issues with the then existing files not having enough coverage for some variant calling tools. I think it should be cleaned up and replaced though.

Documentation for applying sarek to nf-core test data

Hi,
I have some serious problems to apply sarek to the nf-core test data. I am searching for the specific nextflow call that generates the sarek outputs of the test for the nf-core test data. I could not find any explicit command in the readme.

Thanks!
Ben

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.