bhattlab / bhattlab_workflows Goto Github PK
View Code? Open in Web Editor NEWComputational workflows for metagenomics tasks, by the Bhatt lab
Home Page: http://www.bhattlab.com
Computational workflows for metagenomics tasks, by the Bhatt lab
Home Page: http://www.bhattlab.com
Assembly is expecting a table of the following format:
sample_foo reads_1.fq[.gz][,reads_1.fq[.gz][,orphans.fq[.gz]]]
...
Can the preprocessing workflow output this for convenience?
Sorry for the super nested square brackets, I'm trying to indicate that the files can be optionally gzipped, and there can be one file (single ended or interleaved), two files (paired reads) or three files (paired plus orphans)
Hello! Preprocessing flow is getting closer.
(1) my access for some files was denied (which might be a good thing)
(2) the work-around gave me this:
My Input:
snakemake --configfile ~/asbhatt/chris/project/gutdecon/testpoop/testpoop1.yaml.txt -s ~/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile
My Output:
[fread] Cannot allocate memory
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 0 reads
Error in rule rm_host_reads:
jobid: 1
output: /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/04_host_align/sample_1_rmHost_2.fq, /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/04_host_align/sample_1_rmHost_1.fq, /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/04_host_align/sample_1_rmHost_orphan.fq
RuleException:
CalledProcessError in line 136 of /home/cseveryn/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile:
Command ' set -euo pipefail;
mkdir -p /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/04_host_align/
# if an index needs to be built, use bwa index ref.fa
# run first on the paired reads
bwa mem -t 1 /labs/asbhatt/data/host_reference_genomes/hg19/hg19.fa /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/03_sync/sample_1_1.fastq /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/03_sync/sample_1_2.fastq | samtools view -bS - | samtools bam2fq -f 4 -1 /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/04_host_align/sample_1_rmHost_1.fq -2 /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/04_host_align/sample_1_rmHost_2.fq - ;
# run on the orphan reads
#bwa mem -t 1 /labs/asbhatt/data/host_reference_genomes/hg19/hg19.fa /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/03_sync/sample_1_orphans.fastq | samtools view -bS - | samtools bam2fq -f 4 - > /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/04_host_align/sample_1_rmHost_orphan.fq ' returned non-zero exit status 1
File "/home/cseveryn/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile", line 136, in __rule_rm_host_reads
File "/home/cseveryn/miniconda3/envs/preprocessing/lib/python3.5/concurrent/futures/thread.py", line 55, in run
Removing output files of failed job rm_host_reads since they might be corrupted:
/labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/04_host_align/sample_1_rmHost_2.fq, /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/01_processing/04_host_align/sample_1_rmHost_1.fq
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /labs/asbhatt/chris/project/gutdecon/testpoop/.snakemake/log/2018-11-08T165736.470978.snakemake.log
in preprocessing/preprocessing.snakefile, line 17, the extensions used are hardcoded. Shouldn't tuple(['fastq.gz', 'fq.gz'])
be replaced with EXTENSION
if this is going to be pulled from the config to begin with? I encountered this by providing some *.fq files, specifying '.fq' as my extension, and still getting no result until I made this substitution. Just making sure as I'm not too familiar with this code.
Hi all! I ran the bin_das_tool_manysamp.snakefile
workflow recently and noticed a few things that might be helpful to update:
docker://quay.io/biocontainers/prokka:1.14.6--pl5262hdfd78af_1
due to a tbl2asn error.This workflow is so useful and awesome, a huge thanks to everyone who worked so hard to build and maintain it! ๐
Example with 3 input files (paths removed for clarity) -- leads to misplaced reads by name
sync.py FS102_T15_1_val_1.fq.gz FS102_T15_nodup_PE1.fastq FS102_T15_nodup_PE2.fastq
FS102_T15_2.fastq FS102_T15_orphans.fastq FS102_T15_1.fastq
Total 18956 reads from forward read file.
Total 0 reads from reverse read file.
Synced read files contain 0 reads.
Put 18956 forward reads in the orphans file.
Put 0 reverse reads in the orphans file.
This was my command:
snakemake --configfile ~/asbhatt/chris/project/gutdecon/testpoop/testpoop1.yaml.txt -s ~/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile
Here is my .yaml file
specify directories
raw_reads_directory: ~/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/2.test_subset
output_directory: ~/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput
specify parameters for TrimGalore -- automatically chcecks the adaptor type
trim_galore:
quality: 30
start_trim: 15
end_trim: 0
min_read_length: 60
rm_host_reads:
host_pre: bowtie_index_prefix (i.e. hg19 or mm10)
host_genome: /path/to/genome/fastq/prefix
Here's my Output
SyntaxError in line 35 of /home/cseveryn/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile:
invalid syntax
Thank you so much. :^)
There's some error where a temporary contig file gets made in the output directory, and the program overwrites or fails to delete it if there are multiple files with the same name, such as contigs.fasta, even if they come from different directories. Temp workaround is to symlink them all into a folder with unique names. Permanent fix will be added to the pipeline shortly.
In the event that the orphans output .fq.gz is empty, fastqc will fail and the pipeline will stop.
I can see that Ben updated the sync.py script and corresponding snakefile in Nov 2018, however the current version of the snakefile has a hardcoded script to Jessy's SCG account in the sync step. What happened here? Did someone push an outdated version?
We have been getting data back as a giant fastq file of undetermined reads (instead of bcl) with the barcode in the read name. Most tools that demultiplex from fastq were very slow, could not be parallelized, and/or failed. This is just a pre-preprocessing tip.
You need two files (a file that lists your barcodes, and a script)
barcodes.txt:
samplenameA GGACTCCT+AGAGGATA
samplenameB TAGGCATG+AGAGGATA
samplenameC CTCTCTAC+AGAGGATA
...all your samples
demultiplex.sh
#!/bin/bash
module load sickle/1.33
#demultiplex samples
grep -A3 --no-group-separator -i $2 {giant_UndeterminedFile_1.fq} | gzip > $1_1.fq.gz &
grep -A3 --no-group-separator -i $2 {giant_UndeterminedFile_2.fq} | gzip > $1_2.fq.gz &
wait
#remove instances that do not have pairs (trimming will fail if you do not)
sickle pe -f $1_1.fq.gz -r $1_2.fq.gz -t sanger -o paired_$1_1.fq -p paired_$1_2.fq -s $1_single.fq
Run:
cat barcodes.txt | xargs -l bash -c 'sbatch ..... demultiplex.sh $0 $1'
Will save you a lot of time instead of trying existing tools.
This was my command:
(preprocessing) [02:34:19] hppsl230s-rcf-412-01-l:~/.config/snakemake > snakemake -np --configfile ~/asbhatt/chris/project/gutdecon/testpoop/testpoop1.yaml.txt -s ~/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile --profile scg
/labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/3.links
['sample_2_R2.fastq.gz', 'sample_5_R1.fastq.gz', 'sample_10_R1.fastq.gz', 'sample_1_R1.fastq.gz', 'sample_8_R2.fastq.gz', 'sample_7_R1.fastq.gz', 'sample_UNKNOWN_R1.fastq.gz', 'sample_9_R2.fastq.gz', 'sample_UNKNOWN_R2.fastq.gz', 'sample_11_R2.fastq.gz', 'sample_13_R1.fastq.gz', 'sample_11_R1.fastq.gz', 'sample_9_R1.fastq.gz', 'sample_10_R2.fastq.gz', 'sample_6_R2.fastq.gz', 'sample_7_R2.fastq.gz', 'sample_1_R2.fastq.gz', 'sample_3_R1.fastq.gz', 'sample_8_R1.fastq.gz', 'sample_3_R2.fastq.gz', 'sample_6_R1.fastq.gz', 'sample_4_R1.fastq.gz', 'sample_12_R1.fastq.gz', 'sample_14_R2.fastq.gz', 'sample_4_R2.fastq.gz', 'sample_2_R1.fastq.gz', 'sample_5_R2.fastq.gz', 'sample_13_R2.fastq.gz', 'sample_12_R2.fastq.gz', 'sample_14_R1.fastq.gz']
Here's my Output
[Tue Nov 6 14:51:26 2018] Waiting at most 5 seconds for missing files.
[Tue Nov 6 14:51:31 2018] MissingOutputException in line 58 of /home/cseveryn/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile:
[Tue Nov 6 14:51:31 2018] Missing files after 5 seconds:
[Tue Nov 6 14:51:31 2018] /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/qc/01_trimmed/sample_13_2_val_2.fq.gz
[Tue Nov 6 14:51:31 2018] /labs/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput/qc/01_trimmed/sample_13_1_val_1.fq.gz
[Tue Nov 6 14:51:31 2018] This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
[Tue Nov 6 14:51:31 2018] Shutting down, this might take some time.
[Tue Nov 6 14:51:31 2018] Exiting because a job execution failed. Look above for error message
[Tue Nov 6 14:51:31 2018] Complete log: /home/cseveryn/.config/snakemake/.snakemake/log/2018-11-06T145030.120889.snakemake.log
/labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/3.links
['sample_2_R2.fastq.gz', 'sample_5_R1.fastq.gz', 'sample_10_R1.fastq.gz', 'sample_1_R1.fastq.gz', 'sample_8_R2.fastq.gz', 'sample_7_R1.fastq.gz', 'sample_UNKNOWN_R1.fastq.gz', 'sample_9_R2.fastq.gz', 'sample_UNKNOWN_R2.fastq.gz', 'sample_11_R2.fastq.gz', 'sample_13_R1.fastq.gz', 'sample_11_R1.fastq.gz', 'sample_9_R1.fastq.gz', 'sample_10_R2.fastq.gz', 'sample_6_R2.fastq.gz', 'sample_7_R2.fastq.gz', 'sample_1_R2.fastq.gz', 'sample_3_R1.fastq.gz', 'sample_8_R1.fastq.gz', 'sample_3_R2.fastq.gz', 'sample_6_R1.fastq.gz', 'sample_4_R1.fastq.gz', 'sample_12_R1.fastq.gz', 'sample_14_R2.fastq.gz', 'sample_4_R2.fastq.gz', 'sample_2_R1.fastq.gz', 'sample_5_R2.fastq.gz', 'sample_13_R2.fastq.gz', 'sample_12_R2.fastq.gz', 'sample_14_R1.fastq.gz']
Error in rule plot_R_k21:
jobid: 214
output: /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/sourmash/04_sourmash_compare/compare_k21_heatmap_complete.pdf
RuleException:
CalledProcessError in line 151 of /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/sourmash_tools/sourmash.snakefile:
Command 'set -euo pipefail; Rscript --vanilla /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/.snakemake/scripts/tmpxg8sobsl.heatmaps.R' returned non-zero exit status 1.
File "/oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/sourmash_tools/sourmash.snakefile", line 151, in __rule_plot_R_k21
File "/labs/asbhatt/alvinhan/miniconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
[Wed Feb 5 12:07:41 2020]
localrule plot_R_k51:
input: /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/sourmash/04_sourmash_compare/compare_k51.csv
output: /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/sourmash/04_sourmash_compare/compare_k51_heatmap_complete.pdf
jobid: 216
Error in library(gplots) : there is no package called โgplotsโ
Execution halted
Doing things locally resolves this problem, not sure if this is just me being dumb.
Use pigz -p 8
to parallelize this
In compare_bins_references.snakefile, rule concat_top_nucmer throws an error:
No values given for wildcard choice
for the input line.
This happens in about 5% of cases, looping away and wasting compute with no error. See ablab/spades#152
A solution is to split the read correction and assembler, and skip the read correction if it fails to compete after some reasonable amount of time.
spades-hammer
can be run alone, then spades with the --only-assembler
option
I'm not ambitious enough but we should think about making preprocessing work with single end data. Surprisingly, there's enough of it out there that it would come in handy.
Specific Scenario:
I had 41 samples. 40 of them were named like this:
1_1.fq.gz 1_2.fq.gz
2_1.fq.gz 2_2.fq.gz
...
One of them was named:
aerobicmixture_1.fq.gz aerobicmixture_2.fq.gz
The pipeline only worked for aerobicmixture. The rest of the samples did not get submitted for trimming. The pipeline was still running but not submitting these jobs.
If I change the first 40 files to:
AD1_1.fq.gz AD1_2.fq.gz
AD2_1.fq.gz AD2_2.fq.gz
It seems to work fine for them all.
is there a possibility of integrating a method to merge bins that are comparable? CheckM has a merge method to match bins that would increase the total completeness without increasing contamination - could filter those on GC, coverage, kraken classification and then merge for more complete bins
Error message attached. Problem resolved by deleting the bin.tooShort file and rerunning.
slurm-14172059.out.txt
Hi,
Thank you for implementing the workflow. I noticed when I tried to run the drep pipeline with the fastANI algorithm, it told me that the fastANI not installed. And indeed for that docker image there is no fastANI. So it couldn't run correctly. It would be a good idea to modify this or just run with the default algorithm.
A
This was my command:
snakemake --configfile ~/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/configpreprocessing.yaml -s ~/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile --profile scg --jobs 400
Here is my .yaml file:
raw_reads_directory: /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/L*
output_directory: /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2
read_specification: ['1', '2']
extension: .fastq.gz
trim_galore:
quality: 30
start_trim: 15
end_trim: 0
min_read_length: 60
rm_host_reads:
host_genome: /labs/asbhatt/data/host_reference_genomes/hg19/hg19.fa
This was my output:
Trying to restart job 184.
Error in rule trim_galore:
jobid: 166
output: /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2/gutdecon2_output/01_processing/01_trimmed/CS_ICt103018_1_val_1.fq.gz, /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2/gutdecon2_output/01_processing/01_trimmed/CS_ICt103018_2_val_2.fq.gz, /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2/gutdecon2_output/01_processing/01_trimmed/CS_ICt103018_unpaired.fq.gz
cluster_jobid: 8592599
Trying to restart job 166.
Error in rule trim_galore:
jobid: 182
output: /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2/gutdecon2_output/01_processing/01_trimmed/CS_Zym103018_1_val_1.fq.gz, /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2/gutdecon2_output/01_processing/01_trimmed/CS_Zym103018_2_val_2.fq.gz, /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2/gutdecon2_output/01_processing/01_trimmed/CS_Zym103018_unpaired.fq.gz
cluster_jobid: 8592582
Trying to restart job 182.
Error in rule trim_galore:
jobid: 179
output: /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2/gutdecon2_output/01_processing/01_trimmed/CS_81_dn6_1_val_1.fq.gz, /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2/gutdecon2_output/01_processing/01_trimmed/CS_81_dn6_2_val_2.fq.gz, /labs/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/19-03-19_chris_gutdecon1/chris_demux/gutdecon2/gutdecon2_output/01_processing/01_trimmed/CS_81_dn6_unpaired.fq.gz
cluster_jobid: 8592624
Trying to restart job 179.
This was my command:
snakemake --configfile ~/asbhatt/chris/project/gutdecon/testpoop/testpoop1.yaml.txt -s ~/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile
Here is my .yaml file
specify directories
raw_reads_directory: ~/asbhatt/chris/project/gutdecon/18-11-1_chris_gutdecon1/2.test_subset
output_directory: ~/asbhatt/chris/project/gutdecon/testpoop/testpoopoutput
specify parameters for TrimGalore -- automatically chcecks the adaptor type
trim_galore:
quality: 30
start_trim: 15
end_trim: 0
min_read_length: 60
rm_host_reads:
host_pre: bowtie_index_prefix (i.e. hg19 or mm10)
host_genome: /path/to/genome/fastq/prefix
Here's my Output
SyntaxError in line 78 of /home/cseveryn/asbhatt/chris/tools/bhattlab_workflows/preprocessing/preprocessing.snakefile:
invalid syntax
Thank you! =^)
Hi developers,
I recently am running the binning step using the bin_das_tool_manysamp.snakefile
, and met a bash: Fasta_to_Scaffolds2Bin.sh: command not found
error.
It was because the dastool team recently (March 2022) released a version 1.1.4 and now there is not a Fasta_to_Scaffolds2Bin.sh
anymore.
So I would suggest to add das_tool=1.1.3
in the das_tool.yaml
file to make sure the Fasta_to_Scaffolds2Bin.sh
still exists, and the pipeline can finish.
Regards.
Angel
I wonder if we might be better off using super deduper https://ibest.github.io/HTStream/#hts_SuperDeduper since to my knowledge it produces synced paired-end files and having two syncing steps in our pipeline slows things a bit.
It appears that the syncing script cannot handle fastq headers that are not in standard illumination format and puts all reads into the orphans file. This commonly affects data downloaded from SRA, for example. We definitely need to fix this asap.
Activating singularity image /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/.snakemake/singularity/74b361ef183a5d4349a9285ba3a1a819.simg
FATAL: container creation failed: mount /proc/self/fd/10->/var/singularity/mnt/session/rootfs error: can't mount image /proc/self/fd/10: kernel reported a bad superblock for squashfs image partition, possible causes are that your kernel doesn't support the compression algorithm or the image is corrupted
[Thu Feb 6 14:11:15 2020]
Error in rule plot_R_k31:
jobid: 191
output: /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/sourmash/04_sourmash_compare/compare_k31_heatmap_complete.pdf
RuleException:
CalledProcessError in line 162 of /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/sourmash_tools/sourmash.snakefile:
Command ' singularity exec --home /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash --bind /labs/,/oak/ --bind /labs/asbhatt/alvinhan/miniconda3/lib/python3.7/site-packages:/mnt/snakemake /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/.snakemake/singularity/74b361ef183a5d4349a9285ba3a1a819.simg bash -c 'set -euo pipefail; Rscript --vanilla /oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/.snakemake/scripts/tmphm7c91pp.heatmaps.R'' returned non-zero exit status 255.
File "/oak/stanford/scg/lab_asbhatt/alvinhan/projects/hotmice_sourmash/sourmash_tools/sourmash.snakefile", line 162, in __rule_plot_R_k31
File "/labs/asbhatt/alvinhan/miniconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
@tamburinif Your commit added some garbage hidden files from mac. Can you add a .gitignore file and fix this?
Line 237 if statement does not have a corresponding fi command. When we added this it ran successfully. I will be submitting a pull request but wanted to start an issue first.
I see a commit "getting rid of old scripts notation" for preprocessing yet the snakefile still looks for 'scripts_dir' in the config, what's up with that?
@tamburinif @elimoss @jribado We're getting to the point where this repo changes enough and is used enough that I think we need to keep a stable "master" branch and only change things on a dev branch. We should also come up with a standardized test dataset (could be automatically run through a continuous integration tool).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.