felixkrueger / sherman Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 6.0 161 KB

A simple Bisulfite FastQ Read Simulator (BiQRS)

License: GNU General Public License v3.0

Perl 100.00%

sherman's People

Contributors

Stargazers

Watchers

Forkers

avilella darogan sdeans0 nathaliagg zzygyx9119 mikewlloyd

sherman's Issues

error when using min and max fragment size options

Sherman version 0.1.8

command:
perl ~/tools/Sherman_v0.1.8/Sherman -l 150 -n 15000 --genome_folder cho_horizon_test/ -pe -I 300 -X 400

error:
Please select a fragment length that is longer than the read length

when removing -I and -X options the run finishes sucessfully

is it possible that simulate Amplicon methylaiton reads or add this as a new function in sherman？

Incorrect paired-end read ID

Hello,

Single-end read IDs are generated correctly, but when generating paired-end reads with

(bioitools_dev) gzynda-mbpr:BSMAPz gzynda$ perl Sherman -l 100 -n 2 --genome_folder ./genome/ -cr 95 -e 1 -pe
Genome folder was specified as ./genome/

Selected general parameters:
----------------------------------------------------------------------------------------------------
Paired-end reads selected. Fragment length will be 70-400 bp
sequence length:	100 bp
number of sequences being generated:	2

Possible sources of contamination:
----------------------------------------------------------------------------------------------------
overall error rate:	1%
bisulfite conversion rate:	95%


Now reading in and storing sequence information of the genome specified in: ./genome/

chr chr1 (3000 bp)
chr chr2 (3000 bp)

Generating quality values with a user defined decaying per-bp error rate of 1%
Starting to work out the slope of the error curve
Error rates per bp will be modelled according to the formula:
default base quality - 0.034286*position[bp] + 0.0009263*(position[bp]**2)) - 0.00001*(position[bp]**3)*3.7568)


Final report:
----------------------------------------------------------------------------------------------------
2 genomic sequences were successfully generated in total (+ strand: 0	 - strand: 2)
Cytosines bisulfite converted in any context: 94.78%
Random sequencing errors introduced in total: 4 (of 400 bp in total) (percentage: 1.00)

Looking at the 2 reads I generated, I would expect the first pair to exist in chr1:586-713

(bioitools_dev) gzynda-mbpr:BSMAPz gzynda$ head simulated_*
==> simulated_1.fastq <==
@1_chr1:586-713_R1
TTTGTTTATAGAATTGAGTATACTATTTAAGGGTGGGGGGTTAATGATTTTTTTAATTAGAATATTAAAGGTATTTGGATTGATGATTTTTGTTTATCAG
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGFFFFFFFEEEEEDDDDCCCCBBBAAA@@???>>==<<;;::98877655432210/.--,+*)
@2_chr2:2295-2622_R1
TATGTGTTATTATGTTTTGTAATTTTTGATTTGATTTGTGTGGAATGCGGGCAAGATGTATTTTGTTTGAGTTATTTTGTTATGTTTTTGGGTGCAAGTT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGFFFFFFFEEEEEDDDDCCCCBBBAAA@@???>>==<<;;::98877655432210/.--,+*)

==> simulated_2.fastq <==
@1_chr1:586-713_R2
ATACAACAAAATTTTTTTTCTATAATCCTTAAAAAAAAAAATCATCAATCCAAATACCTTTAATATTCTGATTAAAAAAATCATTAACCCCCCACCCTTA
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGFFFFFFFEEEEEDDDDCCCCBBBAAA@@???>>==<<;;::98877655432210/.--,+*)
@2_chr2:2295-2622_R2
CAACACCAAAAAATAAAACAACCTTAAATCAAAAACAATATTAATCAAAAAAAATATTCAACGCAATAAAACCACAAAATTAGCATAAAAATAATACCTC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGFFFFFFFEEEEEDDDDCCCCBBBAAA@@???>>==<<;;::98877655432210/.--,+*)

However, it does not seem so after extracting that region with samtools

(bioitools_dev) gzynda-mbpr:BSMAPz gzynda$ samtools faidx genome/test.fasta chr1:586-713
>chr1:586-713
CTTACATAAAGGAGCTATTAGTATTATCCTGCGAAGATTCAAAAAGGTGAGCCAATTCGG
CCGATCCGGAAAGACGGACTTCAAAGTTACGTGACGACGGTTGTGGGTCCGTAACAAAAT
CCTCATAA

A discontiguous blast shows that R1 of the first fragment is the reverse complement of chr1:2455-2540

ALIGNMENTS
>Query_86114 chr1
Length=3000

 Score = 84.7 bits (56),  Expect = 1e-20
 Identities = 71/86 (83%), Gaps = 0/86 (0%)
 Strand=Plus/Minus

Query  4     GTTTATAGAATTGAGTATACTATTTAAGGGTGGGGGGTTAATGAtttttttAATTAGAAT  63
             |||||||||| |||||||||||| |||||||||||||  || |||  ||  ||| |||| 
Sbjct  2540  GTTTATAGAACTGAGTATACTATCTAAGGGTGGGGGGCCAACGATCCTTCCAATCAGAAC  2481

Query  64    ATTAAAGGTATTTGGATTGATGATTT  89
             || |||||||| |||| ||| |||||
Sbjct  2480  ATCAAAGGTATCTGGACTGACGATTT  2455

which can be confirmed with samtools

(bioitools_dev) gzynda-mbpr:BSMAPz gzynda$ samtools faidx genome/test.fasta chr1:2455-2540
>chr1:2455-2540
AAATCGTCAGTCCAGATACCTTTGATGTTCTGATTGGAAGGATCGTTGGCCCCCCACCCT
TAGATAGTATACTCAGTTCTATAAAC

Are you able to replicate this issue?

Sequence length inequality warning - please investigate

I'm getting the folling warning message:

The length of seq1 or seq2 were not equal to the sequence length of 101

This was the command line:

Sherman -l 100 -n 100000 --paired --genome /bi/scratch/Genomes/Yeast/Saccharomyces_cerevisiae/SGD1/ -CG 23 -CH 6

Uniform Distribution of the insert size?

Hi Felix,
I'd like to make sure of one thing: the distribution of the insert size is uniformly distributed, right?
I'm wondering whether it's ok for me add the feature to generate the PE reads with insert size of normal distribution.

Thanks,
Shaojun

Problems of read ID generated by Sherman

Hi Felix,

The command line I used for Sherman (v0.1.7) was:
Sherman -l 100 -n 1000000 --genome_folder dir4ref -pe -CG 20.00 -CH 4

Then I got two files: simulated_1.fastq and simulated_2.fastq
Here is one example from read 1:
@1_chr4:8813479-8813651_R1
TTTGACAAGAGGAGAAGATGGAAGAGACAAATGGATCAGAGATGTGTCCTTTTAAACTGGTAGTTGAATTGTTGATACAAGTTTCAACTCCTATCCATGT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

One example from read 2:
@1_chr4:8813479-8813651_R2
TATGAATGAATTAAATAAACATTGTTAGCAGACTTTATTTTTTCTTAAATTAAATTTTGTTCCTAGTGTGGACACATGAATAGGAGTTGAAACTTGTATC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

When I was doing alignment, the command line I used for bismark (0.16.3/) was:

bismark dir4ref -1 simulated_1.fastq -2 simulated_1.fastq

Here is the error message:

Now starting a Bowtie 2 paired-end alignment for CTread1GAread2CTgenome (reading in sequences from simulated_1.fastq_C_to_T.fastq and simulated_1.fastq_G_to_A.fastq, with the options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500 --norc))
Either the first or the second id need to be read 1! ID1 was: 1_chr4:8813479-8813651_R1/2; ID2 was: 1_chr4:8813479-8813651_R1/2
Error while flushing and closing output
terminate called after throwing an instance of 'int'
(ERR): bowtie2-align died with signal 6 (ABRT) (core dumped)

When I checked the intermediate converted fastq file, the read ID for read 1 was:
@1_chr4:8813479-8813651_R1/1/1

The read ID for read 2 was:
@1_chr4:8813479-8813651_R1/2/2

I hope these message can help you to find the problem. Thanks.
Please let me know if you need any other information.

Best!

Error rate and runtime

Hi,

I'm using Sherman to simulate some BS reads with the following parameters:

    - 80 million reads   
    - 80% conversion rate (20% methylation)   
    - 125bp length for all reads  
    - Paired end reads  
    - Constant Phred quality score of 40  
    - Error rate of 0.01%

The first time I tried to simulate I let Sherman run for a couple of days before giving up and stopping it. I tried again this week and now it's been running for 71h and it's still stuck on the same process, which is the error curve:

Genome folder was specified as /path/to/genome/

Selected general parameters:
----------------------------------------------------------------------------------------------------
Paired-end reads selected. Fragment length will be 70-400 bp
sequence length: 125 bp
number of sequences being generated: 80000000

Possible sources of contamination:
----------------------------------------------------------------------------------------------------
overall error rate: 0.01%
bisulfite conversion rate: 80%

Generating quality values with a user defined decaying per-bp error rate of 0.01%
Starting to work out the slope of the error curve

By setting the error rate to zero, the simulation runs quite smoothly, so the culprit is clearly the error rate. My question is: is this normal?

Thank you in advance for your answer!

Stefan

Active Project?

Is Sherman still being actively developed?
I have some ideas for a couple of features I'd like to add. I have created a fork and will add my code additions there. So just checking if this of interest to you?
New features are top secret until I work out exactly what I'm doing... but related to bs/oxbs subtractions

Feature request: output directory and output prefix

Hi, I'm wondering whether the options for output directory and output prefix can be added so that it's easier to integrate sherman into a pipeline which needs parallel running. Thanks in advance.

Error when using Sherman 0.1.9, but working with 0.1.8

Hello,
I managed to run Sherman v0.1.8 on a small arthropod genome without issues with the following command line:

 ~/Software/Sherman-0.1.8/Sherman -l 125 -n 1000000 -pe -CG 50 -CH 5 --genome_folder ${PWD}/GENOME/

I then decided to try with the most recent version due to the possibility of saving the truth set directly. However, when I tried to use to use the latest version with the following command line:

~/Software/Sherman-0.1.9/Sherman -l 125 -n 1000000 -pe -CG 50 -CH 5 --genome_folder ${PWD}/GENOMES -o ${PWD}/SIMULATED --truth-seq

I get the following error:

Output will be written into the directory: /PATH/TO/METHYL/SIMULATED/
Writing a 'truth_set' of positions that underwent bisulfite conversion to the file 'positional_changes.txt' (not available in random mode)
Genome folder was specified as /PATH/TO/METHYL/GENOMES/

Selected general parameters:
----------------------------------------------------------------------------------------------------
Paired-end reads selected. Fragment length will be 70-400 bp
sequence length:        125 bp
number of sequences being generated:    1000000
Generating additional truth set file ('positional_changes.txt')

Possible sources of contamination:
----------------------------------------------------------------------------------------------------
overall error rate:     0%
bisulfite conversion rate in CG-context:        50%
bisulfite conversion rate in CH-context:        5%
default Phred quality value:    40


Now reading in and storing sequence information of the genome specified in: /PATH/TO/METHYL/SIMULATED/

chr CMA.1 (27628197 bp)
...
chr SEQN.1 (13038 bp)
chr SEQM.1 (20264 bp)
chr SEQQ.1 (10396 bp)

Quality values will be constant throughout with a Phred score of 40
Argument "" isn't numeric in addition (+) at /home/atalent/Software/Sherman-0.1.9/Sherman line 860, <CHR_IN> line 2260693.

Not sure what the issue might be. I tried both the release version and cloning the repo.
Any suggestions?

Thanks in advance
Andrea

Paralelization

First of all, congratulation for the excellent work in developing this tool.

If I can, I would suggest to make it parallelizable.
I am trying to create fastq with 80 millions of reads taken from a 70GB FASTA file and it's taken really long time. I tried to parallelize creating more smaller fastq and then joining but apart RAM problem (as it put the db in RAM every time) I could have duplicate reads by chance.

Any advice or new features are welcomed!
Best,
Francesco

felixkrueger / sherman Goto Github PK

sherman's People

Contributors

Stargazers

Watchers

Forkers

sherman's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs