heathsc / gembs Goto Github PK

gemBS is a bioinformatics pipeline designed for high throughput analysis of DNA methylation from Whole Genome Bisulfite Sequencing data (WGBS).

License: GNU General Public License v3.0

Python 66.37% Makefile 0.94% C 32.40% AngelScript 0.14% Dockerfile 0.09% M4 0.06%

gembs's People

Contributors

Stargazers

Watchers

Forkers

christacaggiano ismailm ihec microtsiu karl616 wangdi2014 akhileshkaushal weizhousjtu kgurushankar dolapoa tw7649116 edsu7 andywangsfu linhxxx jakelehle jing-xinxing

gembs's Issues

gemBS extract does not produce bedmethyl or snps output files

With the latest version of gemBS, gemBS extract only produces the gemBS output (even if not set). Moreover, it does not produce the ENCODE style output files or the SNP extraction files...

This is the relevant part of my config

[extract]
# there is a jobs arg but no threads arg
jobs=70
# strand_specific = True
phred_threshold = 10
# make_cpg = True
# make_non_cpg = True
make_bedmethyl = True
# make_bigwig = True
make_snps = True

And my ./.gemBS/gemBS.json:

    "extract": {
      "make_snps": "True",
      "phred_threshold": "10",
      "make_bedmethyl": "True",
      "jobs": "70"
    },

TypeError: sequence item 15: expected str instance, NoneType found

I see this error when trying to run gemBS map:

:
: Command map started at 2018-09-29 19:18:30.885616
:
: ------------ Mapping Parameters ------------
: Sample barcode   : pgp1
: Data set         : pgp01
: No. threads      : 16
: Index            : /data/pgp_data/pgp_wgbs/analysis/ref_indexes/hg38.BS.gem
: Paired           : True
: Read non stranded: False
: Type             : PAIRED
: Input Files      : /data/pgp_data/pgp_wgbs/analysis/fastq/pgp01_R1.fastq.gz,/data/pgp_data/pgp_wgbs/analysis/fastq/pgp01_R2.fastq.gz
: Output dir       : /data/pgp_data/pgp_wgbs/analysis/mapping/pgp1
:
: Bisulfite Mapping...
TypeError: sequence item 15: expected str instance, NoneType found

Looking at this at more details - this issue is raised at the following location (specifically at L165):

gemBS/gemBS/utils.py

Lines 161 to 166 in 87a6657

 def to_bash(self): 

 """Returns the bash command representation 

  """ 

 if isinstance(self.commands, (list, tuple)): 

 return " ".join(self.commands) 

 return str(self.commands)

In my case, self.command is equal to the following:

>>> ['/usr/local/lib/python3.6/dist-packages/gemBS/gemBSbinaries/gem-mapper', '-I', '/data/pgp_data/pgp_wgbs/analysis/ref_indexes/hg38.BS.gem', '--i1', '/data/pgp_data/pgp_wgbs/analysis/fastq/pgp01_R1.fastq.gz', '--i2', '/data/pgp_data/pgp_wgbs/analysis/fastq/pgp01_R2.fastq.gz', '-p', '-t', '16', '--report-file', '/data/pgp_data/pgp_wgbs/analysis/mapping/pgp1/pgp01.json', '-r', '@RG\\tID:pgp01\\tSM:\\tBC:pgp1\\tPU:pgp01', '--underconversion_sequence', None, '--overconversion_sequence', None]

As such it seems the issue because the values of --underconversion_sequence and --overconversion_sequence is set to None.

Thus it seems that there should be a default value (of an empty string) set for both of these parameters (or rather they shouldn't be added to this list if their value is None)

Typo in help section

gemBS/src/production.py

Line 386 in 6dc1084

 parser.add_argument('-i', '--input-dir', dest="input_dir",metavar="PATH", help='Path were are located the BAM aligned files.', required=True) 

Please update to something like: Path where the BAM aligned files are located (using where instead of were)

i.e.

 parser.add_argument('-i', '--input-dir', dest="input_dir",metavar="PATH", help='Path where the BAM aligned files are located.', required=True)

ValueError: too many values to unpack

gemBS/src/production.py

Lines 406 to 422 in 9b7d158

 #Create Dictionary of samples and bam file  

 self.samplesBams = {} 

 self.records = 0 

 for k,v in FLIdata(args.json_file).sampleData.iteritems(): 

 fileBam = "%s/%s.bam" %(args.input_dir,v.getFli()) 

 self.records = self.records + 1 

 if os.path.isfile(fileBam): 

 if v.sample_barcode not in self.samplesBams: 

 self.samplesBams[v.sample_barcode] = [fileBam] 

 else: 

 self.samplesBams[v.sample_barcode].append(fileBam) 

 #Check list of file 

 self.totalFiles = 0 

 for sample,listBams in self.samplesBams: 

 self.totalFiles += len(listBams)

Error:

  File "...../python2.7/site-packages/src/production.py", line 421, in run
    for sample,listBams in self.samplesBams:
ValueError: too many values to unpack

self.samplesBams is a dict, so i'm guessing you need iter_items() on line 421:

 for sample,listBams in self.samplesBams.iteritems():

getting several install errors

Hi,

I was really enthusiastic about your new bioarvix paper! Upon installing I am getting several errors when I try to run python setup.py install --user. Most of these errors seem to be related to bs_call.c

s_call.c:284: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘attribute’ before ‘’ token
bs_call.c:357: error: expected ‘)’ before ‘’ token
bs_call.c:373: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘attribute’ before ‘*’ token
bs_call.c: In function ‘do_calling’:
bs_call.c:417: error: ‘gsl_integration_glfixed_table’ undeclared (first use in this function)
bs_call.c:417: error: (Each undeclared identifier is reported only once
bs_call.c:417: error: for each function it appears in.)
bs_call.c:417: error: ‘tab1’ undeclared (first use in this function)
bs_call.c:417: error: ‘tab2’ undeclared (first use in this function)
bs_call.c:417: warning: left-hand operand of comma expression has no effect
bs_call.c:420: warning: implicit declaration of function ‘get_gl_tab’
bs_call.c: In function ‘call_genotypes_ML’:
bs_call.c:1485: warning: array subscript has type ‘char’
bs_call.c:1433: warning: unused variable ‘fisher_p’
bs_call.c:1432: warning: unused variable ‘nr’

I am using python 2.6.6 as per the instructions. I think I have properly configured gsl-2.4 according to the gsl website instructions, however, I was unsure of the difference between include and lib so that could be a potential source of c related error.

Any advice you would have here is greatly appreciated!

Thanks,
Christa

Failed to run gemBS extract

Hi,

When I am running gemBS extract, it reported:

: No BCF files are available for methylation extraction.

However, the gemBS call seems worked successfully, it reported:

: Command call started at 2019-03-18 10:15:43.098498
:
: ----------- Methylation Calling --------
: Reference       : /home/nhpcc502/ssd/gemBS_reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa
: Species         : Homo_sapiens
: Right Trim      : 0
: Left Trim       : 0
: Chromosomes     : ['KI270303.1', 'GL000225.1', 'KI270396.1', 'KI270713.1', '3', '5', 'GL000195.1', 'KI270747.1', 'KI270710.1', 'KI270394.1', 'KI270708.1', 'KI270727.1', '9', 'KI270539.1', 'KI270310.1', 'KI270371.1', 'KI270311.1', 'KI270519.1', 'KI270374.1', '8', 'KI270723.1', 'GL000226.1', 'KI270751.1', '7', 'GL000216.2', 'KI270438.1', 'KI270518.1', 'KI270510.1', 'KI270590.1', 'KI270384.1', 'KI270724.1', 'KI270728.1', 'KI270362.1', 'KI270411.1', 'KI270337.1', 'KI270390.1', 'KI270583.1', 'KI270378.1', 'KI270517.1', 'KI270364.1', '15', 'KI270740.1', 'KI270516.1', 'KI270717.1', 'KI270333.1', 'KI270722.1', 'KI270754.1', 'KI270731.1', 'KI270742.1', '16', 'GL000008.2', 'KI270393.1', 'KI270584.1', 'KI270580.1', 'KI270363.1', 'KI270389.1', 'KI270373.1', 'KI270442.1', 'GL000213.1', 'KI270733.1', 'KI270329.1', 'KI270316.1', 'X', 'KI270423.1', 'KI270320.1', 'KI270755.1', 'KI270746.1', 'KI270741.1', 'KI270709.1', 'KI270712.1', 'KI270528.1', 'KI270544.1', 'GL000220.1', 'KI270757.1', 'GL000214.1', '1', 'KI270465.1', '11', 'KI270418.1', 'KI270756.1', 'KI270530.1', 'KI270511.1', 'KI270512.1', 'GL000221.1', 'KI270467.1', 'KI270302.1', 'KI270334.1', 'KI270739.1', 'KI270417.1', 'KI270330.1', 'KI270735.1', 'KI270340.1', 'KI270507.1', 'KI270714.1', 'GL000194.1', 'KI270422.1', 'KI270752.1', 'KI270468.1', '13', 'KI270748.1', 'KI270737.1', 'KI270429.1', 'KI270381.1', 'KI270375.1', 'KI270386.1', 'KI270749.1', 'KI270412.1', 'KI270305.1', 'KI270726.1', 'KI270391.1', 'KI270508.1', '22', 'KI270466.1', 'KI270732.1', 'KI270338.1', 'KI270315.1', 'KI270716.1', 'KI270322.1', '14', 'KI270383.1', 'KI270448.1', 'KI270425.1', 'KI270424.1', 'KI270715.1', 'KI270372.1', 'GL000219.1', 'KI270743.1', 'KI270515.1', 'KI270335.1', '19', 'KI270711.1', 'KI270317.1', 'KI270582.1', 'KI270366.1', 'KI270589.1', '10', 'KI270304.1', 'GL000009.2', 'KI270385.1', 'KI270387.1', 'KI270379.1', 'KI270509.1', 'GL000224.1', 'KI270753.1', 'KI270729.1', 'KI270312.1', 'GL000208.1', 'KI270738.1', 'KI270336.1', 'KI270435.1', 'KI270419.1', '12', 'KI270376.1', 'KI270587.1', 'KI270395.1', 'KI270579.1', 'GL000218.1', 'KI270581.1', 'KI270521.1', '2', 'KI270721.1', 'KI270744.1', 'Y', 'KI270414.1', 'KI270420.1', '20', 'KI270736.1', 'MT', '6', 'KI270593.1', 'KI270730.1', 'KI270725.1', 'KI270720.1', '21', 'KI270734.1', 'KI270591.1', 'KI270718.1', 'KI270522.1', 'GL000205.2', 'KI270548.1', 'KI270750.1', 'KI270529.1', 'KI270706.1', 'KI270707.1', 'KI270388.1', '17', 'KI270382.1', '18', 'KI270538.1', 'KI270588.1', '4', 'KI270392.1', 'KI270719.1', 'KI270745.1']
: Threads         : 4
: Sample: SRR5453774    Bam: ./mapping/SRR5453774/SRR5453774.bam
:
: Methylation Calling...
: Methylation call done, samples performed: SRR5453774

And the bcf folder has the following files:

bs_call_SRR5453774_10.err  bs_call_SRR5453774_4.err   contigs_SRR5453774_18.bed  SRR5453774_10.bcf   SRR5453774_18.bcf   SRR5453774_4.bcf
bs_call_SRR5453774_11.err  bs_call_SRR5453774_5.err   contigs_SRR5453774_19.bed  SRR5453774_10.json  SRR5453774_18.json  SRR5453774_4.json
bs_call_SRR5453774_12.err  bs_call_SRR5453774_6.err   contigs_SRR5453774_1.bed   SRR5453774_11.bcf   SRR5453774_19.bcf   SRR5453774_5.bcf
bs_call_SRR5453774_13.err  bs_call_SRR5453774_7.err   contigs_SRR5453774_20.bed  SRR5453774_11.json  SRR5453774_19.json  SRR5453774_5.json
bs_call_SRR5453774_14.err  bs_call_SRR5453774_8.err   contigs_SRR5453774_21.bed  SRR5453774_12.bcf   SRR5453774_1.bcf    SRR5453774_6.bcf
bs_call_SRR5453774_15.err  bs_call_SRR5453774_9.err   contigs_SRR5453774_22.bed  SRR5453774_12.json  SRR5453774_1.json   SRR5453774_6.json
bs_call_SRR5453774_16.err  bs_call_SRR5453774_X.err   contigs_SRR5453774_2.bed   SRR5453774_13.bcf   SRR5453774_20.bcf   SRR5453774_7.bcf
bs_call_SRR5453774_17.err  bs_call_SRR5453774_Y.err   contigs_SRR5453774_3.bed   SRR5453774_13.json  SRR5453774_20.json  SRR5453774_7.json
bs_call_SRR5453774_18.err  contigs_SRR5453774_10.bed  contigs_SRR5453774_4.bed   SRR5453774_14.bcf   SRR5453774_21.bcf   SRR5453774_8.bcf
bs_call_SRR5453774_19.err  contigs_SRR5453774_11.bed  contigs_SRR5453774_5.bed   SRR5453774_14.json  SRR5453774_21.json  SRR5453774_8.json
bs_call_SRR5453774_1.err   contigs_SRR5453774_12.bed  contigs_SRR5453774_6.bed   SRR5453774_15.bcf   SRR5453774_22.bcf   SRR5453774_9.bcf
bs_call_SRR5453774_20.err  contigs_SRR5453774_13.bed  contigs_SRR5453774_7.bed   SRR5453774_15.json  SRR5453774_22.json  SRR5453774_9.json
bs_call_SRR5453774_21.err  contigs_SRR5453774_14.bed  contigs_SRR5453774_8.bed   SRR5453774_16.bcf   SRR5453774_2.bcf    SRR5453774_X.bcf
bs_call_SRR5453774_22.err  contigs_SRR5453774_15.bed  contigs_SRR5453774_9.bed   SRR5453774_16.json  SRR5453774_2.json   SRR5453774_X.json
bs_call_SRR5453774_2.err   contigs_SRR5453774_16.bed  contigs_SRR5453774_X.bed   SRR5453774_17.bcf   SRR5453774_3.bcf    SRR5453774_Y.bcf
bs_call_SRR5453774_3.err   contigs_SRR5453774_17.bed  contigs_SRR5453774_Y.bed   SRR5453774_17.json  SRR5453774_3.json   SRR5453774_Y.json

Could you help me to solve this problem? Thank you in advance.

about parallelization when calling and extracting

I have read the documents of GEMBS, but I am confusing about the parallelization parameters. If the contents of pipeline configuration file are:

.....
threads = 8
jobs = 3
...

I guess there are 3*8=24 CPU threads when mapping, calling and extracting. Is this right?

Can it be used for amplification based targeted methylation analysis?

Hi,

gemBS seems to be a very interesting toolkit. My wet lab colleagues use targeted approach - they amplify specific regions of bisulfite converted DNA and analyze it using amplicon sequening on Illumina MiSEQ. I came to their data rather by an accident because they were unable to use their methods.

I've tried to analyze their data using gemBS. I can map the data without any problem - paired end data are mapped to a sequence used as a reference. Almost all reads are mapped and are reported as correct pairs in gemBS mapping report. Samtools flagstat reports them as properly paired too. But I get a lot of warning and errors during methylation and variant calling. All reads are reported as follows in calling err file:

[W::vcf_parse] Contig 'Warning not found: M03647:172:000000000-D4RY5:1:1102:4243:8025 4858 4863 +' is not defined in the header. (Quick workaround: index the file with tabix.)
...
[E::bcf_write] Broken VCF record, the number of columns at Warning not found: M03647:172:000000000-D4RY5:1:1102:4243:8025 4858 4863 +:1 does not match the number of samples (0 vs 1)
...

I can avoid the messages by changing keep_improper_pairs option to False but I get all reads as PairNotFound in calling json log.

Is there any way how to analyze such data?

Thank you very much.

My config is:

# Required section
#
# Note that the index and contig_sizes files are generated from the
# reference file if they do not already exist
#

reference = reference/ctnnd2.fa

#
# This is for the control sequences.  The contigs here will
# be used for mapping, but will not be passed to the caller
#
# extra_references = reference/conversion_control.fa.gz

index_dir = indexes

#
# The variables below define the directory structure for the results files
# This structure should not be changed after the analysis has started
#

base = .
sequence_dir = ${base}/fastq
bam_dir = ${base}/mapping/@BARCODE
bcf_dir = ${base}/calls/@BARCODE
extract_dir = ${base}/extract/@BARCODE
report_dir = ${base}/report

#
# End of required section
#


# The following are optional

project = gembs_test_ctnnd2
species = human

threads = 20
jobs = 6

[index]

sampling_rate = 4

[mapping]

non_stranded = False
remove_individual_bams = True

#underconversion_sequence = NC_001416.1 
#overconversion_sequence = NC_001604.1

[calling]

mapq_threshold = 10
qual_threshold = 13
reference_bias = 2
left_trim = 5
right_trim = 0
keep_improper_pairs = True
keep_duplicates = True
haploid = False
conversion = auto
remove_individual_bcfs = True

# Contigs smaller than contig_pool_limit will be called together
contig_pool_limit = 25000000

[extract]

strand_specific = True
phred_threshold = 10
make_cpg = True
make_non_cpg = True
make_bedmethyl = True
make_bigwig = True

Running `gemBS map-report` on an external bam file

Hi,

Is it possible to generate the just the JSON files required for the map-report on a bam file generated from elsewhere (in this case a downsampled version of a BAM file generated from GemBS)?

gemBS merging-all - when there is only one input bam file

In order to move from gemBS mapping to gemBS bscall, gemBS expects the bam filename to have changed from flowcell_lane_index.bam to sample.bam.

The documentation does not make it clear that gemBS merging-all needs to be run even if there is only a single bam (in order to essentially rename files)

However, when there is only a single BAM file, the bam and index should simply be copied and not re-calculated?

gemBS/src/__init__.py

Lines 400 to 423 in 9b7d158

 if len(listBams) > 1 : 

 bammerging.extend(["samtools","merge","--threads",threads,"-f",bam_filename]) 

 for bamFile in listBams: 

 bammerging.append(bamFile) 

 else: 

 bammerging.extend(["cp",listBams[0],bam_filename]) 

 #Check output directory 

 if not os.path.exists(output_dir): 

 os.makedirs(output_dir) 

 logging.debug("Merging sample: %s" % sample) 

 process = utils.run_tools([bammerging], name="bisulphite-merging",output=bam_filename) 

 if process.wait() != 0: 

 raise ValueError("Error while executing the Bisulphite merging") 

 return_info[sample] = os.path.abspath("%s" % bam_filename) 

 #Samtools index 

 indexing = ["samtools","index","%s"%(bam_filename)] 

 processIndex = utils.run_tools([indexing],name="Indexing")

So with the above code:

On L406, bam file is copied if there is only one file, which is great
However, as shown on L421, the merged bam is always re-indexed (even if the bam file was copied and an index already has been previously computed),
- hence, I would suggest the indexing should be moved into the if statement on L400, and the existing bam index should be copied and renamed when there is only a single bam file ...

gemBS merging-sample doe not with multiple threads

Hello,

The gemBS sample-merging command does not use -t option for setting the number of threads. It runs single threaded even if the -t option was set to higher values. Can you please check this issue?

Best,
Bekir

gembs prepare command

I have been running gemBS prepare for nearly a day, and it is not generating error messages, but it is also not doing or changing any files on my system. It generated the following hidden directory almost immediately

393330691    4 drwxr-xr-x   3 christacaggiano lsd          4096 Dec 12 14:44 ./gembs_test
393330692    4 drwxr-xr-x   3 christacaggiano lsd          4096 Dec 12 14:23 ./gembs_test/.gemBS
393330693    4 drwxr-xr-x   2 christacaggiano lsd          4096 Dec 12 14:23 ./gembs_test/.gemBS/gemBS_inputs
393322921    4 -rw-r--r--   1 christacaggiano lsd           243 Dec 12 14:44 ./gembs_test/.gemBS/gemBS_inputs/metadata-1-8.csv
393322920    4 -rw-r--r--   1 christacaggiano lsd          1070 Dec 12 14:44 ./gembs_test/.gemBS/gemBS_inputs/1-8.conf
393322922    0 -rw-r--r--   1 christacaggiano lsd             0 Dec 12 14:23 ./gembs_test/.gemBS/gemBS.db

but no changes have been made since then. I am running it on a human sample, hg38 genome. I had a similar problem when I was running the worked example given in the documentation. Is this normal? Should this command take many hours to run?

Alignment mode for RRBS

Hello,

RRBS protocol relies on digestion based enrichment using certain restriction enzymes such as Taq1 and MspI. These enzymes recognize and cut at certain DNA patterns (T-CGA, C-CGG) so we expect most of the reads to be mapped on corresponding genomic regions. Is it possible to implement a feature in the gem-mapper to prioritize these regions? I think it is crucial for gemBS to have such feature to support the claim that it is supporting RRBS analysis.

Best,
Bekir

gemBS map error

Hello,

I installed the gemBS from the current master branch. I am getting the following error during the mapping stage:

: ------------ Mapping Parameters ------------
: Sample barcode   : A001XS1
: Data set         : sample1_data_a
: No. threads      : 8
: Index            : indexes/sacCer3.BS.gem
: Paired           : True
: Read non stranded: False
: Type             : PAIRED
: Input Files      : ./fastq/sample1/sample1_data_a_1.fastq.gz,./fastq/sample1/sample1_data_a_2.fastq.gz
: Output dir       : ./mapping/A001XS1
:
: Bisulfite Mapping...
2019-04-10 16:43:12,782 ERROR: Process '/scratch/users/berguener/pyvirtual/gembs/lib/python3.6/site-packages/gemBS/gemBSbinaries/gem-mapper' finished with 1
2019-04-10 16:43:12,783 ERROR: 2019/4/10 16:43:12 -- [Opening input file './fastq/sample1/sample1_data_a_1.fastq.gz']
2019-04-10 16:43:12,783 ERROR: 2019/4/10 16:43:12 -- [Opening input file './fastq/sample1/sample1_data_a_2.fastq.gz']
2019-04-10 16:43:12,783 ERROR: 2019/4/10 16:43:12 -- [Outputting to stdout]
2019-04-10 16:43:12,783 ERROR: 2019/4/10 16:43:12 -- [Loading GEM index 'indexes/sacCer3.BS.gem']
2019-04-10 16:43:12,783 ERROR: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2019-04-10 16:43:12,783 ERROR: >> GEM.System.Error::Signal raised (no=11) [errno=0,Success]
2019-04-10 16:43:12,783 ERROR: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
ValueError: Error while executing the Bisulfite bisulphite-mapping

I got the same error while running it for another in-house dataset.

I tried running the gem3-mapper with only index and fastq files and still got the same error. Unfortunately I could not generate a more verbose error message.

Typo in Docs - bscall-reports section

On http://statgen.cnag.cat/GEMBS/UserGuide/_build/html/qualityControl.html#bscall-reports:

At the very bottom, in the last table, I believe intermediate is supposed to be between 0.3 and 0.7 (not 0 and 0.7)...

What is BadOrientation ?

Hi,

I'm seeing my BScall report.

Below stats are what we got from the report.
I used IHEC standard configuration for analysis.

Type #reads %
Total 1329357671 100%
Mapped 1323152968 99.53 %
Passed 434824972 32.71 %
NotPassed 888327996 66.82 %

LowMAPQ 123367328 9.28 %
MateUnmapped 6204703 0.47 %
Duplicate 247767436 18.64 %
BadOrientation 487678356 36.69 %
LargeInsertSize 13315864 1.00 %

NotCorrectlyAligned 16199012 1.22 %
Unmapped 6204703 0.47 %

I tried to finding concept of BadOrientation but I couldn't find it in the gemBS document.

Could you explain the BadOrientation ?

Thanks,
Nak

Adapter trimming

Hello,

Is adapter trimming required or recommended for WGBS analysis using gemBS?

missing `Uniqueness Reads` slot from mapping json ?

Hi Simon,
I am trying to parse json files generated by map command but it seems that the Uniqueness Reads information is missing from it. However, they are reported in html file generated from map-report. Is it calculated on the fly while generating html reports ? I have attached a sample files for your reference.
files.zip

You can see that Unique Fragments (479609) reported in html is missing in json. Is there any way to get this value ?

Thanks.

GemBS version

I downloaded GemBS version 3.2.0 but gemBS --version says I have 3.0.0

This is because the following two places say different things:

gemBS/gemBS/commands.py

Line 13 in 80d1288

__VERSION__ = "3.0.0"

gemBS/setup.py

Lines 19 to 22 in 80d1288

 __VERSION_MAJOR = "3" 

 __VERSION_MINOR = "2" 

 __VERSION_SUBMINOR = "0" 

 __VERSION__ = "%s.%s.%s" % (__VERSION_MAJOR, __VERSION_MINOR,__VERSION_SUBMINOR)

Would be nice if both of these say the same thing...

gem-indexer error

When follow the 14. Worked example in http://statgen.cnag.cat/gemBS/UserGuide/_build/html/example.html

:
: Command index started at 2018-07-02 16:06:22.354822
:
: Creating index
2018-07-02 16:06:22,824 ERROR: Process '~/.local/lib/python3.6/site-packages/gemBS/gemBSbinaries/gem-indexer' finished with 1
2018-07-02 16:06:22,825 ERROR: 2018/7/2 16:06:22 -- [Inspecting MultiFASTA]
2018-07-02 16:06:22,825 ERROR: 2018/7/2 16:06:22 -- 100% ... done [0.020 s]
2018-07-02 16:06:22,825 ERROR: 2018/7/2 16:06:22 -- Inspected text 15268125 characters (index_complement=yes). Requesting 14 MB (encoded text)
2018-07-02 16:06:22,825 ERROR: 2018/7/2 16:06:22 -- [Reading MultiFASTA]
2018-07-02 16:06:22,825 ERROR: GEM::Error (archive_builder_text.c:225,archive_builder_text_generate_forward)
2018-07-02 16:06:22,825 ERROR: MultiFASTA parsing (indexes/sacCer3.BS_gemBS.tmp.gz:1). must start with a tag '>'
ValueError: Error while executing the Bisulphite gem-indexer

nonbs_index disrupts example data processing

Hi,
I have problem running the example data. It fails at the index step with the error:

KeyError: 'nonbs_index'

I can bypass this by inserting nonbs_index = False in example.conf. Then, the BS-index is generated and I can continue with mapping. There is still a FileNotFoundError and the indexing does not generate a contig.sizes file and this stops me from continue to the calling step.

I have tried to stick to the guidelines, but maybe I'm missing something obvious. Anyhow, I would appreciate some help.

GemBS output: Methylation estimate and base count columns

Hi,

We have had some issues in understanding the numbers in columns 7-11 of the GEMBS output and so would appreciate further explanation on the following columns:

Methylation estimate (model based)
Non-converted base count
Converted base count
Total bases supporting genotype call
Total bases

Specifically:

The sum of the non-converted and converted base counts do not always equal to number in the 'Total bases supporting genotype call' or 'Total bases’ columns. (i.e. it’s different in 95+% of the rows)
Base counts can be quite different to that seen in IGV - even when taking into account the base quality, the read quality. Is there anything (like cysteine conversion over- or under-correction) that is also taken into account into this?

Overlapping Bases info missing in the report

Hello,

With the 2.0 release the Overlapping Bases info is removed from the mapping report. Can you please check this issue?

Best,
Bekir

Failed to compile bs_call

I failed installing gemBS because bs_call was failed to compile. I checked the tools/bs_call/src/Makefile.mk file and $CC environment variable was set to /usr/bin/gcc. This prevents using the newer gcc version which is loaded in the user's environment. Our system is pretty old and the version of /usr/bin/gcc is 3.4.0. I overcame this problem by editing tools/bs_call/src/Makefile.mk file and commenting out "CC=/usr/bin/gcc"

What is a dbSNP_index?

Hi,

I'm trying to create a dbSNP index file:

I have the following in my config file:

reference = /home/ucbtmog/d/wgbs/reference/hg38.fa.gz
dbSNP_dir = /home/ucbtmog/d/wgbs/reference/dbsnp
dbSNP_files = ${dbSNP_dir}/*.bed.gz

However, when I run gemBS index -t 80, I get the following:

Bisulphite Index /home/ucbtmog/a/75/EGAZ00001016575/ref_indexes/hg38.BS.gem already exists, skipping indexing
The dbSNP Index file must be specified using the configuration parameter dbSNP_index.

As such, what is the dbSNP_index? Is this a tabix index?

Failed to open -: unknown file type

Hi,

When I am running gemBS call I get the following error Failed to open -: unknown file type

: ----------- Methylation Calling --------
: Reference       : reference/sacCer3.fa.gz
: Species         : Yeast
: Right Trim      : 0
: Left Trim       : 5
: Chromosomes     : ['@pool_1', '@pool_2', '@pool_3']
: Threads         : 1
: Sample: A001XS1    Bam: ./mapping/A001XS1/A001XS1.bam
: Sample: A001XS2    Bam: ./mapping/A001XS2/A001XS2.bam
: Sample: A001XS3    Bam: ./mapping/A001XS3/A001XS3.bam
: Sample: A001XS4    Bam: ./mapping/A001XS4/A001XS4.bam
: Sample: A001XS5    Bam: ./mapping/A001XS5/A001XS5.bam
:
: Methylation Calling...
2018-12-17 18:20:53,695 ERROR: Process '/ye/zaitlenlabstore/christacaggiano/miniconda3/envs/gembs/lib/python3.6/site-packages/gemBS/bin/bcftools' finished with 255
2018-12-17 18:20:53,696 ERROR: Failed to open -: unknown file type
Exception in thread Thread-1:
Traceback (most recent call last):
ValueError: Error while executing the bscall process.

this seems to be specifically a bscall issue, but I can't seem to find what would be causing this issue. The same error comes up with the sample files and my own data. Do you have any insight on what may be causing this error?

gemBS mapping does not limit number of threads

Hello,

I tried running gemBS mapping command with various number of threads setting the -t argument but gem-mapper always used all the cores in the computing node. Can you please check this issue?

Best,
Bekir

error while loading shared libraries: libmysqlclient.so.18

I'm getting the following error running gemBS extract:

2019-03-27 15:15:48,683 ERROR: Process '/home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/gemBSbinaries/bedToBigBed' finished with 127
2019-03-27 15:15:48,684 ERROR: /home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/gemBSbinaries/bedToBigBed: error while loading shared libraries: libmysqlclient.so.18: cannot open shared object file: No such file or directory
Exception in thread Thread-23:
Traceback (most recent call last):
ValueError: Error while making bigBed files.

bedToBigBed seems to be unable to see libmysqlclient.so.18...

This is despite the fact that Libmysqlclient.so.18 is available on my system at this location: /usr/lib64/mysql/libmysqlclient.so.18

Would appreciate any thoughts you may have in fixing this? I have reinstalled the latest version of gemBS, but that did not fix the issue...

IHEC_standard.conf `mode` is incorrect

In the documentation, for the IHEC_Standard.conf, GemBS prepare says:

 Warning: variable 'mode' in section [extract] not used

I think this is supposed to be: strand_specific = True

Can you update the docs with this.

Use threads argument for Samtools

E.g.

gemBS/src/__init__.py

Line 303 in 9b7d158

bamToFastq.extend(["samtools","bam2fq", file_bam])

gemBS/src/__init__.py

Line 362 in 9b7d158

bamIndex = ["samtools","index","%s"%(nameOutput)]

gemBS/src/__init__.py

Line 421 in 9b7d158

indexing = ["samtools","index","%s"%(bam_filename)]

add ,"-@",threads, in each case (but for the first one (L303), you might need to move L318 further up to define threads)

Is it possible to call methylation starting from BAM files

Hi,

I would like to run gemBS call on the BAM files. I ran gemBS mapping on a previous version and do not wish to rerun the mapping...

When I now run gemBS call I get the following:

: Sample BAM file '/home/ucbtmog/a/75/EGAZ00001016575/mapping/original/original.bam' not ready
: Sample BAM file '/home/ucbtmog/a/75/EGAZ00001016575/mapping/orininal_sub50/orininal_sub50.bam' not ready
: Sample BAM file '/home/ucbtmog/a/75/EGAZ00001016575/mapping/orininal_sub25/orininal_sub25.bam' not ready
: Sample BAM file '/home/ucbtmog/a/75/EGAZ00001016575/mapping/orininal_sub10/orininal_sub10.bam' not ready
: Sample BAM file '/home/ucbtmog/a/75/EGAZ00001016575/mapping/orininal_sub05/orininal_sub05.bam' not ready
: Sample BAM file '/home/ucbtmog/a/75/EGAZ00001016575/mapping/orininal_sub025/orininal_sub025.bam' not ready
No available BAM files for calling

This is despite the fact that these files do exist. Looking at the code, it seems that it looks at the SQL db to see whether the bam file has been produced yet...

problem in compilation

--Hi,

i have made some changes in my Makefile.mk.in due to some problems in compilation:
the error is due to the fact that some compilers don't accept the -Ofast option. I have changed it to a more general -O2 option, also i have removed '-march=native -std=c99 -Wall' parameters and it work for now.

best,
Laurent --

Creating an index

Hello,

I was interested in testing out gemBS, but I'm a bit confused about how the indices are built. From the documentation, I can't figure out if there is a "stand-alone" way to build a genome index without having sample files. For example,bowtie2 has bowtie2-build and bismark has bismark_genome_preparation.

Forgive me if I'm being especially dense, but in the worked out example it seems that you need a file describing the samples (i.e. example.csv) to be able to invoke the indexer. Meaning you need to build your first index in the context of a project. I suppose one could create a dummy example.csv that doesn't refer to any files?

Thanks in advance,
Raymond

Unique alignment rates in BScall and Mapping reports differ

Hello,

I realized the Uniquely Aligned values in the BSCall (variants) report and Average Unique values in the mapping report are different. Do these values represent different metrics?

Best,
Bekir

Installation outside home directory

Using python setup.py install --prefix=$MYPREFIX I can install outside my home directory. But gemBS still look for some binaries at ~/.local/lib/python2.7/site-packages/src/gemBSbinaries whereas it should look at /$MYPREFIX/lib/python2.7/site-packages/src/gemBSbinaries.

Is it possible to adjust gemBS behavior accordingly?

metadata.csv

--Hi,
Sorry i have a question about the .csv file i need to use, i have fastq.gz files and i don't understand what 'flowcell,lane,index' means, can you give me an explicit example please ?
thank you --

AttributeError: 'Namespace' object has no attribute 'fileBam'

gemBS/src/production.py

Line 574 in 9b7d158

raise CommandException("Sorry path %s was not found!!" %(args.fileBam))

Change that to:

 raise CommandException("Sorry path %s was not found!!" %(fileBam))

no calling to be performed

Hi Simon,

I have this issue with gemBS call command, which fails to call or merge bcf files.

$gemBS -j AS-287569-LR-39945.json call --no-merge
: No calling to be performed
$gemBS -j AS-287569-LR-39945.json merge-bcfs -r
: No merging to be performed

It works again if I clear the output directory. Any way to force it.

gemBS bsMap-report ZeroDivisionError: float division by zero

When running with single-end reads:

gemBS bsMap-report -j "pgp2/gembs.json" -i "pgp2/gembs_mapping1" -n "pgp2" -o "out"

I get the following error:

------------ Input Parameter ------------
Input File(s)    : None
Output File(s)   : None
Output Directory : /mnt/254b78b9-76b4-422d-84b1-cc632bff60f7/IsmailM/wgbs/PGP2/gembs_mapping1/report
TMP Directory    : /tmp/
Threads          : 1

------- Mapping Report ----------
Name            : ANALYSIS_PGP2_1

Building html reports...
Traceback (most recent call last):
  File "/home/ucbtmog/.local/bin/gemBS", line 9, in <module>
    load_entry_point('gemBS==1.7', 'console_scripts', 'gemBS')()
  File "/home/ucbtmog/.local/lib/python2.7/site-packages/src/commands.py", line 54, in gemBS
    instances[args.command].run(args)
  File "/home/ucbtmog/.local/lib/python2.7/site-packages/src/production.py", line 836, in run
    report.buildReport(inputs=self.sample_lane_files,output_dir=self.output_dir,name=self.name)
  File "/home/ucbtmog/.local/lib/python2.7/site-packages/src/report.py", line 925, in buildReport
    vector_samples.append(SampleStats(name=sample,list_lane_stats=list_stats_lanes))
  File "/home/ucbtmog/.local/lib/python2.7/site-packages/src/reportStats.py", line 563, in __init__
    self.averageSampleOverlappedBases = (float(self.totalSampleOverlappedBases)/float(self.totalSampleBases)) * 100
ZeroDivisionError: float division by zero

The same command works with paired end reads...

The issue is on the last line (L563) of the snippet below:

gemBS/src/reportStats.py

Lines 545 to 564 in c7db033

 #Sum of all statistics 

 self.totalSampleUniqueReads = 0 

 self.totalSampleReads = 0 

 self.averageSampleUniqueReads = 0 

 self.totalSampleOverlappedBases = 0 

 self.totalSampleBases = 0 

 self.averageSampleOverlappedBases = 0 

 for lane_stats in list_lane_stats: 

 self.sum_values(lane_stats) 

 self.totalSampleUniqueReads += lane_stats.getUniqueMappedReads() 

 self.totalSampleReads += lane_stats.getTotalMappedReads() 

 listOverlappingBases = lane_stats.getOverlappingBases() 

 if (len(listOverlappingBases) == 2): 

 self.totalSampleOverlappedBases += listOverlappingBases[0] 

 self.totalSampleBases += listOverlappingBases[1] 

 self.averageSampleUniqueReads = float(self.totalSampleUniqueReads)/float(self.totalSampleReads) * 100 

 self.averageSampleOverlappedBases = (float(self.totalSampleOverlappedBases)/float(self.totalSampleBases)) * 100

Adding simple print messages here shows:

self.totalSampleOverlappedBases = 0.0
self.totalSampleBases = 0.0

If I replace this line with the following, the report is generated:

        self.averageSampleOverlappedBases = 1

So I think an if statement is needed here so that this isn't run with single end reads?
i.e. something like:

if self.is_paired:
  self.averageSampleOverlappedBases = (float(self.totalSampleOverlappedBases)/float(self.totalSampleBases)) * 100

gem3-mapper soft clips the methylated first base

Hello,

In our lab we use Taq1 and Msp1 endonucleases restriction/enrichment for RRBS. These enzymes cut at {T/C}GA and {T/C}GG sites so that most of the reads start with a CpG site. It seems that gem3-mapper soft clips the left-most (reads on forward strand) or right-most (reads on reverse strand) base of these reads if these CpG sites were methylated and not bisulfite converted. As a result, methylation rates at these sites are not called accurately. Please check the IGV screenshot for a comparison between bsmap (top) and Gem3-mapper (bottom) alignments. bscall did not detect any methylation at this site even though the methylation rate is 12/12.

Best,
Bekir

Error while executing the Bisulfite bisulphite-mapping

I'm analysing single-ended reads and I get the following error:

Level 30:
Level 30: Command map started at 2019-03-14 14:21:38.988822
Level 30:
Level 30: ------------ Mapping Parameters ------------
Level 30: Sample barcode   : PCa_45_01b_S19
Level 30: Data set         : PCa_45_01b_S19
Level 30: No. threads      : 70
Level 30: Index            : /home/ucbtmog/d/nugen/t/ref_indexes/hg37.BS.gem
Level 30: Paired           : False
Level 30: Read non stranded: False
Level 30: Type             : SINGLE
Level 30: Input Files      : PCa_45_01b_S19_R1.fastq.gz
Level 30: Output dir       : /home/ucbtmog/d/nugen/t/mapping/PCa_45_01b_S19
Level 30:
Level 30: Bisulfite Mapping...
2019-03-14 14:21:39,017 ERROR: Process '/home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/gemBSbinaries/gem-mapper' finished with 1
Traceback (most recent call last):
  File "/home/ucbtmog/.pyenv/versions/3.5.3/bin/gemBS", line 13, in <module>
    load_entry_point('gemBS==3.2.6', 'console_scripts', 'gemBS')()
  File "/home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/commands.py", line 157, in gemBS_main
    instances[args.command].run(args)
  File "/home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/production.py", line 367, in run
    self.do_mapping(fl)
  File "/home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/production.py", line 559, in do_mapping
    under_conversion=self.underconversion_sequence,over_conversion=self.overconversion_sequence)
  File "/home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/__init__.py", line 739, in mapping
    raise ValueError("Error while executing the Bisulfite bisulphite-mapping")
ValueError: Error while executing the Bisulfite bisulphite-mapping

The error logs for the mapping are empty.

updating the regex for finding files so it can find single-ended files

Current Regex used for finding files is:

gemBS/gemBS/production.py

Line 471 in fca06d5

 reg = re.compile("(.*){}(.*)(.)[.](fastq|fq|fasta|fa|bam|sam)([.][^.]+)?$".format(fli, re.I)) 

At the moment, there needs to be at least one character between the fli and the file ending.

For example, in the case where 'sample1' is the fli:

sample1.fastq.gz will not be matched
sample1_.fastq.gz will be matched
sample1_R1.fastq.gz will be matched
sample1_R2.fastq.gz will be matched

See https://regex101.com/r/1rIGRq/2/

I would vote for the removal of this requirement - as single ended reads won't necessarily have anything after the fli

Suggested Fix:

reg = re.compile("(.*){}(.*)[.](fastq|fq|fasta|fa|bam|sam)([.][^.]+)?$".format(fli, re.I))
# I removed a single `(.)`

The above will match all the following
sample1.fastq.gz
sample1_.fastq.gz
sample1_R1.fastq.gz
sample1_R2.fastq.gz

see https://regex101.com/r/Yj7yxl/1

Just a suggestion - feel free to close the issue, if there are other reasons for this.

Error: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory

Running gemBS extract (jobs = 16) on a smaller node (16 threads and 64gb ram) fails with the following error:

:
: Command extract started at 2018-10-17 13:27:31.752542
:
: Methylation Extraction...
2018-10-17 17:07:39,543 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:40,243 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:39,619 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:39,522 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:39,486 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:39,599 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:39,511 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:39,586 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:39,570 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:40,155 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:40,276 ERROR: Process '/usr/local/lib/python3.6/dist-packages/gemBS/bin/bcftools' finished with -13
2018-10-17 17:07:41,024 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,045 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,045 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,045 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,088 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,155 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,155 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,156 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,227 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,242 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,242 ERROR: Could not write 4096 bytes: Error 1
2018-10-17 17:07:41,262 ERROR: [E::bgzf_compress] Call to deflateInit2 failed: insufficient memory
2018-10-17 17:07:41,262 ERROR: Could not write 4096 bytes: Error 1
2018-10-17 17:07:41,272 ERROR: Could not write 4096 bytes: Error 1
2018-10-17 17:07:41,272 ERROR: Could not write 4096 bytes: Error 1
2018-10-17 17:07:41,272 ERROR: Could not write 4096 bytes: Error 1
2018-10-17 17:07:41,272 ERROR: Could not write 4096 bytes: Error 1
Exception in thread Thread-7:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

Exception in thread Thread-6:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

Exception in thread Thread-4:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

Exception in thread Thread-2:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

Exception in thread Thread-3:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

2018-10-17 17:07:41,286 ERROR: Could not write 4096 bytes: Error 1
Exception in thread Thread-5:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

2018-10-17 17:07:41,302 ERROR: Could not write 4096 bytes: Error 1
Exception in thread Thread-11:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

Exception in thread Thread-1:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

2018-10-17 17:07:41,312 ERROR: Could not write 4096 bytes: Error 1
Exception in thread Thread-9:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

2018-10-17 17:07:41,313 ERROR: Could not write 4096 bytes: Error 1
2018-10-17 17:07:41,372 ERROR: Could not write 4096 bytes: Error 1
Exception in thread Thread-10:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

Exception in thread Thread-8:
Traceback (most recent call last):
ValueError: Error while extracting SNP calls.

The relevant lines from the config:

[extract]
# there is a jobs arg but no threads arg
jobs=16
# strand_specific = True
phred_threshold = 10
make_cpg = True
make_non_cpg = True
make_bedmethyl = True
# make_bigwig = True
make_snps = True

-v and -n arguments are not working in gemBS mapping

Hello,

When I use -n and -v parameters I get the error:
ERROR: .../src/gemBSbinaries/gem-mapper: unrecognized option '--underconversion_sequence'

Can you please check the issue?

Thanks,
Bekir

Skipping the first 5 bases in bscall

Hello,

The -L argument for bscall is set to 5 by default;
parameters_bscall = ['%s' %(executables["bs_call"]),'-r',reference,'-L5','-n',sample,'--report-file',report_file]

Does this mean the first (left) 5 bases of reads are neglected for bisulfite calling? We would like to use gemBS for RRBS analysis and first bases of the reads are critically important for RRBS. Can you make this an optional setting?

Best,
Bekir

Typo in production.py

I think the following should be skip = False instead of skip = false

from:

gemBS/gemBS/production.py

Lines 481 to 484 in fca06d5

 if len(mlist) == 1: 

 (file, m) = mlist[0] 

 skip = false 

 if ftype is None:

At the moment, I get the following error because of this:

Traceback (most recent call last):
  File "/home/ucbtmog/.pyenv/versions/3.5.3/bin/gemBS", line 13, in <module>
    load_entry_point('gemBS==3.2.6', 'console_scripts', 'gemBS')()
  File "/home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/commands.py", line 157, in gemBS_main
    instances[args.command].run(args)
  File "/home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/production.py", line 367, in run
    self.do_mapping(fl)
  File "/home/ucbtmog/.pyenv/versions/3.5.3/lib/python3.5/site-packages/gemBS/production.py", line 483, in do_mapping
    skip = false
NameError: name 'false' is not defined

Over/under conversion rates

Hello,

Can you please explain how over/under conversion rates are calculated?

Thanks,
Bekir

Mapping requires high memory

Hello,

I have realized that gemBS mapping command requires quite a lot of memory. A far as I can understand gem-mapper requires about 20GB of memory for human genome. However, total memory footprint of gemBS mapping increases depending on the size of the input fastq file. I suppose it is because gem-mapper & readNameClean & bamView & bamSort commands are piped with python subprocess.PIPE and this is causing the sorting to be done in-memory. Is there a workaround to overcome this issue?

Best,
Bekir

GemBS call - Multiple Errors & Error handlings

e.g. Below is the output of a few displayed errors when running gemBS call:

print what subcommand (or what sample) failed.
- I'm running gemBS call with 40 Jobs (each with 2 threads), so multiple samples are running at the same time so have no idea which sample a error in the stdout refers to (i can guess the chr from the err log).

Here are a few suggestions wrt to error reporting etc. that I think would be useful to implement.

Add the word ERROR: before the error line in the bs_call*.err file - which will be make it easier to run grep -rn ... and see where errors are happening...
Rename *.err files to *.log (which is more correct).
Hide log files and the contig files (contig*.bed files) - i.e. maybe move into a tmp dir in the same dir - e.g. ${bcf_dir}/tmp.
- further to this, would be nicer if the JSONs are merged when the BCFs are merged)
Add a sentence - like "successfully completed processing chr*" - the current " Processing chromosome chr16 (OK)" is cryptic - especially using 'Processing" with "OK"
It seems that (based on timestamps and the fact that parts of the (dbsnp/ref) loading log lines are replaced with error lines), that 2+ threads ( lines and bscall/ other sub-thread stderr) are attempting to write to the console/ log file at the same time.
- how important is this - is it worth adding a mutex?
- Looking at the chr20 err log, there is no error output from bscall/other subthread - is this the reason. (the *err log file looks fine despite the failed exit code according to the stdout)

: Methylation Calling...
2018-08-22 16:25:57,198 ERROR: Process '/home/ucbtmog/.local/lib/python3.6/site-packages/gemBS/gemBSbinaries/bs_call' finished with -9
2018-08-22 16:25:57,199 ERROR: [E::bcf_write] Broken VCF record, the number of columns at chr1:1672448 does not match the number of samples (0 vs 1)
2018-08-22 16:25:57,199 ERROR: rence sequences
2018-08-22 16:25:57,199 ERROR: Completed loading dbSNP (no. contigs 25, no. bins 45003271, no. SNPs 609438362
2018-08-22 16:25:57,199 ERROR: Processing chromosome chr1 (OK)
Exception in thread Thread-1:
Traceback (most recent call last):
ValueError: Error while executing the bscall process.

...

2018-08-22 18:15:44,584 ERROR: Process '/home/ucbtmog/.local/lib/python3.6/site-packages/gemBS/gemBSbinaries/bs_call' finished with -9
2018-08-22 18:15:44,632 ERROR: Loading reference sequences
2018-08-22 18:15:44,632 ERROR: Loading dbSNP from /home/ucbtmog/a/analysis/ref_indexes/dbsnp.index
2018-08-22 18:15:44,632 ERROR: Completed loading reference sequences
2018-08-22 18:15:44,633 ERROR: Completed loading dbSNP (no. contigs 25, no. bins 45003271, no. SNPs 609438362
2018-08-22 18:15:44,633 ERROR: Processing chromosome chr20 (OK)
Exception in thread Thread-20:
Traceback (most recent call last):
ValueError: Error while executing the bscall process.

Add to bioconda

Hi,
I want to add gemBS to bioconda.

Although I think it is nice how gemBS pulls down all software it needs during installation, my suggestion would be to allow conda to handle the dependencies. It is possible to fix the version of the dependencies. Most of the software is already available (samtools, htslib, wigToBigWig etc). Besides gem-mapper and bs_call, it is only bcftools that needs special treatment. This can be done.

In addition, compilation of external software needs to be optional. I have made an attempt here:
https://github.com/karl616/gemBS/blob/60de304e32c30bc81c62e5109fff1d71053e17a0/setup.py#L146-L181

I lift out building to build_ext and add a try/catch around _install_bundle. A setuptool option would be nicer.
Second is a new make target in the tools directory to compile repository internal software:
https://github.com/karl616/gemBS/blob/60de304e32c30bc81c62e5109fff1d71053e17a0/tools/Makefile#L49-L52

Software is installed into the bin folder of the conda installation/environment.

What is needed from you is to make releases for gem-mapper and bs_call for each gemBS release. (This would be the best solution for a bundled version as well.) For testing I make releases in my forks, but it would be nicer to point to the main repositories.

As I see it, there are pros and cons with this solution:
pros:

Relies on bioconda for dependencies. Avoids multiple installations of the same software. (In my opinion, more in the conda mentality)
Assigning versions to bs_call and gem-mapper makes the dependencies more transparent. Perhaps bring harmony between master and cnag versions.
Successfully processes example data. (neutral)

cons:

Updating bioconda requires up to four recipes to be updated simultaneously.
potentially easier for users to overwrite software. A messed up path could also pose problems.
Fringe cases. One example is libgen. It is included in the conda installation, but I'm not sure when this is needed during processing.

As mentioned I'm for separating the software. One installation for each software. But writing this I realize that reproducible processing is key here and that might be easier to guarantee with the bundled setup. I'm open for both solutions... What are your thoughts?

	def to_bash(self):
	"""Returns the bash command representation
	"""
	if isinstance(self.commands, (list, tuple)):
	return " ".join(self.commands)
	return str(self.commands)

	#Create Dictionary of samples and bam file
	self.samplesBams = {}
	self.records = 0
	for k,v in FLIdata(args.json_file).sampleData.iteritems():
	fileBam = "%s/%s.bam" %(args.input_dir,v.getFli())
	self.records = self.records + 1
	if os.path.isfile(fileBam):
	if v.sample_barcode not in self.samplesBams:
	self.samplesBams[v.sample_barcode] = [fileBam]
	else:
	self.samplesBams[v.sample_barcode].append(fileBam)

	#Check list of file
	self.totalFiles = 0
	for sample,listBams in self.samplesBams:
	self.totalFiles += len(listBams)

	if len(listBams) > 1 :
	bammerging.extend(["samtools","merge","--threads",threads,"-f",bam_filename])

	for bamFile in listBams:
	bammerging.append(bamFile)
	else:
	bammerging.extend(["cp",listBams[0],bam_filename])

	#Check output directory
	if not os.path.exists(output_dir):
	os.makedirs(output_dir)

	logging.debug("Merging sample: %s" % sample)

	process = utils.run_tools([bammerging], name="bisulphite-merging",output=bam_filename)
	if process.wait() != 0:
	raise ValueError("Error while executing the Bisulphite merging")

	return_info[sample] = os.path.abspath("%s" % bam_filename)

	#Samtools index
	indexing = ["samtools","index","%s"%(bam_filename)]
	processIndex = utils.run_tools([indexing],name="Indexing")

	__VERSION_MAJOR = "3"
	__VERSION_MINOR = "2"
	__VERSION_SUBMINOR = "0"
	__VERSION__ = "%s.%s.%s" % (__VERSION_MAJOR, __VERSION_MINOR,__VERSION_SUBMINOR)

	#Sum of all statistics
	self.totalSampleUniqueReads = 0
	self.totalSampleReads = 0
	self.averageSampleUniqueReads = 0
	self.totalSampleOverlappedBases = 0
	self.totalSampleBases = 0
	self.averageSampleOverlappedBases = 0

	for lane_stats in list_lane_stats:
	self.sum_values(lane_stats)
	self.totalSampleUniqueReads += lane_stats.getUniqueMappedReads()
	self.totalSampleReads += lane_stats.getTotalMappedReads()
	listOverlappingBases = lane_stats.getOverlappingBases()
	if (len(listOverlappingBases) == 2):
	self.totalSampleOverlappedBases += listOverlappingBases[0]
	self.totalSampleBases += listOverlappingBases[1]

	self.averageSampleUniqueReads = float(self.totalSampleUniqueReads)/float(self.totalSampleReads) * 100
	self.averageSampleOverlappedBases = (float(self.totalSampleOverlappedBases)/float(self.totalSampleBases)) * 100

	if len(mlist) == 1:
	(file, m) = mlist[0]
	skip = false
	if ftype is None:

heathsc / gembs Goto Github PK

gembs's People

Contributors

Stargazers

Watchers

Forkers

gembs's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs