secastel / phaser Goto Github PK
View Code? Open in Web Editor NEWphasing and Allele Specific Expression from RNA-seq
License: GNU General Public License v3.0
phasing and Allele Specific Expression from RNA-seq
License: GNU General Public License v3.0
Dear Stephane.
I am trying to run phaser on my server (they made the installation), but unfortunately I always get this error: /bin/sh: python2.7: command not found
/bin/sh: python2.7: command not found
/bin/sh: python2.7: command not found
/bin/sh: python2.7: command not found
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1samtools view: writing to standard output failed
: Broken pipe
I have tried setting-up an alias for python2.7 but it doesn't work, any hints?
Thanks
Hi Stephane,
I have WGS and RNA-seq from the same donor (cell line), I'd like to get as long haplotypes as possible. In the docs you say that population based phasing prior to phaser helps a lot -- however, what about population-based phasing with sequencing read (but only WGS)-based phasing (like SHAPEIT2 does) prior to phaser? Not sure if you've tested/have a sense for whether it would be better to include WGS at both steps, or only in the phaser step.
Thanks,
Sahin
Hi,
I am trying to run phaser using HiC data on a ~3 Mb region and ran into the "no alignment score value found" error.
The command I am using:
python phaser.py --vcf ~/GT_myc_output.recalibrated.filtered.vcf.gz --bam ~/SRR6251266_chr8pairs_only_bwa_sorted.bam --paired_end 1 --o ~/phaser_case --sample 20 --mapq 60 --baseq 20
(I have tried restricting the interval --chr chr8
but get the same error)
The output:
STARTED "Read backed phasing and ASE/haplotype analyses" ...
DATE, TIME : 2019-11-07, 10:22:17
#1. Loading heterozygous variants into intervals...
Processing sample named 20
using all the chromosomes ...
processing VCF...
Memory efficient mode is deactivated...
If RAM is limited, activate memory efficient mode using the flag "--process_slow = 1"...
creating variant mapping table...
1059 heterozygous sites being used for phasing (1243 filtered, 0 indels excluded, 988 unphased)
#2. Retrieving reads that overlap heterozygous sites...
file: ~/SRR6251266_chr8pairs_only_bwa_sorted.bam
minimum mapq: 10
mapping reads to variants...
completed chromosome chr8...
processing mapped reads...
no alignment score value found in reads, cannot use cutoff
retrieved 0 reads
#3. Identifying connected variants...
calculating sequencing noise level...
FATAL ERROR: No reads could be matched to variants. Please double check your settings and input files. Common reasons for this occurring include: 1) MAPQ or BASEQ set too conservatively 2) BAM and VCF have different chromosome names (IE 'chr1' vs '1').
After inspecting the BAM I can see that column 5 has the correct MAPQ scores. Example:
SRR6251266.247070450 81 chr8 10025 54 42M = 10860 795 CAGTGCAGACTGATATATAAATCAAAACAAATGTCCTTTACA AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEAAAAA NM:i:0 MD:Z:42 MC:Z:36M6S AS:i:42 XS:i:33
SRR6251266.167696392 97 chr8 10051 60 42M = 149413 139404 ACAAATGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:29
SRR6251266.100942691 97 chr8 10052 60 42M = 18805 8795 CAAATGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATAT AAAAAEE/AEEEEEE/EEEEEEAEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:28
SRR6251266.173849305 97 chr8 10056 60 42M = 77718 67697 TGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATATGTGT /AAAAEEEEEEEEEEEEEAEEEAEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:7S35M AS:i:42 XS:i:28
SRR6251266.217291158 177 chr8 10057 60 42M = 162736 152680 GTCCTTTACATGTTTTCTGTTACAGTAGTAACAATATGTGTA EAEEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEA6AAA NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:29
SRR6251266.119745571 161 chr8 10063 39 42M = 11197 1176 TACATGTTTTCTGTTACAGTAGTAACAATATGTGTAAACTTA AAAAAEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:35 XA:Z:chr4,+21281,27M1D15M,1;
I would appreciate any help :)
Have you seen this error ? Seems most files were produced, but the VCF is empty. Could it have something to do with the lack of IDs in the VCF ? I am rerunning with "--unique_ids 1 --unphased_vars 1", as the goal is gene level ASE.
Output:
out.allele_config.txt
out.allelic_counts.txt
out.haplotypes.txt
out.haplotypic_counts.txt
out.variant_connections.txt
out.vcf
out.vcf only contains a header, no data
"
"
``python phaser.py --bam in.bam --vcf in.vcf --o test1_out --sample testsample --threads 48 --mapq 10 --baseq 10 --pass_only 0
Welcome to phASER v0.2
Author: Stephane Castel ([email protected])
#1. Loading heterozygous variants into intervals...
loading VCF into memory...
parsing VCF...
371607 total heterozygous variants, 0 indels excluded, 0 blacklisted variants
creating genomic intervals...
#2. Retrieving reads that overlap heterozygous sites...
Reads are being written to disk, this will impact performance.
file: K564_2DB4008_Rep1_6.bam
minimum mapq: 10
retrieved 8418748 reads
using alignment score cutoff of 188
splitting reads into 84 files with 100000 reads
assigning reads to variants...
#3. Identifying connected variants...
sequencing noise level estimated at 0.006877
24323 variant connections dropped because of conflicting configurations (threshold = 0.010000)
68195 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
phasing large (>15 variants) blocks...
identifying haplotypes with most support...
#6. Outputting haplotypes...
#7. Outputting phased VCF...
Traceback (most recent call last):
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 1756, in
main();
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 1012, in main
write_vcf();
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 1123, in write_vcf
genotype = list(vcf_columns[sample_column].split(":")[gt_index]);
UnboundLocalError: local variable 'sample_column' referenced before assignment
``
Hi,
Some times a chromosome or contig may no have any het variant to be analysed. on these case it would be useful if phaser.py would stop and exit instead of carrying on.
this is an example
##################################################
Welcome to phASER v0.2
Author: Stephane Castel ([email protected])
##################################################
#1. Loading heterozygous variants into intervals...
loading VCF into memory...
parsing VCF...
0 total heterozygous variants, 0 indels excluded, 0 blacklisted variants
creating genomic intervals...
#2. Retrieving reads that overlap heterozygous sites...
Reads are being written to disk, this will impact performance.
file: mapping/samples/gsnap/BELA.sorted.bam
minimum mapq: 30
retrieved 0 reads
Traceback (most recent call last):
File "/home/shared/app/phaser/phaser/phaser.py", line 1756, in <module>
main();
File "/home/shared/app/phaser/phaser/phaser.py", line 301, in main
use_as_cutoff, as_cutoff = calculate_as_cutoff(reads_in);
File "/home/shared/app/phaser/phaser/phaser.py", line 1706, in calculate_as_cutoff
percentile_cutoff = numpy.percentile(alignment_scores,args.as_q_cutoff*100);
File "/home/ipedroso/anaconda/envs/phaser/lib/python2.7/site-packages/numpy/lib/function_base.py", line 3268, in percentile
interpolation=interpolation)
File "/home/ipedroso/anaconda/envs/phaser/lib/python2.7/site-packages/numpy/lib/function_base.py", line 2997, in _ureduce
r = func(a, **kwargs)
File "/home/ipedroso/anaconda/envs/phaser/lib/python2.7/site-packages/numpy/lib/function_base.py", line 3385, in _percentile
x1 = take(ap, indices_below, axis=axis) * weights_below
File "/home/ipedroso/anaconda/envs/phaser/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 124, in take
return take(indices, axis, out, mode)
IndexError: cannot do a non-empty take from an empty axes.
cheers,
inti
Considering that the reference genome has been updated, now I want to use the hg38 reference genome for analysis, may I ask that can this tool be transplanted to hg38 ? If so, I notice that you only provide files related to hg19 (such as 'Useful files' : hg19_hla.bed.gz and hg19_haplo_count_blacklist.bed.gz), could you please update these files to hg38 .
Hi Secastel,
When I try to run phaser_gene_ae.py on my haplotypic counts I get the error
ERROR - this version of phaser_gene_ae is only compatible with results from phASER v1.0.0+
However, I am using https://github.com/secastel/phaser/archive/cd7daba.zip for the code, and when I look in the log-files of the jobs that got the phASER output it says
##################################################
Welcome to phASER v1.0.0
Author: Stephane Castel ([email protected])
##################################################
This is the command I used for phASER:
python /apps/software/phASER/20170714-cd7daba/phaser/phaser.py
--paired_end 1
--bam $BAM
--vcf $VCF
--mapq 255
--sample $SAMPLE
--baseq 10
--o $phaserOutPrefix
--temp_dir $TMPDIR
--threads 1
--gw_phase_method 1
--gw_phase_vcf 1
--show_warning 1
--debug 1
--unphased_vars 1
and the command for phaser_ae:
python /apps/software/phASER/20170714-cd7daba/phaser_gene_ae/phaser_gene_ae.py
--haplotypic_counts $haplotype_count
--features hg19_ensembl.bed
--o results/gene_ae/$SAMPLENAME.geneAE.txt
Any ideas what could cause this?
Thanks!
Hi Stephane,
Many thanks for sharing phaser, great tool! I was wondering if there is any way to get all possible haplotypes for a particular gene, when this is compatible with the input data.
Error:
...
completed chromosome X...
processing mapped reads...
using alignment score cutoff of 122
Traceback (most recent call last):
File "tools/phaser-v1.1.0/phaser/phaser.py", line 2359, in
main();
File "tools/phaser-v1.1.0/phaser/phaser.py", line 170, in main
parse_sample(sample_name, map_sample_column, args.bam, args.o, contig_ban)
File "tools/phaser-v1.1.0/phaser/phaser.py", line 262, in parse_sample
start_time, vcf_out, sample_out_path, last_chr=True, pi_block_value = 0)
File "tools/phaser-v1.1.0/phaser/phaser.py", line 555, in process_vcf
pool_output = parallelize(process_mapping_result, result_files);
File "tools/phaser-v1.1.0/phaser/phaser.py", line 2090, in parallelize
pool_output.append(function(input));
File "tools/phaser-v1.1.0/phaser/phaser.py", line 1303, in process_mapping_result
if use_as_cutoff == False or int(fields[4]) >= as_cutoff:
ValueError: invalid literal for int() with base 10: ''
Command exited with non-zero status 1
We have a VCF file with many thousand samples that we would like to phase. Currently using phASER we can only phase one sample at a time, which means that for every sample we need to write out a new VCF file and merge them once all are finished.
Would it be possible to change it so that multiple samples in one VCF can be phased at the same time?
Dear Stephane,
pHASER has been quite helpful for me. However, I'm trying to assess allele-specific gene expression in single-cell RNA-Seq data from human lung cancer cells and I have tried it many times,But all failed(aCount=0 and bCount=0),is there a problem with my vcf file?
vcf file:
chr1 89923 . A T 68 . DP=3;VDB=4.340713e-02;AF1=1;AC1=2;DP4=0,0,2,1;MQ=50;FQ=-36 GT:PL:GQ 1/1:100,9,0:16
chr1 90311 . T C 8.64 . DP=17;VDB=6.089532e-02;RPB=-2.152553e+00;AF1=0.5;AC1=1;DP4=6,7,1,2;MQ=48;FQ=11.3;PV4=1,0.15,1, 0.059 GT:PL:GQ 0/1:38,0,229:40
chr1 134223 . G C 87 . DP=5;VDB=2.649457e-02;RPB=8.293682e-01;AF1=0.5013;AC1=1;DP4=1,0,3,1;MQ=50;FQ=-5.45;PV4=1,0.42,1,1 GT:PL:GQ 0/1:117,0,23:26
chr1 134667 . A G 158 . DP=6;VDB=7.655903e-02;AF1=1;AC1=2;DP4=0,0,3,3;MQ=50;FQ=-45 GT:PL:GQ 1/1:191,18,0:33
haplotypic_counts.txt:
- chr15 60422193 60422224 chr15_60422193_T_C,chr15_60422224_T_C 2 C,C T,T 0 0 0 0/1 0.5
- chr15 29719135 29719136 chr15_29719135_A_G,chr15_29719136_C_T 2 G,T A,C 0 0 0 0/1 0.5
- chr15 85748317 85748331 chr15_85748317_T_G,chr15_85748331_G_A 2 T,A G,G 0 0 0 0/1 0.5
- chr15 84202077 84202144 chr15_84202077_C_G,chr15_84202144_G_A 2 C,G G,A 0 0 0 0/1 0.5
python phaser.py --pass_only 0 --vcf var.flt.vcf.gz --bam $bam --paired_end 1 --mapq 10 --baseq 10 --sample $bam --blacklist hg19_hla.bed --haplo_count_blacklist hg19_haplo_count_blacklist.bed --threads 6 --o ase
I do not know much about it in this respect, can you tell me where there is a problem?
thanks
Hi @secastel
As, I understand that PC is a confidence estimate for GW phase. But, I am interested in checking how good are adjacent alleles phased within Haplotype Block, PI
. In the given data below is there a way to estimate confidence level for
1860 1|0 vs. 1860 1|0
1879 0|1 1879 1|0
2 1860 . T G 22718.54 PASS AC=6;AF=0.231;AN=26;BaseQRankSum=0.548;ClippingRankSum=0.00;DP=2218;ExcessHet=6.2249;FS=5.236;InbreedingCoeff=-0.3001;MQ=47.32;MQRankSum=0.00;QD=20.43;ReadPosRankSum=-5.100e-02;SOR=0.399;set=Intersection GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/1:204,329:533:99:10442,0,7734:1|0:.,.,.,.,.:1036:|:0.5
2 1879 . G A 11022.23 PASS AC=8;AF=0.286;AN=28;BaseQRankSum=-4.150e-01;ClippingRankSum=0.00;DP=2210;ExcessHet=0.0577;FS=5.183;InbreedingCoeff=0.6497;MQ=60.00;MQRankSum=0.00;QD=10.93;ReadPosRankSum=0.049;SOR=1.060;set=Intersection GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/1:326,208:534:99:5861,0,13140:0|1:.,.,.,.,.:1036:|:0.5
Is the confidence values (PC which is actually for GW PC) good to estimate the confidence of PG at haplotype block level? My thought is no, because almost all my sites have PC as 0.5, since I didn't supply any phased vcf (which I don't have). Any other way I can estimate this?
Thanks,
Hi Stephene,
I looked at the structure of the out_prefix.vcf.gz output and I'd like you to enlighten me on a few things:
Thanks for your answer in advance!
Cheers,
John Ma
Department of Lymphoma/Myeloma
UT MD Anderson Cancer Center
Dear Stephane,
I am experiencing some problems with Phaser results.
I need to obtain allelic counts for hybrid yeast, and I know the reference sequence of one of its parents.
My pipeline is following:
Generate bam files using STAR (sorted by coordinates) by mapping reads from hybrid to known reference parental. Then generate vcf file on RNAseq data using GATK pipeline (I skip base recalibration step), then I do filtering of the vcf file with parameters mentioned in GATK pipeline (http://gatkforums.broadinstitute.org/gatk/discussion/3891/calling-variants-in-rnaseq).
Then I bgzip and tabix fitered vcf file.
Here are some lines of final vcf file
NC_018292.1 1078 . G A 100.28 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=33.43;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,3:3:9:128,9,0
NC_018292.1 1130 . T C 100.28 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=33.43;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,3:3:9:128,9,0
NC_018292.1 1249 . A T 178.90 PASS AC=2;AF=1.00;AN=2;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.24;SOR=1.022 GT:AD:DP:GQ:PL 1/1:0,5:5:15:207,15,0
NC_018292.1 1369 . T C 107.28 SnpCluster AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.09;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:9:135,9,0
My command line for phaser is:
python phaser/phaser/phaser.py --vcf filt_merged.vcf.gz --bam Aligned_sorted.bam
--paired_end 1 --mapq 255 --baseq 10 --sample sample --id_separator "="
--threads 10 --o phaser_out
Then I do:
python phaser_gene_ae.py --haplotypic_counts phaser_out.haplotypic_counts.txt
--features orth.bed --o phaser_gene_ae.txt --no_gw_phase 1
I generated .bed file form gff file just by selecting corresponding columns.
In gene output file I see strange behavour of phaser:
Many snps are sssigned to incorrect genes (you can see it by coordinates)
NC_018292.1 1048416 1052312 CORT_0A04800 3763 3200 6963 0.233811384293 61 NC_018292.1=948682=C=T,NC_018292.1=948822=T=C,NC_018292.1=948823=A=G,
NC_018292.1 1054556 1055605 CORT_0A04810 3763 3200 6963 0.233811384293 61 NC_018292.1=948682=C=T,NC_018292.1=948822=T=C,NC_018292.1=948823=A=G,
NC_018292.1 1056216 1056737 CORT_0A04820 3763 3200 6963 0.233811384293 61 NC_018292.1=948682=C=T,NC_018292.1=948822=T=C,NC_018292.1=948823=A=G,
I'd appreciate very much if you can give a hint on how to troubleshoot this issue.
Thank you,
Hrant
Hi,
congrats on the interesting preprint. Quite surprised RNA-seq is apparently so effective in phasing.
But on another note : I didn't find a license so far in the repo or the preprint. Which one is applicable ?
Thanks,
Colin
Hello,
I ran phASER as suggested here by running Sanger imputation service on my VCFs.
I managed to run it and then obtain phASER results with no runtime problem but the VCF I obtain at the end of the pipeline has lost the AN and DP information.
After Sanger imputation all variants have AN=2 and DP=2
Example:
After imputing with Sanger
TYPED;RefPanelAF=0.489683;AN=2;AC=1;INFO=1 GT:ADS:DS:GP:PS:PG:PB:PI:PW:PC:PM 1|0:1,0:1:0,1,0:13380:0/1:.:.:1|0:.:.
Same variant before:
GT:AD:DP:GQ:PL 0/1:11,7:18:99:244,0,264
Is this a correct behavior for the analysis pipeline or is there a better way to keep the DP and AD information throughout the process ?
Thanks,
Mattia
Hi,
I am running phaser on a bam of around 6Gb and it is completing step 2 but never making it to step 3. I have been able to successfully run the example in the tutorial. This is the end of the log file:
[samopen] SAM header is present: 25 sequences.
completed chromosome Y...
[samopen] SAM header is present: 25 sequences.
completed chromosome 3...
[samopen] SAM header is present: 25 sequences.
completed chromosome 2...
[samopen] SAM header is present: 25 sequences.
completed chromosome 5...
[samopen] SAM header is present: 25 sequences.
completed chromosome 9...
[samopen] SAM header is present: 25 sequences.
completed chromosome X...
[samopen] SAM header is present: 25 sequences.
completed chromosome 6...
[samopen] SAM header is present: 25 sequences.
completed chromosome 8...
[samopen] SAM header is present: 25 sequences.
completed chromosome 13...
[samopen] SAM header is present: 25 sequences.
completed chromosome 10...
[samopen] SAM header is present: 25 sequences.
completed chromosome 15...
[samopen] SAM header is present: 25 sequences.
completed chromosome 14...
completed chromosome 1...
completed chromosome 18...
completed chromosome 11...
completed chromosome 16...
completed chromosome 17...
completed chromosome 19...
completed chromosome 12...
It has been running for over an hour (and on a larger file over 2 days still at this point). Any ideas as to what might be causing this would be great!
Thanks,
Clare
Dear Stephane,
With great interest I have read your recent publication in Genome Biology and seen phASER, which I would really like to use for my own work. So I was wondering if I could ask your advise on some points.
I'm trying to assess allele-specific gene expression in single-cell RNA-Seq data from human blood cells. From each individual, I have a fairly large number of single-cell transcriptomes (approximately 200-300). In addition, I have one exome per individual. My aim is to obtain the most accurate ASE counts possible from each individual single-cell transcriptome.
My strategy so far is:
Call variants from the exome. (I have used a bcbio pipeline for this [https://bcbio-nextgen.readthedocs.io/en/latest/contents/testing.html#exome-with-validation-against-reference-materials].)
Map the sc-RNA-Seq reads to the reference genome. (I have used STAR for the mapping and then WASP to eliminate mapping biases.)
Next, I would like to use phASER to obtain gene-level haplotypic read counts for each individual transcriptome using the variant file from step 1 and the bam file(s) generated in step 2.
In step 3, of course, I would like to use the reads from all the sc-transcriptomes of a given individual for the read backed phasing but obtain separate read counts for each individual single cell. Do I understand you correctly that this in principle can be achieved using --bam <list of all the bam files> --haplo_count_bam 1
to output haplotypic read counts for bam file 1 only, while using the information from all the bam files for phasing? If yes, is it possible to obtain separate read counts for multiple bam files from the same run (e.g. specifying --haplo_count_bam 1, 2, ...
) or would I have to run the script repeatedly, once for each bam file?
I also meant to ask another question: In your tutorial you emphasize that the variant file should be pre-phased using a method like population phasing. I was wondering how important that is in the above situation (i.e. with a vcf that has been constructed from a single exome and with a really large number of RNA-Seq reads that can be used for read backed phasing). Do you think pre-phasing would still improve the quality of allelic counts here? I'm asking this because the various servers for phasing and imputation all require GRCh37-based coordinates, while I would prefer to work with hg38. (I have noticed that the bed files you provide are also based on hg19 but I suppose these could be lifted over to hg38 pretty easily?) Are you aware of any resources for population-based (pre-)phasing based on hg38 coordinates?
Would be great if you could have a look at this and give me some feedback.
Thanks a lot in advance,
Maik
The phaser.py
script appears to duplicate the VCF meta-information FORMAT lines in the output VCF file. For instance, if I had the following test.vcf.gz
VCF file:
##fileformat=VCFv4.1
##INFO=<ID=mut_type,Number=1,Type=String,Description="Mutation type">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT amsim_pcawg_phaseable
chr1 877831 . T C 40 PASS mut_type=germline GT 1/1
chr1 877868 . G T 40 PASS mut_type=somatic GT 0/1
chr1 881019 . G T 40 PASS mut_type=somatic GT 0/1
chr1 881025 . G T 40 PASS mut_type=somatic GT 0/1
chr1 881602 . C T 40 PASS mut_type=somatic GT 0/1
chr2 45812 . C A 40 PASS mut_type=somatic GT 0/1
chr2 45862 . G T 40 PASS mut_type=somatic GT 0/1
chr2 45875 . C T 40 PASS mut_type=somatic GT 0/1
chr2 45895 . A G 40 PASS mut_type=germline GT 1/1
chr2 45946 . C T 40 PASS mut_type=somatic GT 0/1
Running the following command:
python lib/phaser/phaser/phaser.py \
--vcf test.vcf.gz \
--bam data/bams/amsim_pcawg_phaseable.sorted.cleaned.dups_removed.bam \
--paired_end 1 \
--mapq 20 \
--baseq 20 \
--sample amsim_pcawg_phaseable \
--include_indels 1 \
--blacklist data/phaser_test_data/hg19_hla.bed \
--haplo_count_blacklist data/phaser_test_data/hg19_haplo_count_blacklist.bed \
--threads 1 \
--o test/phaser
Produces the output test/phaser.vcf.gz
VCF file with the GT meta-information line duplicated.
##fileformat=VCFv4.1
##INFO=<ID=mut_type,Number=1,Type=String,Description="Mutation type">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PG,Number=1,Type=String,Description="phASER Local Genotype">
##FORMAT=<ID=PB,Number=1,Type=String,Description="phASER Local Block">
##FORMAT=<ID=PI,Number=1,Type=String,Description="phASER Local Block Index (unique for each block)">
##FORMAT=<ID=PM,Number=1,Type=String,Description="phASER Local Block Maximum Variant MAF">
##FORMAT=<ID=PW,Number=1,Type=String,Description="phASER Genome Wide Genotype">
##FORMAT=<ID=PC,Number=1,Type=String,Description="phASER Genome Wide Confidence">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT amsim_pcawg_phaseable
chr1 877831 . T C 40 PASS mut_type=germline GT:PG:PB:PI:PW:PC:PM 1/1:1/1:.:.:1/1:.:.
chr1 877868 . G T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0/1:.:.:0/1:.:.
chr1 881019 . G T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0|1:chr1_881019_G_T,chr1_881025_G_T:1:|:0.5:0
chr1 881025 . G T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0|1:chr1_881019_G_T,chr1_881025_G_T:1:|:0.5:0
chr1 881602 . C T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0/1:.:.:0/1:.:.
chr2 45812 . C A 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:0|1:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2 45862 . G T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2 45875 . C T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2 45895 . A G 40 PASS mut_type=germline GT:PG:PB:PI:PW:PC:PM 1/1:1/1:.:.:1/1:.:.
chr2 45946 . C T 40 PASS mut_type=somatic GT:PG:PB:PI:PW:PC:PM 0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
This causes issues in downstream VCF processing tools (e.g. vcfR) that expect there to be no duplicate meta-information lines.
Hi,
quick question : should read alignments in BAM format and SNP calls in VCF be from alignments to pseudogenomes ? By pseudogenomes I mean genomes in which SNPs have been inserted to avoid too much reference bias, which is a known problem for these types of analyses.
Or is it OK if the BAM and VCF are from alignments to the reference sequence ?
Thanks!
Colin
Hi Castel,
I raised this issue before but I think I had lots of other issues in that section and might have been buried. The problem is with how the tags and it's values get updated in the FORMAT field as you run phASER on one sample first and again on that output vcf to phase the another sample.
Say I have a vcf file with several samples: samole1, sample2, sample3
And, if these were the tags in the FORMAT field: GT:AD:DP:GQ:PL
I run the phaser on sample1
which will now update the FORMAT
field to GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW
Since, I have to run phASER on another sample now, I take the output from the previous run and now the FORMAT
fields are updated as: GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW:PB:PC:PG:PI:PW
leading to two tags and two values (in some cases) for a sample.
I am still using the old pHASER due to issues with Cython, but wanted to highlight if this fix is done (if not a priority) in updated phASER versions.
Thanks,
HI ,
phaser can support to phase indel ??
I find ReadBackedPhasing just can phase SNV....but I wanna phase indel and SNV .
my data is human DNA array sequencing data .
Hi,
Thanks for the work on this!
it seems that phaser.py ignores variants on the --vcf file which do not have the PASS flag. This is somewhat restrictive sine one could not have it at all or have other flags. I think it is worth stating on the documentation that only variants with the PASS flag are considered.
Cheers
Hi @secastel
I may have raised this issue before, but the problem with VCF merge still exists. I am not sure if VCF from other callers are running into this issue, but with GATK (HaplotypeCaller) generated VCF's, I can run run phaser. But, the single sample VCF's that are not generated cannot be merged back.
This is the error message if it helps:
$ java -jar -Xmx16g /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T CombineVariants -R lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o F1.phased_variants.Final.vcf
INFO 19:25:42,458 HelpFormatter - --------------------------------------------------------------------------------
INFO 19:25:42,460 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
INFO 19:25:42,461 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 19:25:42,461 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO 19:25:42,461 HelpFormatter - [Thu Jan 04 19:25:42 EST 2018] Executing on Linux 4.10.0-42-generic amd64
INFO 19:25:42,461 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
INFO 19:25:42,465 HelpFormatter - Program Args: -T CombineVariants -R lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o F1.phased_variants.Final.vcf
INFO 19:25:42,469 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.10.0-42-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12.
INFO 19:25:42,469 HelpFormatter - Date/Time: 2018/01/04 19:25:42
INFO 19:25:42,469 HelpFormatter - --------------------------------------------------------------------------------
INFO 19:25:42,469 HelpFormatter - --------------------------------------------------------------------------------
INFO 19:25:42,492 GenomeAnalysisEngine - Strictness is SILENT
INFO 19:25:42,741 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 19:25:43,091 GenomeAnalysisEngine - Preparing for traversal
INFO 19:25:43,094 GenomeAnalysisEngine - Done preparing for traversal
INFO 19:25:43,094 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 19:25:43,095 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 19:25:43,095 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
##### ERROR --
##### ERROR stack trace
java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:592)
at java.lang.Integer.valueOf(Integer.java:766)
at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:717)
at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:129)
at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148)
at htsjdk.variant.variantcontext.GenotypesContext.iterator(GenotypesContext.java:465)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.mergeGenotypes(GATKVariantContextUtils.java:1556)
at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:1224)
at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:361)
at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:143)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: For input string: ""
##### ERROR ------------------------------------------------------------------------------------------
The issue at hand is that even if I can merge the data using other tools, I will still have problems because I further have to use GATK to do my downstream analyses. Can you please check the issue if time permits?
Here are my files.
ms01e_phased.vcf.gz
ms02g_phased.vcf.gz
ms03g_phased.vcf.gz
ms04h_phased.vcf.gz
Do I understand it correctly that phaser.py will at maximum detect two different haplotypes per contig?
This is reasonable for perfect unambiguous mapping in diploids, but a severe limitation if (a) the genome / transcriptome is indeed polyploid (or a pool of individuals), and (b) reads from paralog loci were mapped to the same reference contig. In both cases, >2 haplotypes can be possible from a single sample.
I am using this tool for my data analyses and finding it quite helpful. Although I am having to manipulate my vcf to let it use with other software downstream.
I think the addition of following list of things will help. And it doesn't probably need changing of the algorithms at all.
1) Option for add HP (haplotype phase) tag in the FORMAT field like the one obtained in GATK RBphasing.
https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php
2) Option to add PS (phase state) tag in the FORMAT field as specified in vcf specs.
http://samtools.github.io/hts-specs/VCFv4.2.pdf
https://samtools.github.io/bcftools/bcftools.html
3) Option to add a pipe (|) to the GT field while updating GT field with PG.
I am having a problem by taking the output vcf file and curating it with custom scripts. It is also changing formatting of the file most of the time. I think this tool would really help if output can be made more compatible with downstream application.
Thanks,
after omitting blacklist arguments, phaser ran for a long time but then ...
completed chromosome 16...
completed chromosome 17...
completed chromosome 22...
completed chromosome 19...
completed chromosome 6...
processing mapped reads...
Traceback (most recent call last):
File "REDO/phaser/phaser/phaser.py", line 2358, in
main();
File "REDO/phaser/phaser/phaser.py", line 170, in main
parse_sample(sample_name, map_sample_column, args.bam, args.o, contig_ban)
File "REDO/phaser/phaser/phaser.py", line 262, in parse_sample
start_time, vcf_out, sample_out_path, last_chr=True, pi_block_value = 0)
File "REDO/phaser/phaser/phaser.py", line 546, in process_vcf
alignment_scores = map(int,[x for x in subprocess.check_output("set -euo pipefail && "+"cut -f 5 "+" ".join(result_files), shell=True, executable='/bin/bash').split("\n") if x != ""]);
TypeError: a bytes-like object is required, not 'str'
Hi,
I am running phaser and getting an error that there is no alignment score in the RNA-seq BAM file. I was wondering, if is there a way to run phaser without considering alignment score?
Many thanks,
Rahul
Hello,
I use Trinity denovo assembled transcriptome contigs as the reference, and Trinity names its contigs e.g. "TR72377|c0_g1_i1". Unfortunately, "samtools view -h X.bam TR72377|c0_g1_i1: "
whill choke on this because the "|" symbol in the contig name is interpreted as a functional pipe.
The solution is to wrap the contig ID in hyphens, which eleviates the issue and does not affect contig names that do not contain the pipe.
I suggest to modify phaser.py:
orig. line 1070:
error_code = subprocess.call("samtools view -h "+bam+" "+chrom+": | samtools view -Sh "+samtools_arg+" -L "+bed_out+" -q "+mapq+" - | "+args.python_string+" "+return_script_path()+"/call_read_variant_map.py --baseq "+str(args.baseq)+" --splice 1 --isize_cutoff "+str(isize)+" --variant_table "+mapper_out+" --o "+mapping_result.name, stderr=devnull, stdout=devnull, shell=True);
replace this line by the hyphenated version:
error_code = subprocess.call("samtools view -h "+bam+" '"+chrom+"': | samtools view -Sh "+samtools_arg+" -L "+bed_out+" -q "+mapq+" - | "+args.python_string+" "+return_script_path()+"/call_read_variant_map.py --baseq "+str(args.baseq)+" --splice 1 --isize_cutoff "+str(isize)+" --variant_table "+mapper_out+" --o "+mapping_result.name, stderr=devnull, stdout=devnull, shell=True);
Thanks!
I just wanted to highlight a suggestion that could be incorporated into phaser given your time allows.
phaser
is designed mainly to generate haplotypes only for the heterozygous sites. But, from a biological stand point I think it would be more helpful to include an option, where we can extract the haplotypes even if the allele is homozgous variant, given the RB phased read to supports it. And, I think there are most cases like that.
An example something like this:
chr pos id ref alt GT PG Phase_Block_index
2 35 . A G 0/1 0|1 50
2 71 . C A 1/1 1/1 .
2 85 . C G 0/1 1|0 50
2 97 . G T,A 1/2 2|1 50
2 103 . T C 0/0 0/0 .
2 107 . C A,G 0/1 0|1 50
So, for the above example phaser
is designed to just generate haplotype from the heterozygous loci, which will be A-G-A-C & G-C-T-A. But, if the homozygous alternate genotype (1/1) or (0/0) is supported by the RBphased bam file there should be an option to get that - i.e haplotypes A-A-G-A-T-C & G-A-C-T-T-A and also the update of phase_block_index and PG. These kind of haplotypes are going to be more helpful in population genetics analyses. I think it would make phaser more helpful tool.
I have realized that phaser seems to take so much of a ram memory. Of several samples I tried there were two samples where phaser really didn't go through even in 2 days after several tries. I set number of threads
to 1. But, still no luck.
Is there a way to fix memory handling in phaser ?
Hi Stephane,
This tool is great and very well explained for a non-expert such as myself, many thanks indeed!!!
I have a problem which I hope you can help me, running your example data set I tried the following command after having succeed with the first one
$python /scratch/ev250/bin/phaser/phaser_gene_ae/phaser_gene_ae.py --haplotypic_counts phaser_test_case.haplotypic_counts.txt --features hg19_ensembl.bed --o phaser_test_case_gene_ae.txt --no_gw_phase 0
However, I get the following error
##################################################
Welcome to phASER Gene AE v1.1
Author: Stephane Castel ([email protected])
##################################################
#1 Loading features...
#2 Loading haplotype counts...
Traceback (most recent call last):
File "/scratch/ev250/bin/phaser/phaser_gene_ae/phaser_gene_ae.py", line 127, in
main();
File "/scratch/ev250/bin/phaser/phaser_gene_ae/phaser_gene_ae.py", line 65, in main
df_haplo_counts = pandas.DataFrame.from_csv(args.haplotypic_counts, sep="\t", index_col=False);
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/core/frame.py", line 1231, in from_csv
infer_datetime_format=infer_datetime_format)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 389, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 730, in init
self._make_engine(self.engine)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 923, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 1436, in init
self._set_noconvert_columns()
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 1501, in _set_noconvert_columns
_set(self.index_col)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 1476, in _set
x = names.index(x)
ValueError: False is not in list
which I cannot figure out, could you please help me to troubleshoot this? many thanks in advance
Elena
Hi, I've been using phASER the past couple of days and have been super happy with the results, but after trying to incorporate DNAseq reads in addition to RNAseq reads into our haplotype calling with phASER we are constantly getting out of memory errors in thread 13. phASER works fine with the example data you provided in your tutorial, as well as with just our RNAseq data, but as soon as we use RNAseq reads and DNAseq reads we get out of memory errors.
phASER manages to map reads to variants for both RNAseq and DNAseq reads, but during the processing step it always throws the following error:
processing mapped reads...
using alignment score cutoff of 115
Exception in thread Thread-13:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 325, in _handle_workers
pool._maintain_pool()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 229, in _maintain_pool
self._repopulate_pool()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 222, in _repopulate_pool
w.start()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/usr/lib64/python2.7/multiprocessing/forking.py", line 121, in init
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
At first I thought it was just my local computer not having enough memory (only 6GB free), but even after letting it run on our cluster which has around 50 GB memory free, it still threw that error. The command I used to call phASER was:
python /home/ibis/paul.hager/phaser/phaser/phaser.py --vcf /home/ibis/paul.hager/SCN306/Phasing/GATKPhaseByTransmission/SL154375.GATKTransmissionPhased.vcf.gz --bam /home/ibis/paul.hager/SCN306/Phasing/RNAseq/rnanz1_1-2_S13_R1_001_val_1.fq.Aligned.out.sort.bam,/home/ibis/paul.hager/SCN306/IBIS_WGS/SCN306pat/HMNNHCCXX_s1_1_GSLv3-7_49_SL154375.filtered.fastq_mem.sorted.mergedAll.realigned.recal.bam --paired_end 1,1 --mapq 255,60 --baseq 10 --sample SL154375 --blacklist /home/ibis/paul.hager/phaser/testData/hg19_hla.bed --haplo_count_blacklist /home/ibis/paul.hager/phaser/testData/hg19_haplo_count_blacklist.bed --threads 4 --o /home/ibis/paul.hager/SCN306/Phasing/Phaser/rnanz1_2/phaserSL154375_trioPhase_r12 --id_separator - --unique_ids 1
Do you have any idea what might be causing this error?
Thanks for your help!
Dear Stephane,
I am working on a RNASeq project with multiple chicken populations and multiple tissues per population for a total of around 700 samples.
Our wish is to do ASE analysis per population and tissu combination (POP_tissue).
I called variants at the POP_tissue level on my RNASeq data. But I easy can call variant at POP level.
For now I run phaser.py and phaser_gene_ae.py command for each sample individually (if samples is sequences multiple times, I merged the bam, so I have one sample = one bam), and then I want to launch the phaser_expr_matrix.py.
My understand is that this is only possible if my input VCF file is phased, but is this not partially the goal of phaser ?
Wat choices do I have to use phaser in this context ?
Would it be better to :
I am not sure to well understand how everything interact and what is the impact of doing it on unphased input VCF file.
Kind regards
Maria
Dear Stephane,
I used the phASER software using two input BAM files, one containing RNA-seq data, the other containing Whole Genome Sequencing DNA data. I used the --haplo_count_bam_exclude option to exclude the WGS BAM data from my output data.
The number of reads overlapping for example chr1, position 100547994, in my both BAM files are:
#WGS BAM
$ samtools mpileup -r 1:100547994-100547994 DNA/AC1JV9ACXX-2-20.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1 100547994 N 17 cctcTcTTCcTctCtTT ;;D;E<EDC;D9DCDEE
#RNA BAM
samtools mpileup -r 1:100547994-100547994 RNA/AC1JV9ACXX-2-20.mdup.sorted.readGroupsAdded.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1 100547994 N 23 >t$CTCtCtTTcCtCcccccctt^~t D@IIJCGADGDGJHI@FJJGIIH
I noticed that the *.allelic_counts file contains the summed counts from both RNA and WGS data:
grep 1_100547994_T_C allelic_counts/AC1JV9ACXX-2-20.chr1.allelic_counts.txt
1 100547994 1_100547994_T_C T C 19 20 39
While in the *.haplotypic_counts data just 22 reads for this same SNP are used:
grep 1_100547994_T_C haplotypic_counts/AC1JV9ACXX-2-20.chr1.haplotypic_counts.txt
1 100547994 100547994 1_100547994_T_C 1 0 T C 10 12 22 0|1 1 0.394 AC1JV9ACXX-2-20.mdup.sorted.readGroupsAdded
Is this intended behaviour for *.allelic_counts output when using a BAM file containing DNA data and the --halpo_count_bam_exclude option?
Regards,
Freerk
Hello,
I ran phASER for multiple samples and extracted allele specific expression for each sample independently.
Seen that these samples are split into case/control, I would like to process these expression data to see if there is anything interesting.
My question is if the allele specific expression data from multiple samples are directly compatible or not.
Can I compare aCount and bCount across multiple samples straight away or is there a risk that what is measured on aCount for sample X, ends up in bCount for sample Y ?
Do you have a suggested protocol for this task?
Thanks a lot
Mattia
It's obviously an enhancement, but is it possible to develop a read-backed allele counter--without the phasing--with phASER code?
Hi Secastel,
This isn't an issue, but an update. I recently finished writing a python3 script to improve the phased GT of the F1 hybrid. I wanted to connect with people who might be having similar problems and let them know about this tool. This script and details aren't completely clean yet, but should be able to run given the python3 and required modules are available.
https://github.com/everestial/pHASE-Stitcher
Thanks,
--output_orphans options isn't working. I looked into the python file and that option isn't there. I am not using this at the moment but just picking something that might be useful to others.
Also, with the updated version of phaser you have included 'setup.py'. I have my cython module installed but I am getting the following error:
Traceback (most recent call last):
File "setup.py", line 2, in
from Cython.Build import cythonize
ImportError: No module named 'Cython'
The old version didn't quite needed that.
Dear Stephane,
I noticed your tools phaser can generate ASE data without vcf file. but I don't how to do, can you help me? Thank you very much!
Hi,
I would really recommend the --chr option for RNA-seq datasets from plant genomes - this one is ca. 750MB so relatively small.
On 512gb and 2TB RAM servers I was never getting past step 5 for whole genomes
"5. Phasing blocks...
phasing large (>15 variants) blocks...
"
It would just killed by the server after apparently spiralling mem usage after 12-24 hrs.
If I pass the --chr option most chromosomes are done after 2-6 hours.
There may be a bug with --chr by the way. It seems I get a VCF for chromosome5 called eg "test21_chr5.vcf", but it contains SNPs from all chromosomes eg 1-9, eg, not just phased SNPs for chr5.
Is this a known bug ?
Also - can I just combine the *haplotypic_counts.txt files from each chromosome and use them as input for the next step "phaser_gene_ae.py" ? I hope this will work.
Thanks,
Colin
in phaser.py
fatal_error("Allele frequency VCF (--gw_af_vcf) specified does not exit.");
Should likely be
fatal_error("Allele frequency VCF (--gw_af_vcf) specified does not exist.");
Cheers!
Colin
Hi!
I aligned HG00096 from 1000GP with STAR and Tophat separately. And then I ran phASER using the same parameter you gave in the tutorial.
The error message for running phASER with the BAM aligned with STAR is:
#2. Retrieving reads that overlap heterozygous sites...
file: /data/reddylab/scarlett/1000G/data/StarOutput/HG00096/Aligned.sortedByCoord.out.bam
minimum mapq: 255
mapping reads to variants...
[bam_parse_region] fail to determine the sequence name.
[main_samview] region "1:" specifies an unknown reference name. Continue anyway.
[samopen] SAM header is present: 25 sequences.
[sam_read1] reference 'user command line: STAR --runMode alignReads --genomeDir /data/reddylab/scarlett/1000G/data/STARIndex --runThreadN 24 --readFilesCommand zcat --readFilesIn /gpfs/fs1/data/reddylab/scarlett/1000G/data/fastq/HG00096/ERR188040_R1.fastq.gz /gpfs/fs1/data/reddylab/scarlett/1000G/data/fastq/HG00096/ERR188040_R2.fastq.gz --outSAMtype BAM Unsorted SortedByCoordinate --outReadsUnmapped Fastx --outFileNamePrefix /data/reddylab/scarlett/1000G/data/StarOutput/HG00096/
AMtype BAM Unsorted SortedByCoordinate
' is recognized as '*'.
[main_samview] truncated file.
Exception in thread Thread-6:
Traceback (most recent call last):
File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/multiprocessing/pool.py", line 389, in _handle_results
task = get()
TypeError: ('__init__() takes at least 3 arguments (1 given)', <class 'subprocess.CalledProcessError'>, ())
The problem for using phASER with the BAM aligned with Tophat is that it stuck at first step forever ...
STARTED "Read backed phasing and ASE/haplotype analyses" ...
DATE, TIME : 2019-11-01, 10:44:04
#1. Loading heterozygous variants into intervals...
Processing sample named HG00096
using all the chromosomes ...
processing VCF...
Does phASER work with Tophat aligned bam file? And what parameters do you specify for STAR alignment? I wonder whether that is the point where causing these errors.
Could you help me with this?
I'd appreciate your help!
Thanks,
Scarlett
Hello,
I am trying to phase RNA-seq PE 150 mapped to its own denovo-reference. However, after running for a few hours and displaying many "completed chromosome X..." messages, it breaks with the following message:
.
.
completed chromosome TR43716|c0_g2_i1...
processing mapped reads...
Traceback (most recent call last):
File "/cluster/project/gdc/people/schamath/tools/phaser.py", line 2004, in
main();
File "/cluster/project/gdc/people/schamath/tools/phaser.py", line 334, in main
alignment_scores = map(int,[x for x in subprocess.check_output("cut -f 5 "+" ".join(result_files), shell=True).split("\n") if x != ""]);
File "/cluster/apps/python/2.7.6/x86_64/lib64/python2.7/subprocess.py", line 566, in check_output
process = Popen(stdout=PIPE, *popenargs, **kwargs)
File "/cluster/apps/python/2.7.6/x86_64/lib64/python2.7/subprocess.py", line 709, in init
errread, errwrite)
File "/cluster/apps/python/2.7.6/x86_64/lib64/python2.7/subprocess.py", line 1326, in _execute_child
raise child_exception
OSError: [Errno 7] Argument list too long
What can I do to solve this problem?
trying out phaser at NHGRI Computational medicine hackation
is the example at https://stephanecastel.wordpress.com/2017/02/15/how-to-generate-ase-data-with-phaser/ known to work with current code?
##################################################
Welcome to phASER v1.1.1
Author: Stephane Castel ([email protected])
Updated by: Bishwa K. Giri ([email protected])
##################################################
Completed the check of dependencies and input files availability...
STARTED "Read backed phasing and ASE/haplotype analyses" ...
DATE, TIME : 2019-06-10, 18:17:09
#1. Loading heterozygous variants into intervals...
Processing sample named NA06986
using all the chromosomes ...
removing blacklisted variants and processing VCF...
#1b. Loading haplotypic count blacklist intervals...
Traceback (most recent call last):
File "REDO/phaser/phaser/phaser.py", line 2358, in
main();
File "REDO/phaser/phaser/phaser.py", line 170, in main
parse_sample(sample_name, map_sample_column, args.bam, args.o, contig_ban)
File "REDO/phaser/phaser/phaser.py", line 235, in parse_sample
for line in raw_interval.split("\n"):
TypeError: a bytes-like object is required, not 'str'
Hi @secastel 👍 pHASER has been quite helpful for me. However, I have encountered several issues that would need attention to make this tool better.
1) The vcf file output by pHASER isn't quite compatible with the GATK tool. I am not able to figure out what actually is the problem but I suspect there is some problem with the structure of the output *.vcf when written by pHASER. I can give you more details if need be.
2) GW (genome wide) phasing is quite a useful thing, but I think if there was a choice to update the GT field rather with PG. I think in some instances we would want to get the haplotype states of the particular block.
Note: I am actually working with F1 hybrid data and in that situation GW phase using pHASER led to switch errors (verified by comparison of the several bam and vcf). The solution to the haplotyping problem (especially in the F1 hybrids) is to get the haplotypes as much big as possible represented by unique PI
. So, each PI
has two haplotypes; now we can test statistically if Haplotype-A vs. Haplotype-B belong to Population_X vs. Population_Y using OddsRatio or Markov Model. I have actually just finished writing a python script
to do this using Odds Ratio. Currently I am writing another part for the use of Markov Model, but being a biologist its taking me some time. In nutshell, my program uses the PB
and PI
generated by your program to stitch the haplotypes.
So, if you could add an option to update the GT
by PB
values it would be helpful.
3) Another addon: GATK generates the haplotypes in the given format
GT:AD:DP:GQ:PGT:PID:PL 0/1:80,25:105:99:0|1:5398_A_G:780,0,3463
GT:AD:DP:GQ:PGT:PID:PL 0/1:14,4:18:83:1|0:47883_G_A:83,0,472
I transfered the phase state from PGT
to GT
using awk and supplementd that vcf as phased input. The ouput file was able to extend the haplotypes of the block quite significantly for several block. I think this capability would he helpful.
4) Phasing of Indels: Also, the phasing of InDels can be improved by phasing the SNPs first, then transferring the phase state in PB
to GT
(which I did using awk again). The phased state of the InDels was quite improved. I think this can be incorporated. But, the issue that came up was the following:
The FORMAT
field would look like following: GT:AD:DP:PL:PG:PB:PI:PW:PC:PG:PB:PI:PW:PC
So, when the phase output from pHASER is supplemented as Phased-Input for phasing of InDels the field PG:PB:PI:PW:PC
get extended along with the SAMPLE
field. This can be probably removed.
5) This is the issue I previously reported: Choice to output the phase of the homozygous alternate allele (1|1, or 2|2, 3|3) and haplotype if they are connected with the heterozygous allele. I actual tried to write some thing with in your pHASER but couldn't track the several def function()
Hope these comments don't bother you too much.
And, finally "Happy New Year 2017 ! " :) 👍
When I run phaser_expr_matrix.py
with the suggested python 2.7, I get this error for every line of the BED file:
[E::get_intv] Failed to parse TBX_GENERIC, was wrong -p [type] used?
The offending line was: "10 ENSRNOG00000033508 57185346 57238531 1|0 0|0 [...]
The column order is wrong because the pandas DataFrame is initialized with a dictionary. Pandas documentation says "column order follows insertion-order for Python 3.6 and later." When I run phaser_expr_matrix.py
with python3, column order is correct and it runs fine.
Hi Stephane,
trying 0.9.2 for the first time today I came across this new bug
Best wishes,
Colin
#3. Identifying connected variants...
calculating sequencing noise level...
sequencing noise level estimated at 0.003902
creating read sets...
generating read connectivity map...
testing variant connections versus noise...
25551 variant connections dropped because of conflicting configurations (threshold = 0.010000)
104943 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
Traceback (most recent call last):
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 2009, in
main();
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 779, in main
phase_support[0] += maf;
TypeError: unsupported operand type(s) for +=: 'int' and 'str'
I got the following error while using the repo phaser.
##################################################
Welcome to phASER v1.1.1
Author: Stephane Castel ([email protected])
Updated by: Bishwa K. Giri ([email protected])
##################################################
Completed the check of dependencies and input files availability...
STARTED "Read backed phasing and ASE/haplotype analyses" ...
DATE, TIME : 2019-04-24, 17:21:25
#1. Loading heterozygous variants into intervals...
Processing sample named 10780_wgs_FromBam
using all the chromosomes ...
processing VCF...
Memory efficient mode is deactivated...
If RAM is limited, activate memory efficient mode using the flag "--process_slow = 1"...
creating variant mapping table...
2441858 heterozygous sites being used for phasing (0 filtered, 0 indels excluded, 2268598 unphased)
#2. Retrieving reads that overlap heterozygous sites...
file: /imppc/labs/lplab/share/marc/epimutations/processed/bam/hg38/atacBam/10780_ATAC.bam
minimum mapq: 255
mapping reads to variants...
Traceback (most recent call last):
File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/call_read_variant_map.py", line 2, in <module>
import read_variant_map;
ImportError: /imppc/labs/lplab/share/marc/repos/phaser/phaser/read_variant_map.so: undefined symbol: PyUnicodeUCS2_DecodeUTF8
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
Traceback (most recent call last):
File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 2358, in <module>
main();
File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 170, in main
parse_sample(sample_name, map_sample_column, args.bam, args.o, contig_ban)
File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 262, in parse_sample
start_time, vcf_out, sample_out_path, last_chr=True, pi_block_value = 0)
File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 533, in process_vcf
result_files = parallelize(call_mapping_script, pool_input);
File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 2089, in parallelize
pool_output.append(function(input));
File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 1346, in call_mapping_script
error_code = subprocess.check_call("set -euo pipefail && "+run_cmd, stdout=devnull, shell=True, executable='/bin/bash')
File "/soft/general/python/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail && samtools view -h /imppc/labs/lplab/share/marc/epimutations/processed/bam/hg38/atacBam/10780_ATAC.bam 'chr1': | samtools view -Sh -F 0x400 -L /tmp/tmpnWNPVv -q 255 - | python2.7 /imppc/labs/lplab/share/marc/repos/phaser/phaser/call_read_variant_map.py --baseq 10 --splice 1 --isize_cutoff 0.0 --variant_table /tmp/tmpy1yJfS --o /tmp/tmppUwFEs' returned non-zero exit status 1
Code:
python /imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py --vcf $vcf --bam $bam --mapq 255 --baseq 10 --sample $(bcftools query -l $vcf) --o prove --paired_end 0 --id_separator "@"
I used the id_separator "@" becuase the error:
FATAL ERROR: Character '_' must not be present in contig name. Please change id separtor using --id_separator to a character not found in the contig names and try again.
Any suggestions?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.