secastel / phaser Goto Github PK

View Code? Open in Web Editor NEW

106.0 106.0 37.0 10.3 MB

phasing and Allele Specific Expression from RNA-seq

License: GNU General Public License v3.0

Python 99.65% Shell 0.35%

phaser's People

Contributors

Stargazers

Watchers

phaser's Issues

problem when running phaser

Dear Stephane.

I am trying to run phaser on my server (they made the installation), but unfortunately I always get this error: /bin/sh: python2.7: command not found
/bin/sh: python2.7: command not found
/bin/sh: python2.7: command not found
/bin/sh: python2.7: command not found
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1samtools view: writing to standard output failed
: Broken pipe

I have tried setting-up an alias for python2.7 but it doesn't work, any hints?

Thanks

Prior population phasing with or without WGS?

Hi Stephane,

I have WGS and RNA-seq from the same donor (cell line), I'd like to get as long haplotypes as possible. In the docs you say that population based phasing prior to phaser helps a lot -- however, what about population-based phasing with sequencing read (but only WGS)-based phasing (like SHAPEIT2 does) prior to phaser? Not sure if you've tested/have a sense for whether it would be better to include WGS at both steps, or only in the phaser step.

Thanks,
Sahin

no alignment score value found in reads

Hi,

I am trying to run phaser using HiC data on a ~3 Mb region and ran into the "no alignment score value found" error.

The command I am using:

python phaser.py --vcf ~/GT_myc_output.recalibrated.filtered.vcf.gz --bam ~/SRR6251266_chr8pairs_only_bwa_sorted.bam --paired_end 1 --o ~/phaser_case --sample 20 --mapq 60 --baseq 20

(I have tried restricting the interval --chr chr8 but get the same error)

The output:

STARTED "Read backed phasing and ASE/haplotype analyses" ...
DATE, TIME : 2019-11-07, 10:22:17
#1. Loading heterozygous variants into intervals...
Processing sample named 20
using all the chromosomes ...
processing VCF...

Memory efficient mode is deactivated...
If RAM is limited, activate memory efficient mode using the flag "--process_slow = 1"...

 creating variant mapping table...
      1059 heterozygous sites being used for phasing (1243 filtered, 0 indels excluded, 988 unphased)

#2. Retrieving reads that overlap heterozygous sites...
file: ~/SRR6251266_chr8pairs_only_bwa_sorted.bam
minimum mapq: 10
mapping reads to variants...
completed chromosome chr8...
processing mapped reads...
no alignment score value found in reads, cannot use cutoff
retrieved 0 reads
#3. Identifying connected variants...
calculating sequencing noise level...
FATAL ERROR: No reads could be matched to variants. Please double check your settings and input files. Common reasons for this occurring include: 1) MAPQ or BASEQ set too conservatively 2) BAM and VCF have different chromosome names (IE 'chr1' vs '1').

After inspecting the BAM I can see that column 5 has the correct MAPQ scores. Example:

SRR6251266.247070450 81 chr8 10025 54 42M = 10860 795 CAGTGCAGACTGATATATAAATCAAAACAAATGTCCTTTACA AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEAAAAA NM:i:0 MD:Z:42 MC:Z:36M6S AS:i:42 XS:i:33
SRR6251266.167696392 97 chr8 10051 60 42M = 149413 139404 ACAAATGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:29
SRR6251266.100942691 97 chr8 10052 60 42M = 18805 8795 CAAATGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATAT AAAAAEE/AEEEEEE/EEEEEEAEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:28
SRR6251266.173849305 97 chr8 10056 60 42M = 77718 67697 TGTCCTTTACATGTTTTCTGTTACAGTAGTAACAATATGTGT /AAAAEEEEEEEEEEEEEAEEEAEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:7S35M AS:i:42 XS:i:28
SRR6251266.217291158 177 chr8 10057 60 42M = 162736 152680 GTCCTTTACATGTTTTCTGTTACAGTAGTAACAATATGTGTA EAEEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEA6AAA NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:29
SRR6251266.119745571 161 chr8 10063 39 42M = 11197 1176 TACATGTTTTCTGTTACAGTAGTAACAATATGTGTAAACTTA AAAAAEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEE NM:i:0 MD:Z:42 MC:Z:42M AS:i:42 XS:i:35 XA:Z:chr4,+21281,27M1D15M,1;

I would appreciate any help :)

UnboundLocalError local variable 'sample_column' referenced before assignment

Have you seen this error ? Seems most files were produced, but the VCF is empty. Could it have something to do with the lack of IDs in the VCF ? I am rerunning with "--unique_ids 1 --unphased_vars 1", as the goal is gene level ASE.

Output:
out.allele_config.txt
out.allelic_counts.txt
out.haplotypes.txt
out.haplotypic_counts.txt
out.variant_connections.txt
out.vcf

out.vcf only contains a header, no data

FORMAT=<ID=PB,NUMBER=1,TYPE=String,Description="phASER Local Block">

FORMAT=<ID=PI,NUMBER=1,TYPE=String,Description="phASER Local Block Index (unique for each block)">

FORMAT=<ID=PW,NUMBER=1,TYPE=String,Description="phASER Genome Wide Genotype">

FORMAT=<ID=PC,NUMBER=1,TYPE=String,Description="phASER Genome Wide Confidence">

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT K564

``python phaser.py --bam in.bam --vcf in.vcf --o test1_out --sample testsample --threads 48 --mapq 10 --baseq 10 --pass_only 0

          Welcome to phASER v0.2

Author: Stephane Castel ([email protected])

#1. Loading heterozygous variants into intervals...

 loading VCF into memory...
 parsing VCF...
      371607 total heterozygous variants, 0 indels excluded, 0 blacklisted variants
 creating genomic intervals...

#2. Retrieving reads that overlap heterozygous sites...

 Reads are being written to disk, this will impact performance.
 file: K564_2DB4008_Rep1_6.bam
      minimum mapq: 10
      retrieved 8418748 reads

      using alignment score cutoff of 188
      splitting reads into 84 files with 100000 reads
      assigning reads to variants...

#3. Identifying connected variants...

 sequencing noise level estimated at 0.006877
 24323 variant connections dropped because of conflicting configurations (threshold = 0.010000)
 68195 variants covered by at least 1 read

#4. Identifying haplotype blocks...
#5. Phasing blocks...

 phasing large (>15 variants) blocks...
 identifying haplotypes with most support...

#6. Outputting haplotypes...
#7. Outputting phased VCF...

Traceback (most recent call last):
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 1756, in
main();
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 1012, in main
write_vcf();
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 1123, in write_vcf
genotype = list(vcf_columns[sample_column].split(":")[gt_index]);
UnboundLocalError: local variable 'sample_column' referenced before assignment
``

stop processing if there are no het variants

Hi,
Some times a chromosome or contig may no have any het variant to be analysed. on these case it would be useful if phaser.py would stop and exit instead of carrying on.
this is an example

##################################################
              Welcome to phASER v0.2
  Author: Stephane Castel ([email protected])
##################################################

#1. Loading heterozygous variants into intervals...
     loading VCF into memory...
     parsing VCF...
          0 total heterozygous variants, 0 indels excluded, 0 blacklisted variants
     creating genomic intervals...
#2. Retrieving reads that overlap heterozygous sites...
     Reads are being written to disk, this will impact performance.
     file: mapping/samples/gsnap/BELA.sorted.bam
          minimum mapq: 30
          retrieved 0 reads
Traceback (most recent call last):
  File "/home/shared/app/phaser/phaser/phaser.py", line 1756, in <module>
    main();
  File "/home/shared/app/phaser/phaser/phaser.py", line 301, in main
    use_as_cutoff, as_cutoff = calculate_as_cutoff(reads_in);
  File "/home/shared/app/phaser/phaser/phaser.py", line 1706, in calculate_as_cutoff
    percentile_cutoff = numpy.percentile(alignment_scores,args.as_q_cutoff*100);
  File "/home/ipedroso/anaconda/envs/phaser/lib/python2.7/site-packages/numpy/lib/function_base.py", line 3268, in percentile
    interpolation=interpolation)
  File "/home/ipedroso/anaconda/envs/phaser/lib/python2.7/site-packages/numpy/lib/function_base.py", line 2997, in _ureduce
    r = func(a, **kwargs)
  File "/home/ipedroso/anaconda/envs/phaser/lib/python2.7/site-packages/numpy/lib/function_base.py", line 3385, in _percentile
    x1 = take(ap, indices_below, axis=axis) * weights_below
  File "/home/ipedroso/anaconda/envs/phaser/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 124, in take
    return take(indices, axis, out, mode)
IndexError: cannot do a non-empty take from an empty axes.

cheers,
inti

Issues with the reference genome update（hg19 to hg38）

Considering that the reference genome has been updated, now I want to use the hg38 reference genome for analysis, may I ask that can this tool be transplanted to hg38 ? If so, I notice that you only provide files related to hg19 (such as 'Useful files' : hg19_hla.bed.gz and hg19_haplo_count_blacklist.bed.gz), could you please update these files to hg38 .

phaser_gene_ae only compatible with v1.0.0+, I'm using v1.0.0

Hi Secastel,

When I try to run phaser_gene_ae.py on my haplotypic counts I get the error

ERROR - this version of phaser_gene_ae is only compatible with results from phASER v1.0.0+

However, I am using https://github.com/secastel/phaser/archive/cd7daba.zip for the code, and when I look in the log-files of the jobs that got the phASER output it says

##################################################
Welcome to phASER v1.0.0
Author: Stephane Castel ([email protected])
##################################################

This is the command I used for phASER:

python /apps/software/phASER/20170714-cd7daba/phaser/phaser.py
--paired_end 1
--bam $BAM
--vcf $VCF
--mapq 255
--sample $SAMPLE
--baseq 10
--o $phaserOutPrefix
--temp_dir $TMPDIR
--threads 1
--gw_phase_method 1
--gw_phase_vcf 1
--show_warning 1
--debug 1
--unphased_vars 1

and the command for phaser_ae:

python /apps/software/phASER/20170714-cd7daba/phaser_gene_ae/phaser_gene_ae.py
--haplotypic_counts $haplotype_count
--features hg19_ensembl.bed
--o results/gene_ae/$SAMPLENAME.geneAE.txt

Any ideas what could cause this?

Thanks!

Get all possible haplotypes for a particular gene

Hi Stephane,

Many thanks for sharing phaser, great tool! I was wondering if there is any way to get all possible haplotypes for a particular gene, when this is compatible with the input data.

RSLVD?: Invalid literal during cutoff: " "

Error:
...
completed chromosome X...
processing mapped reads...
using alignment score cutoff of 122
Traceback (most recent call last):
File "tools/phaser-v1.1.0/phaser/phaser.py", line 2359, in
main();
File "tools/phaser-v1.1.0/phaser/phaser.py", line 170, in main
parse_sample(sample_name, map_sample_column, args.bam, args.o, contig_ban)
File "tools/phaser-v1.1.0/phaser/phaser.py", line 262, in parse_sample
start_time, vcf_out, sample_out_path, last_chr=True, pi_block_value = 0)
File "tools/phaser-v1.1.0/phaser/phaser.py", line 555, in process_vcf
pool_output = parallelize(process_mapping_result, result_files);
File "tools/phaser-v1.1.0/phaser/phaser.py", line 2090, in parallelize
pool_output.append(function(input));
File "tools/phaser-v1.1.0/phaser/phaser.py", line 1303, in process_mapping_result
if use_as_cutoff == False or int(fields[4]) >= as_cutoff:
ValueError: invalid literal for int() with base 10: ''
Command exited with non-zero status 1

Phasing multiple samples in the same VCF file at the same time

We have a VCF file with many thousand samples that we would like to phase. Currently using phASER we can only phase one sample at a time, which means that for every sample we need to write out a new VCF file and merge them once all are finished.

Would it be possible to change it so that multiple samples in one VCF can be phased at the same time?

Ask the use of some of the problems encountered

Dear Stephane,
pHASER has been quite helpful for me. However, I'm trying to assess allele-specific gene expression in single-cell RNA-Seq data from human lung cancer cells and I have tried it many times,But all failed(aCount=0 and bCount=0),is there a problem with my vcf file?

vcf file:

chr1    89923   .       A       T       68      .       DP=3;VDB=4.340713e-02;AF1=1;AC1=2;DP4=0,0,2,1;MQ=50;FQ=-36      GT:PL:GQ        1/1:100,9,0:16
chr1    90311   .       T       C       8.64    .       DP=17;VDB=6.089532e-02;RPB=-2.152553e+00;AF1=0.5;AC1=1;DP4=6,7,1,2;MQ=48;FQ=11.3;PV4=1,0.15,1, 0.059     GT:PL:GQ        0/1:38,0,229:40
chr1    134223  .       G       C       87      .       DP=5;VDB=2.649457e-02;RPB=8.293682e-01;AF1=0.5013;AC1=1;DP4=1,0,3,1;MQ=50;FQ=-5.45;PV4=1,0.42,1,1       GT:PL:GQ        0/1:117,0,23:26
chr1    134667  .       A       G       158     .       DP=6;VDB=7.655903e-02;AF1=1;AC1=2;DP4=0,0,3,3;MQ=50;FQ=-45      GT:PL:GQ        1/1:191,18,0:33

haplotypic_counts.txt:

- chr15   60422193        60422224        chr15_60422193_T_C,chr15_60422224_T_C   2       C,C     T,T     0       0       0       0/1     0.5
- chr15   29719135        29719136        chr15_29719135_A_G,chr15_29719136_C_T   2       G,T     A,C     0       0       0       0/1     0.5
- chr15   85748317        85748331        chr15_85748317_T_G,chr15_85748331_G_A   2       T,A     G,G     0       0       0       0/1     0.5
- chr15   84202077        84202144        chr15_84202077_C_G,chr15_84202144_G_A   2       C,G     G,A     0       0       0       0/1     0.5

python phaser.py --pass_only 0 --vcf var.flt.vcf.gz --bam $bam --paired_end 1 --mapq 10 --baseq 10 --sample $bam --blacklist hg19_hla.bed --haplo_count_blacklist hg19_haplo_count_blacklist.bed --threads 6 --o ase

I do not know much about it in this respect, can you tell me where there is a problem?
thanks

Confidence estimate for phased state between two adjacent site within a HaplotypeBlock?

Hi @secastel

As, I understand that PC is a confidence estimate for GW phase. But, I am interested in checking how good are adjacent alleles phased within Haplotype Block, PI. In the given data below is there a way to estimate confidence level for

1860    1|0         vs.         1860    1|0 
1879    0|1                     1879    1|0

2	1860	.	T	G	22718.54	PASS	AC=6;AF=0.231;AN=26;BaseQRankSum=0.548;ClippingRankSum=0.00;DP=2218;ExcessHet=6.2249;FS=5.236;InbreedingCoeff=-0.3001;MQ=47.32;MQRankSum=0.00;QD=20.43;ReadPosRankSum=-5.100e-02;SOR=0.399;set=Intersection	GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC	0/1:204,329:533:99:10442,0,7734:1|0:.,.,.,.,.:1036:|:0.5
2	1879	.	G	A	11022.23	PASS	AC=8;AF=0.286;AN=28;BaseQRankSum=-4.150e-01;ClippingRankSum=0.00;DP=2210;ExcessHet=0.0577;FS=5.183;InbreedingCoeff=0.6497;MQ=60.00;MQRankSum=0.00;QD=10.93;ReadPosRankSum=0.049;SOR=1.060;set=Intersection	GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC	0/1:326,208:534:99:5861,0,13140:0|1:.,.,.,.,.:1036:|:0.5

Is the confidence values (PC which is actually for GW PC) good to estimate the confidence of PG at haplotype block level? My thought is no, because almost all my sites have PC as 0.5, since I didn't supply any phased vcf (which I don't have). Any other way I can estimate this?

Thanks,

Question on out_prefix.vcf.gz

Hi Stephene,

I looked at the structure of the out_prefix.vcf.gz output and I'd like you to enlighten me on a few things:

Do blacklisted variants within a haplotype (i.e. variants listed as variantsBlacklisted in haplotypic_counts.txt) get their, say, PB field populated in the vcf?
Does the PI field in the vcf have anything to do with the line index in haplotypic_counts.txt?

Thanks for your answer in advance!
Cheers,
John Ma
Department of Lymphoma/Myeloma
UT MD Anderson Cancer Center

Phaser assigns incorrect SNVs to genes

Dear Stephane,

I am experiencing some problems with Phaser results.
I need to obtain allelic counts for hybrid yeast, and I know the reference sequence of one of its parents.
My pipeline is following:
Generate bam files using STAR (sorted by coordinates) by mapping reads from hybrid to known reference parental. Then generate vcf file on RNAseq data using GATK pipeline (I skip base recalibration step), then I do filtering of the vcf file with parameters mentioned in GATK pipeline (http://gatkforums.broadinstitute.org/gatk/discussion/3891/calling-variants-in-rnaseq).

Then I bgzip and tabix fitered vcf file.

Here are some lines of final vcf file
NC_018292.1 1078 . G A 100.28 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=33.43;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,3:3:9:128,9,0
NC_018292.1 1130 . T C 100.28 PASS AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=33.43;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,3:3:9:128,9,0
NC_018292.1 1249 . A T 178.90 PASS AC=2;AF=1.00;AN=2;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.24;SOR=1.022 GT:AD:DP:GQ:PL 1/1:0,5:5:15:207,15,0
NC_018292.1 1369 . T C 107.28 SnpCluster AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.09;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:9:135,9,0

My command line for phaser is:
python phaser/phaser/phaser.py --vcf filt_merged.vcf.gz --bam Aligned_sorted.bam
--paired_end 1 --mapq 255 --baseq 10 --sample sample --id_separator "="
--threads 10 --o phaser_out

Then I do:
python phaser_gene_ae.py --haplotypic_counts phaser_out.haplotypic_counts.txt
--features orth.bed --o phaser_gene_ae.txt --no_gw_phase 1

I generated .bed file form gff file just by selecting corresponding columns.

In gene output file I see strange behavour of phaser:
Many snps are sssigned to incorrect genes (you can see it by coordinates)
NC_018292.1 1048416 1052312 CORT_0A04800 3763 3200 6963 0.233811384293 61 NC_018292.1=948682=C=T,NC_018292.1=948822=T=C,NC_018292.1=948823=A=G,
NC_018292.1 1054556 1055605 CORT_0A04810 3763 3200 6963 0.233811384293 61 NC_018292.1=948682=C=T,NC_018292.1=948822=T=C,NC_018292.1=948823=A=G,
NC_018292.1 1056216 1056737 CORT_0A04820 3763 3200 6963 0.233811384293 61 NC_018292.1=948682=C=T,NC_018292.1=948822=T=C,NC_018292.1=948823=A=G,

I'd appreciate very much if you can give a hint on how to troubleshoot this issue.
Thank you,
Hrant

License?

Hi,

congrats on the interesting preprint. Quite surprised RNA-seq is apparently so effective in phasing.

But on another note : I didn't find a license so far in the repo or the preprint. Which one is applicable ?

Thanks,
Colin

Loss of AD/DP info after phasing

Hello,

I ran phASER as suggested here by running Sanger imputation service on my VCFs.

I managed to run it and then obtain phASER results with no runtime problem but the VCF I obtain at the end of the pipeline has lost the AN and DP information.
After Sanger imputation all variants have AN=2 and DP=2
Example:
After imputing with Sanger

TYPED;RefPanelAF=0.489683;AN=2;AC=1;INFO=1 GT:ADS:DS:GP:PS:PG:PB:PI:PW:PC:PM 1|0:1,0:1:0,1,0:13380:0/1:.:.:1|0:.:.

Same variant before:

GT:AD:DP:GQ:PL 0/1:11,7:18:99:244,0,264

Is this a correct behavior for the analysis pipeline or is there a better way to keep the DP and AD information throughout the process ?

Thanks,

Mattia

Phaser freezing after retrieving reads

Hi,

I am running phaser on a bam of around 6Gb and it is completing step 2 but never making it to step 3. I have been able to successfully run the example in the tutorial. This is the end of the log file:
[samopen] SAM header is present: 25 sequences.
completed chromosome Y...
[samopen] SAM header is present: 25 sequences.
completed chromosome 3...
[samopen] SAM header is present: 25 sequences.
completed chromosome 2...
[samopen] SAM header is present: 25 sequences.
completed chromosome 5...
[samopen] SAM header is present: 25 sequences.
completed chromosome 9...
[samopen] SAM header is present: 25 sequences.
completed chromosome X...
[samopen] SAM header is present: 25 sequences.
completed chromosome 6...
[samopen] SAM header is present: 25 sequences.
completed chromosome 8...
[samopen] SAM header is present: 25 sequences.
completed chromosome 13...
[samopen] SAM header is present: 25 sequences.
completed chromosome 10...
[samopen] SAM header is present: 25 sequences.
completed chromosome 15...
[samopen] SAM header is present: 25 sequences.
completed chromosome 14...
completed chromosome 1...
completed chromosome 18...
completed chromosome 11...
completed chromosome 16...
completed chromosome 17...
completed chromosome 19...
completed chromosome 12...

It has been running for over an hour (and on a larger file over 2 days still at this point). Any ideas as to what might be causing this would be great!

Thanks,
Clare

separate read counts from multiple bam files from one individual/importance of pre-phasing?

Dear Stephane,

With great interest I have read your recent publication in Genome Biology and seen phASER, which I would really like to use for my own work. So I was wondering if I could ask your advise on some points.

I'm trying to assess allele-specific gene expression in single-cell RNA-Seq data from human blood cells. From each individual, I have a fairly large number of single-cell transcriptomes (approximately 200-300). In addition, I have one exome per individual. My aim is to obtain the most accurate ASE counts possible from each individual single-cell transcriptome.

My strategy so far is:

Call variants from the exome. (I have used a bcbio pipeline for this [https://bcbio-nextgen.readthedocs.io/en/latest/contents/testing.html#exome-with-validation-against-reference-materials].)
Map the sc-RNA-Seq reads to the reference genome. (I have used STAR for the mapping and then WASP to eliminate mapping biases.)
Next, I would like to use phASER to obtain gene-level haplotypic read counts for each individual transcriptome using the variant file from step 1 and the bam file(s) generated in step 2.

In step 3, of course, I would like to use the reads from all the sc-transcriptomes of a given individual for the read backed phasing but obtain separate read counts for each individual single cell. Do I understand you correctly that this in principle can be achieved using --bam <list of all the bam files> --haplo_count_bam 1 to output haplotypic read counts for bam file 1 only, while using the information from all the bam files for phasing? If yes, is it possible to obtain separate read counts for multiple bam files from the same run (e.g. specifying --haplo_count_bam 1, 2, ...) or would I have to run the script repeatedly, once for each bam file?

I also meant to ask another question: In your tutorial you emphasize that the variant file should be pre-phased using a method like population phasing. I was wondering how important that is in the above situation (i.e. with a vcf that has been constructed from a single exome and with a really large number of RNA-Seq reads that can be used for read backed phasing). Do you think pre-phasing would still improve the quality of allelic counts here? I'm asking this because the various servers for phasing and imputation all require GRCh37-based coordinates, while I would prefer to work with hg38. (I have noticed that the bed files you provide are also based on hg19 but I suppose these could be lifted over to hg38 pretty easily?) Are you aware of any resources for population-based (pre-)phasing based on hg38 coordinates?

Would be great if you could have a look at this and give me some feedback.

Thanks a lot in advance,

Maik

phASER duplicates VCF meta-information FORMAT lines in output VCF

The phaser.py script appears to duplicate the VCF meta-information FORMAT lines in the output VCF file. For instance, if I had the following test.vcf.gz VCF file:

##fileformat=VCFv4.1
##INFO=<ID=mut_type,Number=1,Type=String,Description="Mutation type">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL  FILTER  INFO  FORMAT  amsim_pcawg_phaseable
chr1  877831  . T C 40  PASS  mut_type=germline GT  1/1
chr1  877868  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881019  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881025  . G T 40  PASS  mut_type=somatic  GT  0/1
chr1  881602  . C T 40  PASS  mut_type=somatic  GT  0/1
chr2  45812 . C A 40  PASS  mut_type=somatic  GT  0/1
chr2  45862 . G T 40  PASS  mut_type=somatic  GT  0/1
chr2  45875 . C T 40  PASS  mut_type=somatic  GT  0/1
chr2  45895 . A G 40  PASS  mut_type=germline GT  1/1
chr2  45946 . C T 40  PASS  mut_type=somatic  GT  0/1

Running the following command:

python lib/phaser/phaser/phaser.py \
        --vcf test.vcf.gz \
        --bam data/bams/amsim_pcawg_phaseable.sorted.cleaned.dups_removed.bam \
        --paired_end 1 \
        --mapq 20 \
        --baseq 20 \
        --sample amsim_pcawg_phaseable \
        --include_indels 1 \
        --blacklist data/phaser_test_data/hg19_hla.bed \
        --haplo_count_blacklist data/phaser_test_data/hg19_haplo_count_blacklist.bed \
        --threads 1 \
        --o test/phaser

Produces the output test/phaser.vcf.gz VCF file with the GT meta-information line duplicated.

##fileformat=VCFv4.1
##INFO=<ID=mut_type,Number=1,Type=String,Description="Mutation type">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PG,Number=1,Type=String,Description="phASER Local Genotype">
##FORMAT=<ID=PB,Number=1,Type=String,Description="phASER Local Block">
##FORMAT=<ID=PI,Number=1,Type=String,Description="phASER Local Block Index (unique for each block)">
##FORMAT=<ID=PM,Number=1,Type=String,Description="phASER Local Block Maximum Variant MAF">
##FORMAT=<ID=PW,Number=1,Type=String,Description="phASER Genome Wide Genotype">
##FORMAT=<ID=PC,Number=1,Type=String,Description="phASER Genome Wide Confidence">
#CHROM  POS ID  REF ALT QUAL  FILTER  INFO  FORMAT  amsim_pcawg_phaseable
chr1  877831  . T C 40  PASS  mut_type=germline GT:PG:PB:PI:PW:PC:PM  1/1:1/1:.:.:1/1:.:.
chr1  877868  . G T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0/1:.:.:0/1:.:.
chr1  881019  . G T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0|1:chr1_881019_G_T,chr1_881025_G_T:1:|:0.5:0
chr1  881025  . G T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0|1:chr1_881019_G_T,chr1_881025_G_T:1:|:0.5:0
chr1  881602  . C T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0/1:.:.:0/1:.:.
chr2  45812 . C A 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:0|1:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2  45862 . G T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2  45875 . C T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0
chr2  45895 . A G 40  PASS  mut_type=germline GT:PG:PB:PI:PW:PC:PM  1/1:1/1:.:.:1/1:.:.
chr2  45946 . C T 40  PASS  mut_type=somatic  GT:PG:PB:PI:PW:PC:PM  0/1:1|0:chr2_45812_C_A,chr2_45862_G_T,chr2_45875_C_T,chr2_45946_C_T:2:|:0.5:0

This causes issues in downstream VCF processing tools (e.g. vcfR) that expect there to be no duplicate meta-information lines.

Pseudogenomes

Hi,

quick question : should read alignments in BAM format and SNP calls in VCF be from alignments to pseudogenomes ? By pseudogenomes I mean genomes in which SNPs have been inserted to avoid too much reference bias, which is a known problem for these types of analyses.

Or is it OK if the BAM and VCF are from alignments to the reference sequence ?

Thanks!
Colin

Possible problem with the update of the tags in the FORMAT field

Hi Castel,

I raised this issue before but I think I had lots of other issues in that section and might have been buried. The problem is with how the tags and it's values get updated in the FORMAT field as you run phASER on one sample first and again on that output vcf to phase the another sample.

Say I have a vcf file with several samples: samole1, sample2, sample3
And, if these were the tags in the FORMAT field: GT:AD:DP:GQ:PL
I run the phaser on sample1 which will now update the FORMAT field to GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW

Since, I have to run phASER on another sample now, I take the output from the previous run and now the FORMAT fields are updated as: GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW:PB:PC:PG:PI:PW leading to two tags and two values (in some cases) for a sample.

I am still using the old pHASER due to issues with Cython, but wanted to highlight if this fix is done (if not a priority) in updated phASER versions.

Thanks,

phaser can phase indel ?

HI ,
phaser can support to phase indel ??
I find ReadBackedPhasing just can phase SNV....but I wanna phase indel and SNV .
my data is human DNA array sequencing data .

ignoring variants without PASS flag

Hi,
Thanks for the work on this!

it seems that phaser.py ignores variants on the --vcf file which do not have the PASS flag. This is somewhat restrictive sine one could not have it at all or have other flags. I think it is worth stating on the documentation that only variants with the PASS flag are considered.

Cheers

Issues with VCF merge after running phaser

Hi @secastel

I may have raised this issue before, but the problem with VCF merge still exists. I am not sure if VCF from other callers are running into this issue, but with GATK (HaplotypeCaller) generated VCF's, I can run run phaser. But, the single sample VCF's that are not generated cannot be merged back.

This is the error message if it helps:

$ java -jar -Xmx16g /home/everestial007/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T CombineVariants -R lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o F1.phased_variants.Final.vcf

INFO  19:25:42,458 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:25:42,460 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 
INFO  19:25:42,461 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute 
INFO  19:25:42,461 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk 
INFO  19:25:42,461 HelpFormatter - [Thu Jan 04 19:25:42 EST 2018] Executing on Linux 4.10.0-42-generic amd64 
INFO  19:25:42,461 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 
INFO  19:25:42,465 HelpFormatter - Program Args: -T CombineVariants -R lyrata_genome.fa -V ms01e_phased.vcf -V ms02g_phased.vcf -V ms03g_phased.vcf -V ms04h_phased.vcf -o F1.phased_variants.Final.vcf 
INFO  19:25:42,469 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.10.0-42-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12. 
INFO  19:25:42,469 HelpFormatter - Date/Time: 2018/01/04 19:25:42 
INFO  19:25:42,469 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:25:42,469 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  19:25:42,492 GenomeAnalysisEngine - Strictness is SILENT 
INFO  19:25:42,741 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  19:25:43,091 GenomeAnalysisEngine - Preparing for traversal 
INFO  19:25:43,094 GenomeAnalysisEngine - Done preparing for traversal 
INFO  19:25:43,094 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  19:25:43,095 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  19:25:43,095 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
##### ERROR --
##### ERROR stack trace 
java.lang.NumberFormatException: For input string: ""
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:592)
    at java.lang.Integer.valueOf(Integer.java:766)
    at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:717)
    at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:129)
    at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:158)
    at htsjdk.variant.variantcontext.LazyGenotypesContext.getGenotypes(LazyGenotypesContext.java:148)
    at htsjdk.variant.variantcontext.GenotypesContext.iterator(GenotypesContext.java:465)
    at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.mergeGenotypes(GATKVariantContextUtils.java:1556)
    at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:1224)
    at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:361)
    at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:143)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: For input string: ""
##### ERROR ------------------------------------------------------------------------------------------

The issue at hand is that even if I can merge the data using other tools, I will still have problems because I further have to use GATK to do my downstream analyses. Can you please check the issue if time permits?

Here are my files.

ms01e_phased.vcf.gz
ms02g_phased.vcf.gz
ms03g_phased.vcf.gz
ms04h_phased.vcf.gz

Phaser limited to maximum two haplotypes per contig?

Do I understand it correctly that phaser.py will at maximum detect two different haplotypes per contig?
This is reasonable for perfect unambiguous mapping in diploids, but a severe limitation if (a) the genome / transcriptome is indeed polyploid (or a pool of individuals), and (b) reads from paralog loci were mapped to the same reference contig. In both cases, >2 haplotypes can be possible from a single sample.

Add the PS or HP tag to the format field, enable pipe in GT field

I am using this tool for my data analyses and finding it quite helpful. Although I am having to manipulate my vcf to let it use with other software downstream.

I think the addition of following list of things will help. And it doesn't probably need changing of the algorithms at all.

1) Option for add HP (haplotype phase) tag in the FORMAT field like the one obtained in GATK RBphasing.
https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php

2) Option to add PS (phase state) tag in the FORMAT field as specified in vcf specs.
http://samtools.github.io/hts-specs/VCFv4.2.pdf
https://samtools.github.io/bcftools/bcftools.html

3) Option to add a pipe (|) to the GT field while updating GT field with PG.

I am having a problem by taking the output vcf file and curating it with custom scripts. It is also changing formatting of the file most of the time. I think this tool would really help if output can be made more compatible with downstream application.

Thanks,

fails after omitting blacklist arg to allow progress

after omitting blacklist arguments, phaser ran for a long time but then ...
completed chromosome 16...
completed chromosome 17...
completed chromosome 22...
completed chromosome 19...
completed chromosome 6...
processing mapped reads...
Traceback (most recent call last):
File "REDO/phaser/phaser/phaser.py", line 2358, in
main();
File "REDO/phaser/phaser/phaser.py", line 170, in main
parse_sample(sample_name, map_sample_column, args.bam, args.o, contig_ban)
File "REDO/phaser/phaser/phaser.py", line 262, in parse_sample
start_time, vcf_out, sample_out_path, last_chr=True, pi_block_value = 0)
File "REDO/phaser/phaser/phaser.py", line 546, in process_vcf
alignment_scores = map(int,[x for x in subprocess.check_output("set -euo pipefail && "+"cut -f 5 "+" ".join(result_files), shell=True, executable='/bin/bash').split("\n") if x != ""]);
TypeError: a bytes-like object is required, not 'str'

No alignment score in BAM file

Hi,

I am running phaser and getting an error that there is no alignment score in the RNA-seq BAM file. I was wondering, if is there a way to run phaser without considering alignment score?

Many thanks,
Rahul

"|" pipe symbol in contig /chromosome ID causes crash: Solution

Hello,
I use Trinity denovo assembled transcriptome contigs as the reference, and Trinity names its contigs e.g. "TR72377|c0_g1_i1". Unfortunately, "samtools view -h X.bam TR72377|c0_g1_i1: "
whill choke on this because the "|" symbol in the contig name is interpreted as a functional pipe.
The solution is to wrap the contig ID in hyphens, which eleviates the issue and does not affect contig names that do not contain the pipe.

I suggest to modify phaser.py:

orig. line 1070:
error_code = subprocess.call("samtools view -h "+bam+" "+chrom+": | samtools view -Sh "+samtools_arg+" -L "+bed_out+" -q "+mapq+" - | "+args.python_string+" "+return_script_path()+"/call_read_variant_map.py --baseq "+str(args.baseq)+" --splice 1 --isize_cutoff "+str(isize)+" --variant_table "+mapper_out+" --o "+mapping_result.name, stderr=devnull, stdout=devnull, shell=True);

replace this line by the hyphenated version:
error_code = subprocess.call("samtools view -h "+bam+" '"+chrom+"': | samtools view -Sh "+samtools_arg+" -L "+bed_out+" -q "+mapq+" - | "+args.python_string+" "+return_script_path()+"/call_read_variant_map.py --baseq "+str(args.baseq)+" --splice 1 --isize_cutoff "+str(isize)+" --variant_table "+mapper_out+" --o "+mapping_result.name, stderr=devnull, stdout=devnull, shell=True);

Thanks!

Haplotype phase of the homozygous allele

@secastel

I just wanted to highlight a suggestion that could be incorporated into phaser given your time allows.
phaser is designed mainly to generate haplotypes only for the heterozygous sites. But, from a biological stand point I think it would be more helpful to include an option, where we can extract the haplotypes even if the allele is homozgous variant, given the RB phased read to supports it. And, I think there are most cases like that.

An example something like this:

chr       pos        id        ref        alt          GT        PG      Phase_Block_index
2          35          .          A         G            0/1       0|1          50
2          71          .          C         A            1/1       1/1          .
2          85          .          C         G            0/1       1|0          50
2          97          .          G        T,A          1/2       2|1          50
2          103        .          T         C             0/0       0/0          .
2          107        .          C        A,G          0/1       0|1          50

So, for the above example phaser is designed to just generate haplotype from the heterozygous loci, which will be A-G-A-C & G-C-T-A. But, if the homozygous alternate genotype (1/1) or (0/0) is supported by the RBphased bam file there should be an option to get that - i.e haplotypes A-A-G-A-T-C & G-A-C-T-T-A and also the update of phase_block_index and PG. These kind of haplotypes are going to be more helpful in population genetics analyses. I think it would make phaser more helpful tool.

Memory issue on phaser

I have realized that phaser seems to take so much of a ram memory. Of several samples I tried there were two samples where phaser really didn't go through even in 2 days after several tries. I set number of threads to 1. But, still no luck.

I was using a desktop with 16 gb memory and 20 gb swap and that was all taken up.
I am working with A. lyrata genome, vcf and bam files which is relatively small compared to human genome, so with human genome the problem should be even bigger.

Is there a way to fix memory handling in phaser ?

error running phaser, sorry...

Hi Stephane,
This tool is great and very well explained for a non-expert such as myself, many thanks indeed!!!
I have a problem which I hope you can help me, running your example data set I tried the following command after having succeed with the first one

$python /scratch/ev250/bin/phaser/phaser_gene_ae/phaser_gene_ae.py --haplotypic_counts phaser_test_case.haplotypic_counts.txt --features hg19_ensembl.bed --o phaser_test_case_gene_ae.txt --no_gw_phase 0

However, I get the following error

##################################################
Welcome to phASER Gene AE v1.1
Author: Stephane Castel ([email protected])
##################################################

#1 Loading features...
#2 Loading haplotype counts...
Traceback (most recent call last):
File "/scratch/ev250/bin/phaser/phaser_gene_ae/phaser_gene_ae.py", line 127, in
main();
File "/scratch/ev250/bin/phaser/phaser_gene_ae/phaser_gene_ae.py", line 65, in main
df_haplo_counts = pandas.DataFrame.from_csv(args.haplotypic_counts, sep="\t", index_col=False);
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/core/frame.py", line 1231, in from_csv
infer_datetime_format=infer_datetime_format)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 389, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 730, in init
self._make_engine(self.engine)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 923, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 1436, in init
self._set_noconvert_columns()
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 1501, in _set_noconvert_columns
_set(self.index_col)
File "/scratch/ev250/bin/PYTHON/lib/python2.7/site-packages/pandas/io/parsers.py", line 1476, in _set
x = names.index(x)
ValueError: False is not in list

which I cannot figure out, could you please help me to troubleshoot this? many thanks in advance

Elena

OSError: [Errno 12] Cannot allocate memory

Hi, I've been using phASER the past couple of days and have been super happy with the results, but after trying to incorporate DNAseq reads in addition to RNAseq reads into our haplotype calling with phASER we are constantly getting out of memory errors in thread 13. phASER works fine with the example data you provided in your tutorial, as well as with just our RNAseq data, but as soon as we use RNAseq reads and DNAseq reads we get out of memory errors.

phASER manages to map reads to variants for both RNAseq and DNAseq reads, but during the processing step it always throws the following error:

processing mapped reads...
using alignment score cutoff of 115
Exception in thread Thread-13:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 764, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 325, in _handle_workers
pool._maintain_pool()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 229, in _maintain_pool
self._repopulate_pool()
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 222, in _repopulate_pool
w.start()
File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/usr/lib64/python2.7/multiprocessing/forking.py", line 121, in init
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

At first I thought it was just my local computer not having enough memory (only 6GB free), but even after letting it run on our cluster which has around 50 GB memory free, it still threw that error. The command I used to call phASER was:

python /home/ibis/paul.hager/phaser/phaser/phaser.py --vcf /home/ibis/paul.hager/SCN306/Phasing/GATKPhaseByTransmission/SL154375.GATKTransmissionPhased.vcf.gz --bam /home/ibis/paul.hager/SCN306/Phasing/RNAseq/rnanz1_1-2_S13_R1_001_val_1.fq.Aligned.out.sort.bam,/home/ibis/paul.hager/SCN306/IBIS_WGS/SCN306pat/HMNNHCCXX_s1_1_GSLv3-7_49_SL154375.filtered.fastq_mem.sorted.mergedAll.realigned.recal.bam --paired_end 1,1 --mapq 255,60 --baseq 10 --sample SL154375 --blacklist /home/ibis/paul.hager/phaser/testData/hg19_hla.bed --haplo_count_blacklist /home/ibis/paul.hager/phaser/testData/hg19_haplo_count_blacklist.bed --threads 4 --o /home/ibis/paul.hager/SCN306/Phasing/Phaser/rnanz1_2/phaserSL154375_trioPhase_r12 --id_separator - --unique_ids 1

Do you have any idea what might be causing this error?

Thanks for your help!

Advice on phasging protocole with RNASeq only

Dear Stephane,

I am working on a RNASeq project with multiple chicken populations and multiple tissues per population for a total of around 700 samples.
Our wish is to do ASE analysis per population and tissu combination (POP_tissue).

I called variants at the POP_tissue level on my RNASeq data. But I easy can call variant at POP level.
For now I run phaser.py and phaser_gene_ae.py command for each sample individually (if samples is sequences multiple times, I merged the bam, so I have one sample = one bam), and then I want to launch the phaser_expr_matrix.py.

My understand is that this is only possible if my input VCF file is phased, but is this not partially the goal of phaser ?
Wat choices do I have to use phaser in this context ?
Would it be better to :

run phaser 1 time with all the tissus corresponding to the same individuals (so on one VCF called at the POP level). Will this VCF file contained all my tissue samples?
and use this (tissue ? ) phased VCF file to generate the allelic_count, then to run gene_ae.py for each sample individually and finally pop_expr_matrix on all sample from the same tissue?

I am not sure to well understand how everything interact and what is the impact of doing it on unphased input VCF file.

Kind regards

Maria

Summed DNA and RNA counts in allelic_counts output file?

Dear Stephane,

I used the phASER software using two input BAM files, one containing RNA-seq data, the other containing Whole Genome Sequencing DNA data. I used the --haplo_count_bam_exclude option to exclude the WGS BAM data from my output data.

The number of reads overlapping for example chr1, position 100547994, in my both BAM files are:

#WGS BAM
$ samtools mpileup -r 1:100547994-100547994 DNA/AC1JV9ACXX-2-20.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1	100547994	N	17	cctcTcTTCcTctCtTT	;;D;E<EDC;D9DCDEE

#RNA BAM
samtools mpileup -r 1:100547994-100547994 RNA/AC1JV9ACXX-2-20.mdup.sorted.readGroupsAdded.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1	100547994	N	23	>t$CTCtCtTTcCtCcccccctt^~t	D@IIJCGADGDGJHI@FJJGIIH

I noticed that the *.allelic_counts file contains the summed counts from both RNA and WGS data:

grep 1_100547994_T_C allelic_counts/AC1JV9ACXX-2-20.chr1.allelic_counts.txt 
1	100547994	1_100547994_T_C	T	C	19	20	39

While in the *.haplotypic_counts data just 22 reads for this same SNP are used:

grep 1_100547994_T_C haplotypic_counts/AC1JV9ACXX-2-20.chr1.haplotypic_counts.txt 
1	100547994	100547994	1_100547994_T_C	1		0	T	C	10	12	22	0|1	1	0.394	AC1JV9ACXX-2-20.mdup.sorted.readGroupsAdded

Is this intended behaviour for *.allelic_counts output when using a BAM file containing DNA data and the --halpo_count_bam_exclude option?

Regards,

Freerk

Allele specific expression on multiple samples

Hello,

I ran phASER for multiple samples and extracted allele specific expression for each sample independently.
Seen that these samples are split into case/control, I would like to process these expression data to see if there is anything interesting.
My question is if the allele specific expression data from multiple samples are directly compatible or not.
Can I compare aCount and bCount across multiple samples straight away or is there a risk that what is measured on aCount for sample X, ends up in bCount for sample Y ?

Do you have a suggested protocol for this task?

Thanks a lot

Mattia

phASER based replacement for allelecounter?

It's obviously an enhancement, but is it possible to develop a read-backed allele counter--without the phasing--with phASER code?

Improve phased GT in F1 hybrids using partially phased data (PG) generated using pHASER

Hi Secastel,

This isn't an issue, but an update. I recently finished writing a python3 script to improve the phased GT of the F1 hybrid. I wanted to connect with people who might be having similar problems and let them know about this tool. This script and details aren't completely clean yet, but should be able to run given the python3 and required modules are available.

https://github.com/everestial/pHASE-Stitcher

Thanks,

There is a problem connecting setup.py with cython module in the updated version of phaser.r.

@secastel

--output_orphans options isn't working. I looked into the python file and that option isn't there. I am not using this at the moment but just picking something that might be useful to others.

Also, with the updated version of phaser you have included 'setup.py'. I have my cython module installed but I am getting the following error:
Traceback (most recent call last):
File "setup.py", line 2, in
from Cython.Build import cythonize
ImportError: No module named 'Cython'

The old version didn't quite needed that.

How to use phaser without vcf file

Dear Stephane,
I noticed your tools phaser can generate ASE data without vcf file. but I don't how to do, can you help me? Thank you very much!

--chr option, crashing and a possible bug

Hi,

I would really recommend the --chr option for RNA-seq datasets from plant genomes - this one is ca. 750MB so relatively small.

On 512gb and 2TB RAM servers I was never getting past step 5 for whole genomes
"5. Phasing blocks...
phasing large (>15 variants) blocks...
"

It would just killed by the server after apparently spiralling mem usage after 12-24 hrs.

If I pass the --chr option most chromosomes are done after 2-6 hours.

There may be a bug with --chr by the way. It seems I get a VCF for chromosome5 called eg "test21_chr5.vcf", but it contains SNPs from all chromosomes eg 1-9, eg, not just phased SNPs for chr5.

Is this a known bug ?

Also - can I just combine the *haplotypic_counts.txt files from each chromosome and use them as input for the next step "phaser_gene_ae.py" ? I hope this will work.

Thanks,
Colin

Typo

in phaser.py

fatal_error("Allele frequency VCF (--gw_af_vcf) specified does not exit.");

Should likely be

fatal_error("Allele frequency VCF (--gw_af_vcf) specified does not exist.");

Cheers!
Colin

cannot run phASER to my aligned data

Hi!
I aligned HG00096 from 1000GP with STAR and Tophat separately. And then I ran phASER using the same parameter you gave in the tutorial.

The error message for running phASER with the BAM aligned with STAR is:

#2. Retrieving reads that overlap heterozygous sites...
     file: /data/reddylab/scarlett/1000G/data/StarOutput/HG00096/Aligned.sortedByCoord.out.bam
          minimum mapq: 255
          mapping reads to variants...
[bam_parse_region] fail to determine the sequence name.
[main_samview] region "1:" specifies an unknown reference name. Continue anyway.
[samopen] SAM header is present: 25 sequences.
[sam_read1] reference 'user command line: STAR --runMode alignReads --genomeDir /data/reddylab/scarlett/1000G/data/STARIndex --runThreadN 24 --readFilesCommand zcat --readFilesIn /gpfs/fs1/data/reddylab/scarlett/1000G/data/fastq/HG00096/ERR188040_R1.fastq.gz /gpfs/fs1/data/reddylab/scarlett/1000G/data/fastq/HG00096/ERR188040_R2.fastq.gz --outSAMtype BAM Unsorted SortedByCoordinate --outReadsUnmapped Fastx --outFileNamePrefix /data/reddylab/scarlett/1000G/data/StarOutput/HG00096/
AMtype BAM   Unsorted   SortedByCoordinate
' is recognized as '*'.
[main_samview] truncated file.
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/nfs/software/helmod/apps/Core/Anaconda/2.5.0-fasrc01/x/lib/python2.7/multiprocessing/pool.py", line 389, in _handle_results
    task = get()
TypeError: ('__init__() takes at least 3 arguments (1 given)', <class 'subprocess.CalledProcessError'>, ())

The problem for using phASER with the BAM aligned with Tophat is that it stuck at first step forever ...

STARTED "Read backed phasing and ASE/haplotype analyses" ...
    DATE, TIME : 2019-11-01, 10:44:04
#1. Loading heterozygous variants into intervals...
Processing sample named HG00096
    using all the chromosomes ...
    processing VCF...

Does phASER work with Tophat aligned bam file? And what parameters do you specify for STAR alignment? I wonder whether that is the point where causing these errors.
Could you help me with this?
I'd appreciate your help!

Thanks,
Scarlett

crash in subprocess.py: OSError: [Errno 7] Argument list too long

Hello,
I am trying to phase RNA-seq PE 150 mapped to its own denovo-reference. However, after running for a few hours and displaying many "completed chromosome X..." messages, it breaks with the following message:

           .
           .
           completed chromosome TR43716|c0_g2_i1...
      processing mapped reads...

Traceback (most recent call last):
File "/cluster/project/gdc/people/schamath/tools/phaser.py", line 2004, in
main();
File "/cluster/project/gdc/people/schamath/tools/phaser.py", line 334, in main
alignment_scores = map(int,[x for x in subprocess.check_output("cut -f 5 "+" ".join(result_files), shell=True).split("\n") if x != ""]);
File "/cluster/apps/python/2.7.6/x86_64/lib64/python2.7/subprocess.py", line 566, in check_output
process = Popen(stdout=PIPE, *popenargs, **kwargs)
File "/cluster/apps/python/2.7.6/x86_64/lib64/python2.7/subprocess.py", line 709, in init
errread, errwrite)
File "/cluster/apps/python/2.7.6/x86_64/lib64/python2.7/subprocess.py", line 1326, in _execute_child
raise child_exception
OSError: [Errno 7] Argument list too long

What can I do to solve this problem?

example fails on blacklist intervals

trying out phaser at NHGRI Computational medicine hackation

is the example at https://stephanecastel.wordpress.com/2017/02/15/how-to-generate-ase-data-with-phaser/ known to work with current code?

##################################################
Welcome to phASER v1.1.1
Author: Stephane Castel ([email protected])
Updated by: Bishwa K. Giri ([email protected])
##################################################

Completed the check of dependencies and input files availability...

STARTED "Read backed phasing and ASE/haplotype analyses" ...
DATE, TIME : 2019-06-10, 18:17:09
#1. Loading heterozygous variants into intervals...
Processing sample named NA06986
using all the chromosomes ...
removing blacklisted variants and processing VCF...
#1b. Loading haplotypic count blacklist intervals...
Traceback (most recent call last):
File "REDO/phaser/phaser/phaser.py", line 2358, in
main();
File "REDO/phaser/phaser/phaser.py", line 170, in main
parse_sample(sample_name, map_sample_column, args.bam, args.o, contig_ban)
File "REDO/phaser/phaser/phaser.py", line 235, in parse_sample
for line in raw_interval.split("\n"):
TypeError: a bytes-like object is required, not 'str'

Several possible improvements recommendation for pHASER.

Hi @secastel 👍 pHASER has been quite helpful for me. However, I have encountered several issues that would need attention to make this tool better.

1) The vcf file output by pHASER isn't quite compatible with the GATK tool. I am not able to figure out what actually is the problem but I suspect there is some problem with the structure of the output *.vcf when written by pHASER. I can give you more details if need be.

2) GW (genome wide) phasing is quite a useful thing, but I think if there was a choice to update the GT field rather with PG. I think in some instances we would want to get the haplotype states of the particular block.
Note: I am actually working with F1 hybrid data and in that situation GW phase using pHASER led to switch errors (verified by comparison of the several bam and vcf). The solution to the haplotyping problem (especially in the F1 hybrids) is to get the haplotypes as much big as possible represented by unique PI. So, each PI has two haplotypes; now we can test statistically if Haplotype-A vs. Haplotype-B belong to Population_X vs. Population_Y using OddsRatio or Markov Model. I have actually just finished writing a python script to do this using Odds Ratio. Currently I am writing another part for the use of Markov Model, but being a biologist its taking me some time. In nutshell, my program uses the PB and PI generated by your program to stitch the haplotypes.
So, if you could add an option to update the GT by PB values it would be helpful.

3) Another addon: GATK generates the haplotypes in the given format
GT:AD:DP:GQ:PGT:PID:PL 0/1:80,25:105:99:0|1:5398_A_G:780,0,3463
GT:AD:DP:GQ:PGT:PID:PL 0/1:14,4:18:83:1|0:47883_G_A:83,0,472

I transfered the phase state from PGT to GT using awk and supplementd that vcf as phased input. The ouput file was able to extend the haplotypes of the block quite significantly for several block. I think this capability would he helpful.

4) Phasing of Indels: Also, the phasing of InDels can be improved by phasing the SNPs first, then transferring the phase state in PB to GT (which I did using awk again). The phased state of the InDels was quite improved. I think this can be incorporated. But, the issue that came up was the following:
The FORMAT field would look like following: GT:AD:DP:PL:PG:PB:PI:PW:PC:PG:PB:PI:PW:PC
So, when the phase output from pHASER is supplemented as Phased-Input for phasing of InDels the field PG:PB:PI:PW:PC get extended along with the SAMPLE field. This can be probably removed.

5) This is the issue I previously reported: Choice to output the phase of the homozygous alternate allele (1|1, or 2|2, 3|3) and haplotype if they are connected with the heterozygous allele. I actual tried to write some thing with in your pHASER but couldn't track the several def function()

Hope these comments don't bother you too much.

And, finally "Happy New Year 2017 ! " :) 👍

tabix error due to pandas behavior in python 2.7

When I run phaser_expr_matrix.py with the suggested python 2.7, I get this error for every line of the BED file:

[E::get_intv] Failed to parse TBX_GENERIC, was wrong -p [type] used?
The offending line was: "10	ENSRNOG00000033508	57185346	57238531	1|0	0|0 [...]

The column order is wrong because the pandas DataFrame is initialized with a dictionary. Pandas documentation says "column order follows insertion-order for Python 3.6 and later." When I run phaser_expr_matrix.py with python3, column order is correct and it runs fine.

check_dependency("bedtools") == False

unsupported operand type(s) for +=: 'int' and 'str'

Hi Stephane,

trying 0.9.2 for the first time today I came across this new bug

Best wishes,
Colin

#3. Identifying connected variants...
calculating sequencing noise level...
sequencing noise level estimated at 0.003902
creating read sets...
generating read connectivity map...
testing variant connections versus noise...
25551 variant connections dropped because of conflicting configurations (threshold = 0.010000)
104943 variants covered by at least 1 read
#4. Identifying haplotype blocks...
#5. Phasing blocks...
#6. Outputting haplotypes...
Traceback (most recent call last):
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 2009, in
main();
File "/home/bioinformatics/NAS01/programs/phaser/phaser/phaser/phaser.py", line 779, in main
phase_support[0] += maf;
TypeError: unsupported operand type(s) for +=: 'int' and 'str'

subprocess.CalledProcessError

I got the following error while using the repo phaser.

##################################################
              Welcome to phASER v1.1.1
  Author: Stephane Castel ([email protected])
  Updated by: Bishwa K. Giri ([email protected])
##################################################

Completed the check of dependencies and input files availability... 

STARTED "Read backed phasing and ASE/haplotype analyses" ... 
    DATE, TIME : 2019-04-24, 17:21:25
#1. Loading heterozygous variants into intervals...
Processing sample named 10780_wgs_FromBam
    using all the chromosomes ...
    processing VCF...

    Memory efficient mode is deactivated...
    If RAM is limited, activate memory efficient mode using the flag "--process_slow = 1"...

     creating variant mapping table...
          2441858 heterozygous sites being used for phasing (0 filtered, 0 indels excluded, 2268598 unphased)

#2. Retrieving reads that overlap heterozygous sites...
     file: /imppc/labs/lplab/share/marc/epimutations/processed/bam/hg38/atacBam/10780_ATAC.bam
          minimum mapq: 255
          mapping reads to variants...
Traceback (most recent call last):
  File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/call_read_variant_map.py", line 2, in <module>
    import read_variant_map;
ImportError: /imppc/labs/lplab/share/marc/repos/phaser/phaser/read_variant_map.so: undefined symbol: PyUnicodeUCS2_DecodeUTF8
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
Traceback (most recent call last):
  File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 2358, in <module>
    main();
  File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 170, in main
    parse_sample(sample_name, map_sample_column, args.bam, args.o, contig_ban)
  File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 262, in parse_sample
    start_time, vcf_out, sample_out_path, last_chr=True, pi_block_value = 0)
  File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 533, in process_vcf
    result_files = parallelize(call_mapping_script, pool_input);
  File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 2089, in parallelize
    pool_output.append(function(input));
  File "/imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py", line 1346, in call_mapping_script
    error_code = subprocess.check_call("set -euo pipefail && "+run_cmd, stdout=devnull, shell=True, executable='/bin/bash')
  File "/soft/general/python/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail && samtools view -h /imppc/labs/lplab/share/marc/epimutations/processed/bam/hg38/atacBam/10780_ATAC.bam 'chr1': | samtools view -Sh -F 0x400  -L /tmp/tmpnWNPVv -q 255 - | python2.7 /imppc/labs/lplab/share/marc/repos/phaser/phaser/call_read_variant_map.py --baseq 10 --splice 1 --isize_cutoff 0.0 --variant_table /tmp/tmpy1yJfS --o /tmp/tmppUwFEs' returned non-zero exit status 1

Code:

python /imppc/labs/lplab/share/marc/repos/phaser/phaser/phaser.py --vcf $vcf --bam $bam --mapq 255 --baseq 10 --sample $(bcftools query -l $vcf) --o prove --paired_end 0 --id_separator "@"

I used the id_separator "@" becuase the error:

FATAL ERROR: Character '_' must not be present in contig name. Please change id separtor using --id_separator to a character not found in the contig names and try again.

Any suggestions?

secastel / phaser Goto Github PK

phaser's People

Contributors

Stargazers

Watchers

Forkers

phaser's Issues

FORMAT=<ID=PB,NUMBER=1,TYPE=String,Description="phASER Local Block">

FORMAT=<ID=PI,NUMBER=1,TYPE=String,Description="phASER Local Block Index (unique for each block)">

FORMAT=<ID=PW,NUMBER=1,TYPE=String,Description="phASER Genome Wide Genotype">

FORMAT=<ID=PC,NUMBER=1,TYPE=String,Description="phASER Genome Wide Confidence">

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT K564

Recommend Projects

Recommend Topics

Recommend Org

Jobs