GithubHelp home page GithubHelp logo

sbslee / pypgx Goto Github PK

View Code? Open in Web Editor NEW
59.0 11.0 8.0 228.87 MB

A Python package for pharmacogenomics (PGx) research

Home Page: https://pypgx.readthedocs.io

License: MIT License

Python 100.00%
api cli pgx pharmacogenomics pharmacogenetics star-alleles structural-variation cyp2d6 genotype phenotype

pypgx's Introduction

README

Documentation Status

Introduction

The main purpose of the PyPGx package is to provide a unified platform for pharmacogenomics (PGx) research. PyPGx is and always will be completely free and open source.

The package is written in Python, and supports both command line interface (CLI) and application programming interface (API) whose documentations are available at the Read the Docs.

Quick links:

PyPGx can predict PGx genotypes (e.g. *4/*5) and phenotypes (e.g. Poor Metabolizer) using various genomic data, including data from next-generation sequencing (NGS), single nucleotide polymorphism (SNP) array, and long-read sequencing. Importantly, for NGS data the package can detect structural variation (SV) using a machine learning-based approach. Finally, note that PyPGx is compatible with both of the Genome Reference Consortium Human (GRCh) builds, GRCh37 (hg19) and GRCh38 (hg38).

There are currently 61 pharmacogenes in PyPGx:

ABCB1 ABCG2 CACNA1S CFTR COMT
CYP1A1 CYP1A2 CYP1B1 CYP2A6/CYP2A7 CYP2A13
CYP2B6/CYP2B7 CYP2C8 CYP2C9 CYP2C19 CYP2D6/CYP2D7
CYP2E1 CYP2F1 CYP2J2 CYP2R1 CYP2S1
CYP2W1 CYP3A4 CYP3A5 CYP3A7 CYP3A43
CYP4A11 CYP4A22 CYP4B1 CYP4F2 CYP17A1
CYP19A1 CYP26A1 DPYD F5 G6PD
GSTM1 GSTP1 GSTT1 IFNL3 MTHFR
NAT1 NAT2 NUDT15 POR PTGIS
RYR1 SLC15A2 SLC22A2 SLCO1B1 SLCO1B3
SLCO2B1 SULT1A1 TBXAS1 TPMT UGT1A1
UGT1A4 UGT2B7 UGT2B15 UGT2B17 VKORC1
XPC        

Your contributions (e.g. feature ideas, pull requests) are most welcome.

Author: Seung-been "Steven" Lee
License: MIT License

Citation

If you use PyPGx in a published analysis, please report the program version and cite the following article:

In this article, PyPGx was used to call star alleles for genomic DNA reference materials from the Centers for Disease Control and Prevention–based Genetic Testing Reference Materials Coordination Program (GeT-RM), where it showed almost 100% concordance with genotype results from previous works.

The development of PyPGx was heavily inspired by Stargazer, another star-allele calling tool developed by Steven when he was in his PhD program at the University of Washington. Therefore, please also cite the following articles:

Below is an incomplete list of publications which have used PyPGx:

Support PyPGx

If you find my work useful, please consider becoming a sponsor.

Installation

Following packages are required to run PyPGx:

Package Anaconda PyPI
fuc
scikit-learn
openjdk

There are various ways you can install PyPGx. The recommended way is via conda (Anaconda):

$ conda install -c bioconda pypgx

Above will automatically download and install all the dependencies as well. Alternatively, you can use pip (PyPI) to install PyPGx and all of its dependencies except openjdk (i.e. Java JDK must be installed separately):

$ pip install pypgx

Finally, you can clone the GitHub repository and then install PyPGx locally:

$ git clone https://github.com/sbslee/pypgx
$ cd pypgx
$ pip install .

The nice thing about this approach is that you will have access to development versions that are not available in Anaconda or PyPI. For example, you can access a development branch with the git checkout command. When you do this, please make sure your environment already has all the dependencies installed.

Note

Beagle is one of the default software tools used by PyPGx for haplotype phasing SNVs and indels. The program is freely available and published under the GNU General Public License. Users do not need to download Beagle separately because a copy of the software (beagle.22Jul22.46e.jar) is already included in PyPGx.

Warning

You're not done yet! Keep scrolling down to obtain the resource bundle for PyPGx, which is essential for running the package.

Resource bundle

Starting with the 0.12.0 version, reference haplotype panel files and structural variant classifier files in PyPGx are moved to the pypgx-bundle repository (only those files are moved; other files such as allele-table.csv and variant-table.csv are intact). Therefore, the user must clone the pypgx-bundle repository with matching PyPGx version to their home directory in order for PyPGx to correctly access the moved files (i.e. replace x.x.x with the version number of PyPGx you're using, such as 0.18.0):

$ cd ~
$ git clone --branch x.x.x --depth 1 https://github.com/sbslee/pypgx-bundle

This is undoubtedly annoying, but absolutely necessary for portability reasons because PyPGx has been growing exponentially in file size due to the increasing number of genes supported and their variation complexity, to the point where it now exceeds upload size limit for PyPI (100 Mb). After removal of those files, the size of PyPGx has reduced from >100 Mb to <1 Mb.

Starting with version 0.22.0, you can now specify a custom location for the pypgx-bundle directory instead of using the home directory. This can be achieved by setting the bundle location using the PYPGX_BUNDLE environment variable:

$ export PYPGX_BUNDLE=/path/to/pypgx-bundle

Structural variation detection

Many pharmacogenes are known to have structural variation (SV) such as gene deletions, duplications, and hybrids. You can visit the Genes page to see the list of genes with SV.

Some of the SV events can be quite challenging to detect accurately with NGS data due to misalignment of sequence reads caused by sequence homology with other gene family members (e.g. CYP2D6 and CYP2D7). PyPGx attempts to address this issue by training a support vector machine (SVM)-based multiclass classifier using the one-vs-rest strategy for each gene for each GRCh build. Each classifier is trained using copy number profiles of real NGS samples as well as simulated ones, including those from 1KGP and GeT-RM.

You can plot copy number profile and allele fraction profile with PyPGx to visually inspect SV calls. Below are CYP2D6 examples:

SV Name Gene Model Profile
Normal https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-1.png https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-8.png
WholeDel1 https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-2.png https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-1.png
WholeDel1Hom https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-3.png https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-6.png
WholeDup1 https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-4.png https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-2.png
Tandem3 https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-11.png https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-9.png
Tandem2C https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/gene-model-CYP2D6-10.png https://raw.githubusercontent.com/sbslee/pypgx-data/main/dpsv/GRCh37-CYP2D6-7.png

PyPGx was recently applied to the entire high-coverage WGS dataset from 1KGP (N=2,504). Click here to see individual SV calls, and corresponding copy number profiles and allele fraction profiles.

GRCh37 vs. GRCh38

When working with PGx data, it's not uncommon to encounter a situation where you are handling GRCh37 data in one project but GRCh38 in another. You may be tempted to use tools like LiftOver to convert GRCh37 to GRCh38, or vice versa, but deep down you know it's going to be a mess (and please don't do this). The good news is, PyPGx supports both of the builds!

In many PyPGx actions, you can simply indicate which genome build to use. For example, for GRCh38 data you can use --assembly GRCh38 in CLI and assembly='GRCh38' in API. Note that GRCh37 will always be the default. Below is an example of using the API:

>>> import pypgx
>>> pypgx.list_variants('CYP2D6', alleles=['*4'], assembly='GRCh37')
['22-42524947-C-T']
>>> pypgx.list_variants('CYP2D6', alleles=['*4'], assembly='GRCh38')
['22-42128945-C-T']

However, there is one important caveat to consider if your sequencing data is GRCh38. That is, sequence reads must be aligned only to the main contigs (i.e. chr1, chr2, ..., chrX, chrY), and not to the alternative (ALT) contigs such as chr1_KI270762v1_alt. This is because the presence of ALT contigs reduces the sensitivity of variant calling and many other analyses including SV detection. Therefore, if you have sequencing data in GRCh38, make sure it's aligned to the main contigs only.

The only exception to above rule is the GSTT1 gene, which is located on chr22 for GRCh37 but on chr22_KI270879v1_alt for GRCh38. This gene is known to have an extremely high rate of gene deletion polymorphism in the population and thus requires SV analysis. Therefore, if you are interested in genotyping this gene with GRCh38 data, then you must include that contig when performing read alignment. To this end, you can easily filter your reference FASTA file before read alignment so that it only contains the main contigs plus the ALT contig. If you don't know how to do this, here's one way using the fuc program (which should have already been installed along with PyPGx):

$ cat contigs.list
chr1
chr2
...
chrX
chrY
chr22_KI270879v1_alt
$ fuc fa-filter in.fa --contigs contigs.list > out.fa

Archive file, semantic type, and metadata

In order to efficiently store and transfer data, PyPGx uses the ZIP archive file format (.zip) which supports lossless data compression. Each archive file created by PyPGx has a metadata file (metadata.txt) and a data file (e.g. data.tsv, data.vcf). A metadata file contains important information about the data file within the same archive, which is expressed as pairs of =-separated keys and values (e.g. Assembly=GRCh37):

Metadata Description Examples
Assembly Reference genome assembly. GRCh37, GRCh38
Control Control gene. VDR, chr1:10000-20000
Gene Target gene. CYP2D6, GSTT1
Platform Genotyping platform. WGS, Targeted, Chip, LongRead
Program Name of the phasing program. Beagle, SHAPEIT
Samples Samples used for inter-sample normalization. NA07000,NA10854,NA11993
SemanticType Semantic type of the archive. CovFrame[CopyNumber], Model[CNV]

Semantic types

Notably, all archive files have defined semantic types, which allows us to ensure that the data that is passed to a PyPGx command (CLI) or method (API) is meaningful for the operation that will be performed. Below is a list of currently defined semantic types:

  • CovFrame[CopyNumber]
    • CovFrame for storing target gene's per-base copy number which is computed from read depth with control statistics.
    • Requires following metadata: Gene, Assembly, SemanticType, Platform, Control, Samples.
  • CovFrame[DepthOfCoverage]
    • CovFrame for storing read depth for all target genes with SV.
    • Requires following metadata: Assembly, SemanticType, Platform.
  • CovFrame[ReadDepth]
    • CovFrame for storing read depth for single target gene.
    • Requires following metadata: Gene, Assembly, SemanticType, Platform.
  • Model[CNV]
    • Model for calling CNV in target gene.
    • Requires following metadata: Gene, Assembly, SemanticType, Control.
  • SampleTable[Alleles]
    • TSV file for storing target gene's candidate star alleles for each sample.
    • Requires following metadata: Platform, Gene, Assembly, SemanticType, Program.
  • SampleTable[CNVCalls]
    • TSV file for storing target gene's CNV call for each sample.
    • Requires following metadata: Gene, Assembly, SemanticType, Control.
  • SampleTable[Genotypes]
    • TSV file for storing target gene's genotype call for each sample.
    • Requires following metadata: Gene, Assembly, SemanticType.
  • SampleTable[Phenotypes]
    • TSV file for storing target gene's phenotype call for each sample.
    • Requires following metadata: Gene, SemanticType.
  • SampleTable[Results]
    • TSV file for storing various results for each sample.
    • Requires following metadata: Gene, Assembly, SemanticType.
  • SampleTable[Statistics]
    • TSV file for storing control gene's various statistics on read depth for each sample. Used for converting target gene's read depth to copy number.
    • Requires following metadata: Control, Assembly, SemanticType, Platform.
  • VcfFrame[Consolidated]
    • VcfFrame for storing target gene's consolidated variant data.
    • Requires following metadata: Platform, Gene, Assembly, SemanticType, Program.
  • VcfFrame[Imported]
    • VcfFrame for storing target gene's raw variant data.
    • Requires following metadata: Platform, Gene, Assembly, SemanticType.
  • VcfFrame[Phased]
    • VcfFrame for storing target gene's phased variant data.
    • Requires following metadata: Platform, Gene, Assembly, SemanticType, Program.

Working with archive files

To demonstrate how easy it is to work with PyPGx archive files, below we will show some examples. First, download an archive to play with, which has SampleTable[Results] as semantic type:

$ wget https://raw.githubusercontent.com/sbslee/pypgx-data/main/getrm-wgs-tutorial/grch37-CYP2D6-results.zip

Let's print its metadata:

$ pypgx print-metadata grch37-CYP2D6-results.zip
Gene=CYP2D6
Assembly=GRCh37
SemanticType=SampleTable[Results]

Now print its main data (but display first sample only):

$ pypgx print-data grch37-CYP2D6-results.zip | head -n 2
    Genotype        Phenotype       Haplotype1      Haplotype2      AlternativePhase        VariantData     CNV
HG00276_PyPGx       *4/*5   Poor Metabolizer        *4;*10;*74;*2;  *10;*74;*2;     ;       *4:22-42524947-C-T:0.913;*10:22-42526694-G-A,22-42523943-A-G:1.0,1.0;*74:22-42525821-G-T:1.0;*2:default;        DeletionHet

We can unzip it to extract files inside (note that tmpcty4c_cr is the original folder name):

$ unzip grch37-CYP2D6-results.zip
Archive:  grch37-CYP2D6-results.zip
  inflating: tmpcty4c_cr/metadata.txt
  inflating: tmpcty4c_cr/data.tsv

We can now directly interact with the files:

$ cat tmpcty4c_cr/metadata.txt
Gene=CYP2D6
Assembly=GRCh37
SemanticType=SampleTable[Results]
$ head -n 2 tmpcty4c_cr/data.tsv
    Genotype        Phenotype       Haplotype1      Haplotype2      AlternativePhase        VariantData     CNV
HG00276_PyPGx       *4/*5   Poor Metabolizer        *4;*10;*74;*2;  *10;*74;*2;     ;       *4:22-42524947-C-T:0.913;*10:22-42526694-G-A,22-42523943-A-G:1.0,1.0;*74:22-42525821-G-T:1.0;*2:default;        DeletionHet

We can easily create a new archive:

$ zip -r grch37-CYP2D6-results-new.zip tmpcty4c_cr
  adding: tmpcty4c_cr/ (stored 0%)
  adding: tmpcty4c_cr/metadata.txt (stored 0%)
  adding: tmpcty4c_cr/data.tsv (deflated 84%)
$ pypgx print-metadata grch37-CYP2D6-results-new.zip
Gene=CYP2D6
Assembly=GRCh37
SemanticType=SampleTable[Results]

Phenotype prediction

Many genes in PyPGx have a genotype-phenotype table available from the Clinical Pharmacogenetics Implementation Consortium (CPIC) or the Pharmacogenomics Knowledge Base (PharmGKB). PyPGx uses these tables to perform phenotype prediction with one of the two methods:

  • Method 1. Simple diplotype-phenotype mapping: This method directly uses the diplotype-phenotype mapping as defined by CPIC or PharmGKB. Using the CYP2B6 gene as an example, the diplotypes *6/*6, *1/*29, *1/*2, *1/*4, and *4/*4 correspond to Poor Metabolizer, Intermediate Metabolizer, Normal Metabolizer, Rapid Metabolizer, and Ultrarapid Metabolizer.
  • Method 2. Summation of haplotype activity scores: This method uses a standard unit of enzyme activity known as an activity score. Using the CYP2D6 gene as an example, the fully functional reference *1 allele is assigned a value of 1, decreased-function alleles such as *9 and *17 receive a value of 0.5, and nonfunctional alleles including *4 and *5 have a value of 0. The sum of values assigned to both alleles constitutes the activity score of a diplotype. Consequently, subjects with *1/*1, *1/*4, and *4/*5 diplotypes have an activity score of 2 (Normal Metabolizer), 1 (Intermediate Metabolizer), and 0 (Poor Metabolizer), respectively.

Please visit the Genes page to see the list of genes with a genotype-phenotype table and each of their prediction method.

To perform phenotype prediction with the API, you can use the pypgx.predict_phenotype method:

>>> import pypgx
>>> pypgx.predict_phenotype('CYP2D6', '*4', '*5')   # Both alleles have no function
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*5', '*4')   # The order of alleles does not matter
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*22')  # *22 has uncertain function
'Indeterminate'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*1x2') # Gene duplication
'Ultrarapid Metabolizer'

To perform phenotype prediction with the CLI, you can use the call-phenotypes command. It takes a SampleTable[Genotypes] file as input and outputs a SampleTable[Phenotypes] file:

$ pypgx call-phenotypes genotypes.zip phenotypes.zip

Pipelines

PyPGx currently provides three pipelines for performing PGx genotype analysis of single gene for one or multiple samples: NGS pipeline, chip pipeline, and long-read pipeline. In additional to genotyping, each pipeline will perform phenotype prediction based on genotype results. All pipelines are compatible with both GRCh37 and GRCh38 (e.g. for GRCh38 use --assembly GRCh38 in CLI and assembly='GRCh38' in API).

NGS pipeline

https://raw.githubusercontent.com/sbslee/pypgx-data/main/flowchart-ngs-pipeline.png

Implemented as pypgx run-ngs-pipeline in CLI and pypgx.pipeline.run_ngs_pipeline in API, this pipeline is designed for processing short-read data (e.g. Illumina). Users must specify whether the input data is from whole genome sequencing (WGS) or targeted sequencing (custom targeted panel sequencing or whole exome sequencing).

This pipeline supports SV detection based on copy number analysis for genes that are known to have SV. Therefore, if the target gene is associated with SV (e.g. CYP2D6) it's strongly recommended to provide a CovFrame[DepthOfCoverage] file and a SampleTable[Statistics] file in addtion to a VCF file containing SNVs/indels. If the target gene is not associated with SV (e.g. CYP3A5) providing a VCF file alone is enough. You can visit the Genes page to see the full list of genes with SV. For details on SV detection algorithm, please see the Structural variation detection section.

When creating a VCF file (containing SNVs/indels) from BAM files, users have a choice to either use the pypgx create-input-vcf command (strongly recommended) or a variant caller of their choice (e.g. GATK4 HaplotypeCaller). See the Variant caller choice section for detailed discussion on when to use either option.

Check out the GeT-RM WGS tutorial to see this pipeline in action.

Chip pipeline

https://raw.githubusercontent.com/sbslee/pypgx-data/main/flowchart-chip-pipeline.png

Implemented as pypgx run-chip-pipeline in CLI and pypgx.pipeline.run_chip_pipeline in API, this pipeline is designed for DNA chip data (e.g. Global Screening Array from Illumina). It's recommended to perform variant imputation on the input VCF prior to feeding it to the pipeline using a large reference haplotype panel (e.g. TOPMed Imputation Server). Alternatively, it's possible to perform variant imputation with the 1000 Genomes Project (1KGP) data as reference within PyPGx using --impute in CLI and impute=True in API.

The pipeline currently does not support SV detection. Please post a GitHub issue if you want to contribute your development skills and/or data for devising an SV detection algorithm.

Check out the Coriell Affy tutorial to see this pipeline in action.

Long-read pipeline

https://raw.githubusercontent.com/sbslee/pypgx-data/main/flowchart-long-read-pipeline.png

Implemented as pypgx run-long-read-pipeline in CLI and pypgx.pipeline.run_long_read_pipeline in API, this pipeline is designed for long-read data (e.g. Pacific Biosciences and Oxford Nanopore Technologies). The input VCF must be phased using a read-backed haplotype phasing tool such as WhatsHap.

The pipeline currently does not support SV detection. Please post a GitHub issue if you want to contribute your development skills and/or data for devising an SV detection algorithm.

Results interpretation

PyPGx outputs per-sample genotype results in a table, which is stored in an archive file with the semantic type SampleTable[Results]. Below, we will use the CYP2D6 gene with GRCh37 as an example to illustrate how to interpret genotype results from PyPGx.

  Genotype Phenotype Haplotype1 Haplotype2 AlternativePhase VariantData CNV
NA11839 *1/*2 Normal Metabolizer *1; *2; ; *1:22-42522613-G-C,22-42523943-A-G:0.5,0.488;*2:default Normal
NA12006 *4/*41 Intermediate Metabolizer *41;*2; *4;*10;*2; *69; *69:22-42526694-G-A,22-42523805-C-T:0.5,0.551;*4:22-42524947-C-T:0.444;*10:22-42523943-A-G,22-42526694-G-A:0.55,0.5;*41:22-42523805-C-T:0.551;*2:default; Normal
HG00276 *4/*5 Poor Metabolizer *4;*10;*74;*2; *10;*74;*2; ; *4:22-42524947-C-T:0.913;*10:22-42523943-A-G,22-42526694-G-A:1.0,1.0;*74:22-42525821-G-T:1.0;*2:default; WholeDel1
NA19207 *2x2/*10 Normal Metabolizer *10;*2; *2; ; *10:22-42523943-A-G,22-42526694-G-A:0.361,0.25;*2:default; WholeDup1

This list explains each of the columns in the example results.

  • Genotype: Diplotype call. When there is no SV this simply combines the two top-ranked star alleles from Haplotype1 and Haplotype2 with the delimiter '/'. In the presence of SV the final diplotype is determined using one of the genotypers in the pypgx.api.genotype module (e.g. CYP2D6Genotyper).
  • Phenotype: Phenotype call.
  • Haplotype1, Haplotype2: List of candidate star alleles for each haplotype. For example, if a given haplotype contains three variants 22-42523943-A-G, 22-42524947-C-T, and 22-42526694-G-A, then it will get assigned *4;*10; because the haplotype pattern can fit both *4 (22-42524947-C-T) and *10 (22-42523943-A-G and 22-42526694-G-A). Note that *4 comes first before *10 because it has higher priority for reporting purposes (see the pypgx.sort_alleles method for detailed implementation).
  • AlternativePhase: List of star alleles that could be missed due to potentially incorrect statistical phasing. For example, let's assume that statistical phasing has put 22-42526694-G-A for Haplotype1 and 22-42523805-C-T for Haplotype2. Even though the two variants are in trans orientation, PyPGx will also consider alternative phase in case the two variants are actually in cis orientation, resulting in *69; as AlternativePhase because *69 is defined by 22-42526694-G-A and 22-42523805-C-T.
  • VariantData: Information for SNVs/indels used to define observed star alleles, including allele fraction which is important for allelic decomposition after identifying CNV (e.g. the sample NA19207). In some situations, there will not be any variants for a given star allele because the allele itself is "default" allele for the selected reference assembly (e.g. GRCh37 has *2 as default while GRCh38 has *1).
  • CNV: Structural variation call. See the Structural variation detection section for more details.

Getting help

For detailed documentations on the CLI and API, please refer to the Read the Docs.

For getting help on the CLI:

$ pypgx -h

usage: pypgx [-h] [-v] COMMAND ...

positional arguments:
  COMMAND
    call-genotypes      Call genotypes for target gene.
    call-phenotypes     Call phenotypes for target gene.
    combine-results     Combine various results for target gene.
    compare-genotypes   Calculate concordance between two genotype results.
    compute-control-statistics
                        Compute summary statistics for control gene from BAM
                        files.
    compute-copy-number
                        Compute copy number from read depth for target gene.
    compute-target-depth
                        Compute read depth for target gene from BAM files.
    create-consolidated-vcf
                        Create a consolidated VCF file.
    create-input-vcf    Call SNVs/indels from BAM files for all target genes.
    create-regions-bed  Create a BED file which contains all regions used by
                        PyPGx.
    estimate-phase-beagle
                        Estimate haplotype phase of observed variants with
                        the Beagle program.
    filter-samples      Filter Archive file for specified samples.
    import-read-depth   Import read depth data for target gene.
    import-variants     Import SNV/indel data for target gene.
    plot-bam-copy-number
                        Plot copy number profile from CovFrame[CopyNumber].
    plot-bam-read-depth
                        Plot read depth profile with BAM data.
    plot-cn-af          Plot both copy number profile and allele fraction
                        profile in one figure.
    plot-vcf-allele-fraction
                        Plot allele fraction profile with VCF data.
    plot-vcf-read-depth
                        Plot read depth profile with VCF data.
    predict-alleles     Predict candidate star alleles based on observed
                        variants.
    predict-cnv         Predict CNV from copy number data for target gene.
    prepare-depth-of-coverage
                        Prepare a depth of coverage file for all target
                        genes with SV from BAM files.
    print-data          Print the main data of specified archive.
    print-metadata      Print the metadata of specified archive.
    run-chip-pipeline   Run genotyping pipeline for chip data.
    run-long-read-pipeline
                        Run genotyping pipeline for long-read sequencing data.
    run-ngs-pipeline    Run genotyping pipeline for NGS data.
    slice-bam           Slice BAM file for all genes used by PyPGx.
    test-cnv-caller     Test CNV caller for target gene.
    train-cnv-caller    Train CNV caller for target gene.

options:
  -h, --help            Show this help message and exit.
  -v, --version         Show the version number and exit.

For getting help on a specific command (e.g. call-genotypes):

$ pypgx call-genotypes -h

Below is the list of submodules available in the API:

  • core : The core submodule is the main suite of tools for PGx research.
  • genotype : The genotype submodule is primarily used to make final diplotype calls by interpreting candidate star alleles and/or detected structural variants.
  • pipeline : The pipeline submodule is used to provide convenient methods that combine multiple PyPGx actions and automatically handle semantic types.
  • plot : The plot submodule is used to plot various kinds of profiles such as read depth, copy number, and allele fraction.
  • utils : The utils submodule contains main actions of PyPGx.

For getting help on a specific submodule (e.g. utils):

>>> from pypgx.api import utils
>>> help(utils)

For getting help on a specific method (e.g. pypgx.predict_phenotype):

>>> import pypgx
>>> help(pypgx.predict_phenotype)

In Jupyter Notebook and Lab, you can see the documentation for a python function by hitting SHIFT + TAB. Hit it twice to expand the view.

CLI examples

We can print the metadata of an archive file:

$ pypgx print-metadata grch37-depth-of-coverage.zip

Above will print:

Assembly=GRCh37
SemanticType=CovFrame[DepthOfCoverage]
Platform=WGS

We can run the NGS pipeline for the CYP2D6 gene:

$ pypgx run-ngs-pipeline \
CYP2D6 \
grch37-CYP2D6-pipeline \
--variants grch37-variants.vcf.gz \
--depth-of-coverage grch37-depth-of-coverage.zip \
--control-statistics grch37-control-statistics-VDR.zip

Above will create a number of archive files:

Saved VcfFrame[Imported] to: grch37-CYP2D6-pipeline/imported-variants.zip
Saved VcfFrame[Phased] to: grch37-CYP2D6-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: grch37-CYP2D6-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: grch37-CYP2D6-pipeline/alleles.zip
Saved CovFrame[ReadDepth] to: grch37-CYP2D6-pipeline/read-depth.zip
Saved CovFrame[CopyNumber] to: grch37-CYP2D6-pipeline/copy-number.zip
Saved SampleTable[CNVCalls] to: grch37-CYP2D6-pipeline/cnv-calls.zip
Saved SampleTable[Genotypes] to: grch37-CYP2D6-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: grch37-CYP2D6-pipeline/phenotypes.zip
Saved SampleTable[Results] to: grch37-CYP2D6-pipeline/results.zip

API examples

We can obtain allele function for the CYP2D6 gene:

>>> import pypgx
>>> pypgx.get_function('CYP2D6', '*1')
'Normal Function'
>>> pypgx.get_function('CYP2D6', '*4')
'No Function'
>>> pypgx.get_function('CYP2D6', '*22')
'Uncertain Function'
>>> pypgx.get_function('CYP2D6', '*140')
'Unknown Function'

We can predict phenotype for CYP2D6 based on two haplotype calls:

>>> import pypgx
>>> pypgx.predict_phenotype('CYP2D6', '*4', '*5')   # Both alleles have no function
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*5', '*4')   # The order of alleles does not matter
'Poor Metabolizer'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*22')  # *22 has uncertain function
'Indeterminate'
>>> pypgx.predict_phenotype('CYP2D6', '*1', '*1x2') # Gene duplication
'Ultrarapid Metabolizer'

We can also obtain recommendation (e.g. CPIC) for certain drug-phenotype combination:

>>> import pypgx
>>> # Codeine, an opiate and prodrug of morphine, is metabolized by CYP2D6
>>> pypgx.get_recommendation('codeine', 'CYP2D6', 'Normal Metabolizer')
'Use codeine label recommended age- or weight-specific dosing.'
>>> pypgx.get_recommendation('codeine', 'CYP2D6', 'Ultrarapid Metabolizer')
'Avoid codeine use because of potential for serious toxicity. If opioid use is warranted, consider a non-tramadol opioid.'
>>> pypgx.get_recommendation('codeine', 'CYP2D6', 'Poor Metabolizer')
'Avoid codeine use because of possibility of diminished analgesia. If opioid use is warranted, consider a non-tramadol opioid.'
>>> pypgx.get_recommendation('codeine', 'CYP2D6', 'Indeterminate')
'None'

pypgx's People

Contributors

kokyriakidis avatar ntnguyen13 avatar sbslee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pypgx's Issues

CYP2B6*17 calling issue caused by having multiple variant synonyms

CYP2B6*17 is currently defined with three GRCh37 variants 19-41497286-A-T, 19-41497292-G-GGC, and 19-41497294-CCG-C. If you look closely, you will note that the latter two variants actually overlap such that:

GRCh37:          GRCh38:
   * **             * **
999999999        888889999
012345678        567890123
ATGACCGCC        ATGACCGCC
ATGgCacCC        ATGgCacCC

Therefore, another way to represent 19-41497292-G-GGC and 19-41497294-CCG-C is to have three separate SNVs instead: 19-41497293-A-G, 19-41497295-C-A, and 19-41497296-G-C.

In fact, different genotype callers use different variant representations -- e.g. GATK4 HaplotypeCaller will output two indels (and gnomAD too: 19-41497292-G-GGC and 19-41497294-CCG-C) while the newly introduced create-input-vcf command in PyPGx (which uses bcftools internally) will output three separate SNVs.

This poses a problem for PyPGx because it needs to be able to call the same CYP2B6*17 allele but with different variant representations depending on the input VCF.

Usually, this type of problem is relatively easily handled by using the idea of variant "synonyms". For example, if you look at the variant-table.csv file, there is GRCh37Synonym column (e.g. 2-234668879-C-CAT in the UGT1A1 gene has 2-234668879-CAT-CATAT as synonym).

However, in the case of CYP2B6*17 we have two indels vs. three SNVs, so it breaks the paradigm of one synonym per variant. Therefore, starting the 0.14.0-dev version, we will abandon this paradigm such that 19-41497292-G-GGC will have 19-41497293-A-G,19-41497295-C-A as synonym while 19-41497294-CCG-C has 19-41497296-G-C as synonym:

>>> import pypgx
pypgx.get_variant_synonyms('CYP2B6')
>>> pypgx.get_variant_synonyms('CYP2B6')
{'19-41497293-A-G': '19-41497292-G-GGC', '19-41497295-C-A': '19-41497292-G-GGC', '19-41497296-G-C': '19-41497294-CCG-C'}

Note that this means both 19-41497293-A-G and 19-41497295-C-A will point to 19-41497292-G-GGC. Therefore, when it comes to star allele calling, technically, having either 19-41497293-A-G or 19-41497295-C-A is equivalent to having 19-41497292-G-GGC.

Obviously this is not ideal, but given how rare this issue is, I think this is a reasonable solution that has minimal impact on the overall data structure in PyPGx.

Impact of rs2032582A>C variant

The impact of rs2032582A>C variant in "pypgx" database is "A893S". However, according to dbSNP, the impact is "Ser893Ala" or "S893A". Could you please check it again?
Thanks a lot,
Trang

Recommendations for comparing pypgx calls to commercial tests?

Hi Steven,

I have some folks genotyped for DPYD via a commercial test that evaluates *2A, *7, *8, *10, *13, HapB3 (and reports *1 when none of these variants are identified)

These folks are also genotyped by array on GRCh38 and I was planning to compare the concordance between pypgx on array and the commercial test.

I noticed the GRCh38Default for DPYD is Reference, so I am running into cases where commercial test calls things like *1/HapB3 and pypgx calls Reference/HapB3.

I am thinking these would be concordant diplotypes if Reference is the allele provided when none of the named alleles are detected by pypgx (similar to *1 for the commercial test), but thought I would get your take on it

Please let me know what you think , thanks

Error using bed

Hi @sbslee,
Congrats, nice work!
So, I was testing with some target data then I got that error... could you help me?

pypgx compute-control-statistics chr12:48235319-48298814 grch38-control-statistics-VDR.zip --assembly GRCh38 ../data/20210819_AHJLFMDRXY_CUSTOM_20210382/bam/*.bam --bed ../genes_pypgx.bed
Traceback (most recent call last):
  File "/Users/wilsonjunior/miniconda3/envs/pypgx/bin/pypgx", line 10, in <module>
    sys.exit(main())
  File "/Users/wilsonjunior/miniconda3/envs/pypgx/lib/python3.9/site-packages/pypgx/__main__.py", line 33, in main
    commands[args.command].main(args)
  File "/Users/wilsonjunior/miniconda3/envs/pypgx/lib/python3.9/site-packages/pypgx/cli/compute_control_statistics.py", line 81, in main
    result = utils.compute_control_statistics(
  File "/Users/wilsonjunior/miniconda3/envs/pypgx/lib/python3.9/site-packages/pypgx/api/utils.py", line 394, in compute_control_statistics
    if bam_prefix and bed_prefix:
NameError: name 'bam_prefix' is not defined

Thanks!!!

Error in bam2gdf

I am getting the following error when I run bam2gdf in gstt1:

[INFO] PyPGx v0.1.22
[INFO] Command:
[INFO]     /Users/kokyriakidis/Documents/GitHub/Pharmakon/.snakemake/conda/3247b051/bin/pypgx bam2gdf hg38 gstp1 vdr /Users/kokyriakidis/Desktop/test_folder/KOSTAS/genes/gstp1/gdf/gstp1.gdf /Users/kokyriakidis/Desktop/test_folder/KOSTAS/applybqsr/KOSTAS.bam
[INFO] Sample IDs: ['KOSTAS']
[INFO] Contigs: ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y', 'KI270728.1', 'KI270727.1', 'KI270442.1', 'KI270729.1', 'GL000225.1', 'KI270743.1', 'GL000008.2', 'GL000009.2', 'KI270747.1', 'KI270722.1', 'GL000194.1', 'KI270742.1', 'GL000205.2', 'GL000195.1', 'KI270736.1', 'KI270733.1', 'GL000224.1', 'GL000219.1', 'KI270719.1', 'GL000216.2', 'KI270712.1', 'KI270706.1', 'KI270725.1', 'KI270744.1', 'KI270734.1', 'GL000213.1', 'GL000220.1', 'KI270715.1', 'GL000218.1', 'KI270749.1', 'KI270741.1', 'GL000221.1', 'KI270716.1', 'KI270731.1', 'KI270751.1', 'KI270750.1', 'KI270519.1', 'GL000214.1', 'KI270708.1', 'KI270730.1', 'KI270438.1', 'KI270737.1', 'KI270721.1', 'KI270738.1', 'KI270748.1', 'KI270435.1', 'GL000208.1', 'KI270538.1', 'KI270756.1', 'KI270739.1', 'KI270757.1', 'KI270709.1', 'KI270746.1', 'KI270753.1', 'KI270589.1', 'KI270726.1', 'KI270735.1', 'KI270711.1', 'KI270745.1', 'KI270714.1', 'KI270732.1', 'KI270713.1', 'KI270754.1', 'KI270710.1', 'KI270717.1', 'KI270724.1', 'KI270720.1', 'KI270723.1', 'KI270718.1', 'KI270317.1', 'KI270740.1', 'KI270755.1', 'KI270707.1', 'KI270579.1', 'KI270752.1', 'KI270512.1', 'KI270322.1', 'GL000226.1', 'KI270311.1', 'KI270366.1', 'KI270511.1', 'KI270448.1', 'KI270521.1', 'KI270581.1', 'KI270582.1', 'KI270515.1', 'KI270588.1', 'KI270591.1', 'KI270522.1', 'KI270507.1', 'KI270590.1', 'KI270584.1', 'KI270320.1', 'KI270382.1', 'KI270468.1', 'KI270467.1', 'KI270362.1', 'KI270517.1', 'KI270593.1', 'KI270528.1', 'KI270587.1', 'KI270364.1', 'KI270371.1', 'KI270333.1', 'KI270374.1', 'KI270411.1', 'KI270414.1', 'KI270510.1', 'KI270390.1', 'KI270375.1', 'KI270420.1', 'KI270509.1', 'KI270315.1', 'KI270302.1', 'KI270518.1', 'KI270530.1', 'KI270304.1', 'KI270418.1', 'KI270424.1', 'KI270417.1', 'KI270508.1', 'KI270303.1', 'KI270381.1', 'KI270529.1', 'KI270425.1', 'KI270396.1', 'KI270363.1', 'KI270386.1', 'KI270465.1', 'KI270383.1', 'KI270384.1', 'KI270330.1', 'KI270372.1', 'KI270548.1', 'KI270580.1', 'KI270387.1', 'KI270391.1', 'KI270305.1', 'KI270373.1', 'KI270422.1', 'KI270316.1', 'KI270340.1', 'KI270338.1', 'KI270583.1', 'KI270334.1', 'KI270429.1', 'KI270393.1', 'KI270516.1', 'KI270389.1', 'KI270466.1', 'KI270388.1', 'KI270544.1', 'KI270310.1', 'KI270412.1', 'KI270395.1', 'KI270376.1', 'KI270337.1', 'KI270335.1', 'KI270378.1', 'KI270379.1', 'KI270329.1', 'KI270419.1', 'KI270336.1', 'KI270312.1', 'KI270539.1', 'KI270385.1', 'KI270423.1', 'KI270392.1', 'KI270394.1']
[INFO] Elapsed time: 0:00:01
[INFO] PyPGx finished
[INFO] PyPGx v0.1.22
[INFO] Command:
[INFO]     /Users/kokyriakidis/Documents/GitHub/Pharmakon/.snakemake/conda/3247b051/bin/pypgx bam2gdf hg38 gstt1 vdr /Users/kokyriakidis/Desktop/test_folder/KOSTAS/genes/gstt1/gdf/gstt1.gdf /Users/kokyriakidis/Desktop/test_folder/KOSTAS/applybqsr/KOSTAS.bam
Traceback (most recent call last):
  File "/Users/kokyriakidis/Documents/GitHub/Pharmakon/.snakemake/conda/3247b051/bin/pypgx", line 8, in <module>
    sys.exit(main())
  File "/Users/kokyriakidis/Documents/GitHub/Pharmakon/.snakemake/conda/3247b051/lib/python3.6/site-packages/pypgx/__main__.py", line 598, in main
    result = PYPGX_TOOLS[args.tool](**vars(args))
  File "/Users/kokyriakidis/Documents/GitHub/Pharmakon/.snakemake/conda/3247b051/lib/python3.6/site-packages/pypgx/common.py", line 233, in wrapper
    func(*args, **kwargs, input_files=input_files)
  File "/Users/kokyriakidis/Documents/GitHub/Pharmakon/.snakemake/conda/3247b051/lib/python3.6/site-packages/pypgx/bam2gdf.py", line 56, in bam2gdf
    sdf = bam2sdf(genome_build, target_gene, control_gene, input_files)
  File "/Users/kokyriakidis/Documents/GitHub/Pharmakon/.snakemake/conda/3247b051/lib/python3.6/site-packages/pypgx/bam2sdf.py", line 51, in bam2sdf
    regions = sort_regions([tr, cr])
  File "/Users/kokyriakidis/Documents/GitHub/Pharmakon/.snakemake/conda/3247b051/lib/python3.6/site-packages/pypgx/sglib.py", line 578, in sort_regions
    return sorted(regions, key = f)
  File "/Users/kokyriakidis/Documents/GitHub/Pharmakon/.snakemake/conda/3247b051/lib/python3.6/site-packages/pypgx/sglib.py", line 576, in f
    chr = int(r[0].replace("chr", ""))
ValueError: invalid literal for int() with base 10: '22_KI270879v1_alt'

I am using GRCh38 reference genome from Ensemble (release 100)
Any clue?

Plotting allele fraction with empty VCF raises error

A user reported a potential bug in the run-ngs-pipeline command where if the user provides an empty VCF, the command will throw an error complaining the VCF is empty when it should handle this type of situations.

I haven't had a chance to reproduce the issue yet, but it's possible and likely due to

def plot_vcf_allele_fraction(

Improve PyPGx's handling of genes on X chromosome (i.e. G6PD)

A PyPGx user recently reported the following via email:

G6PD x-chr single allele case?

For this one, I'm also attaching a sample G6PD vcf I tried running with run_ngs_pipeline()
The error I get is something like this:

❯ pypgx run-ngs-pipeline --variants G6PD.ref.fixed.vcf.gz --panel /mnt/data/pypgx-bundle/1kgp/GRCh37/G6PD.vcf.gz G6PD /tmp/pypgx-g6pd-ukb

Saved VcfFrame[Imported] to: /tmp/pypgx-g6pd-ukb/imported-variants.zip
Saved VcfFrame[Phased] to: /tmp/pypgx-g6pd-ukb/phased-variants.zip
Saved VcfFrame[Consolidated] to: /tmp/pypgx-g6pd-ukb/consolidated-variants.zip
Traceback (most recent call last):
  File "/home/min/venv/mg/bin/pypgx", line 8, in <module>
    sys.exit(main())
  File "/home/min/venv/mg/lib/python3.8/site-packages/pypgx/__main__.py", line 33, in main
    commands[args.command].main(args)
  File "/home/min/venv/mg/lib/python3.8/site-packages/pypgx/cli/run_ngs_pipeline.py", line 159, in main
    pipeline.run_ngs_pipeline(
  File "/home/min/venv/mg/lib/python3.8/site-packages/pypgx/api/pipeline.py", line 247, in run_ngs_pipeline
    alleles = utils.predict_alleles(consolidated_variants)
  File "/home/min/venv/mg/lib/python3.8/site-packages/pypgx/api/utils.py", line 1109, in predict_alleles
    observed = consolidated_variants.data.df.apply(one_row, args=(sample, i), axis=1)
  File "/home/min/venv/mg/lib/python3.8/site-packages/pandas/core/frame.py", line 8740, in apply
    return op.apply()
  File "/home/min/venv/mg/lib/python3.8/site-packages/pandas/core/apply.py", line 688, in apply
    return self.apply_standard()
  File "/home/min/venv/mg/lib/python3.8/site-packages/pandas/core/apply.py", line 812, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/min/venv/mg/lib/python3.8/site-packages/pandas/core/apply.py", line 828, in apply_series_generator
    results[i] = self.f(v)
  File "/home/min/venv/mg/lib/python3.8/site-packages/pandas/core/apply.py", line 131, in f
    return func(x, *args, **kwargs)
  File "/home/min/venv/mg/lib/python3.8/site-packages/pypgx/api/utils.py", line 1086, in one_row
    j = int(gt.split('|')[i])
IndexError: list index out of range

I ran this using the CLI version, just to rule out any api integration mistakes. Looking at utils.py it does assume that gt has two alleles. In the case of G6PD x-chromosome data, doesn't that break the assumption? Not my area, so correct me if I'm wrong about this. Either way, I'm hoping you can reproduce the errors using the sample files I'm sending over.

Updating phenotype data for several genes

very interesting! I really admire how you already planned ahead for all of this situations.

I'm planning to add other functional impact of some other genes, such as: CACNA1S, RYR1, CFTR, CYP4F2, G6PD, IFNL3. I see the phenotype prediction page, as well as the categories in PhenotypeMethod of gene-table.tsv. I want to add this case to our discussion as well:

G551D Het/Hom > F508del Homozygous > Some other variants Het/Hom.

However, the meanings of those variant functions have not been defined in the list FUNCTION_ORDER yet. Do you think it's OK to add more entry for CFTR variant functionality score?

Originally posted by @NTNguyen13 in #35 (comment)

Interpreting Outputs

Hello,
I'm trying to understand how to interpret some of the Indeterminate outputs from pypgx for CYP2D6 GRCh38. It is not clear to me why an Indeterminate call is made. Few examples below. Explaining these might help my understand the rest.

Unnamed: 0 | Genotype | Phenotype | Haplotype1 | Haplotype2 | AlternativePhase | VariantData | CNV
123 | Indeterminate | Indeterminate | *17;*2; | *29;*2; | ; | *29:22-42129132-C-T,22-42127608-C-T:0.675,0.631;*17:22-42129770-G-A:0.294;*2:22-42126611-C-G,22-42127941-G-A:1.0,1.0; | Tandem2A
1234 | Indeterminate | Indeterminate | *2; | *2; | ; | *2:22-42127941-G-A,22-42126611-C-G:1.0,1.0; | Tandem3
12345 | Indeterminate | Indeterminate | *28;*2; | *2; | ; | *28:22-42129087-G-C,22-42130773-C-T:0.525,0.589;*2:22-42126611-C-G,22-42127941-G-A:1.0,1.0; | Tandem1A
123456 | Indeterminate | Indeterminate | *2; | *4;*10; | *65; | *4:22-42128945-C-T:0.483;*10:22-42130692-G-A,22-42126611-C-G:0.448,1.0;*65:22-42130692-G-A,22-42127941-G-A,22-42126611-C-G:0.448,0.457,1.0;*2:22-42127941-G-A,22-42126611-C-G:0.457,1.0; | WholeMultip1
1234567 | Indeterminate | Indeterminate | *2; | *17;*2; | ; | *17:22-42129770-G-A:0.527;*2:22-42127941-G-A,22-42126611-C-G:1.0,1.0; | Tandem2B

Why does `phenotyper` not default to providing phenotypes for drug metabolizing enzymes for UGT1A1, DPYD etc.?

Hi @sbslee Steven,

PharmGKB's "diplotype to phenotype" tables for UGT1A1, DPYD, and other enzymes seem to use phenotype nomenclature consistent with CPIC recommendations consensus terms for drug metabolizing enzymes.

pypgx phenotyper seems to default to providing allele functional status for some enzymes. Is there a reason for this?

Example:

from pypgx.phenotyper import phenotyper
phenotyper("tpmt", "*1", "*2")

gives decreased_function, but based on PharmGKB "diplotype to phenotype", I would have expected something like Intermediate Metabolizer

If I wanted phenotypes more comparable with PharmGKB reports, I think realize that I could just change the code in my copy of phenotyper.py to output metabolizing phenotypes for these genes, but thought I would get your input before doing so (I'm not a pgx expert).

On a side note, I wonder if there is an issue with the webform for downloading the most recent version of stargazer from UW. Unless I am missing something, I have not been able to download using chrome, edge, firefox or brave browsers.

I am a big fan of pypgx and stargazer, thanks for developing. :)

-Brett

Regarding multiple alleles on the same strand

Hi, I found that in our population data, sometimes there are multiple alleles detected on the same strand of one individual. This will affect 2 things:

  1. How should we represent that PGx gene?
  2. How should we assign the activity score for that case?

This issue made me scratching my head for a few days, then I talked to some experts and came across this paper: https://ascpt.onlinelibrary.wiley.com/doi/full/10.1002/cpt.2122

So I get the general idea:

  1. If the variants from those alleles are not overlapping each other, we can represent them by the + notation. If they are overlap, we can represent 1 allele + the additional variants from the other allele.
  2. We can choose the more 'severe' score among 2 or more alleles.

However, this raises other concerns as well, especially about complex CNV. As I am working with PyPGx, I propose my below ideas to enhance it, please correct me if my idea is not appropriate
SimplerGenotyper drawio

At this moment, we just deal with the PGx gene without CNV first. As PyPGx currently reports all detected allele in Haplotype columns, it's already easier to report multiple alleles but I will add one more columns in allele-table.tsv to indicate which variants are overlapping each other.

In the later call phenotype step, because our alleles is already sorted to prioritize severe allele, I can just use the first instance of .split('+') as input for ``call phenotype` function.

Please comment if my idea aligns with PyPGx development direction.

Issue while running snp chip tutorial.

Hi,
I am new to this and trying to run tutorial for snp chip pipeline but the attached error is occuring. Can you look into this.? Thanks in advanced.

pysam.utils.SamtoolsError cannot parse region "chr22_KI270879v1_alt:267307-281486"

Hello @sbslee, I hope you are well.
I'm trying to run a WGS sample that was run using Dragen 3.9 for the prepare-depth-of-coverage and have the below error:

$ docker run -v "$PWD":/data pypgx:v0.15.0 pypgx prepare-depth-of-coverage /data/sample/sample_WGS-depth-of-coverage.zip /data/sample_WGS/sample_WGS-contigs.bam --assembly GRCh38
ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['pypgx', 'prepare-depth-of-coverage', '/data/sample_WGS/sample_WGS-depth-of-coverage.zip', '/data/sample_WGS/sample_WGS-contigs.bam', '--assembly', 'GRCh38']' command failed.  (See above for error)
Traceback (most recent call last):
  File "/opt/conda/envs/myenv/bin/pypgx", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/envs/myenv/lib/python3.8/site-packages/pypgx/__main__.py", line 33, in main
    commands[args.command].main(args)
  File "/opt/conda/envs/myenv/lib/python3.8/site-packages/pypgx/cli/prepare_depth_of_coverage.py", line 90, in main
    archive = utils.prepare_depth_of_coverage(
  File "/opt/conda/envs/myenv/lib/python3.8/site-packages/pypgx/api/utils.py", line 1232, in prepare_depth_of_coverage
    cf = pycov.CovFrame.from_bam(bams, regions=regions, zero=True)
  File "/opt/conda/envs/myenv/lib/python3.8/site-packages/fuc/api/pycov.py", line 261, in from_bam
    results += pysam.depth(*(bams + args + ['-r', region]))
  File "/opt/conda/envs/myenv/lib/python3.8/site-packages/pysam/utils.py", line 69, in __call__
    raise SamtoolsError(
pysam.utils.SamtoolsError: 'samtools returned with error 1: stdout=, stderr=samtools depth: cannot parse region "chr22_KI270879v1_alt:267307-281486"\n'

I think it is happening because the contig name is different:

$ samtools idxstats sample_WGS-contigs.bam | grep chr22
chr22
chr22_KI270731v1_random
chr22_KI270732v1_random
chr22_KI270733v1_random
chr22_KI270734v1_random
chr22_KI270735v1_random
chr22_KI270736v1_random
chr22_KI270737v1_random
chr22_KI270738v1_random
chr22_KI270739v1_random

I have removed the reads in these random contigs because I'm trying to use a sample already aligned instead of aligning it again without the contigs. Do you think there is any other way of solving it? Maybe creating the sample_WGS-depth-of-coverage.zip file out of pypgx?

Thank you.

Most recent stargazer package

Hi, thanks for this package.

I am currently trying to properly install stargazer to use with this package. I noticed here that the command to run stargazer is different from the package that I have. Where do I get the most up to date stargazer distro?

ImportError: cannot import name 'pycov' from 'fuc'

Hello @sbslee, thanks for making the project available.
I have tried to install pypgx using your instruction but when I try to run it a still have problems with fuc package.

$ pypgx -h
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/bin/pypgx", line 33, in <module>
    sys.exit(load_entry_point('pypgx==0.10.1', 'console_scripts', 'pypgx')())
  File "/home/ubuntu/miniconda3/bin/pypgx", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/home/ubuntu/miniconda3/lib/python3.8/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/home/ubuntu/miniconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/pypgx/__init__.py", line 37, in <module>
    from .api.utils import (
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/pypgx/api/utils.py", line 14, in <module>
    from .. import sdk
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/pypgx/sdk/__init__.py", line 1, in <module>
    from .utils import (Archive, parse_input_bams, compare_metadata, simulate_copy_number)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/pypgx/sdk/utils.py", line 10, in <module>
    from fuc import pyvcf, pycov, common, pybam
ImportError: cannot import name 'pycov' from 'fuc' (/home/ubuntu/miniconda3/lib/python3.8/site-packages/fuc/__init__.py)

Not sure about what is the problem with fuc.

Best,
George.

CPIC/PharamGKB recommendations

Hi Steven,

Thanks a lot for this great tool.

Do you have any plans in the future to automatically generate CPIC/PharmGKB recommendations based on the predicted phenotyoe from pypgx?

Thanks,

Regards,
Dheeraj.

BAM Identifiers/Filenames Are Alphanumeric Strings When Viewing Results

Hello,

For reference I am using GeT-RM BAM files directly from the ENA:
https://www.ebi.ac.uk/ebisearch/search?query=PRJEB19931&requestFrom=ebi_index&db=allebi

I have the data stored, as an example, as such:
/data/bam_files/ERR195/ERR1955341/NA11993.bam

I am calling the pipeline and all other commands building up to it as such:

pypgx run-ngs-pipeline CYP2D6 grch37-CYP2D6-pipeline_get_rm_1 --variants grch37-variants_get_rm_1.vcf.gz --depth-of-coverage grch37-depth-of-coverage_get_rm_1.zip --control-statistics grch37-control-statistics-RYR1_get_rm_1.zip

I have followed the tutorial and no warnings/issues seem to be present, but when viewing the output of pypgx print-data grch37-CYP2D6-pipeline_get_rm_1/results.zip | head, I get the following:

Genotype        Phenotype       Haplotype1      Haplotype2      AlternativePhase        VariantData     CNV
20b87673c1224e9db8bdbbe82899309c        *4/*5   Poor Metabolizer        *4;*10;*74;*2;  *4;*10;*74;*2;  ;       *4:22-42524947-C-T:0.95;*10:22-42526694-G-A,22-42523943-A-G:1.0,1.0;*74:22-42525821-G-T:1.0;*2:default;     WholeDel1

Could you please tell me how I can get these names to show up in an informative/consistent way?

Thank you.

Regarding RefAllele and Default Allele in gene-table.csv

Hi, this question is mostly for better understanding of pypgx data structure.

I tried to figure out the meaning of RefAllele, but it's not quite right actually. I thought RefAllele is the allele represented on the Human Reference genome (the fasta file), but GRCh37Default and GRCh38Default are already represented that. I also saw case where GRCh37Default and GRCh38Default flip (I think it's because of changes between GRCh37 and GRCh38), but I found 5 cases where GRCh37Default and GRCh38Default are the same, but they are different from RefAllele

Gene	RefAllele	GRCh37Default	GRCh38Default
ABCB1	*1          	*2                   	*2
NAT2	*4          	*12                   	*12
SLC22A2	*1          	*3                   	*3
UGT2B7	*1          	*2                   	*2
UGT2B15	*1          	*2                   	*2

I found this logic check to assign allele where no candidate is found, but still, I'm not fully understand the role of RefAllele

if ref_allele != default_allele and ref_allele not in candidates and default_allele not in candidates:
    candidates.append(default_allele)
if not candidates:
    candidates.append(default_allele)

Could you please explain what is RefAllele please? And how to assign it in gene-table? Thank you very much.

Installation error?

Hello Steven,
I was trying to install pypgx in a new system and get the following error

$ pypgx -h
Traceback (most recent call last):
  File "/home/smaggo/.conda/envs/pypgx/bin/pypgx", line 6, in <module>
    from pypgx.__main__ import main
  File "/home/smaggo/.conda/envs/pypgx/lib/python3.7/site-packages/pypgx/__main__.py", line 4, in <module>
    from .cli import commands
  File "/home/smaggo/.conda/envs/pypgx/lib/python3.7/site-packages/pypgx/cli/__init__.py", line 9, in <module>
    commands[f.stem.replace('_', '-')] = import_module(f'.{f.stem}', __package__)
  File "/home/smaggo/.conda/envs/pypgx/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/smaggo/.conda/envs/pypgx/lib/python3.7/site-packages/pypgx/cli/compute_control_statistics.py", line 29, in <module>
AttributeError: module 'fuc.api.common' has no attribute '_script_name'

Error for pypgx installed by conda

Hello team, thank you for making this nice project available.
I have installed pypgx using conda, but when I try to check the tool I have the error below:

$ pypgx -h
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/bin/pypgx", line 6, in <module>
    from pypgx.__main__ import main
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/pypgx/__main__.py", line 4, in <module>
    from .cli import commands
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/pypgx/cli/__init__.py", line 9, in <module>
    commands[f.stem.replace('_', '-')] = import_module(f'.{f.stem}', __package__)
  File "/home/ubuntu/miniconda3/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/pypgx/cli/compute_control_statistics.py", line 18, in <module>
    $ pypgx {fuc.api.common._script_name()} \\
AttributeError: module 'fuc.api.common' has no attribute '_script_name'

Not sure how to solve it, could you help?

Minor bug in pipeline.py all functions

If output exists and force=False, line 227 generates FileExistsError.

Could modify lines 224-227 to be:

if os.path.exists(output):
    if force:
        shutil.rmtree(output)
        os.mkdir(output)
else:
    os.mkdir(output)

I gave run_ngs_pipeline as an example, but the same bug exists for other functions in the file as well.

Installation error

Hello there, I am working in a new compute environment and get the following error after installation.

(PYPGX) pypgx --help
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/PYPGX/bin/pypgx", line 6, in <module>
    from pypgx.__main__ import main
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/pypgx/__init__.py", line 1, in <module>
    from .api.core import (
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/pypgx/api/core.py", line 10, in <module>
    from .. import sdk
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/pypgx/sdk/__init__.py", line 1, in <module>
    from .utils import (Archive, add_cn_samples, compare_metadata, simulate_copy_number)
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/pypgx/sdk/utils.py", line 10, in <module>
    from fuc import pyvcf, pycov, common, pybam
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/fuc/__init__.py", line 1, in <module>
    from .api import *
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/fuc/api/common.py", line 19, in <module>
    from . import pyvcf
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/fuc/api/pyvcf.py", line 146, in <module>
    from . import pybed, common, pymaf, pybam
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/fuc/api/pymaf.py", line 61, in <module>
    import statsmodels.formula.api as smf
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/statsmodels/formula/api.py", line 15, in <module>
    from statsmodels.discrete.discrete_model import MNLogit
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py", line 45, in <module>
    from statsmodels.distributions import genpoisson_p
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/statsmodels/distributions/__init__.py", line 2, in <module>
    from .edgeworth import ExpandedNormal
  File "/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/statsmodels/distributions/edgeworth.py", line 7, in <module>
    from scipy.misc import factorial
ImportError: cannot import name 'factorial' from 'scipy.misc' (/home/ec2-user/anaconda3/envs/PYPGX/lib/python3.7/site-packages/scipy/misc/__init__.py)

Help with nanopore data

Dear Steven, firstly thank you for such a useful tool!

We have been dabbling in generating nanopore data to evaluate CYP2D6 and hopefully other PGX genes in the future. I was hoping to use the CNV tools as well as some of your graphing tools. I realise I will need a depth of coverage file, and I was wondering how we could go about generating one for our nanopore data? Also, assuming we might be able to use the control-statistics which you provided or generate one from our own files?

Happy to share some vcf and or BAM files if required.

FYI - I have been able to call variants/phenotypes from nanopore data by using the run-chip-pipeline.

Why running my WGS data stops in the middle?

image

Hello,

As seen in the attached image, I have run the code for WGS and it ran and then stopped with the above error:

Here is my code:

/usr/local/packages/python-3.9.13/bin/pypgx run-ngs-pipeline CYP2C19 /data4/sbargal/Amish/results/WGS/pypgx/CYP2C19 --variants /data4/sbargal/Amish/amish_input_data/build38/TOPMed6a_build38_889_subject_vcf.vcf.gz --assembly GRCh38 --panel /data4/sbargal/Amish/amish_input_data/build38/resource_bundle/pypgx-bundle/1kgp/GRCh38/CYP2C19.vcf.gz

Here is the error:

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-lmw_btrn because the default path (/home/sbargal/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Saved VcfFrame[Imported] to: /data4/sbargal/Amish/results/WGS/pypgx/CYP2C19/imported-variants.zip
Traceback (most recent call last):
  File "/usr/local/packages/python-3.9.13/bin/pypgx", line 8, in <module>
    sys.exit(main())
  File "/usr/local/packages/python-3.9.13/lib/python3.9/site-packages/pypgx/__main__.py", line 33, in main
    commands[args.command].main(args)
  File "/usr/local/packages/python-3.9.13/lib/python3.9/site-packages/pypgx/cli/run_ngs_pipeline.py", line 159, in main
    pipeline.run_ngs_pipeline(
  File "/usr/local/packages/python-3.9.13/lib/python3.9/site-packages/pypgx/api/pipeline.py", line 239, in run_ngs_pipeline
    phased_variants = utils.estimate_phase_beagle(
  File "/usr/local/packages/python-3.9.13/lib/python3.9/site-packages/pypgx/api/utils.py", line 851, in estimate_phase_beagle
    subprocess.run(command, check=True, stdout=subprocess.DEVNULL)
  File "/usr/local/packages/python-3.9.13/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['java', '-Xmx2g', '-jar', '/usr/local/packages/python-3.9.13/lib/python3.9/site-packages/pypgx/api/beagle.28Jun21.220.jar', 'gt=/tmp/tmp42k2isct/input.vcf', 'chrom=chr10:94759680-94858547', 'ref=/data4/sbargal/Amish/amish_input_data/build38/resource_bundle/pypgx-bundle/1kgp/GRCh38/CYP2C19.vcf.gz', 'out=/tmp/tmp42k2isct/output', 'impute=false']' returned non-zero exit status 1.

I would appreciate it if you could help me handle this error so I can proceed with the next step.

Thank you

Best Regards,

Salma

Introducing `create-input-vcf` command

This new command can be used to create a VCF file (containing SNVs/indels) from BAM files.

Before this command, users were supposed to create a VCF file from BAM files on their own in order to run the NGS pipeline. This raises several potential problems:

  1. PyPGx results depend on which germline variant caller is used (e.g. GATK4, bcftools, DRAGEN, DeepVariant), which in turn can negatively impact the reproducibility of PyPGx. Even if the same variant caller is used, there is no guarantee that the same version of the tool is used.
  2. This assumes that PyPGx users are already familiar with ins and outs of variant calling (e.g. know how to control the balance between sensitivity vs. specificity). PyPGx strongly recommends producing input VCF that contains all possible SNVs/indels to achieve maximum sensitivity because it will only use known variants for genotyping purposes anyways (i.e. variants used to define star alleles). Therefore, PyPGx is actually quite robust against false positives.
  3. In the case of genes with structural variation, PyPGx strongly recommends producing input VCF with allelic depth data (i.e. AD tag). See here for why. However, the AD tag is not something that every variant caller produces by default (it almost always needs to be requested by user).
  4. In the case of WGS data, input VCF file can be quite large so in many cases users are forced to slice the VCF to contain PyPGx target genes only. This task can be tiresome -- users would need to first extract PyPGx target regions (e.g. pypgx create-regions-bed) and then actually do the slicing using a software tool. Of course, users could restrict variant calling to PyPGx target regions only from the beginning, but they would still need to extract PyPGx regions first.
  5. Often, many of the popular variant callers such as GATK4 are very slow and require a huge amount of computing resources. This is because they all use fancy algorithms for maximizing sensitivity and specificity (e.g. GATK4 uses what's known as "local re-assembly of haplotypes"). However, as illustrated in point 2, PyPGx is less concerned about specificity, so it does not require these highly sophisticated algorithms.

That's why I've introduced the create-input-vcf command in 0.14.0-dev, which calls SNVs/indels from BAM files using the bcftools program internally. If you didn't know, bcftools is very "classic" in that it uses less sophisticated algorithm (likelihood-based) for calling variants than, say, GATK4 but it's still very reliable and fast. Therefore, by using bcftools with maximum sensitivity setting, we can create input VCF that's suitable for PyPGx.

Even with the help from bcftools, one would normally need to construct a variant calling pipeline to create a VCF file, which can be quite complex. Therefore, when developing pypgx create-input-vcf, I made sure the command line is as simple as possible with sensible choice of parameters for variant calling:

$ pypgx create-input-vcf out.vcf.gz ref.fa bam.list

The out.vcf.gz file will be already indexed for random access (i.e. out.vcf.gz.tbi) and therefore can be directly used with pypgx run-ngs-pipeline.

Below is the full help message for the command:

$ pypgx create-input-vcf -h
usage: pypgx create-input-vcf [-h] [--assembly TEXT] [--genes TEXT [TEXT ...]]
                              [--exclude] [--dir-path PATH]
                              vcf fasta bams [bams ...]

Call SNVs/indels from BAM files for all target genes.

To save computing resources, this method will call variants only for target
genes whose at least one star allele is defined by SNVs/indels. Therefore,
variants will not be called for target genes that have star alleles defined
only by structural variation (e.g. UGT2B17).

Positional arguments:
  vcf                   Output VCF file. It must have .vcf.gz as suffix.
  fasta                 Reference FASTA file.
  bams                  One or more input BAM files. Alternatively, you can
                        provide a text file (.txt, .tsv, .csv, or .list)
                        containing one BAM file per line.

Optional arguments:
  -h, --help            Show this help message and exit.
  --assembly TEXT       Reference genome assembly (default: 'GRCh37')
                        (choices: 'GRCh37', 'GRCh38').
  --genes TEXT [TEXT ...]
                        List of genes to include.
  --exclude             Exclude specified genes. Ignored when --genes is not
                        used.
  --dir-path PATH       By default, intermediate files (likelihoods.bcf,
                        calls.bcf, and calls.normalized.bcf) will be stored
                        in a temporary directory, which is automatically
                        deleted after creating final VCF. If you provide a
                        directory path, intermediate files will be stored
                        there.

I hope this will help standardize the NGS pipeline even further.

Disclaimer: There will be cases where more sophisticated variant callers are preferred, or even suitable, to generate input VCF for PyPGx. For example, if your samples are ancient DNA then you would probably want to use callers like ATLAS to correct for ancient DNA damage. Also, if you've already created input VCF for purposes other than running PyPGx and you want/need to be consistent with the other variant-level analyses you may also just use the same VCF for PyPGx. The bottom line is, if you are going to create your own input VCF, then you need to know what you are doing. Otherwise, it's probably safer to use pypgx create-input-vcf.

Depth of coverage command.

I am using some exome data to see what kind of 2D6 data I can pull. Following the tutorial I guess I need a depth of coverage file which I tried to create, but am getting the following impressive error.

(test) pypgx prepare-depth-of-coverage depth-of-coverage.zip --assembly GRCh38 "/home/ec2-user/bams.txt" --bed "/home/ec2-user/DATA/agilentV6_design_hg38.bed"
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/test/bin/pypgx", line 10, in <module>
    sys.exit(main())
  File "/home/ec2-user/anaconda3/envs/test/lib/python3.7/site-packages/pypgx/__main__.py", line 33, in main
    commands[args.command].main(args)
  File "/home/ec2-user/anaconda3/envs/test/lib/python3.7/site-packages/pypgx/cli/prepare_depth_of_coverage.py", line 92, in main
    exclude=args.exclude
  File "/home/ec2-user/anaconda3/envs/test/lib/python3.7/site-packages/pypgx/api/utils.py", line 1275, in prepare_depth_of_coverage
    cf = pycov.CovFrame.from_bam(bams, regions=regions, zero=True)
  File "/home/ec2-user/anaconda3/envs/test/lib/python3.7/site-packages/fuc/api/pycov.py", line 315, in from_bam
    if all([pybam.has_chr_prefix(x) for x in bams]):
  File "/home/ec2-user/anaconda3/envs/test/lib/python3.7/site-packages/fuc/api/pycov.py", line 315, in <listcomp>
    if all([pybam.has_chr_prefix(x) for x in bams]):
  File "/home/ec2-user/anaconda3/envs/test/lib/python3.7/site-packages/fuc/api/pybam.py", line 187, in has_chr_prefix
    contigs = tag_sn(fn)
  File "/home/ec2-user/anaconda3/envs/test/lib/python3.7/site-packages/fuc/api/pybam.py", line 163, in tag_sn
    lines = pysam.view('-H', fn, '--no-PG').strip().split('\n')
  File "/home/ec2-user/anaconda3/envs/test/lib/python3.7/site-packages/pysam/utils.py", line 75, in __call__
    stderr))
pysam.utils.SamtoolsError: "samtools returned with error 1: stdout=, stderr=samtools view: unrecognised option '--no-PG'\n\nUsage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]\n\nOptions:\n  -b       output BAM\n  -C       output CRAM (requires -T)\n  -1       use fast BAM compression (implies -b)\n  -u       uncompressed BAM output (implies -b)\n  -h       include header in SAM output\n  -H       print SAM header only (no alignments)\n  -c       print only the count of matching records\n  -o FILE  output file name [samtools_stdout]\n  -U FILE  output reads not selected by filters to FILE [null]\n  -t FILE  FILE listing reference names and lengths (see long help) [null]\n  -L FILE  only include reads overlapping this BED FILE [null]\n  -r STR   only include reads in read group STR [null]\n  -R FILE  only include reads with read group listed in FILE [null]\n  -q INT   only include reads with mapping quality >= INT [0]\n  -l STR   only include reads in library STR [null]\n  -m INT   only include reads with number of CIGAR operations consuming\n           query sequence >= INT [0]\n  -f INT   only include reads with all  of the FLAGs in INT present [0]\n  -F INT   only include reads with none of the FLAGS in INT present [0]\n  -G INT   only EXCLUDE reads with all  of the FLAGs in INT present [0]\n  -s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the\n           fraction of templates/read pairs to keep; INT part sets seed)\n  -M       use the multi-region iterator (increases the speed, removes\n           duplicates and outputs the reads as they are ordered in the file)\n  -x STR   read tag to strip (repeatable) [null]\n  -B       collapse the backward CIGAR operation\n  -?       print long help, including note about region specification\n  -S       ignored (input format is auto-detected)\n      --input-fmt-option OPT[=VAL]\n               Specify a single input file format option in the form\n               of OPTION or OPTION=VALUE\n  -O, --output-fmt FORMAT[,OPT[=VAL]]...\n               Specify output format (SAM, BAM, CRAM)\n      --output-fmt-option OPT[=VAL]\n               Specify a single output file format option in the form\n               of OPTION or OPTION=VALUE\n  -T, --reference FILE\n               Reference sequence FASTA FILE [null]\n  -@, --threads INT\n               Number of additional threads to use [0]\n\n"

CYP2B6 phenotyping/genotyping

Hello folks,
I think there maybe an issue with the way *6 and *9 are being assigned. At the end of the day they are both decreased function alleles, so it should not impact on phenotype, but it will mess with any allele frequency comparisons.

As per https://www.pharmvar.org/gene/CYP2B6

*4 = 18053A>G
*6 = 15631G>T AND 18053A>G
*9 = 15631G>T

When putting our GSA imputed data through PYPGX, we got our previously assigned (by sanger) *6 being assigned as *9. Possibly something to follow up on.

Future Plans

Hi @sbslee !

I am currently working on a new PGx pipeline and wanted to incorporate pypgx into it. I saw your developing branch and wondered if pypgx will support genotyping. Will it have the same functionality as Stargazer?

Beagle throws error when there is only one variant in input VCF

A PyPGx user recently reported the following via email:

I have a couple of questions/issues while integrating with the latest version (1.5) of pypgx. I was hoping you could help me understand what's going on here.

Possible Beagle compatibility issue?

I'm attaching a sample vcf file to this email so you can check it out. It's GRCh37, ngs, CYP26A1
I get this error message when I'm running run_ngs_pipeline()

Captured stdout call

Saved VcfFrame[Imported] to: /tmp/tmpqbzlh8l4/out-CYP26A1/imported-variants.zip

Captured stderr call

Exception in thread "main" java.lang.IllegalArgumentException: Window has only one position: CHROM=10 POS=94833639
        at vcf.MarkerMap.meanSingleBaseGenDist(MarkerMap.java:98)
        at phase.FixedPhaseData.markerMap(FixedPhaseData.java:170)
        at phase.FixedPhaseData.<init>(FixedPhaseData.java:117)
        at main.Main.phaseAndImpute(Main.java:140)
        at main.Main.main(Main.java:110)

While looking up what this error means, I saw some folks talking about Beagle version being incompatible (input file format has changed since v4 and v5). Without really understanding what's going on =P, I tried replacing site-packages/pypgx/api/beagle.28Jun21.220.jar with another version (ex. Beagle v5.1 beagle.18May20.d20.jar) which seems to work around the issue... I'm hoping this means something to you =P

create-input-vcf error using numpy 1.24.1

Hi sbslee,

When I attempted to create an input vcf file, it would return

AttributeError: module 'numpy' has no attribute 'long'

I ended up installing numpy 1.23.0 (instead of the version 1.24.1 that I originally had), and that seemed to work.
I just wanted to let you know.

Thank you!

Is it possible to replace Depth of Coverage file with DP field?

Hi, I am new to this field, and trying to grasp the genotyping and this wonderful pypgx.

I am currently working with vcf files, and generating depth of coverage with bam read is not possible for now.

I know my vcf files contain DP field, do you think it is reasonable to use this instead?

Add sphinx-issues extension

The sphinx-issues extension provides a simple way to link to a GitHub project's issues, pull requests, user profiles, etc. For more details, visit the extension's website.

Window has only one position error" happened when running Estimate-phase-beagle with variants.

Dear sbslee,

I found that the window has only one position error that happened with the below variant when I ran Estimate-phase-beagle for CYP2W1 gene.

*metadata.txt
Platform=WGS
Gene=CYP2W1
Assembly=GRCh37
SemanticType=VcfFrame[Imported]

*data.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 971
7 1020320 . CT C . . . GT:AD:DP:AF 0/1:3,13:16:0.188,0.812
7 1028279 . G A . . . GT:AD:DP:AF 0/1:16,16:32:0.500,0.500
7 1029860 . T C . . . GT:AD:DP:AF 0/1:6,4:10:0.600,0.400
7 1032032 . C G . . . GT:AD:DP:AF 0/1:20,17:37:0.541,0.459

As you see, the two variants (7:1028279G>A and 7:1032032C>G) are included in window size ('chrom=7:1019834-1032276') and overlapped with 1kgp/GRCh37/CYP2W1.vcf.gz.

However, the error message was printed as follow.

Exception in thread "main" java.lang.IllegalArgumentException: Window has only one position: CHROM=7 POS=1028279
at vcf.MarkerMap.meanSingleBaseGenDist(MarkerMap.java:98)
at phase.FixedPhaseData.markerMap(FixedPhaseData.java:170)
at phase.FixedPhaseData.(FixedPhaseData.java:117)
at main.Main.phaseAndImpute(Main.java:140)
at main.Main.main(Main.java:110)

File "/storage/home/leefall2/anaconda3/envs/CKD/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['java', '-Xmx2g', '-jar', '/storage/home/leefall2/anaconda3/envs/CKD/lib/python3.7/site-packages/pypgx/api/beagle.22Jul22.46e.jar', 'gt=/tmp/tmp1zfkzd9c/input.vcf', 'chrom=7:1019834-1032276', 'ref=/storage/home/leefall2/pypgx-bundle/1kgp/GRCh37/CYP2W1.vcf.gz', 'out=/tmp/tmp1zfkzd9c/output', 'impute=false']' returned non-zero exit status 1.

Is there any problem in my data?

I found this kind of case only three times among 7,695 cases.

Best regards,

pypgx on anaconda

Hi!

It seems that pypgx is missing from Anaconda and cannot install it. Have you noticed it?

Citing PyPGx

Hi, I'm currently writing a manuscript related to Vietnamese PGx landscape. In that study, for generating star allele genotype and phenotype, I formerly used Stargazer, but now I'm using PyPGx, and the final manuscript is expected to use results from PyPGx too.

I would like to cite PyPGx, however I don't find the citation information anywhere, so I seek your advice on this one.

Thank you.

Segmentation fault when calling create_input_vcf through API or CLI

Dear sbslee,
When I try to use create_input_vcf function the program crashes with 'Segmentation fault'. I have tried the API in Python and just CLI, both with standard parameters and an own test bam and reference fasta. This also happens when calling this function in the GeT-RM WGS tutorial of your documentation. Is there a way to figure out what the actual underlying problem could be here? Many Thanks!

>>> pypgx.api.utils.create_input_vcf(out_vcf, fasta, bams, assembly='GRCh37', genes=None, exclude=False, dir_path=None, max_depth=250) Segmentation fault

$ pypgx create-input-vcf out.vcf.gz bwa06_1KGRef_PhiX/hs37d5_PhiX.fa blood_*_merged.mdup.bam Segmentation fault

$ pypgx create-input-vcf grch37-variants.vcf.gz genome.fa grch37-bam/*.bam Segmentation fault

Could not load index for bam list having for more than 1018 lines

Hi, I'm trying those 2 functions to calculate control and sample depth of coverage:

bamlist=$1
out_dir=$2

control_gene='VDR'

pypgx compute-control-statistics \
    $out_dir/control-statistcs-${control_gene}.zip \
    --gene VDR \
    --fn $bamlist \
    --assembly GRCh38    

pypgx prepare-depth-of-coverage \
    $out_dir/depth-of-coverage.zip \
    --fn $bamlist \
    --assembly GRCh38

I successfully experimented it on bamlist with around 250 samples, but when I tried it with a list of 1050 samples, it fails and shows this message:

pysam.utils.SamtoolsError: 'samtools returned with error 1: stdout=, stderr=samtools depth: cannot load index for "file.bam"\n'

I tried shuffling the sample list, and it always fails at sample 1018th, no matter what file at that position. I'm using Whole Genome Sequencing files.

Could you please help me with this? I'm using branch 0.90-dev.

Thank you very much

can CRAM be used as input?

Hello,

Thank you for developing pypgx. Can CRAM be used as input in place of BAM?

EDIT: I am also interested in if there are any obvious processing steps that were applied to the CRAMs that I should be on the lookout for when evaluating whether these CRAMS are suitable as input for pypgx.

If the input sample name consists of only numbers, "ValueError: Different sample sets found" is happened when running compute-copy-number alteration.

If the input sample name consists of only numbers, the "read_depth.data.samples" contains the sample name as a string type.

However the "control_statistics.data.index" contains the sample name as an integer data type.

It caused the 'ValueError('Different sample sets found')' and KeyError: "None of [Index(['1642'], dtype='object')] are in the [index]" during running compute-copy-number.

Is there any solution except attaching some string to the sample name?

Best regards,

Update algorithm for whole gene duplication detection

It has come to my attention recently that a sample with gene duplication (i.e. Duplication2 for the CNV column) was called as Indeterminate for CYP2A6 while it was clear that the sample should been called as *1/*2x2.

The sample had *2;*1;, *1;, *2:19-41354533-A-T:0.695;*1:19-41350664-A-T:1.0; for the columns Haplotype1, Haplotype2, and VariantData, respectively -- both AF=0.695 of *2 and AF=1.0 of *1 are evidence of the presence of gene duplication.

The issue is that, currently, PyPGx only compares the "best" alleles from each haplotype to determine the duplicated allele (i.e. *2 from Haplotype1 vs. *1 from Haplotype2); therefore, it does not see that *1 is actually present in both haplotypes. Had PyPGx accounted for the homozygosity of *1, it would have output *1/*2x2.

While this kind of situation is rare, it should still be dealt with properly.

The plan is to update the detection algorithm to account for homozygosity.

Speeding up VCF loading for large files

I found that the current method for loading and processing VCF files is quite memory-demanding and time-consuming, especially for cohort WGS. I'm thinking about using cyvcf for the region slicing and then load the data into pandas Dataframe, do you think it's appropriate for Pypgx?

About CNV, currently I cannot open the CNV definition files in pypgx/pypgx/api/cnv/GRCh38 because of the *.sav extension.

Originally posted by @NTNguyen13 in #32 (comment)

Update CNV labels

In the upcoming release (0.16.0), CNV labels will be updated so that they are:

  1. more systematic
  2. more consistent, both within and between genes
  3. more concise

Below I will list changes in labels for each gene.

CYP2A6

Current Label New Label
Normal Normal
Deletion1Het WholeDel1
Deletion1Hom WholeDel1Hom
Deletion2Het WholeDel2
Deletion2Hom WholeDel2Hom
Deletion3Het WholeDel3
Duplication1 WholeDup1
Duplication2 WholeDup2
Duplication3 WholeDup3
Hybrid1 Hybrid1
Hybrid2 Hybrid2
Hybrid2Hom Hybrid2Hom
Hybrid3 Hybrid3
Hybrid4 Hybrid4
Hybrid5 Hybrid5
Hybrid6 Hybrid6
N/A (added) Hybrid7
Tandem Tandem1
N/A (added) Tandem2
PseudogeneDeletion ParalogWholeDel1
PseudogeneDuplication ParalogWholeDup1
N/A (added) Unknown1

CYP2B6

Current Label New Label
Normal Normal
Duplication WholeDup1
Hybrid Hybrid1
N/A (added) Tandem1
N/A (added) PartialDup1
N/A (added) PartialDup2
N/A (added) ParalogWholeDel1

CYP2D6

Current Label New Label
Normal Normal
DeletionHet WholeDel1
DeletionHom WholeDel1Hom
Duplication WholeDup1
Multiplication WholeMultip1
Tandem1A Tandem1A
Tandem1B Tandem1B
Tandem2A Tandem2A
Tandem2B Tandem2B
Tandem2C Tandem2C
Tandem2F Tandem2F
Tandem3 Tandem3
Tandem4 Tandem4
DeletionHet,Tandem1A WholeDel1+Tandem1A
Duplication,Tandem1A WholeDup1+Tandem1A
PseudogeneDeletion ParalogPartialDel1
N/A (added) WholeDel1+Tandem3
Unknown1 Unknown1
Unknown2 Unknown2
PseudogeneDownstreamDel N/A (removed)

CYP2E1

Current Label New Label
Normal Normal
N/A (added) WholeDel1
Duplication1 WholeDup1
Duplication2 WholeDup2
PartialDuplicationHet PartialDup1
PartialDuplicationHom PartialDup1Hom
Multiplication WholeMultip1
Multiplication2 WholeMultip2
N/A (added) WholeDup1+PartialDup1

CYP4F2

Current Label New Label
Normal Normal
DeletionHet WholeDel1

G6PD

Current Label New Label
Female Female
Male Male

GSTM1

Current Label New Label
Normal Normal
DeletionHet WholeDel1
DeletionHom WholeDel1Hom
Normal,Deletion2 WholeDel2
Duplication WholeDup1
UpstreamDeletionHet NoncodingDel1
DeletionHet,UpstreamDeletionHet WholeDel1+NoncodingDel1
PartialDuplication PartialDup1
DeletionHet,Deletion2 WholeDel1+WholeDel2

GSTT1

Current Label New Label
Normal Normal
DeletionHet WholeDel1
DeletionHom WholeDel1Hom

SLC22A2

Current Label New Label
Normal Normal
Intron9Deletion NoncodingDel1
N/A (added) NoncodingDel1Hom
Exon11Deletion PartialDel1
Intron9Deletion,Exon11Deletion NoncodingDel1+PartialDel1
PartialDuplication PartialDup1

SULT1A1

Current Label New Label
Normal Normal
DeletionHet WholeDel1
DeletionHom WholeDel1Hom
Duplication WholeDup1
Multiplication1 WholeMultip1
Multiplication2 WholeMultip2
Unknown1 Unknown1
N/A (added) Unknown2
N/A (added) Unknown3
N/A (added) Unknown4

UGT1A4

Current Label New Label
Normal Normal
Intron1DeletionA NoncodingDel1
N/A (added) NoncodingDel1Hom
Intron1DeletionB NoncodingDel2
Intron1PartialDup NoncodingDup1

UGT2B15

Current Label New Label
Normal Normal
Deletion WholeDel1
Deletion2 WholeDel2
Duplication WholeDup1
PartialDeletion1 PartialDel1
PartialDeletion2 PartialDel2
PartialDeletion3 PartialDel3
PartialDuplication PartialDup1
N/A (added) PartialDup2

UGT2B17

Current Label New Label
Normal,Normal Normal
Normal,Deletion WholeDel1
Deletion,Deletion WholeDel1Hom
N/A (added) PartialDel2
Normal,PartialDeletion3 PartialDel3
Deletion,PartialDeletion1 WholeDel1+PartialDel1
Deletion,PartialDeletion2 WholeDel1+PartialDel2
Deletion,PartialDeletion3 WholeDel1+PartialDel3

Running Pypgx with Phased variants and Predicted CNV

Hi, I have read the manual and the code, I found that Pypgx always needs to run Beagle to phase the variants before genotyping. and the CNV is detected by read depth. However, in some cases, our variants will be phased by another workflow, so does CNV! How can I input phased VCF and predicted CNV (say, in some common format) into predict_allele and call_genotypes ?

In the past, when I used Stargazer, I saw in the code that it will detect the GT separator '|' or '/' before doing phasing, but Pypgx will always unphase all variants first, then phase, and finally consolidate both files. I tried to tinker with the new Pypgx code, but it turned out to be harder than I thought, and I'm not sure that I understand everything.

Thank you very much.

Integration CNV/SV with PyPGx pipeline

Hi, as discussed in #32, I want to integrate CNV/SV called by other tools with PyPGx pipeline. Most of the time, those CNV/SV are stored as VCF file, while PyPGx need the CNV/SV to be labelled as in cnv-table.csv. In most cases, the label are quite easy to understand, such as DeletionHet, DeletionHom. However, some cases are not as clear, e.g. Tandem2A[CYP2D6], Intron1DeletionA[UGT1A4]. For me, most names with Tandem and A, B, C are quite difficult to understand.
I have some questions:

  • How should I understand these above cases?
  • What is the criteria/threshold for labeling a CNV/SV as in cnv-table.csv? For example, I have a DEL SV span 1/2 the length of CYP2D6 gene, can I call it Deletion yet?
  • In cases I have fully understood the label and criteria, is it OK for me to add CNV as an additional argument to ngs and chip pipeline of PyPGx?

UserWarning: Trying to unpickle estimator SVC from version 0.24.2 when using version 0.24.1

I was just going through the tutorial I got the following "error messages"- is this something I should be worried about?

$ pypgx run-ngs-pipeline CYP2D6 grch37-CYP2D6-pipeline --variants "/home/ec2-user/PYPX/getrm-wgs-tutorial/grch37-variants.vcf.gz                                                                                                                             " --depth-of-coverage "/home/ec2-user/PYPX/getrm-wgs-tutorial/grch37-depth-of-coverage.zip" --control-statistics "/home/ec2-user/PYPX/getrm-wgs-tutorial/grch37-control-                                                                                                                             statistics-VDR.zip"
Saved VcfFrame[Imported] to: grch37-CYP2D6-pipeline/imported-variants.zip
Saved VcfFrame[Phased] to: grch37-CYP2D6-pipeline/phased-variants.zip
Saved VcfFrame[Consolidated] to: grch37-CYP2D6-pipeline/consolidated-variants.zip
Saved SampleTable[Alleles] to: grch37-CYP2D6-pipeline/alleles.zip
Saved CovFrame[ReadDepth] to: grch37-CYP2D6-pipeline/read-depth.zip
Saved CovFrame[CopyNumber] to: grch37-CYP2D6-pipeline/copy-number.zip
_/home/ec2-user/anaconda3/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator SVC from version 0.24.2 when using version 0.24.1. T                                                                                                                             his might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(
/home/ec2-user/anaconda3/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator LabelBinarizer from version 0.24.2 when using versio                                                                                                                             n 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(
/home/ec2-user/anaconda3/lib/python3.8/site-packages/sklearn/base.py:310: UserWarning: Trying to unpickle estimator OneVsRestClassifier from version 0.24.2 when using v                                                                                                                             ersion 0.24.1. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(_
Saved SampleTable[CNVCalls] to: grch37-CYP2D6-pipeline/cnv-calls.zip
Saved SampleTable[Genotypes] to: grch37-CYP2D6-pipeline/genotypes.zip
Saved SampleTable[Phenotypes] to: grch37-CYP2D6-pipeline/phenotypes.zip
Saved SampleTable[Results] to: grch37-CYP2D6-pipeline/results.zip

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.