cbg-ethz / bnpc Goto Github PK

View Code? Open in Web Editor NEW

17.0 3.0 4.0 205 KB

Bayesian non-parametric clustering (BnpC) of binary data with missing values and uneven error rates

License: MIT License

Python 100.00%

clustering binary-data mcmc split-merge genotyping

bnpc's Introduction

BnpC

Bayesian non-parametric clustering (BnpC) of binary data with missing values and uneven error rates.

BnpC is a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. BnpC employs a Chinese Restaurant Process prior to handle the unknown number of clonal populations. The model introduces a combination of Gibbs sampling, a modified non-conjugate split-merge move and Metropolis-Hastings updates to explore the joint posterior space of all parameters. Furthermore, it employs a novel estimator, which accounts for the shape of the posterior distribution, to predict the clones and genotypes.

The corresponsing paper can be found in Bioinformatics

Installation
Usage
Example data

Requirements

Python 3.X

Installation

Clone repository

First, download BnpC from github and change to the directory:

git clone https://github.com/cbg-ethz/BnpC
cd BnpC

Create conda environment (optional)

First, create a new environment named "BnpC":

conda create --name BnpC python=3

Second, source it:

conda activate BnpC

Install requirements

Use pip to install the requirements:

python -m pip install -r requirements.txt

Now you are ready to run BnpC!

Usage

The BnpC wrapper script run_BnpC.py can be run with the following shell command:

python run_BnpC.py <INPUT_DATA> [-t] [-FN] [-FP] [-FN_m] [-FN_sd] [-FP_m] [-FP_sd] [-dpa] [-pp] [-n] [-s] [-r] [-ls] [-b] [-smp] [-cup] [-e] [-sc] [--seed] [-o] [-v] [-np] [-tr] [-tc] [-td]]

Input

BnpC requires a binary matrix as input, where each row corresponds with a mutations and each columns with a cell. All matrix entries must be of the following: 0|1|3/" ", where 0 indicates the absence of a mutation, 1 the presence, and a 3 or empty element a missing value.

Note

If your data is arranged in the transposed way (cells = columns, rows = mutations), use the -t argument.

Arguments

Input Data Arguments

<str>, Path to the input data.
-t <flag>, If set, the input matrix is transposed.

Optional input arguments (for simulated data)

-tr <str>, Path to the mutation tree file (in .gv format) used for data generation.
-tc <str>, Path to the true clusters assignments to compare clustering methods.
-td <str>, Path to the true/raw data/genotypes.

Model Arguments

-FN <float>, Replace <float> with the fixed error rate for false negatives.
-FP <float>, Replace <float> with the fixed error rate for false positives.
-FN_m <float>, Replace <float> with the mean for the prior for the false negative rate.
-FN_sd <float>, Replace <float> with the standard deviation for the prior for the false negative rate.
-FP_m <float>, Replace <float> with the mean for the prior for the false positive rate.
-FP_sd <float>, Replace <float> with the standard deviation for the prior for the false positive rate.
-ap <float>, Alpha value of the Beta function used as prior for the concentration parameter of the CRP.
-pp <float> <float>, Beta function shape parameters used for the cluster parameter prior.

Note

If you run BnpC on panel data with few mutation only or on error free data, we recommend changing the -pp argument to beta distribution closer to uniform, like -pp 0.75 0.75 or even -pp 1 1. Otherwise, BnpC will incorrectly report many singleton clusters.

MCMC Arguments

-n <int>, Number of MCMC chains to run in parallel (1 chain per thread).
-s <int>, Number of MCMC steps.
-r <int>, Runtime in minutes. If set, steps argument is overwritten.
-ls <float>, Lugsail batch means estimator as convergence diagnostics [Vats and Flegal, 2018].
-b <float>, Ratio of MCMC steps discarded as burn-in.
-cup <float>, Probability of updating the CRP concentration parameter.
-eup <float>, Probability to do update the error rates in An MCMC step.
-smp <float>, Probability to do a split/merge step instead of Gibbs sampling.
-sms <int>, Number of intermediate, restricted Gibbs steps in the split-merge move.
-smr <float, float>, Ratio of splits/merges in the split merge move.
-e +<str>, Estimator(s) for inferrence. If more than one, seperate by space. Options = posterior|ML|MAP.
-sc <flag>, If set, infer a result for each chain individually (instead of from all chains together).
--seed <int>, Seed used for random number generation.

Output Arguments

-o <str>, Path to an output directory.
-np <flag>, If set, no plots are generated.
-v <int>, Stdout verbosity level. Options = 0|1|2.

Example data

Lets employ the toy dataset that one can find in the data folder (data.csv) to understand the functionality of the different arguments. First go to the folder and activate the environment:

    cd /path/to/crp_clustering
    conda activate environment_name

BnpC can run in three different settings:

Number of steps. Runs for the given number of MCMC steps. Arument: -s
Running time limit. Every MCMC the time is tracked and the method stops after the introduced time is achieved. Argument: -r
Lugsail for convergence diagnosis. The chain is terminated if the estimator undercuts a threshold defined by a significance level of 0.05 and a user defined float between [0,1], comparable to the half-width of the confidence interval in sample size calculation for a one sample t-test. Reasonal values = 0.1, 0.2, 0.3. Argument: -ls

The simplest way to run the BnpC is to leave every argument as default and hence only the path to the data needs to be given. In this case BnpC runs in the setting 1.

python run_BnpC.py example_data/data.csv

If the error rates are known for a particular sequenced data (e.g FP = 0.0001 and FN = 0.3), one can run BnpC with fixed error rates by:

python run_BnpC.py example_data/data.csv -FP 0.0001 -FN 0.3

On the other hand, if errors are not known one can leave it blank as in the first case or if there is some intuition add the mean and standard deviation priors for the method to learn them:

python run_BnpC.py example_data/data.csv -FP_m 0.0001 -FN_m 0.3 -FP_sd 0.000001 -FN_sd 0.05

Additional MCMC arguments can be employed to allow faster convergence. Among other options:

Reduce burnin to include more posterior samples in the estimation. Example: -b 0.2, discard 20 % of the total MCMC steps.
Adapt split-merge probability to better explore the posterior landscape. Example: -smp 0.33, 1 out of every 3 steps will be a split-merge move on average.
Adjust the Dirchlet Process alpha which accounts for the probability of starting a new cluster. Example: -dpa 10. Increasing the value, leads to a larger probability of starting a new cluster in the cell assignment step.

bnpc's People

Contributors

Stargazers

Watchers

Forkers

huzheng16 cao-yuanxin-sduwh wsczw arthurdondi

bnpc's Issues

how to deal with "too many missing data"

Hi BnpC support

from your published paper, your tools are best to deal with missing data (more than 20% missing).

Now. we have targeted sequencing snDNA samples from FFPE, with has lots of missing data due to the random fragmentation of DNA. (By the way, we cannot use regular filter thresholds to reduce the % missing data, otherwise, we will not get any cells). With such large % missing data (~50%), we got too many singleton clusters (which certainly not make sense). now what are the best parameters settings to avoid such a issue?

how can I generate tree file

according to the readme, BnpC can generate a output tree file (did I misunderstand?)
but when I run this:
run_BnpC.py snp.input.txt -tr test.gv

I got error like below, it seems that BnpC is trying to read "test.gv":
Traceback (most recent call last):
File "BnpC/run_BnpC.py", line 296, in
main(args)
File "BnpC/run_BnpC.py", line 291, in main
generate_output(args, results, data, data_names)
File "BnpC/run_BnpC.py", line 232, in generate_output
io.save_tree_plots(
File "BnpC/libs/dpmmIO.py", line 226, in save_tree_plots
pl.color_tree_nodes(
File "BnpC/libs/plotting.py", line 324, in color_tree_nodes
with open(tree_file, 'r') as f_in:
FileNotFoundError: [Errno 2] No such file or directory: 'test.gv'

Heatmap with the phylogeny tree

Dear BnpC team,

Thank you so much for the quick fix. I am wondering if there is a way to also display the pylogeny tree with the heatmap (genoCluster_posterior_mean.png) or export the tree in a newick format. Thanks a lot in advance.

Monica.

Extraction of row order from genoCluster_posterior_mean.png?

Hi Nico,

Thank you so much for your reply and help with the interpretation.
I just realized that the row orders shown in the cluster_posterior_mean image do not represent the row orders given in the input data. Is there a way to extract this information? I would like to take a closer look at the mutations that are behind each clone formation (clustering) and I think it could only be possible by tracking it back to the input data.

Thank you very much again and I look forward to your reply.

Best,
Monica.

How to determine mutations responsible for different clones?

Hello BnpC team,

Thank you so much for the wonderful tool. I have a small question. I am wondering how i could determine the mutations that are solely responsible for the formation of the respective clones. this would help us in further analysis.

Thanks again and i look forward to your reply.

Monica.

how can I map my cells to an previously generated cluster?

Hi
there is a publication that has their generated cluster by BnpC, and each cluster associates with a function, now how can I map my data to such an existing cluster? (I don't want to use my data to generate any new cluster because I want to map my data to their functions).

this is equivalent to the situation below:
I have 11 samples, and I want to use 10 of them to generate a cluster, and use the 11th sample to map to the pre-generated cluster from the 10 samples.

if current version of BnpC does not support such a request, could you let me know which part of your code needs to be changed?

IndexError: list index out of range

Hello,

I tried to run the tool with python3.9 and my command was python run_BnpC.py example_data/data.csv
and this had returned me the following error. I would appreciate any help with running this tool. Thank you very much and look forward to your reply.

Traceback (most recent call last):
  File "/Volumes/Monica_data_folder/bnpc_software/BnpC/run_BnpC.py", line 295, in <module>
    main(args)
  File "/Volumes/Monica_data_folder/bnpc_software/BnpC/run_BnpC.py", line 287, in main
    results = mcmc.get_results()
  File "/Volumes/Monica_data_folder/bnpc_software/BnpC/libs/MCMC.py", line 65, in get_results
    if not 'burn_in' in results[0]:
IndexError: list index out of range```

genoCluster_posterior_mean.png doesnot display the correct row ids

Hi Bnpc support,

I have used row ids and cell IDs in the input and then the resulting heatmap "genoCluster_posterior_mean.png" from it contains the row ids in the exactly same order as the input. But from the image, we could easily say that the rows were sorted and the order of rows had been changed. The order of the row ids is not in the same way as the input but the row ids display the same order as the input. Is there something that I am doing wrong? Do you happen to have a solution for this?

Thanks a lot in advance.

Best,
Monica.

Understanding the output of BnpC

Hi,
I am using BnpC to get the clusters of a single cell dataset. After executing it I want to know the cluster nos for each cell which I believe is provided in the "assignment.txt" file and the mutations of each cluster.
Can you please help me understand which file will get me the mutations of each cluster? Also, please confirm if "assignment.txt" is the file which indicates cluster no for the cells. I used the following command to execute BnpC:

python ../BnpC/run_BnpC.py filename.tsv -pp 0.75 0.75 -o ./bnpc_results/

Thank you,
Ritu

Support for heterozygous/homozygous genotype categories?

Greetings,

I found BnpC is while testing out infSCITE and think it might help us with deciphering our SCS data. I have an initial question - according to the docs All matrix entries must be of the following: 0|1|3/" ", where 0 indicates the absence of a mutation, 1 the presence, and a 3 or empty element a missing value..

I'm interested in running our categorical genotype data, which is very similar to your input requirements:

Our input	BnpC input
0-reference	0 indicates the absence of a mutation
1-heterozygous mutation	1 the presence
2-homozygous mutation	1 the presence
3-unknown	3 or empty element a missing value

Is there any facility, or plans, on including hetero/homozygous genotype distinction in BnpC?

Thanks!
JP

heatmap genoCluster_posterior_mean.png lost tracks of original mutation and cell ID?

Hi BnpC support
genoCluster_posterior_mean.png are only show sequential numbers on x-axis and y-axis, how can I track back to original cell ID and mutation ID? or in other words, are the sequential numbers the orders of original mutation and cell input?
Thanks
Xianfeng Chen

Help with the interpretation of heatmap