yoseflab / cassiopeia Goto Github PK

A Package for Cas9-Enabled Single Cell Lineage Tracing Tree Reconstruction

Home Page: https://cassiopeia-lineage.readthedocs.io/en/latest/

License: MIT License

Python 97.74% Makefile 0.03% Cython 1.19% C++ 1.03% Shell 0.01%

single-cell-rna-seq lineage-tracing single-cell-lineage-tracing computational-phylogenetics computational-biology crispr-cas9

cassiopeia's People

Contributors

Stargazers

Watchers

cassiopeia's Issues

Key error when calling cassiopeia.pp.call_lineage_groups

Hello, I am following the preprocessing notebook in the refactor branch with my own GESTALT data. I believe it has been working well (or at least has not errored out; I haven't had a chance to dig into the various outputs to see if things make sense with my experiment), up until the last step in the notebook. When I ran allele_table = cassiopeia.pp.call_lineage_groups(umi_table, output_dir), I got the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Sample'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.6/site-packages/pandas/core/generic.py in _set_item(self, key, value)
   3575         try:
-> 3576             loc = self._info_axis.get_loc(key)
   3577         except KeyError:

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 

KeyError: 'Sample'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-10-2b5fa699a3d8> in <module>
----> 1 allele_table = cassiopeia.pp.call_lineage_groups(umi_table, output_dir)

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/pipeline.py in call_lineage_groups(input_df, output_directory, min_umi_per_cell, min_avg_reads_per_umi, min_cluster_prop, min_intbc_thresh, inter_doublet_threshold, kinship_thresh, verbose, plot)
    997     )
    998 
--> 999     allele_table = l_utils.filtered_lineage_group_to_allele_table(filtered_lgs)
   1000 
   1001     if verbose:

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/lineage_utils.py in filtered_lineage_group_to_allele_table(filtered_lgs)
    408 
    409     final_df["Sample"] = final_df.apply(
--> 410         lambda x: x.cellBC.split(".")[0], axis=1
    411     )
    412 

~/.local/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3042         else:
   3043             # set column
-> 3044             self._set_item(key, value)
   3045 
   3046     def _setitem_slice(self, key: slice, value):

~/.local/lib/python3.6/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3119         self._ensure_valid_index(value)
   3120         value = self._sanitize_column(key, value)
-> 3121         NDFrame._set_item(self, key, value)
   3122 
   3123         # check if we are modifying a copy

~/.local/lib/python3.6/site-packages/pandas/core/generic.py in _set_item(self, key, value)
   3577         except KeyError:
   3578             # This item wasn't present, just insert at end
-> 3579             self._mgr.insert(len(self._info_axis), key, value)
   3580             return
   3581 

~/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py in insert(self, loc, item, value, allow_duplicates)
   1196             value = _safe_reshape(value, (1,) + value.shape)
   1197 
-> 1198         block = make_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1))
   1199 
   1200         for blkno, count in _fast_count_smallints(self.blknos[loc:]):

~/.local/lib/python3.6/site-packages/pandas/core/internals/blocks.py in make_block(values, placement, klass, ndim, dtype)
   2742         values = DatetimeArray._simple_new(values, dtype=dtype)
   2743 
-> 2744     return klass(values, ndim=ndim, placement=placement)
   2745 
   2746 

~/.local/lib/python3.6/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
   2398             values = np.array(values, dtype=object)
   2399 
-> 2400         super().__init__(values, ndim=ndim, placement=placement)
   2401 
   2402     @property

~/.local/lib/python3.6/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
    129         if self._validate_ndim and self.ndim and len(self.mgr_locs) != len(self.values):
    130             raise ValueError(
--> 131                 f"Wrong number of items passed {len(self.values)}, "
    132                 f"placement implies {len(self.mgr_locs)}"
    133             )

ValueError: Wrong number of items passed 9, placement implies 1

Any insights as to what might have happened? I haven't changed any of the options or arguments for any of the steps in the preprocessing notebook. The only difference is that I started with a different bam file. Thanks for any help!

pip in Makefile

In Makefile the install rule doesn't refer to the pip variable.

reconstruct-lineage error: IndexError: index 2 is out of bounds for axis 0 with size 2

Hello I'm trying to run reconstruct_lineages.ipynb but with the test data data/test_at.txt as the allele table. It seemed to output the allele table fine but when I ran reconstruct lineage I got an error. I am wondering what is the exact input format for the reconstruct-lineage section? Right now the test_lg4_character_matrix.txt I have is below (just first 3 lines). Below that is the error I encountered. Thank you.

cellBC	r0	r1	r2	r3	r4	r5	r6	r7	r8	r9	r10	r11	r12	r13	r14	r15	r16	r17	r18	r19	r20	r21r22	r23	r24	r25	r26	r27	r28	r29	r30	r31	r32	r33	r34	r35	r36	r37	r38	r39	r40	r41	r42	r43	r44	r45	r46	r47	r48	r49	r50

IVLT-2B_00.AAACCTGGTCTGGTCG-1	0	0	0	2	2	0	2	0	0	0	2	2	0	2	0	0	0	0	2	20	0	2	2	0	2	0	2	0	0	0	2	0	0	0	0	2	0	0	-	-	-	-	--	-	-	-	-	-	-  
IVLT-2B_00.AAACCTGTCCACTCCA-1	0	0	0	2	0	0	2	0	2	-	-	-	0	2	0	2	0	0	2	20	0	2	2	2	0	0	2	0	2	-	-	-	2	2	0	0	2	2	0	0	0	-	--	-	-	-	-	-	-

The command and error message

reconstruct-lineage test_lg4_character_matrix.txt test_lg4_tree.pkl --hybrid
running algorithm...
Using 1 threads, 48 available.
Sending off Target Sets: 1
Started new thread for: 0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0 (num targets = 10) , pid = ea70d39f9eb88a0a5b0739128f47b4d5
(2, 51)
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/isshamie/software/Cassiopeia/cassiopeia/TreeSolver/lineage_solver/lineage_solver.py", line 421, in wrapped
    return func(*args, **kwargs)
  File "/home/isshamie/software/Cassiopeia/cassiopeia/TreeSolver/lineage_solver/lineage_solver.py", line 445, in prune_unique_alleles
    counts,
  File "/home/isshamie/software/Cassiopeia/cassiopeia/TreeSolver/lineage_solver/lineage_solver.py", line 443, in <lambda>
    if len(np.where(x[1] == 1)) > 0
IndexError: index 2 is out of bounds for axis 0 with size 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/isshamie/software/Cassiopeia/cassiopeia/TreeSolver/lineage_solver/lineage_solver.py", line 421, in wrapped
    return func(*args, **kwargs)
  File "/home/isshamie/software/Cassiopeia/cassiopeia/TreeSolver/lineage_solver/lineage_solver.py", line 572, in find_good_gurobi_subgraph
    proot, targets_pruned, pruned_to_orig = prune_unique_alleles(root, targets)
  File "/home/isshamie/software/Cassiopeia/cassiopeia/TreeSolver/lineage_solver/lineage_solver.py", line 423, in wrapped
    traceback_str = traceback.format_exc(e)
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/traceback.py", line 167, in format_exc
    return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/traceback.py", line 121, in format_exception
    type(value), value, tb, limit=limit).format(chain=chain))
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/traceback.py", line 509, in __init__
    capture_locals=capture_locals)
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/traceback.py", line 338, in extract
    if limit >= 0:
TypeError: '>=' not supported between instances of 'IndexError' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/concurrent/futures/process.py", line 175, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/isshamie/software/Cassiopeia/cassiopeia/TreeSolver/lineage_solver/lineage_solver.py", line 423, in wrapped
    traceback_str = traceback.format_exc(e)
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/traceback.py", line 167, in format_exc
    return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/traceback.py", line 121, in format_exception
    type(value), value, tb, limit=limit).format(chain=chain))
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/traceback.py", line 498, in __init__
    _seen=_seen)
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/traceback.py", line 509, in __init__
    capture_locals=capture_locals)
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/traceback.py", line 338, in extract
    if limit >= 0:
TypeError: '>=' not supported between instances of 'TypeError' and 'int'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/isshamie/software/anaconda3/envs/cass/bin/reconstruct-lineage", line 33, in <module>
    sys.exit(load_entry_point('cassiopeia-lineage', 'console_scripts', 'reconstruct-lineage')())
  File "/home/isshamie/software/Cassiopeia/cassiopeia/TreeSolver/reconstruct_tree.py", line 245, in main
    lookahead_depth=lookahead_depth,
  File "/home/isshamie/software/Cassiopeia/cassiopeia/TreeSolver/lineage_solver/lineage_solver.py", line 228, in solve_lineage_instance
    results, r, pid, graph_sizes = future.result()
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/data/isshamie/software/anaconda3/envs/cass/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
TypeError: '>=' not supported between instances of 'TypeError' and 'int'

Error in resolve_umi_sequence

Hello,

I was following the tutorial for the preprocessing, and when I get to the resolve_umi_sequence I get an error thrown.
__init__() got an unexpected keyword argument 'basey'
I believe it's a deprecation of the params in the new matplotlib version.

Best,
Chang

Triplets Correct and RF functions both mutate the underlying tree

From the following code snippet, we see that even though the cached edges do not change, the underlying backend networkx object has changed, with the singleton edge at the top of the tree being collapsed. Hence we believe that the behavior of the collapse_unifurcations function that is called on the (supposed) copy of the trees passed into triplets_correct and robinson_foulds somehow persists.

import cassiopeia as cas

simulated_tree_dir = "/data/yosef2/users/richardz/projects/CassiopeiaV2-Reproducibility/topologies/exponential_plus_c/400cells/no_fit/"
reconstructed_tree_dir = '/data/yosef2/users/richardz/projects/CassiopeiaV2-Reproducibility/reconstructed/exponential_plus_c/400cells/no_priors/no_fit/char40/'

ind = 0
tree = pic.load(open(f'{simulated_tree_dir}/topology{ind}.pkl', 'rb'))
print(len(tree.edges))
print(tree._CassiopeiaTree__network.number_of_edges())
print(tree.root, tree.children(tree.root))

recon_tree = cas.data.CassiopeiaTree(tree=f'{reconstructed_tree_dir}/greedy/recon{ind}')
triplets = cas.critique.triplets_correct(tree, recon_tree, min_triplets_at_depth = 50)[0]  # or rf = cas.critique.robinson_foulds(tree, recon_tree)
print(len(tree.edges))
print(tree._CassiopeiaTree__network.number_of_edges())
print(tree.root, tree.children(tree.root))

Deal with empty target site alleles appropriately

Some target sites end up getting an empty string '' as their allele. Still not entirely sure where/why this is happening, but the pipeline should deal with these more appropriately.
Currently, if a table has such empty-string-target site entries, reading the table with pd.read_csv causes these entries to be replaced with a floating-point NaN, which causes problems when we try to use the target site indels to construct a string. Specifically, if we try to join the target sites into a single allele string, as so

molecule_table[['r1', 'r2', 'r3']].apply(lambda x: '_'.join(x))

this will cause an error because all the arguments to join must be strings, but NaN is a float.

One (hacky?) way to get around this is to simply provide na_filter=False to pd.read_csv, which reads the empty strings literally. But it is unclear if this is the behavior we want. Or do we want to simply filter out all UMIs with missing target site indels?

Here is an example of a UMI that has a missing target site indel.

AAAAAACCGTCGTA_GTGGTCGGG_1      777.0   110M1D52M1D55M  0.0     0.0     AATCCAGCTAGCTGTGCAGCCGTGAGTCTCTGATATTCAACTGCAGTAATGCTACCTAGTACTCACGCTTTCCAAGTGCTTGGCGTCGCATCACGGTCCTTTGTACGCCGAAAATCGCCAGACAACTAAGCTACGGCACGCTGCCATGTTGGGTCATACCCAAAATCAGGTTACTCCTGGGCCGCACAAGTCATGGAGAAGTCGAGACTATTAATTCATGATTCCCAATCTGGACTTACTACATGTATACCCC   GTGGTCGGG       [111:1D][164:1D]AAAAAACCGTCGTA  CGTGAGTCTCTGAT  [111:1D]        [164:1D]                1.0

After discussing with @mattjones315, this behavior is mainly due to the low sequencing quality of this dataset, compounded by technical artifacts in local alignment (no alignment for the particular target site)

results produced by ilp_solver

Hi,

I used ilp Solver on around 20 taxa dataset. The output from ilp Solver is a Digragh. When i try to export it into a newick fomat. Sometimes it will succeed and sometimes not. I wonder why would this happen?
Another question is , when i have the output as a newick, some subtree/groups of taxa will appear more than once. the itol view will display them randomly. However, how should choose the best tree.

Thank you for your answer.

Optional Priors Usage in Reconstruction

Make an argument use_priors in CassiopeiaSolvers that dictates whether or not to use priors during reconstruction of a tree.

The current behavior is that if priors exist within a CassiopeiaTree, then they are used. This is a bit presumptuous of the user's intentions.

'cassiopeia' has no attribute 'pp'

Hi @mattjones315, I am trying out the preprocess.ipynb notebook you mentioned in issue #79 from the refactor branch on some of my recently acquired GESTALT data. I am getting the following error: AttributeError: module 'cassiopeia' has no attribute 'pp' when trying to run any command that contains cassiopeia.pp. I'm not sure why; I was able to import cassiopeia no problem into my jupyter notebook (and as an installation check, I confirmed that running reconstruct-lineage -h in the command line gave me usage details). I remember when I first started using your tool back in March I was able to (mostly) run through the process_fastq.ipynb notebook and cassiopeia was working fine. Was there a recent update to cassiopeia that would be throwing things off, or did I perhaps miss something in the install?

Support character matrix colorstrips in plotting

Currently, there is no option to plot indel heatmaps only using the character matrix. As of now, the full allele table is required. Extend the plotting functionality to support this use case. #159

understanding the character matrix

You mentioned that the character matrix:

This simple data structure is an N x M matrix, where we represent each of the N cells in a population by a vector of M characters that can take on a mutation. In the context of Cas9-based lineage tracers, each of these M characters is a specific cut-site that can take on one of several possible indels. The entry 𝑛𝑖,𝑗 represents that mutation observed in the 𝑖𝑡ℎ cell at the 𝑗𝑡ℎ cut-site. For simplicity, we abstract away actual indentities and represent each unique mutation as integer, so that these character matrices are filled with integers. Importantly, Cassiopeia represents missing data with the integer -1, though users can change this as long as they specify this to the CassiopeiaTree downstream.

I don't fully understand this structure. Does it mean that n_{ij} represent a particular form of mutation from cell i that occurs at j-th cutting site? What if a particular mutation occupies several cut targets, will then n_ij>0 for consecutive columns?

A more important question: I am now analyzing a Cas-9 based lineage tracing data from a different experimental design, and I have run the preprocessing using a different pipeline. Now, I have the (Cell ID, mutation) table. For a given cell, it may have several independent mutations (like cuts at different targets), and for a given mutation, it may be observed across several different cells. I wonder if it is more natural to just convert the data into a cell-by-mutation matrix, where each column is a different mutation, and the entry n_{ij} will be whether a particular mutation is observed at a given cell or not. Can I just pass this matrix to your pipeline as the character matrix, and run it?

Will be very happy to discuss more :)

Cophenetic Distance

Implement utility to compute cophenetic distance for a tree. The cophenetic distance is the correlation between phylogenetic distance and some dissimilarity map between samples' character information.

Cleanup pandas operations in preprocessing pipeline

The preprocessing pipeline performs a lot of grouped iterations of the following form:

for cellBC, group in moleculetable.groupby("cellBC"):
    ...

Wherever possible, we should update these instances to use groupby operations, instead of iterating over each group for the following reasons.

Pandas groupby operations are usually much faster than iterating over each group and performing operations separately for each group.
The code usually becomes inflated when iterating through each group manually, and thus increases code complexity.

This applies to several functions in pp.utilities, pp.UMI_utils, pp.lineage_utils, pp.doublet_utils.

Hybrid Solver: logging for the bottom_solver does not work as intended

The bottom_solver doesn't see the logfile passed into the solve method in the HybridSolver

Cassiopeia for time-point data

Hello,
I wonder whether Cassiopeia pipeline can be used for scRNAseq data obtained at different time points during the process of cancer transformation? Please let me know.

Thank you,
Shikha

File names, import issues

File (module) names should be all lowercase see: https://www.python.org/dev/peps/pep-0008/#package-and-module-names -- and this is causing an issue with autosummary for nice class pages in the docs
For an import like this

Cassiopeia/cassiopeia/data/CassiopeiaTree.py

Line 27 in 4a4ba1a

from cassiopeia.data import utilities

utilities should be a subpackage in that directory.
I would avoid cross package imports like this

Cassiopeia/cassiopeia/data/CassiopeiaTree.py

Line 34 in 4a4ba1a

from cassiopeia.solver import solver_utilities

as I believe some parts of solver depend on the tree
For an import like this

Cassiopeia/cassiopeia/solver/NeighborJoiningSolver.py

Lines 16 to 20 in 4a4ba1a

from cassiopeia.solver import (

DistanceSolver,

dissimilarity_functions,

solver_utilities,

)

I feel it gets kind of weird because solver_utilities is a file in solver but the other two are imported into that namespace via the init file.
Imports like the first and third here

Cassiopeia/cassiopeia/simulator/CompleteBinarySimulator.py

Lines 11 to 13 in 4a4ba1a

from cassiopeia.data.CassiopeiaTree import CassiopeiaTree

from cassiopeia.mixins import TreeSimulatorError

from cassiopeia.simulator.TreeSimulator import TreeSimulator

should not have the file name in them, as these are imported via the init of that directory
I would rename the github repo to cassiopeia to remove the capital C so that the name reflects the python package name. All old urls will work still.

Conversion to newick in the reconstruct_lineages notebook

I was just trying out some examples, and in the 'Post-Process Tree & Add Redundant Leaves' section, the command:

g.newick = data_pipeline.convert_network_to_newick_format(g)

Gives me a

'TypeError: 'Cassiopeia_Tree' object is not iterable' error.

Looking at the tree class, it seems class function get_newick(self) calls this same function with the network, not the full object. Is the example notebook out of date? Thanks!

simulation data availability

Hi, Dear Developer
First, I want to thanks for your tool and shared code :).
And I am wondering whether you can offer some simulation data link, like the test_possorted_genome_bam.bam in process_fastq.ipynb. After all, CellRange is time consuming. I believe some prepared data may help people to get acquainted faster.

## first specify the home directory, and possorted genome bam
home_dir = "."
genome_bam = "data/test_possorted_genome_bam.bam"

And I have another question about test_possorted_genome_bam.bam. It looks like this bam is the result of CellRange, but I found in the paper method, there is another step between Assigning cell barcode and UMIs to each read and Aligning to the target site reference.

Finding the consensus sequence for each UMI. To potentially increase the speed of consensussequence finding, we attempt to trim reads to the same length for each UMI.

But I do not found this step in the Cassiopeia.
Please forgive me If I misunderstand something :)

Best wishes
Guandong Shang

Global Alignment in Cassiopeia Preprocessing

We've found that the local alignment strategy used in our Cassiopeia preprocessing pipeline is not ideal for some technologies like GESTALT. We'd like to implement a global alignment option in cas.pp.align_sequences as was described in the original GESTALT paper (McKenna et al, 2016).

Possible bug in `cas.pl.upload_and_export_itol`

It seems that the itol_utilities module requires a Path object. Passing a string results into an error:

itol.py:38, in Itol.add_file(self, file_path)
     34 def add_file(self, file_path: Path) -> None:
     35     """
     36     Add a file to be uploaded, tree or dataset
     37     """
---> 38     if not file_path.is_file():
     39         raise IOError('%s is not a file' % file_path)
     40     self.files.append(file_path)

AttributeError: 'str' object has no attribute 'is_file'

By passing a Path object this gets solved, e.g.

# in itol_utilities.py
# from pathlib import Path

itol_uploader = Itol()
itol_uploader.add_file(
    Path(os.path.join(temporary_directory, "tree_to_plot.tree"))

)

Thanks

"to_newick" utility function does not transfer branch lengths

Show legend for colorstrip when using cassiopeia.pl.plot_matplotlib

Hi Matt,
I have a quick question about showing legend for colorstrip when using cassiopeia.pl.plot_matplotlib. For example, when the colorstrip represents the organ locations where the cells in the tree are collected, I hope to display which color corresponds to which location in a legend in the plot produced from cassiopeia.pl.plot_matplotlib. Is it possible to do so? I tried something like:
cas.pl.plot_matplotlib(cas_tree, add_root=True, meta_data=['sampleID'], colorstrip_kwargs=dict(showlegend=True))
but it doesn't work and returns an error. Thanks very much!

Failed Numbaization of Distance Function is Not Caught

When creating a wrapper around a dissimilarity function from cas.solver.dissimilarity and applying it to a DistanceSolver's dissimilarity_function argument, I get a numba error.

In order to recreate the issue, I used the pip install command from the repo's readme, and ran this python script:

## From cass.py
from typing import Dict, List, Optional
import cassiopeia as cas
import pandas as pd
import pickle as pic
import os

gt_tree_dir = "/data/yosef2/users/richardz/projects/CassiopeiaV2-Reproducibility/trees/exponential_plus_c/400cells/no_fit/char40/"
gt_tree_file = os.path.join(gt_tree_dir, "tree0.pkl")
gt_tree = pic.load(open(gt_tree_file, "rb"))

cm_file = os.path.join(gt_tree_dir, f"cm0.txt")
cm = pd.read_table(cm_file, index_col = 0)

recon_tree = cas.data.CassiopeiaTree(
    character_matrix=cm, 
    missing_state_indicator = -1
    )

def my_distance_function(
    s1: List[int],
    s2: List[int],
    missing_state_indicator=-1,
    weights: Optional[Dict[int, Dict[int, float]]] = None,
) -> float:

    return cas.solver.dissimilarity.weighted_hamming_distance(
        s1,
        s2,
        missing_state_indicator=missing_state_indicator,
        weights=weights,
    )

solver = cas.solver.NeighborJoiningSolver(
    add_root = True, 
    dissimilarity_function=my_distance_function
    )

solver.solve(recon_tree)

Upon running the script above, the following error pops up:

## From stderr
Traceback (most recent call last):
  File "cass.py", line 38, in <module>
    solver.solve(recon_tree)
  File "/home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/solver/DistanceSolver.py", line 140, in solve
    dissimilarity_map = self.get_dissimilarity_map(cassiopeia_tree, layer)
  File "/home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/solver/DistanceSolver.py", line 106, in get_dissimilarity_map
    self.setup_dissimilarity_map(cassiopeia_tree, layer)
  File "/home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/solver/DistanceSolver.py", line 227, in setup_dissimilarity_map
    self.setup_root_finder(cassiopeia_tree)
  File "/home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/solver/NeighborJoiningSolver.py", line 264, in setup_root_finder
    self.dissimilarity_function, self.prior_transformation
  File "/home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/data/CassiopeiaTree.py", line 1855, in compute_dissimilarity_map
    self.missing_state_indicator,
  File "/home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/data/utilities.py", line 214, in compute_dissimilarity_map
    cm, C, missing_state_indicator, nb_weights
  File "/home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/numba/core/dispatcher.py", line 468, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
    raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Unknown attribute 'weighted_hamming_distance' of type Module(<module 'cassiopeia.solver.dissimilarity_functions' from '/home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/solver/dissimilarity_functions.py'>)

File "cass.py", line 27:
def my_distance_function(
    <source elided>

    return cas.solver.dissimilarity.weighted_hamming_distance(
    ^

During: typing of get attribute at cass.py (27)

File "cass.py", line 27:
def my_distance_function(
    <source elided>

    return cas.solver.dissimilarity.weighted_hamming_distance(
    ^

During: resolving callee type: type(CPUDispatcher(<function my_distance_function at 0x7f1322269f80>))
During: typing of call at /home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/data/utilities.py (197)

During: resolving callee type: type(CPUDispatcher(<function my_distance_function at 0x7f1322269f80>))
During: typing of call at /home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/data/utilities.py (197)

During: resolving callee type: type(CPUDispatcher(<function my_distance_function at 0x7f1322269f80>))
During: typing of call at /home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/data/utilities.py (197)

During: resolving callee type: type(CPUDispatcher(<function my_distance_function at 0x7f1322269f80>))
During: typing of call at /home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/data/utilities.py (197)


File "../../../../../home/eecs/ivalexander13/datadir/miniconda3/envs/fake_cass/lib/python3.7/site-packages/cassiopeia/data/utilities.py", line 197:
    def _compute_dissimilarity_map(cm, C, missing_state_indicator, nb_weights):
        <source elided>
                dm[k] = dissimilarity_func(
                    s1, s2, missing_state_indicator, nb_weights
                    ^

When inspecting the source code, I noticed that in /home/eecs/ivalexander13/datadir/Cassiopeia/cassiopeia/data/utilities.py, there seems to be safeguards that are supposed to catch numba failures, as follows

## From utilities.py at lines 159 to 171
numbaize = True
try:
    dissimilarity_func = numba.jit(dissimilarity_function, nopython=True)
except TypeError:
    warnings.warn(
        "Failed to numbaize dissimilarity function. Falling back to Python.",
        CassiopeiaTreeWarning,
    )
    numbaize = False
    dissimilarity_func = dissimilarity_function

## From utilities.py at lines 206 to 215
with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=numba.NumbaDeprecationWarning)
        warnings.simplefilter("ignore", category=numba.NumbaWarning)
        _compute_dissimilarity_map = numba.jit(
            _compute_dissimilarity_map, nopython=numbaize
        ) 

        return _compute_dissimilarity_map(
            cm, C, missing_state_indicator, nb_weights
        )

When these two snippets are changed to completely avoid using numba, the bug disappears. So I think the bug is due to the numbaization functions not working properly, and somehow bypassing the try-catch.

ILPSolver internal nodes naming convention prevents the trees from being converted to newick format

Currently, as the end of the solve procedure, the ILPSolver maintains the names of the internal nodes as tuples representing the character vector used in the Steiner solution. These should be changed to the "cassiopeia_internal_node" naming convention adopted by the other solvers.

CassiopeiaTree doesn't trigger error if the character matrix indices don't correspond to tree cells on initialization

Implement Utilities from KP-Tracer MS

Implement utilities from recent KP-Tracer manuscript into the Cassiopeia codebase. Specifically:

Expansion index
Phylogenetic coupling
Phylotime

Failed to install

Hi Developers,

I failed at install Cassiopeia at "python3 setup.py build". It looks like the arguments for PyCode_New are not in the correct format.

Please see the log below. Do you have any suggestions? Thanks!

running build
running build_py
running egg_info
writing cassiopeia_lineage.egg-info/PKG-INFO
writing dependency_links to cassiopeia_lineage.egg-info/dependency_links.txt
writing entry points to cassiopeia_lineage.egg-info/entry_points.txt
writing requirements to cassiopeia_lineage.egg-info/requires.txt
writing top-level names to cassiopeia_lineage.egg-info/top_level.txt
reading manifest file 'cassiopeia_lineage.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '' under directory 'cython'
writing manifest file 'cassiopeia_lineage.egg-info/SOURCES.txt'
running build_ext
building 'cassiopeia.TreeSolver.lineage_solver.solver_utils' extension
gcc -pthread -B /data/xliu23/binaries/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/data/xliu23/binaries/anaconda3/include/python3.8 -c cassiopeia/TreeSolver/lineage_solver/solver_utils.c -o build/temp.linux-x86_64-3.8/cassiopeia/TreeSolver/lineage_solver/solver_utils.o
cassiopeia/TreeSolver/lineage_solver/solver_utils.c: In function ‘__Pyx_InitCachedConstants’:
cassiopeia/TreeSolver/lineage_solver/solver_utils.c:5934:3: warning: passing argument 6 of ‘PyCode_New’ makes pointer from integer without a cast [enabled by default]
__pyx_codeobj__11 = (PyObject)__Pyx_PyCode_New(2, 0, 6, 0, CO_OPTIMIZED|CO_NEWLOCALS, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__10, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_cassiopeia_TreeSolver_lineage_so_2, __pyx_n_s_node_parent, 6, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__11)) __PYX_ERR(0, 6, __pyx_L1_error)
^
In file included from /data/xliu23/binaries/anaconda3/include/python3.8/compile.h:5:0,
from /data/xliu23/binaries/anaconda3/include/python3.8/Python.h:138,
from cassiopeia/TreeSolver/lineage_solver/solver_utils.c:16:
/data/xliu23/binaries/anaconda3/include/python3.8/code.h:122:28: note: expected ‘struct PyObject *’ but argument is of type ‘int’
PyAPI_FUNC(PyCodeObject ) PyCode_New(
^
cassiopeia/TreeSolver/lineage_solver/solver_utils.c:5934:3: warning: passing argument 14 of ‘PyCode_New’ makes integer from pointer without a cast [enabled by default]
__pyx_codeobj__11 = (PyObject)__Pyx_PyCode_New(2, 0, 6, 0, CO_OPTIMIZED|CO_NEWLOCALS, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__10, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_cassiopeia_TreeSolver_lineage_so_2, __pyx_n_s_node_parent, 6, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__11)) __PYX_ERR(0, 6, __pyx_L1_error)
^
In file included from /data/xliu23/binaries/anaconda3/include/python3.8/compile.h:5:0,
from /data/xliu23/binaries/anaconda3/include/python3.8/Python.h:138,
from cassiopeia/TreeSolver/lineage_solver/solver_utils.c:16:
/data/xliu23/binaries/anaconda3/include/python3.8/code.h:122:28: note: expected ‘int’ but argument is of type ‘struct PyObject *’
PyAPI_FUNC(PyCodeObject ) PyCode_New(
^
cassiopeia/TreeSolver/lineage_solver/solver_utils.c:5934:3: warning: passing argument 15 of ‘PyCode_New’ makes pointer from integer without a cast [enabled by default]
__pyx_codeobj__11 = (PyObject)__Pyx_PyCode_New(2, 0, 6, 0, CO_OPTIMIZED|CO_NEWLOCALS, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__10, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_cassiopeia_TreeSolver_lineage_so_2, __pyx_n_s_node_parent, 6, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__11)) __PYX_ERR(0, 6, __pyx_L1_error)
^
In file included from /data/xliu23/binaries/anaconda3/include/python3.8/compile.h:5:0,
from /data/xliu23/binaries/anaconda3/include/python3.8/Python.h:138,
from cassiopeia/TreeSolver/lineage_solver/solver_utils.c:16:
/data/xliu23/binaries/anaconda3/include/python3.8/code.h:122:28: note: expected ‘struct PyObject *’ but argument is of type ‘int’
PyAPI_FUNC(PyCodeObject ) PyCode_New(
^
cassiopeia/TreeSolver/lineage_solver/solver_utils.c:5934:3: error: too many arguments to function ‘PyCode_New’
__pyx_codeobj__11 = (PyObject)__Pyx_PyCode_New(2, 0, 6, 0, CO_OPTIMIZED|CO_NEWLOCALS, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__10, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_cassiopeia_TreeSolver_lineage_so_2, __pyx_n_s_node_parent, 6, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__11)) __PYX_ERR(0, 6, __pyx_L1_error)
^
In file included from /data/xliu23/binaries/anaconda/include/python3.8/compile.h:5:0,
from /data/xliu23/binaries/anaconda3/include/python3.8/Python.h:138,
from cassiopeia/TreeSolver/lineage_solver/solver_utils.c:16:
/data/xliu23/binaries/anaconda3/include/python3.8/code.h:122:28: note: declared here
PyAPI_FUNC(PyCodeObject *) PyCode_New(

AttributeError calling cassiopeia.pp.align_sequences

Hi @mattjones315 and @Lioscro! I am running through your preprocessing pipeline with my GESTALT data, this time trying out the new global alignment option in cassiopeia.pp.align_sequences (thanks so much for including that!). Reading the API, it looks like for my GESTALT data that I should leave the gap_open_penalty and gap_extend_penalty as the default values (since those are what were originally used for the GESTALT technology) and set method = "global". I did this and got the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-14-84b24ae4ce70> in <module>
      4 umi_table = cs.pp.align_sequences(umi_table, 
      5                                   ref_filepath = target_site_reference,
----> 6                                   method = "global")

~/.local/lib/python3.7/site-packages/ngs_tools/logging.py in inner(*args, **kwargs)
     60                 try:
     61                     self.namespace = namespace
---> 62                     return func(*args, **kwargs)
     63                 finally:
     64                     self.namespace = previous

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/utilities.py in wrapper(*args, **kwargs)
     84     def wrapper(*args, **kwargs):
     85         logger.debug(f"Keyword arguments: {kwargs}")
---> 86         return wrapped(*args, **kwargs)
     87 
     88     return wrapper

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/utilities.py in wrapper(*args, **kwargs)
     62         logger.info("Starting...")
     63         try:
---> 64             return wrapped(*args, **kwargs)
     65         finally:
     66             logger.info(f"Finished in {time.time() - t0} s.")

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/pipeline.py in align_sequences(queries, ref_filepath, ref, gap_open_penalty, gap_extend_penalty, method, n_threads)
    528         )(
    529             delayed(align_partial)(ref, queries.loc[umi].seq)
--> 530             for umi in queries.index
    531         ),
    532     ):

~/.local/lib/python3.7/site-packages/ngs_tools/utils.py in __call__(self, *args, **kwargs)
    221     def __call__(self, *args, **kwargs):
    222         try:
--> 223             return Parallel.__call__(self, *args, **kwargs)
    224         finally:
    225             self._pbar.close()

~/.local/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1039             # remaining jobs.
   1040             self._iterating = False
-> 1041             if self.dispatch_one_batch(iterator):
   1042                 self._iterating = self._original_iterator is not None
   1043 

~/.local/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    857                 return False
    858             else:
--> 859                 self._dispatch(tasks)
    860                 return True
    861 

~/.local/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    775         with self._lock:
    776             job_idx = len(self._jobs)
--> 777             job = self._backend.apply_async(batch, callback=cb)
    778             # A job can complete so quickly than its callback is
    779             # called before we get here, causing self._jobs to

~/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/.local/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    262             return [func(*args, **kwargs)
--> 263                     for func, args, kwargs in self.items]
    264 
    265     def __reduce__(self):

~/.local/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    262             return [func(*args, **kwargs)
--> 263                     for func, args, kwargs in self.items]
    264 
    265     def __reduce__(self):

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/alignment_utilities.py in align_global(ref, seq, substitution_matrix, gap_open_penalty, gap_extend_penalty)
     78     aln = aligner.align(ref, seq)
     79     return (
---> 80         ngs.sequence.alignment_to_cigar(aln.result_a, aln.result_b),
     81         aln.pos_b,
     82         aln.pos_a,

AttributeError: module 'ngs_tools.sequence' has no attribute 'alignment_to_cigar'

I just pulled the latest updates to the Master branch this morning, so I am not if it is perhaps an issue on my end with a package being out of date. Thanks so much for any help!

Command line tools not found

Hi! I followed the installation instructions, and the make test runs successfully. While I can run the command cassiopeia-preprocess --help, I could run reconstruct-lineage --help. Any suggestions?

Here is my script for installation

conda create -n lineage_tree python=3.6 --yes
conda activate lineage_tree
pip install pytest
conda install cython --yes
make install
pip install --user ipykernel
python -m ipykernel install --user --name=lineage_tree

PCT48 hardcoded into processing pipeline?

Hello! Thanks for creating this great tool to process data from single cell lineage tracing experiments. I am going through the process_fastq.ipynb notebook with my own example data (from Raj et al. 2020's scGESTALT paper). Things seemed to be working well until I hit the process.call_indels step, which returned an output .sam file that contained a header but was otherwise blank (i.e., no called indels for each sequence). Digging into this a little more, it looks like the input .sam file (generated from process.align_sequences) is not valid. The reference listed in the @SQ line is PCT48.ref, which is the reference used in the process_fastq.ipynb notebook. However, the alignments all show alignments to the dsRed reference, which is what I was using for my example data. This seems to be hardcoded in the pipeline_utils.py script (see line 18 and line 210). I am wondering if there is some issue with the PCT48.ref being hardcoded that is causing the process.call_indels step to fail on my end with different example data?

I've included the following files from my example run: 1) sw_aligned.sam (output .sam file from process.align_sequences), 2) DsRed.fa (FASTA reference sequence), and 3) umi_table.sam (output .sam file from process.call_indels).
gestalt_ex.zip

Versions:
Python 3.7.4
Emboss 6.6.0.0
Gurobi 9.1.1
Mac OS 10.15.7

I don't see any version numbers for Cassiopeia, but I just finished downloading and installing Cassiopeia yesterday so I should be working with the most current versions of everything.

Thank you for any help or insight!

Issue loading in previously saved allele_table

Hi @mattjones315, I am working through the new Reconstructing trees with Cassiopeia tutorial with my GESTALT data, and so far mostly good (I think). However, I am having a problem with cassiopeia.pp.compute_empirical_indel_priors if I open a new jupyter notebook and load in an allele_table that was saved in my preprocessing notebook:

My code:

indel_priors = cassiopeia.pp.compute_empirical_indel_priors(allele_table, grouping_variables=['intBC', 'lineageGrp'])

The error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-2d54866e2893> in <module>
----> 1 indel_priors = cassiopeia.pp.compute_empirical_indel_priors(allele_table, grouping_variables=['intBC', 'lineageGrp'])

~/Desktop/code/Cassiopeia/cassiopeia/preprocess/utilities.py in compute_empirical_indel_priors(allele_table, grouping_variables, cut_sites)
    689     for g in groups.index:
    690 
--> 691         alleles = np.unique(np.concatenate(groups.loc[g].values))
    692         for a in alleles:
    693             if "none" not in a.lower():

<__array_function__ internals> in unique(*args, **kwargs)

~/.pyenv/versions/3.6.12/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
    259     ar = np.asanyarray(ar)
    260     if axis is None:
--> 261         ret = _unique1d(ar, return_index, return_inverse, return_counts)
    262         return _unpack_tuple(ret)
    263 

~/.pyenv/versions/3.6.12/lib/python3.6/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
    320         aux = ar[perm]
    321     else:
--> 322         ar.sort()
    323         aux = ar
    324     mask = np.empty(aux.shape, dtype=np.bool_)

TypeError: '<' not supported between instances of 'float' and 'str'

I am not sure why this is happening. Using the same allele_table in my preprocessing notebook I am able to run cassiopeia.pp.compute_empirical_indel_priors with no issue. And I guess related to this, is it appropriate for GESTALT data to be using compute_empirical_indel_priors? Based on the tutorial, it sounds like this might be something that is unique to your system with intBCs, and yet if I don't run compute_empirical_indel_priors then I have issues downstream generating the character_matrix. And if I tweak indel_priors = cassiopeia.pp.compute_empirical_indel_priors(allele_table, grouping_variables=['intBC', 'lineageGrp']) to instead be indel_priors = cassiopeia.pp.compute_empirical_indel_priors(allele_table, grouping_variables=['lineageGrp']) (removing intBC since the GESTALT system doesn't have them, so perhaps I wouldn't want to group by them) then I get a tree that looks weird (almost no branches or nodes or leaves). This is using the vanilla_greedy and 'nj' neighbor-joining solvers; I can't current use the ilp-solver because my Gurobi license has expired, but I will look into renewing that if it looks like the ilp-solver will be the most useful to me.

All that said, I ran through the rest of your tutorial as a continuation of my preprocessing steps in the same notebook and everything seems to be working for the most part. I am getting one other error and I have a few questions, and was not sure if I should open them all as separate issues, or keep them running in this thread as a potential singular resource to troubleshoot GESTALT data at the reconstruction stage.

Thanks!

make test error

Hi, i have a problem on the make test step in the Installation. And i have tried some methods mentioned online and failed , and i wonder why this package was not install correctly. i appreciate it that you made a guidebook for your package but i may need some extra help.

Best wishes

unify importing in codebase

Should be a simple fix, but just posting as an issue so that we get to it some time.
We should unify local importing to use relative imports (i.e. from .data import CassiopeiaTree) instead of specifying the full package name (i.e. from cassiopeia.data import CassiopeiaTree).
As of right now, most submodules use the absolute importing scheme.

issues about assign_lineage_groups() function

Thanks for developing such a useful tool for reconstructing phylogenetic tree. when looking into the source code of assigning lineage groups to each cells, I am not sure about if it is correct in several details. There are two main questions confusing me as follows:

https://github.com/YosefLab/Cassiopeia/blob/master/cassiopeia/preprocess/lineage_utils.py
1.

in the last 3rd line, should it be 'prev_clust_size = piv_nolg.shape[0]' because we only need to assign lineage group to the undefined cells with number more than 'min_clust_size'. Is there something important I've missed or misunderstood?

the above intBC set was used in iteratively assignment of lineage groups and the bottom one master_intBC was applied to calculate kinship_score. but these two intBC sets were actually not same because they were filtered via different criteria. so I am wondering if it is more efficient and accurate than using the same criteria to selecting intBC set and master_intBC?

Hope I have got the meaning of your code and looking forward to your reply~

Sequencing module

Trying to run process_fastq.ipynb, I get an error regarding the Sequencing module:

ModuleNotFoundError: No module named 'cassiopeia.ProcessingPipeline.process.sequencing'

It seems like this module is loaded in cassiopeia/ProcessingPipeline/process/collapseUMIReadsByMSALargeFile.py from a local directory:

sys.path.append("/home/jah/projects/sequencing/code")

Could you please point me to where I can access / install that module?

DistanceSolver duplicates handling is inconsistent with other solvers

Other solvers explicitly group duplicates together. At the beginning of the solving process, they remove the duplicates and solve on the duplicate-cleaned character matrix, appending duplicates at the end. The DistanceSolver should adopt this convention, instead of just solving on the character matrix with the duplicates still included.

Esquires about data availability

Hello,

Congratulations on your publication! I’m so impressed by the article and was hoping to use the gastrulation compendium data as a reference for my study.

I used the accession number GSE122187 and got all 10x data. Although you have provided CellStates file and CellStatesKernel file, I'm wondering if you guys could share a metadata file included the cell-by-cell annotation you used in the paper?

Really appreciate all your help!

CassiopeiaTree character matrix has inconsistencies in the naming conventions

#106
In the CassiopeiaTree, the character matrix attribute is called by self._character_matrix in some methods and in other methods called by self.character_matrix.

Sorry I didn't catch this earlier!

ERROR in align

Hello,
I was following the tutorial for the preprocessing, and when I get to the "align" step, I get an error thrown.
IndexError: list index out of range

Looking forward to your answers!
Best,
xinyi

How many intBCs are in the intbc_whitelist?

Hi,
When I ran the [error_correct_intbcs_to_whitelist], 6 intBCs were used in the intbc_whitelist and got error. Generally, how many intBCs are in the intbc_whitelist? How could i get the intbc_whitelist in the paper Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts
Looking forward to your reply

Issues with ILP solution post-processing

2 Issues with ILP

Some samples are being removed in the tree in post-processing
Some spurious (non-sample) leaves remain in the tree, despite our explicit removing of these nodes

Score character-parsimony and likelihood.

Add a method to the CassiopeiaTree class for computing:

the parsimony of the tree
the likelihood of the tree (using Felsenstein's alg)

This should take advantage of the layers functionality in CassiopeiaTree.

Installation instructions in readme

Cassiopeia/README.md

Line 34 in 4a4ba1a

2. Ensure that you have python3.6 installed. You can install this via pip.

This isn't required and you can't install python from pip?!

Implement custom copy & deepcopy methods in CassiopeiaTree

Currently we are invoking copy.deepcopy(self) when trying to copy a CassiopeiaTree. It would be better to implement a custom version of both copy and deepcopy for users. See https://docs.python.org/3/library/copy.html , specifically:

In order for a class to define its own copy implementation, it can define special methods copy() and deepcopy(). The former is called to implement the shallow copy operation; no additional arguments are passed. The latter is called to implement the deep copy operation; it is passed one argument, the memo dictionary. If the deepcopy() implementation needs to make a deep copy of a component, it should call the deepcopy() function with the component as first argument and the memo dictionary as second argument.

Segmentation Fault in error_correct_cellbcs_to_whitelist

I have a dataset with scarring-arrays and am trying to follow the preprocessing user guide here. When correcting the Cell UMIs to the whitelist, the program crashes very early with a segmentation fault during the error_correct_cellbcs_to_whitelist function call.

Code:

 bam_fp = cas.pp.error_correct_cellbcs_to_whitelist(
     bam_fp,
     whitelist='data/3M-february-2018.txt',
     output_directory=output_dir,
     n_threads=n_threads,
 )

stderr:

[3/4] Finding mismatches:   1%|1         | 2776/252224 [01:52<2:48:23, 24.69it/s]
/software/sge-2011.11/default/spool/fermat/job_scripts/9617950:
line 14: 11249 Segmentation fault      python3 main.py

When rerunning the same code, the program crashes reproducible at the same index +-2.

Segmentation faults in python code are rare, so I suspect a compiled code to be the origin of the segfault.
I traced the logging message back to ngs-tools, which uses numba to speedup some private functions. Against my expectation, disabling numba compilation with NUMBA_DISABLE_JIT=1 did not prevent the segfault, so it seems that function is not the origin of the segfault.

I'd be glad about pointers how to make Cassiopeia work on my dataset. I looked into the bam file at the appropriate index, but the read looks like any other read. I'm currently running the pipeline on a another sample to see if that runs through. I thought about tracing with gdb to see where the segfault comes from, or using a separate package to do the cell-UMI correction, but I hoped that you have some insight how this problem could be fixed.

Converting FASTQs into an unmapped BAM: Index out of range

Hi all,

Firstly, thank you for creating an excellent pipeline and such a versatile set of tools! I have used your reconstruct.ipynb notebook extensively and it has been great!

However, I am having some issues with the preprocess.pynb notebook, specifically with the initial conversion from fastqs to an unmapped .bam. The fastq files that I am using are from your previous Quinn et al paper but unfortunately when I am trying to use the convert_fastqs_to_unmapped_bam() function I am met with the following error:

File "/home/george/anaconda3/envs/lineage_tracing/lib/python3.7/site-packages/ngs_tools/chemistry/Chemistry.py", line 129, in parse
raise IndexError('string index out of range')
IndexError: string index out of range

I know that this is probably just a stupid mistake on my part but I can't work out where I am going wrong.

If you have any suggestions it would be greatly appreciated!

Thanks
George

Missing ~/.itolconfig

Hi, it would be really useful if you could provide an example iTol config file ~/.itolconfig, so that we know how to make our own file!

Unable to read pickled files due to renaming of module?

Thanks for the well-documented software package!

I just wanted to check in about reading the data sets published on Zenado (https://zenodo.org/record/3706351) with pickle. I am getting the error message below, potentially because the module was renamed from Cassiopeia to cassiopeia at some point. I was wondering which version of the Cassiopeia code on GitHub I should be using to read this data. Thank you!

>>> import cassiopeia
dir_path = /nfshomes/ekmolloy/.local/lib/python3.10/site-packages/cassiopeia/tools/fitness_estimator
>>> import pickle
>>> with open('true_network_characters_20_run_9.pkl', 'rb') as f:
...     x = pickle.load(f)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
ModuleNotFoundError: No module named 'Cassiopeia'

Target site reference fasta in the Science article "Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts"

Hi! I want to know whether the target site reference fasta in the Science article "Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts" is the same as the file "PCT48.ref.fasta" in the data directory or not. Looking forward to your reply. Thank you very much.

Missing data

Hi again @mattjones315, I wanted follow up on the missing data issue that we discussed in issues #110 and #126. I went through the allele_table generated from my GESTALT data and counted up the number of instances I had a "Missing" entry/missing allele identity for each of my cut sites (r1, r2, r3, ... r10 for GESTALT). For cut sites r1-r8 there were no "Missing" entries. For cut site r9, ~24% of the entries were "Missing" and for cut site r10, ~36% of the entries were "Missing". I know you mentioned here that I can probably get away with missing data for now so long as it is less than ~30%, which is the case for r9 but not for r10. It seems interesting to me that the first 8 cut sites have no issues, and that missing data only increases at the end of my barcode array. Does this mean that in general the alignment is probably working, or do you think that we'll need to implement the global-alignment strategy described in the original GESTALT paper? Thanks!

	from cassiopeia.solver import (
	DistanceSolver,
	dissimilarity_functions,
	solver_utilities,
	)

	from cassiopeia.data.CassiopeiaTree import CassiopeiaTree
	from cassiopeia.mixins import TreeSimulatorError
	from cassiopeia.simulator.TreeSimulator import TreeSimulator

yoseflab / cassiopeia Goto Github PK

cassiopeia's People

Contributors

Stargazers

Watchers

Forkers

cassiopeia's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs