fraenkel-lab / omicsintegrator2 Goto Github PK

View Code? Open in Web Editor NEW

51.0 10.0 24.0 40.77 MB

Prize-Collecting Steiner Forests for Interactomes

Home Page: https://fraenkel-lab.github.io/OmicsIntegrator2

License: BSD 3-Clause "New" or "Revised" License

Python 4.43% Jupyter Notebook 95.57%

omics proteomics prize-collecting steiner-tree

omicsintegrator2's People

Contributors

Stargazers

Watchers

omicsintegrator2's Issues

Randomizations

@AmandaKedaigle when people go to run OmicsIntegrator2, how do they do randomizations? Do they specify a number of times to randomize using each randomization strategy and merge them all? I'm still not sure I understand merging too well either... but maybe I can read about that.

Edge data representation

pcsf returns edge_indices as a 1D np array. It would be nice to return as 2D array of vertex indices (since pcsf_exact uses this format), but I think this breaks a lot of downstream analyses, particularly _aggregate_pcsf and randomizations.

One solution would be to create a copy of the interactome, and assign robustness/specificity as node and edge attributes..

So what's the best way to go from 2D array of vertex indices to 1D np array of edge_indices?

--random_terminals flag doesn't work

python forest.py -e /nfs/latdata/iamjli/ALS/data/interactome/iRefIndex_v13_MIScore_interactome.txt -p /nfs/latdata/iamjli/ALS/data/iMNs/proteomics/20170323_ALS_CTR_iMNs_protein_log2FC1_FDR0.01.tsv -o ../output/ --random_terminals=5

run from /nfs/latdata/iamjli/packages/OmicsIntegrator2/src

Error:

Traceback (most recent call last):
  File "forest.py", line 477, in <module>
    forest, augmented_forest = graph.randomizations(prizes, terminals, args.noisy_edges_repetitions, args.random_terminals_repetitions)
  File "forest.py", line 329, in randomizations
    for random_prizes, terminals in [self._random_terminals(prizes, terminals) for rep in range(random_terminals_reps)]:
  File "forest.py", line 329, in <listcomp>
    for random_prizes, terminals in [self._random_terminals(prizes, terminals) for rep in range(random_terminals_reps)]:
  File "forest.py", line 296, in _random_terminals
    new_prizes[new_terminal] = prizes[old_terminal]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

OI web app

Some notes from our convo on Friday, Mandy, as well as a few other things I just thought of.

What is the mRNA input for? Colorization, or penalizing lowly expressed nodes? If the former, we run in to some consistency issues since we don’t use our entire proteomics file as our prize file. If the latter, we don’t have this functionality implemented, so maybe we should leave that box out for now.
In the algorithm parameters, “Number of Trees” is a bit misleading, but I can’t think of a concise rewording. Maybe “connectivity”? Even that’s not ideal.
Default for b should probably be in the range of 0-2, since we are no longer penalizing node prizes. Default 1 seems to work pretty good for me.
Degree penalty (a) should now be edge penalty (g). Default 10 works fine for me.
iRef14 also worked well for me, so maybe update that.
Visualization: with new penalization scheme, high degree nodes are included, leading to outputs like the one below. Would it be too difficult to make non-robust edges transparent? Low robust edges could still be visualized, but at a very low visibility.
Output for randomized experiments should auto return the top 400 robust nodes as a subnetwork, or some other heuristic. The former is already implemented in graph.py.

Add feature to remove edges between the 1-hop neighbors of a protein of interest

@brycehwang's idea.

May force the network to choose a greater diversity of paths around your protein of interest.

Directed Edges

@AmandaKedaigle this thread is for discussing whether we need to support directed edges, and if so, how we might go about doing so. I imagine Ludwig will need to be part of these discussions. In the short term, I plan on releasing this without support for directed edges, because I have a heavy bias here of "good is better than perfect", unless you think that's a bad idea.

Write cytoscape files

Currently, we have a file called write_cyto_file.py in the repo, but instead of doing this ourselves, we should have networkx do this, I think.

Cytoscape can take in these formats including:

Graph Markup Language (GML or .gml format)

networkx has documentation about reading and writing here, including the functionality of writing GML files.

This seems like a nice way to go. However, we need to make sure we don't run into any problems involving issues like this one: http://stackoverflow.com/questions/5828045/transfer-layout-from-networkx-to-cytoscape

Quick final note, we may want to use graph-tool instead of networkx

Duplicate edges in edge file

Hi @AmandaKedaigle ,

Will there ever be duplicate edges in the edge file? I.e.

A   B   prize1
B   A   prize1

or worse

A   B   prize1
B   A   prize2

Specifically I'm asking whether I should check for that in forest, and what I should do if it happens...

Post processing graphs

@AmandaKedaigle What are the post-processing steps for graphs, and which of those do we want to support under which circumstances? In my notes, I see that there's been discussion of betweeness, Louvain clustering, and community clustering. I imagine some of these are supported by networkx, others we might need to implement on our own.

@iamjli would you comment on this for us?

Update to networkx 2.0

This was released a couple days ago.

Write informative README

Similarly to GarNet, it would be really nice to have a beautiful readme explaining this project. I no longer believe we'll have OmicsIntegrator2 be a fork of OmicsIntegrator(1) so I think it makes sense to duplicate important documents from OI1 here. I imagine a lot of the overview can be borrowed from OI1 (thought it might be nice to add a little). The technical stuff should be easier this time, since installing is a single command.

Variable attributes discussion

I still think some variables would be better as class attributes. For instance, it seems arbitrary that self.edge_penalties is an attribute despite not being used anywhere else in the class. On the other hand, terminals in the _prepare_prizes is redefined again in pcsf, and will be in pcsf_exact.

I think I would be satisfied if prizes, terminals, and (maybe) node_attributes were their own attributes. Especially since prizes and terminals are both arrays that are meaningless to the user outside of the context of the class. So the current format essentially forces users to save prizes/terminals as local vars, just to pass it back into pcsf.

Arguments against?

Post Processing: Clustering and Enriching

I've compiled a list of what seems to be the most useful clustering and enrichment methods for the output of OmicsIntegrator. Below are the methods that are most frequently used, and should be implemented in OmicsIntegrator2 (listed in order of priority)

Clustering:

Louvain Clustering
Edge-betweenness Clustering - already implemented by Alex (?)
K-means clustering / TsNE
Consensus Clustering
Pathway clustering

As well as clustering, it would also be helpful to have enrichment methods.

Output cluster membership
Integration with Enrichr API

If there are any others that you think of, please let me know!

Negative Prizes for Hubs

@AmandaKedaigle With respect to negative prizes, how are they computed? I imagine it isn't all too complicated, just having to do with a mu parameter...?

Remind me again why it is we need to keep original, negative and result prizes again?

Add ability to penalize or remove nodes

Which allows for #35

THE ROADMAP: Everything we want to change for this release:

@divyaramamoorthy @AmandaKedaigle

New in this version

Safe to Remove:

Cross validation code
Shuffle Prizes
Support for Cytoscape 2.8

Add mRNA nodes & interactions to forest

The web tool has support for new kinds of interactions we'd like to include eventually:

The output of Garnet should be changed to indicate what genes were used to predict a TF was important (i.e. these genes are predicted targets of that TF). Then, those genes should be added to the network as new nodes (mRNA nodes, not protein nodes), and edges should be drawn:

"pd" or protein-DNA interaction, from TFs to their target mRNA nodes
"tp" or mRNA-protein interactions, from mRNA nodes to their protein products

What to do when Prize file contains nodes missing from interactome

@AmandaKedaigle What does the old omicsintegrator do?

pcst_fast function not working on cluster

Running forest from /nfs/latdata/iamjli/ALS/OmicsIntegrator2/src:

python forest.py -e ../data/interactome/iRefIndex_v13_MIScore_interactome.txt -p ../data/iMNs/proteomics/20170323_ALS_CTR_iMNs_protein_log2FC1_FDR0.01.tsv -o ../output/test/

am getting this error:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  In the rooted case, only one output cluster is supported.
Aborted

Traced error back to pcsf_fast function.

Seg fault

This command run from /nfs/latdata/iamjli/packages/OmicsIntegrator2/src produces a segfault:

python __main__.py -e ../PCSF_compare/data/iRefIndex_v13_MIScore_interactome_COSTS.txt -p ../PCSF_compare/data/20170323_ALS_CTR_iMNs_protein_log2FC1_FDR0.01.tsv -o ../output/

PCST runs fine if I use a shortened interactome file (../PCSF_compare/data/iRefIndex_v13_MIScore_interactome_COSTS.short.txt), or one that has not been 1 minused (../PCSF_compare/data/iRefIndex_v13_MIScore_interactome.txt)

Node attributes output

-Gene symbol
-Prize value
-Robustness
-Specificity
-Protein type (steiner, TF, terminal)

Network summarization

Provide summary statistics for single run forests, provide warnings when network structure is off (large star structures/many singleton nodes, etc). Write to file.

Ideally also do this for randomizations, but the code structure has to be modified significantly.

Switch all soon-to-be-deprecated .loc's to .reindex's in graph.py

running on local machine

Python 3.5.1 (default, Mar  1 2017, 15:20:57)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pcst_fast import pcst_fast
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: dlopen(/Users/jonathanli/Documents/research/packages/OmicsIntegrator2/venv/lib/python3.5/site-packages/pcst_fast.cpython-35m-darwin.so, 2): Symbol not found: __PyThreadState_UncheckedGet
  Referenced from: /Users/jonathanli/Documents/research/packages/OmicsIntegrator2/venv/lib/python3.5/site-packages/pcst_fast.cpython-35m-darwin.so
  Expected in: flat namespace
 in /Users/jonathanli/Documents/research/packages/OmicsIntegrator2/venv/lib/python3.5/site-packages/pcst_fast.cpython-35m-darwin.so

When I try to install pcst_fast:

(venv) jonathanli (master *) OmicsIntegrator2 $ pip install pcst_fast
Requirement already satisfied (use --upgrade to upgrade): pcst-fast in ./venv/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): pybind11>=2.1.0 in ./venv/lib/python3.5/site-packages (from pcst-fast)
You are using pip version 7.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Ernest: Ranking subnetworks

Ranking subnetworks based on something Anthony introduced in his most recent manuscript:

rank scores for these sub-networks according to their prize densities (sum of prizes multiplied by the fractional size of sub-network

create style code for visualization

Originally from @AmandaKedaigle, copied here.

Specific things create_style_code_for_visualization.py did that we'll need to replace in some manner.

-Attributes for coloring should be named based on what kind of data: proteinChange (for both proteomic and metabolomic data) and geneChange (for expression data)

-Add the "TerminalType" attribute based on what input file you were looking at. For visualization code, these should be "Proteomic", "Metabolite", "mRNA", "TF" (from Garnet)

-It also inferred which one should be the actual "prize" column - for Proteomic or Metabolite nodes, it should be abs(data value), for TF nodes, it should be the data value (which does not change proteinChange or geneChange attr values), and for mRNA data, if prize is not already assigned to a protein node from this mRNA name, it gets prize zero (and geneChange from the data value)

Low(er) priority: support for TF lists

Ernest: randomizations heatmap should show node degree information

Generate a version of the heatmaps for parameter selection that simultaneously capture node degree, robustness and specificity. Since degree is fixed for a node, it could be plotted at the margin of the heatmap. Robustness and specificity depend on the parameters. Perhaps we could have two versions of the plot: (A) Color scale for robustness for each matrix element with average specificity of nodes shown in the margin and (B) color scale for specificity for each matrix element with average robustness of nodes shown in the margin.

Web tool not working

@AmandaKedaigle @zfrenchee Can you guys take a look at this soon. Error in browser:

A problem occurred in a Python script.

/var/www/htdocs/omicsintegrator/log/tmpVAf0YI.html contains the description of this error.

Same error in Safari and Firefox. Likely data-dependent error. He can't share the raw data though.

random terminals: how random?

@AmandaKedaigle

Here are the top couple nodes (left column, their numerical ID) and their edge degrees (right column)

4400      407
9466      409
6347      413
629       416
4309      419
12140     424
430       457
12015     475
75        478
6877      481
10772     516
11477     542
9358      569
4517      597
11171     617
12162     626
6397      647
5146      652
6058      653
4308      655
6433      662
4565      688
7598      746
4767      750
10141     766
6225      836
4310     1113
4770     1239
275      2008
6400     9289

The way random terminals used to work (I think) was it would just look within a fixed window around the current (in the above list) and select by sampling an offset using a gaussian.

But let's say node 6400 is your terminal, and you get an offset of 10, then you're selecting a node with degree 662 instead of 9289 which is a big change in node degree. The window size was set to 100 up or down (200 total I think) meaning you could really get a hugely different value.

Is that how we want to keep doing things? If so do we want to shrink the window? Or maybe we should do something else? What about a uniform sampling from the 5 above and 5 below?

Let me know what you think...

Write docstrings for clustering techniques

The docstrings are the multi-line comments below the function definitions. The functions that need annotations are louvain_clustering, edge_betweenness_clustering, k_clique_clustering.

Adding negative prizes for very low mRNA expression

@AmandaKedaigle's idea. Max's work. Further details coming.

Ernest: Avoid parameter selection by randomize & merge

Avoiding parameter selection. This is much more speculative, but what if we chose parameter combinations at random, and then created a merged network by taking the average weighted by the probability the network is not random (mean specificity score)? This is similar to Importance sampling, but less rigorous since specificity is not the same as prob_true.

Louvain clustering output

OmicsIntegrator2/src/graph.py

Line 366 in 11a24e1

louvain_clustering(augmented_forest)

Getting the following error:

Refactor randomizations and output_forest_as_networkx

The randomizations function has gotten a little bloated. Refactor it to make it a little more atomic. Other functions which might be impacted / implicated:

- _aggregate_pcsf
- get_networkx_graph_as_dataframe_of_nodes
- get_networkx_graph_as_dataframe_of_edges

NetworkXError: GraphML writer does not support <class 'list'> as data values.

The otherLocations node attribute is a list of extracellular locations, but networkx cannot write lists to GraphML, apparently.

Incorrect output filename

Small one, but shouldn't it be filename="graph_edgelist.txt" per the docstring?

OmicsIntegrator2/src/graph.py

Line 948 in 71d5c59

 def output_networkx_graph_as_edgelist(nxgraph, output_dir, filename="graph_json.json"): 

Bug with main.py

The imports for main.py in line 11 need to be fixed a little bit. In particular, get_networkx_graph_as_node_edge_dataframes needs to be replaced with the two separate functions available in graph (and also once again where it is called, needs to be replaced with two lines)

In addition get_networkx_subgraph_from_randomizations doesn't exist in graph.py currently. This needs to be implemented or need to import the right function instead.

knockout proteins

Right now, the knockout functionality is not implemented in OmicsIntegrator2.

parser.add_argument("--knockout", dest='knockout', nargs='*', default=[],
	help='Protein(s) you would like to "knock out" of the interactome to simulate a knockout experiment. [default: %default]')

Can we name alpha something else?

Tony's implementation of multi-PCSF (which Azim and I making a modified/new version of for OI2) already uses a parameter called alpha in calculating new prizes.

If there's a good argument for calling the hub edge penalty alpha we can name ours something different, but for confusion's sake it would be better to keep similar parameters similar names to the published algorithm

Grid Search over parameter space

This is an important feature a lot of the time. How would we like it to be implemented?

Ideally we would want to be able to run each of these in parallel (e.g. with multiprocessing).
There exist a couple "out of the box" solutions for this (e.g. https://github.com/fraenkel-lab/OmicsIntegrator2/blob/dev/src/grid_search.ipynb) which we may be able to use.

What output format would we like?

@iamjli @AmandaKedaigle

exclude_terminals

rewrite _random_terminals

Right now it's just badly written.

Scaling for parameters

Log transform for "g"
Have "w" scale a function of prize weight

Subcellular localization -- "no location" should be referred to as NaN

Web-tool suggestions

run_visualization_for_user_saved_JSON_file.sh creates two redundant folders: ../visualize_results_rundir and ../visualize_saved_<name>
Currently only recognizes Steiner node if TerminalType field is 0. Please change to recognize anything that's not RNA/TF/Proteomic
Node names need to be text searchable
Allow user to switch between viewing Protein and mRNA level change. Both could be represented by node color...looks cluttered as is.

_check_validity_of_instance

We need to think deeply about what the potential malformed inputs to OmicsIntegrator might be. This thread is for that. Beyond making sure every function is passed the right type, what else do we need to check? @divyaramamoorthy

Wrapper scripts TODO list

Data Input
- Wrapper to format prize/interactome files
Parameter sweep
- Grid search
  - Input: parameter file with param grid and I/O paths (/nfs/latdata/iamjli/packages/PCSF_analysis/specification_sheet.yaml)
  - Output: parameter summary node matrix (/nfs/latdata/iamjli/ALS/network_analysis/iMNs_ALS_CTR/new_proteomics_051917/param_search/summary/PCSF_JLI_networkNodeMatrix.tsv)
- Visualize node membership as heatmap (DONE)
  - /nfs/latdata/iamjli/packages/PCSF_analysis/bin/heatmapFromNodeFrequencyMatrix.R
Randomizations
- Randomization experiments
  - Input: parameter set
  - Output: file summarizing all runs
    - node_summary.tsv: protein, specificity, robustness, prize, type (/nfs/latdata/iamjli/ALS/network_analysis/iMNs_ALS_CTR/new_proteomics_051917/W_2_BETA_9_D_7_mu_5e-05/summary/summary_nodes.txt)
    - edge_summary.tsv (not urgent)
Post-processing (optional)
- clustering
  - Input: node_summary.tsv, robustness threshold, clustering method
  - Output: networkx object with nodes that have robustness greater than threshold.
    - Edge weights from interactome
    - Node attributes found in node_summary.tsv
    - Cluster assignments

Error during randomizations

Using this command:

python /home/nlpm/packages/OmicsIntegrator2/src/parameter_sweep.py -e /home/nlpm/OI2_Networks/interactome/iref14_Recon2_Htt_mzMatchedMet_OI2_interactome.txt -p /home/nlpm/OI2_Networks/terminals/Cypro_all_terminals.txt -o /home/nlpm/OI2_Networks/networks/Cypro/randomizations/ -w 3 6 9 12 -b 0.25 0.5 0.75 1 2 -g 5 10 20 --noisy_edges=100 --random_terminals=100 -noise=0.04567861 --seed=1

I got this error:

10:59:30 - Graph: INFO - {'w': 12.0, 'b': 2.0, 'g': 20.0, 'noise': 0.04567861, 'exclude_terminals': False, 'seed': 1, 'noisy_edges_repetitions': 100, 'random_terminals_repetitions': 100}
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 525, in _eval_randomizations
    forest, augmented_forest = self.randomizations(params["noisy_edges_repetitions"], params["random_terminals_repetitions"])
  File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 492, in randomizations
    forest, augmented_forest = self.output_forest_as_networkx(vertex_indices.node_index.values, edge_indices.edge_index.values)
  File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 322, in output_forest_as_networkx
    forest.add_nodes_from(list(set(self.nodes[vertex_indices]) - set(forest.nodes())))
  File "/home/nlpm/packages/OmicsIntegrator2/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 1700, in __getitem__
    result = getitem(key)
IndexError: arrays used as indices must be of integer (or boolean) type
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nlpm/packages/OmicsIntegrator2/src/parameter_sweep.py", line 116, in <module>
    main()
  File "/home/nlpm/packages/OmicsIntegrator2/src/parameter_sweep.py", line 102, in main
    results = graph.grid_search_randomizations(args.prize_file, params)
  File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 561, in grid_search_randomizations
    results = pool.map(self._eval_randomizations, all_params)
  File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
IndexError: arrays used as indices must be of integer (or boolean) type

Terminals not in results included in final network

Terminals that were not included in results are left in the final network as nodes (though no edges are attached to them). The inner join to get rid of the dummy edges still leaves the unattached terminals.

This is not a huge problem, though these nodes don't get assigned attributes and thus caused an error with the web tool visualization, but to be cleaner they should prob be left out. Wanted to leave a note here in case they cause any other issues in the future.

fraenkel-lab / omicsintegrator2 Goto Github PK

omicsintegrator2's People

Contributors

Stargazers

Watchers

Forkers

omicsintegrator2's Issues

New in this version

Safe to Remove:

Recommend Projects

Recommend Topics

Recommend Org

Jobs