fraenkel-lab / omicsintegrator2 Goto Github PK
View Code? Open in Web Editor NEWPrize-Collecting Steiner Forests for Interactomes
Home Page: https://fraenkel-lab.github.io/OmicsIntegrator2
License: BSD 3-Clause "New" or "Revised" License
Prize-Collecting Steiner Forests for Interactomes
Home Page: https://fraenkel-lab.github.io/OmicsIntegrator2
License: BSD 3-Clause "New" or "Revised" License
@AmandaKedaigle when people go to run OmicsIntegrator2, how do they do randomizations? Do they specify a number of times to randomize using each randomization strategy and merge them all? I'm still not sure I understand merging too well either... but maybe I can read about that.
pcsf
returns edge_indices as a 1D np array. It would be nice to return as 2D array of vertex indices (since pcsf_exact uses this format), but I think this breaks a lot of downstream analyses, particularly _aggregate_pcsf
and randomizations
.
One solution would be to create a copy of the interactome, and assign robustness/specificity as node and edge attributes..
So what's the best way to go from 2D array of vertex indices to 1D np array of edge_indices?
python forest.py -e /nfs/latdata/iamjli/ALS/data/interactome/iRefIndex_v13_MIScore_interactome.txt -p /nfs/latdata/iamjli/ALS/data/iMNs/proteomics/20170323_ALS_CTR_iMNs_protein_log2FC1_FDR0.01.tsv -o ../output/ --random_terminals=5
run from /nfs/latdata/iamjli/packages/OmicsIntegrator2/src
Error:
Traceback (most recent call last):
File "forest.py", line 477, in <module>
forest, augmented_forest = graph.randomizations(prizes, terminals, args.noisy_edges_repetitions, args.random_terminals_repetitions)
File "forest.py", line 329, in randomizations
for random_prizes, terminals in [self._random_terminals(prizes, terminals) for rep in range(random_terminals_reps)]:
File "forest.py", line 329, in <listcomp>
for random_prizes, terminals in [self._random_terminals(prizes, terminals) for rep in range(random_terminals_reps)]:
File "forest.py", line 296, in _random_terminals
new_prizes[new_terminal] = prizes[old_terminal]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Some notes from our convo on Friday, Mandy, as well as a few other things I just thought of.
@brycehwang's idea.
May force the network to choose a greater diversity of paths around your protein of interest.
@AmandaKedaigle this thread is for discussing whether we need to support directed edges, and if so, how we might go about doing so. I imagine Ludwig will need to be part of these discussions. In the short term, I plan on releasing this without support for directed edges, because I have a heavy bias here of "good is better than perfect", unless you think that's a bad idea.
Currently, we have a file called write_cyto_file.py in the repo, but instead of doing this ourselves, we should have networkx do this, I think.
Cytoscape can take in these formats including:
Graph Markup Language (GML or .gml format)
networkx has documentation about reading and writing here, including the functionality of writing GML files.
This seems like a nice way to go. However, we need to make sure we don't run into any problems involving issues like this one: http://stackoverflow.com/questions/5828045/transfer-layout-from-networkx-to-cytoscape
Quick final note, we may want to use graph-tool
instead of networkx
Hi @AmandaKedaigle ,
Will there ever be duplicate edges in the edge file? I.e.
A B prize1
B A prize1
or worse
A B prize1
B A prize2
Specifically I'm asking whether I should check for that in forest, and what I should do if it happens...
@AmandaKedaigle What are the post-processing steps for graphs, and which of those do we want to support under which circumstances? In my notes, I see that there's been discussion of betweeness
, Louvain clustering
, and community clustering
. I imagine some of these are supported by networkx, others we might need to implement on our own.
@iamjli would you comment on this for us?
This was released a couple days ago.
Similarly to GarNet, it would be really nice to have a beautiful readme explaining this project. I no longer believe we'll have OmicsIntegrator2 be a fork of OmicsIntegrator(1) so I think it makes sense to duplicate important documents from OI1 here. I imagine a lot of the overview can be borrowed from OI1 (thought it might be nice to add a little). The technical stuff should be easier this time, since installing is a single command.
I still think some variables would be better as class attributes. For instance, it seems arbitrary that self.edge_penalties is an attribute despite not being used anywhere else in the class. On the other hand, terminals in the _prepare_prizes
is redefined again in pcsf
, and will be in pcsf_exact
.
I think I would be satisfied if prizes, terminals, and (maybe) node_attributes were their own attributes. Especially since prizes and terminals are both arrays that are meaningless to the user outside of the context of the class. So the current format essentially forces users to save prizes/terminals as local vars, just to pass it back into pcsf
.
Arguments against?
I've compiled a list of what seems to be the most useful clustering and enrichment methods for the output of OmicsIntegrator. Below are the methods that are most frequently used, and should be implemented in OmicsIntegrator2 (listed in order of priority)
Clustering:
As well as clustering, it would also be helpful to have enrichment methods.
If there are any others that you think of, please let me know!
@AmandaKedaigle With respect to negative prizes, how are they computed? I imagine it isn't all too complicated, just having to do with a mu parameter...?
Remind me again why it is we need to keep original
, negative
and result
prizes again?
Which allows for #35
@divyaramamoorthy @AmandaKedaigle
The web tool has support for new kinds of interactions we'd like to include eventually:
The output of Garnet should be changed to indicate what genes were used to predict a TF was important (i.e. these genes are predicted targets of that TF). Then, those genes should be added to the network as new nodes (mRNA nodes, not protein nodes), and edges should be drawn:
"pd" or protein-DNA interaction, from TFs to their target mRNA nodes
"tp" or mRNA-protein interactions, from mRNA nodes to their protein products
@AmandaKedaigle What does the old omicsintegrator do?
Running forest from /nfs/latdata/iamjli/ALS/OmicsIntegrator2/src
:
python forest.py -e ../data/interactome/iRefIndex_v13_MIScore_interactome.txt -p ../data/iMNs/proteomics/20170323_ALS_CTR_iMNs_protein_log2FC1_FDR0.01.tsv -o ../output/test/
am getting this error:
terminate called after throwing an instance of 'std::invalid_argument'
what(): In the rooted case, only one output cluster is supported.
Aborted
Traced error back to pcsf_fast function.
This command run from /nfs/latdata/iamjli/packages/OmicsIntegrator2/src
produces a segfault:
python __main__.py -e ../PCSF_compare/data/iRefIndex_v13_MIScore_interactome_COSTS.txt -p ../PCSF_compare/data/20170323_ALS_CTR_iMNs_protein_log2FC1_FDR0.01.tsv -o ../output/
PCST runs fine if I use a shortened interactome file (../PCSF_compare/data/iRefIndex_v13_MIScore_interactome_COSTS.short.txt
), or one that has not been 1 minused (../PCSF_compare/data/iRefIndex_v13_MIScore_interactome.txt
)
-Gene symbol
-Prize value
-Robustness
-Specificity
-Protein type (steiner, TF, terminal)
Provide summary statistics for single run forests, provide warnings when network structure is off (large star structures/many singleton nodes, etc). Write to file.
Ideally also do this for randomizations, but the code structure has to be modified significantly.
Python 3.5.1 (default, Mar 1 2017, 15:20:57)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pcst_fast import pcst_fast
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: dlopen(/Users/jonathanli/Documents/research/packages/OmicsIntegrator2/venv/lib/python3.5/site-packages/pcst_fast.cpython-35m-darwin.so, 2): Symbol not found: __PyThreadState_UncheckedGet
Referenced from: /Users/jonathanli/Documents/research/packages/OmicsIntegrator2/venv/lib/python3.5/site-packages/pcst_fast.cpython-35m-darwin.so
Expected in: flat namespace
in /Users/jonathanli/Documents/research/packages/OmicsIntegrator2/venv/lib/python3.5/site-packages/pcst_fast.cpython-35m-darwin.so
When I try to install pcst_fast:
(venv) jonathanli (master *) OmicsIntegrator2 $ pip install pcst_fast
Requirement already satisfied (use --upgrade to upgrade): pcst-fast in ./venv/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): pybind11>=2.1.0 in ./venv/lib/python3.5/site-packages (from pcst-fast)
You are using pip version 7.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Ranking subnetworks based on something Anthony introduced in his most recent manuscript:
rank scores for these sub-networks according to their prize densities (sum of prizes multiplied by the fractional size of sub-network
Originally from @AmandaKedaigle, copied here.
Specific things create_style_code_for_visualization.py did that we'll need to replace in some manner.
-Attributes for coloring should be named based on what kind of data: proteinChange (for both proteomic and metabolomic data) and geneChange (for expression data)
-Add the "TerminalType" attribute based on what input file you were looking at. For visualization code, these should be "Proteomic", "Metabolite", "mRNA", "TF" (from Garnet)
-It also inferred which one should be the actual "prize" column - for Proteomic or Metabolite nodes, it should be abs(data value), for TF nodes, it should be the data value (which does not change proteinChange or geneChange attr values), and for mRNA data, if prize is not already assigned to a protein node from this mRNA name, it gets prize zero (and geneChange from the data value)
Generate a version of the heatmaps for parameter selection that simultaneously capture node degree, robustness and specificity. Since degree is fixed for a node, it could be plotted at the margin of the heatmap. Robustness and specificity depend on the parameters. Perhaps we could have two versions of the plot: (A) Color scale for robustness for each matrix element with average specificity of nodes shown in the margin and (B) color scale for specificity for each matrix element with average robustness of nodes shown in the margin.
@AmandaKedaigle @zfrenchee Can you guys take a look at this soon. Error in browser:
A problem occurred in a Python script.
/var/www/htdocs/omicsintegrator/log/tmpVAf0YI.html contains the description of this error.
Same error in Safari and Firefox. Likely data-dependent error. He can't share the raw data though.
Here are the top couple nodes (left column, their numerical ID) and their edge degrees (right column)
4400 407
9466 409
6347 413
629 416
4309 419
12140 424
430 457
12015 475
75 478
6877 481
10772 516
11477 542
9358 569
4517 597
11171 617
12162 626
6397 647
5146 652
6058 653
4308 655
6433 662
4565 688
7598 746
4767 750
10141 766
6225 836
4310 1113
4770 1239
275 2008
6400 9289
The way random terminals used to work (I think) was it would just look within a fixed window around the current (in the above list) and select by sampling an offset using a gaussian.
But let's say node 6400
is your terminal, and you get an offset of 10, then you're selecting a node with degree 662
instead of 9289
which is a big change in node degree. The window size was set to 100 up or down (200 total I think) meaning you could really get a hugely different value.
Is that how we want to keep doing things? If so do we want to shrink the window? Or maybe we should do something else? What about a uniform sampling from the 5 above and 5 below?
Let me know what you think...
The docstrings are the multi-line comments below the function definitions. The functions that need annotations are louvain_clustering, edge_betweenness_clustering, k_clique_clustering.
@AmandaKedaigle's idea. Max's work. Further details coming.
Avoiding parameter selection. This is much more speculative, but what if we chose parameter combinations at random, and then created a merged network by taking the average weighted by the probability the network is not random (mean specificity score)? This is similar to Importance sampling, but less rigorous since specificity is not the same as prob_true.
Line 366 in 11a24e1
The randomizations function has gotten a little bloated. Refactor it to make it a little more atomic. Other functions which might be impacted / implicated:
- _aggregate_pcsf
- get_networkx_graph_as_dataframe_of_nodes
- get_networkx_graph_as_dataframe_of_edges
The otherLocations
node attribute is a list of extracellular locations, but networkx cannot write lists to GraphML, apparently.
Small one, but shouldn't it be filename="graph_edgelist.txt"
per the docstring?
Line 948 in 71d5c59
The imports for main.py in line 11 need to be fixed a little bit. In particular, get_networkx_graph_as_node_edge_dataframes needs to be replaced with the two separate functions available in graph (and also once again where it is called, needs to be replaced with two lines)
In addition get_networkx_subgraph_from_randomizations doesn't exist in graph.py currently. This needs to be implemented or need to import the right function instead.
Right now, the knockout functionality is not implemented in OmicsIntegrator2.
parser.add_argument("--knockout", dest='knockout', nargs='*', default=[],
help='Protein(s) you would like to "knock out" of the interactome to simulate a knockout experiment. [default: %default]')
Tony's implementation of multi-PCSF (which Azim and I making a modified/new version of for OI2) already uses a parameter called alpha in calculating new prizes.
If there's a good argument for calling the hub edge penalty alpha we can name ours something different, but for confusion's sake it would be better to keep similar parameters similar names to the published algorithm
This is an important feature a lot of the time. How would we like it to be implemented?
What output format would we like?
Right now it's just badly written.
Log transform for "g"
Have "w" scale a function of prize weight
run_visualization_for_user_saved_JSON_file.sh
creates two redundant folders: ../visualize_results_rundir
and ../visualize_saved_<name>
Currently only recognizes Steiner node if TerminalType field is 0. Please change to recognize anything that's not RNA/TF/Proteomic
Node names need to be text searchable
Allow user to switch between viewing Protein and mRNA level change. Both could be represented by node color...looks cluttered as is.
We need to think deeply about what the potential malformed inputs to OmicsIntegrator might be. This thread is for that. Beyond making sure every function is passed the right type, what else do we need to check? @divyaramamoorthy
Using this command:
python /home/nlpm/packages/OmicsIntegrator2/src/parameter_sweep.py -e /home/nlpm/OI2_Networks/interactome/iref14_Recon2_Htt_mzMatchedMet_OI2_interactome.txt -p /home/nlpm/OI2_Networks/terminals/Cypro_all_terminals.txt -o /home/nlpm/OI2_Networks/networks/Cypro/randomizations/ -w 3 6 9 12 -b 0.25 0.5 0.75 1 2 -g 5 10 20 --noisy_edges=100 --random_terminals=100 -noise=0.04567861 --seed=1
I got this error:
10:59:30 - Graph: INFO - {'w': 12.0, 'b': 2.0, 'g': 20.0, 'noise': 0.04567861, 'exclude_terminals': False, 'seed': 1, 'noisy_edges_repetitions': 100, 'random_terminals_repetitions': 100}
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 525, in _eval_randomizations
forest, augmented_forest = self.randomizations(params["noisy_edges_repetitions"], params["random_terminals_repetitions"])
File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 492, in randomizations
forest, augmented_forest = self.output_forest_as_networkx(vertex_indices.node_index.values, edge_indices.edge_index.values)
File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 322, in output_forest_as_networkx
forest.add_nodes_from(list(set(self.nodes[vertex_indices]) - set(forest.nodes())))
File "/home/nlpm/packages/OmicsIntegrator2/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 1700, in __getitem__
result = getitem(key)
IndexError: arrays used as indices must be of integer (or boolean) type
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/nlpm/packages/OmicsIntegrator2/src/parameter_sweep.py", line 116, in <module>
main()
File "/home/nlpm/packages/OmicsIntegrator2/src/parameter_sweep.py", line 102, in main
results = graph.grid_search_randomizations(args.prize_file, params)
File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 561, in grid_search_randomizations
results = pool.map(self._eval_randomizations, all_params)
File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 608, in get
raise self._value
IndexError: arrays used as indices must be of integer (or boolean) type
Terminals that were not included in results are left in the final network as nodes (though no edges are attached to them). The inner join to get rid of the dummy edges still leaves the unattached terminals.
This is not a huge problem, though these nodes don't get assigned attributes and thus caused an error with the web tool visualization, but to be cleaner they should prob be left out. Wanted to leave a note here in case they cause any other issues in the future.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.