GithubHelp home page GithubHelp logo

Rework FEMap API about cinnabar HOT 14 OPEN

openfreeenergy avatar openfreeenergy commented on August 15, 2024 1
Rework FEMap API

from cinnabar.

Comments (14)

IAlibay avatar IAlibay commented on August 15, 2024 1

I think all three would be great, but as a priority CSV + network object since that's how we'll interact with it short term

from cinnabar.

IAlibay avatar IAlibay commented on August 15, 2024 1

Like could there be a case where we have experimental data for ligand A and ligand C but we also simulate ligand B which is part of the A->B->C network but we don't have experimental data for it?

Yes 100% we should expect that we have fewer experimental points than computed points. That's pretty much the standard use case - i.e. you have affinity data for a few ligands and then you do FEPs based on these initial ligands to attempt to find a better binder before the next round of synthesis.

from cinnabar.

richardjgowers avatar richardjgowers commented on August 15, 2024 1

So you're using string labels, but I'm assuming any hashable works, i.e. we could be dropping in gufe systems, but also that's not required.

I'm guessing that under the hood there's a nx.multidigraph actually doing the data management.

I'm tempted to say that maybe the measurements (calculated or exptl) should be an object themselves (somewhere in this repo there's ExperimentalResults and RelativeResult objects). These could handle a few quirky cases like:

  • asymmetric errors (see also experimental measurement limits for non actives)
  • storing the individual legs of a calculation

So something like g.add_edge(systemA, systemB, RelativeResult(...)).

I would say you could be clever and use __float__ to make these fancy objects quack like floats, but you'll still always want the uncertainty, so probably better to make them quack like a pint.Measurement

from cinnabar.

ijpulidos avatar ijpulidos commented on August 15, 2024 1

I wonder if we also want a way to have the information of different experiments (simulations) in the same FEMap for comparison. As in, I run the same set of transformations using a set of parameters and I want to get a comparison with the same set of transformations using a different set of parameters (or maybe even different engines). For this it might be desired that we want to see the results of both experiments in the same plot.

This may be out of the scope of this issue, but I am now faced with this situation where I want to see results from two different experiments in the same plot, to check which dots/transformations are really different between both. Does that make any sense?

from cinnabar.

mikemhenry avatar mikemhenry commented on August 15, 2024

What kinds of inputs do we want to support? I was thinking that we would want to support csv some sort of dictionary that has all the bits, and maybe a networkX graph?

  • csv
    Legacy format, people probably would want to use their old csv to check to see if there are any regressions in our changes.

  • dictionary
    Python dictionary with all the bits we need to build the networkx graph. This could be useful for users that don't want to write out a csv/want to use the API directly

  • networkX graph
    Could be useful for users that really know what they are doing but want to use our analysis tooling

from cinnabar.

mikemhenry avatar mikemhenry commented on August 15, 2024

I also think the dictionary will take some iteration/time to figure out the best structure and may be best to wait after we get the rest of the API figure out.

from cinnabar.

mikemhenry avatar mikemhenry commented on August 15, 2024

RE: Using networkx directed graph

DiGraphs hold directed edges. Self loops are allowed but multiple (parallel) edges are not.
We have talked about parallel edges being allowed, will we need to use a different networkx object to store our network?

from cinnabar.

mikemhenry avatar mikemhenry commented on August 15, 2024

Will we always have a 1-1 mapping of simulated data + experimental data? Or will we have a case where one set is a super or subset of the other? Like could there be a case where we have experimental data for ligand A and ligand C but we also simulate ligand B which is part of the A->B->C network but we don't have experimental data for it?

from cinnabar.

mikemhenry avatar mikemhenry commented on August 15, 2024

This is what I am thinking for the FEMap API. I included a bit for plotting a stats, with the important part being that someone could just use part of arsenic and then use another tool of their choice. For example, they may just want to use our stat methods so being able to pull out a pandas data frame that they can use in another tool would be good for interoperability.

import arsenic
# Three 'main' module name spaces 
import arsenic.network
import arsenic.stats
import arsenic.viz 

# Main python entry point for making the FE Network
fe_map = arsenic.network.FEMap()

# Python API
# Add some experimental data
# Do we need a "expt" namespace on these values?
fe_map.add_node("ligand_A", DeltaG=-8.93*unit.kilocalories_per_mole, dDeltaG=0.10*unit.kilocalories_per_mole)
fe_map.add_node("ligand_B", DeltaG=-9.11*unit.kilocalories_per_mole, dDeltaG=0.10*unit.kilocalories_per_mole)

# Add some simulated data
# Dropping the units to save some space here
fe_map.add_edge("ligand_A", "ligand_B", calc_DDG=0.36, calc_dDDG_MBAR=0.11, calc_dDDG_additional=0.0)

# Loops work, not sure what data we want to store
fe_map.add_edge("ligand_A", "ligand_A")

# network is directional, so we can do A->B, B->A checks
fe_map.add_edge("ligand_B", "ligand_A", calc_DDG=0.36, calc_dDDG_MBAR=0.11, calc_dDDG_additional=0.0)

# For absolute free energy, the environment is needed for bookkeeping 
# Under the hood we will name the node ligand_H_Complex, and ligand_H_Solvent
fe_map.add_node("ligand_H", DDG=-8.83, dDDG=0.10, environment="Complex")
fe_map.add_node("ligand_H", DDG=-8.83, dDDG=0.10, environment="Solvent")



# We can also create a network from a CSV file
fe_map = arsenic.network.FEMap.from_csv("my_data.csv")

# Calculate statistics 
fe_stats = arsenic.stats.analyze_map(fe_map)
fe_stats.calculate("all") # ['RMSE', 'MUE', 'R2', 'rho','RAE','KTAU']
my_data_frame = fe_stats.to_pandas() # returns pandas data frame

# Plotting
plot = arsenc.viz.make_plot(backend="matplotlib")
# Add plotting options
options = {'method_name': 'softwarename',
 'target_name': 'made up protein',
 'color': 'hotpink',
 'guidelines': False}
                
# Make plot
plot.draw(plot, **options)

@richardjgowers let me know what you think!

I'm not sure if we should just use edge/node or do something like add_experiment and add_calculation.

from cinnabar.

orbeckst avatar orbeckst commented on August 15, 2024

Outsider (TAC) comment:

  • networkx graph as data structure looks very sensible (and should be easy to convert to a dict if really needed)
  • I'd look into quantities that can carry uncertainties (see e.g. uncertainties – perhaps pint.Measurement is doing the same?); having the opportunity to do automatic error propagation is nice
  • From our experience with https://github.com/Becksteinlab/multibind , I would pay attention to cycles in the graphs, as they are valuable both for error checking and for strong constraints on the data.

from cinnabar.

mikemhenry avatar mikemhenry commented on August 15, 2024

Okay minor changes, using a class for the results and using deltaG deltadeltaG and variance instead of using the old convention here https://github.com/OpenFreeEnergy/arsenic#terminology

from opennff.units import unit
import arsenic
# Three 'main' module name spaces 
import arsenic.network
import arsenic.stats
import arsenic.viz 

# Main python entry point for making the FE Network
fe_map = arsenic.network.FEMap()

# Python API
# Add some experimental data
# Do we need a "expt" namespace on these values?

fe_map.add_node("ligand_A", ExperimentalResult(deltaG=-8.93*unit.kilocalories_per_mole, 
                                               variance=0.10*unit.kilocalories_per_mole))
fe_map.add_node("ligand_B", ExperimentalResult(deltaG=-9.11*unit.kilocalories_per_mole, 
                                               variance=0.10*unit.kilocalories_per_mole))

# Add some simulated data
# Dropping the units to save some space here
fe_map.add_edge("ligand_A", "ligand_B", RelativeResult(deltadeltaG=0.36*unit.kilocalories_per_mole, 
                                                       variance=0.11*unit.kilocalories_per_mole))

# We can also create a network from a CSV file
fe_map = arsenic.network.FEMap.from_csv("my_data.csv")

# We can save the network to disk as a GraphML XML
fe_map.save("my_network")

# Calculate statistics and maximum likelihood estimate 
fe_stats = arsenic.stats.analyze_map(fe_map)
fe_stats.calculate("all") # ['RMSE', 'MUE', 'R2', 'rho','RAE','KTAU']

# returns pandas data frame
my_data_frame = fe_stats.to_pandas() 

# Plotting
plot = arsenc.viz.make_plot(backend="matplotlib")
# Add plotting options
options = {'method_name': 'softwarename',
 'target_name': 'made up protein',
 'color': 'hotpink',
 'guidelines': False}
                
# Make plot
plot.draw(plot, **options)

from cinnabar.

richardjgowers avatar richardjgowers commented on August 15, 2024

@ijpulidos yeah I think allowing multiple edges (measurements) between the same nodes (chemical states) is definitely a good idea. It does imply there's also some way to tag edges with an annotation to differentiate/group them later on.

from cinnabar.

IAlibay avatar IAlibay commented on August 15, 2024

re: multiple calculation types on one graph, one of the questions I would ask here is what advantage do we provide by having a single graph rather than a graph per experiment? The main reason for mixing calculated and experimental data on the same graph is so that we can potentially normalise one to the other, however with two separate sets of calculations I'm not sure what advantage you get. Playing devil's advocate, would it not be more suitable to have separate graphs and then let users combine data for further analysis?

from cinnabar.

ijpulidos avatar ijpulidos commented on August 15, 2024

@IAlibay While it's mostly just a convenience for bookkeeping and wrangling data, I guess another case this might come handy is when comparing different replicates from simulations using the same methods or protocol. And this can be applied to check some kind of convergence between them. I have been using the _master_plot function to compare different replicates of simulations this way. I guess having them in the FEMap is just going to make it even more straight forward.

from cinnabar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.