blaugroup / mrnet Goto Github PK
View Code? Open in Web Editor NEWMr. Network is a python reaction-network for molecular systems
License: Other
Mr. Network is a python reaction-network for molecular systems
License: Other
Hetal and I made some key improvements to the path characterization function over the summer when we identified that paths from A to e.g. B + PR_C -> D vs to C + PR_B -> D could be inequivalent if C is made as a byproduct of the shortest path to B, but B is not made as a byproduct of the shortest path to C. In that case, we do not need to pay the cost of PR_C since it has already been made on the way to B, thus we perform path replacement to ensure we're taking the cheaper route, and we subtract the unnecessary PR cost out during path characterization. Here is the flow chart of path characterization that we came up with at that time:
However, the fact that post processing paths can make them shorter than Dijkstra's realized due to this BP/PR cancelation introduces the possibility that a path besides the shortest could become shorter than the shortest once BP/PR cancelation is accounted for. Unfortunately, since we only examine the one shortest path to each specie during PR solving, we will miss this entirely. Let me clarify with the following example:
We are attempting to find the shortest path to D during PR solving. Like above, D is made from the reaction of B + C, and we'll define the cost of that reaction to be 3. Let's say the shortest path to B costs 5 and the shortest path to C costs 10, and neither creates the other as a byproduct. Thus, the shortest path to D will cost 5 + 10 + 3 = 18, thereby defining the cost of PR_D for other reactions in the network, and we will have two equivalent paths with that cost going through either B + PR_C -> D or C + PR_B -> D. However, what if the 2nd shortest path to C costs 11 and makes B as a byproduct? That would mean that we could actually make D for a cost of 11 + 3 = 14, substantially less than 18, which would thus modify all network paths including PR_D.
In principle, we could catch this scenario by examining all paths to B (or C) which cost less than 15 and checking for C (or B) as a byproduct, as these are the only paths which could allow BP/PR cancelation to yield an overall shorter path to D via the B + C reaction. However, even that isn't sufficient, because there could be other routes to D entirely that could become shorter than the B + C route due to BP/PR cancelation. Thus, to ensure that we find the shortest path to D, we would have to also examine all paths up to the point that BP/PR cancelation could not bring their cost below that of the shortest path.
I expect us to continue to ignore this problem for the moment - after all, PR solving is already time consuming enough without adding another loop at the lowest level - but I wanted to open an issue so that we remember that it exists.
Pickle is version dependent. Msgpak is a binary format that is instead protocol version dependent, so as long as the major version is the same, it will be compatible. We should make this change while revamping testing on the way to V1
By limiting ourselves to isomorphism checking, we are currently unable to account for differences between stereoisomers and cis/trans isomers, which are distinct non-superimposable molecules. We previously sought to use Yu-Hang's graph kernel M3 metric to distinguish between conformers, and had some success with it, but we never got to the point where we distinguished between reactions involving different isomers. In the medium term, this could be important for TFSI, which has a cis/trans isomer, or FEC, which contains stereo centers.
We currently ignore the charge of a metal ion when deducing reactions. This can be a problem for lithium systems when Li0 is present in the network, which allows you to take a Li+ starting molecule and reduce it to Li0 and then combine that Li0 with an EC0 to obtain Li+EC-, since only the total molecular charge is checked. But by just removing the Li0 from the network, this can mostly be avoided. However, for magnesium systems the problem becomes more prevalent because Mg2+ and Mg1+ are both legitimate species. In principle, we could track the metal ion charge via partial charges or Critic charges and then disallow reactions in which the metal charge is changing in order to avoid reactions like e.g. Li0 + EC0 -> Li+EC-. However, for the Mg system, maybe we don't want to disallow those types of reactions because internal charge transfer can and does happen, e.g. Mg+TFSI- -> CF3 + Mg2+TFSI-(without CF3). I don't know the solution, but I wanted to open this issue so that we don't forget the underlying problem.
Currently, environmental factors like the temperature and the electron free energy are provided when a ReactionNetwork is created. These variables then inform the Reaction classes. While it is facile to change the attributes of the ReactionNetwork class itself, propagating those changes through all of the constituent objects is less straightforward. This functionality should be added as methods in the ReactionNetwork class.
Need new cost functions that take into account kinetics. Might be a good space for a cost function interface.
PR solving is (currently) the most time-intensive algorithm involved in using reaction networks. But, it's also currently an all-of-nothing operation - if something happens during PR solving and the method fails, all of your time is wasted.
It would be nice if, after each PR iteration, the network made a checkpoint file so that restarting was painless and (relatively) time-efficient.
We need abstract core interfaces rather than implementations. This would be a good place to include basic graph functionality. Or an abstract interface to the graph so we can switch in @khanwale's implementations easily.
Right now, when we pathfind, we get some pathways. But the work of visualizing these pathways - including the linkages but also the associated free energy changes - must be done by hand. A framework - based on matplotlib, plotly, etc. - to do this automatically would save a ton of human time.
I just noticed that @hpatel1567 isn't even listed as a contributor to the code despite the fact that she's had multiple substantial PRs merged in. I'd like this to be fixed to ensure that she gets proper credit in an eventual JOS paper and so that her GitHub accurately reflects her coding skills/contributions. @shyamd, thoughts?
An exponential-based cost function is the most physically motivated and easily justifiable form, but it will inherently yield a wide range of costs that span many orders of magnitude which makes PR solving very difficult and time-consuming. Our PR solving procedure depends on being able to iteratively add costs via PR cost updates in order to eventually get path ordering to switch. However, if the shortest and 2nd shortest paths differ in cost by e.g. 10,000 (where the shortest path has an unsolved nested PR, and the 2nd shortest does not) and each iteration only increases the cost of the shortest path by e.g. 10, then it will require 1000 iterations for the 2nd shortest path to become the shortest and thus to solve the principle PR in question.
Perhaps there could be a way to accelerate PR solving in this instance by checking if a PR cost is increasing by the same amount, let's call it dx, for e.g. 5 or 10 iterations in a row, and in that case then starting to increase it by 10*dx for a certain number of iterations instead, and then perhaps continuing to increase dx by some factor after waiting some number of iterations until eventually a path ordering switch occurs, resulting in the principle PR being solved.
This is definitely a half-baked idea, and I can easily imagine corner cases that it would encounter problems with, but I this should serve as the general issue of how we accelerate PR solving with physically motivated exponential cost functions rather than simply trying to use cost functions that bring costs closer together, which makes PR solving easier but is kinda the easy way out and is not physically justified.
Concerted reactions only really make sense when the bonds being formed/broken are all connected to the same reaction center. This can also help with unphysical "robbing Peter to pay Paul" type concerteds where one reaction which is very exergonic can be unphysically combined with an unrelated endergonic reaction in order to facilitate a very uphill reaction which in reality will not occur.
Currently, concerted reaction addition tests only check the number of concerted reactions identified / added. However, we recently missed a bug which incorrectly modified the free energy of concerted reactions without changing the number of concerted reactions that are generated. A new test should be added or a current test should be extended to also check the properties of the generated concerted reactions in addition to the number of concerted reactions.
If a user asks for the 50 shortest paths from [A,B] to X, they should get 50 unique paths and not 50 paths including a whole bunch of duplicates. Every PR implicitly duplicates a path, so if a shortest path has multiple PRs, we'll get a lot of duplicates which all have the same cost.
A recent PR changed several function names in mol_entry.py for the MoleculeEntry class. This inlcudes:
free_energy -> get_free_energy
Nbonds -> num_bonds
edges -> bonds
These are currently causing an aggressive quantity of deprecation warnings, which makes reading output difficult. It would be immensely helpful if someone would go through and change these outdated function calls throughout the repo.
Many of our current tests skip if OpenBabel is not present, but OpenBabel is not currently being installed during testing.
This was/is also a problem in pymatgen. Basically, installing OpenBabel from pip will not work because you also need the C/C++ bindings, and I couldn't figure out a way to get it to install from conda.
This is probably going to be a huge pain, but if someone can figure it out, it'll make our testing suite far more robust than it currently is.
I guess we could also rewrite our tests to be less reliant on OpenBabel. That might be easier, but also less desirable since we use the Babel-based NearNeighbor class everywhere in production.
While it was easiest to conceptualize the introduction of prerequisites by any A + B -> ??? reaction node into two, A + PR_B -> ??? and B + PR_A -> ???, technically it is unnecessary since the costs are on the edges, not the nodes. Thus, we can dramatically reduce the size of our networks, which in practice are mostly composed of A + B -> ??? reactions, by replacing the two nodes A + PR_B -> ??? and B + PR_A -> ??? with a single A + B -> ??? node, where the edge from A to that reaction node will include the cost of PR_B and the edge from B to that reaction node will include the cost of PR_A. This should approximately halve the size of our networks (thus halving the amount of memory) and approximately halve the time required to solve prerequisite costs thanks to the scaling of Dijkstra's algorithm.
We need a Mr. Rodger- esque logo
Current tests were taken from pymatgen, which are based on unittest (and custom test classes). Pytest should allow for more facile test-writing and standardization of tests going forwards.
As part of this effort, we should add automatic testing as part of our build cycle.
Reaction networks have algebraic structure (reactions can be thought of as vectors in the space spanned by the species) which we may be able to use as a tool for understanding huge examples. So far there are two concepts which seem useful:
Deficiency: This is a non negative integer associated to any reaction network independent of rates. It counts the number of loops in the network which leave the number of each species fixed, but are non trivial at the level of complexes. Deficiency is closely related to the dynamics of the network with mass action kinetics, but it is hard to make quantifiable statements except in the deficiency zero case. It is explored thoroughly in the book Quantum techniques for Stochastic mechanics: https://math.ucr.edu/home/baez/stoch_stable.pdf
Concordance: This is a binary value associated to any reaction network independent of rates: A reaction network is either concordant or discordant. In https://arxiv.org/abs/1109.2923, it is proved that concordant systems cannot display many of the exotic phenomena which make reaction networks interesting. In his book Foundations of Chemical Reaction Network Theory https://www.springer.com/gp/book/9783030038571, Feinberg explains that most reaction networks are concordant and this is a reasonable mathematical explanation for the the dull behavior across vast regions of the reaction network landscape with occasional eruptions of exotic behavior.
Each of these quantities can be computed using linear algebra, so we can expect them to be computable even for large networks using GPUs. There are probably more things like deficiency and concordance which I am missing and will add here as I find
Because of prerequisite solving and the use of split reaction nodes, often, there are multiple chemically equivalent pathways that can be taken from the same reactants to the same products. When performing pathfinding, one will often encounter these duplicate pathways, which are uninteresting and redundant.
Ideally, there would be some procedure to prune the shortest path list to return only unique pathways. This shouldn't require any major changes to how MR.Net conducts pathfinding; rather, this can just be a postprocessing step, where the reactions in each path are compared.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.