GithubHelp home page GithubHelp logo

mehmetazizyirik / maygen Goto Github PK

View Code? Open in Web Editor NEW
46.0 4.0 15.0 16.24 MB

MAYGEN is an open source chemical structure generator based on the orderly graph generation method.

License: MIT License

Java 100.00%

maygen's Introduction

drawing DOI Maintainability Rating Reliability Rating Security Rating Lines of Code Bugs Build CodeQL

MAYGEN - A chemical structure generator for constitutional isomers based on the orderly generation principle

Copyright 2021 Mehmet Aziz Yirik

Introduction

MAYGEN is an open source chemical structure generator based on the orderly graph generation method. The principles of this method were outlined in the MAYGEN article[1]. MAYGEN takes a molecular formula (such as C10H16O) as input and generates all constitutional isomers of this formula, i. e. all non-isomorphic molecules that can be constructed with the set of atoms in the input formula. For the case of C10H16O, for example, there are 452,458 non-identical molecules. Here are 12 out of those.

As can be seen from these examples, MAYGEN makes no assumptions on chemical stability. In particular in small ring systems, this may lead to unlikely structures, such as C=1C=C1.

We benchmarked MAYGEN V.1.4 against the current state-of-the-art, but closed-source structure generator MOLGEN 5.0 from the University of Bayreuth as well as against the Parallel Molecule Generator (PMG)[2], the fastest available open source structure generator. Since PMG can be run in multi-threaded mode, the benchmark was performed in single-threaded mode for algorithmic comparability. For randomly selected 50 formulae, MAYGEN was in average 3 times slower than MOLGEN but 47 times faster than PMG. For some formulae, PMG could not generate isomers. These are shown by gaps on the its plot.

Download jar File

Executable JAR files can be downloaded from the release page

Download Source Code

You can download the source code as a ZIP file from the landing page of this repository. Alternatively, you can clone the repository using GIT. For more information set-up-git

To download MAYGEN source code:

$ git clone https://github.com/MehmetAzizYirik/MAYGEN.git

Compiling

To compile MAYGEN, Apache Maven and Java 1.8 (or later) are required.

MAYGEN/$ mvn package

This command will create jar file named as "MAYGEN-1.8" under the target folder.

Usage

MAYGEN-1.8.jar can be run from command line with the specified arguments. An example command is given below.

The definitions of the arguments are given below:

usage: java -jar MAYGEN-1.8.jar [-f <arg>] [-fuzzy <arg>] [-setElements]
       [-v] [-t] [-o <arg>] [-b] [-m] [-smi] [-sdf] [-sdfCoord] [-h]

Generates molecular structures for a given molecular formula.
The input is a molecular formula string.

For example 'C2OH4'.

If user wants to store output file in a specific directory, that is needed
to be specified. It is also possible to generate SMILES instead of an SDF
file, but it slows down the generation time. For this, use the '-smi'
option.

 -f,--formula <arg>               formula

 -fuzzy,--fuzzyFormula <arg>      fuzzy formula

 -setElements,--settingElements   User defined valences

 -v,--verbose                     print message

 -t,--tsvoutput                   Output formula, number of structures and
                                  execution time in CSV format. In
                                  multithread, the 4th column in the
                                  output is the number of threads.

 -o,--outputFile <arg>            Store output file

 -b,--boundaryConditions          Setting the boundary conditions option

 -m,--multithread                 Use multi thread

 -smi,--SMILES                    Output in SMILES format

 -sdf,--SDF                       Output in SDF format

 -sdfCoord,--coordinates          Output in SDF format with atom
                                  coordinates

 -h,--help                        Displays help message

Please report issues at https://github.com/MehmetAzizYirik/MAYGEN
java -jar MAYGEN-1.8.jar -f C2OH4 -v -t -o C:\Users\UserName\Desktop\

java -jar MAYGEN-1.8.jar -fuzzy C[2-5]O2H[4-8] -v -t -o C:\Users\UserName\Desktop\

java -jar MAYGEN-1.8.jar -f N(val=4)6H6 -setElements -v -t -o C:\Users\UserName\Desktop\

Webservice

A webservice is also developed for MAYGEN software for easy usage and educational purposes. The documentation for the webservice is given here.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Authors

Acknowledgements

YourKit

The developer uses YourKit to profile and optimise code.

YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.

cdk

This project relies on the Chemistry Development Project (CDK), hosted under CDK GitHub. Please refer to these pages for updated information and the latest version of the CDK. CDK's API documentation is available though our Github site.

References

1- Yirik, M.A., Sorokina, M. & Steinbeck, C. MAYGEN: an open-source chemical structure generator for constitutional isomers based on the orderly generation principle. J Cheminform 13, 48 (2021). https://doi.org/10.1186/s13321-021-00529-9

2- Jaghoori MM, Jongmans SS, De Boer F, Peironcely J, Faulon JL, Reijmers T, Hankemeier T. PMG: multi-core metabolite identification. Electronic Notes in Theoretical Computer Science. 2013 Dec 25;299:53-60.

maygen's People

Contributors

javadev avatar mehmetazizyirik avatar steinbeck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

maygen's Issues

MAYGEN (mostly or sometimes?!) cannot handle zeros in MF

java -jar MAYGEN.jar -f C0N0O3H2
MAYGEN is generating isomers of C0N0O3H2...

java -jar MAYGEN.jar -f O3H2
MAYGEN is generating isomers of O3H2...
The number of structures is: 1
Time: .052 seconds

java -jar MAYGEN.jar -f O3H2C0N0
MAYGEN is generating isomers of O3H2C0N0...
The number of structures is: 1
Time: .054 seconds

Add structure filters

Excellent work! Thank you.

One of the use-cases is to use the generator to create a small(ish) set of potential structures that fit key structural criteria. It would be great to be able to filter output through "good lists" and "bad lists" (structural features that must be present and those that mustn't be present). Likewise, filtering based on max- min- ring sizes.

The easy way is to do this post generation and prior to output, the performant way is to prevent generation (although I suspect that this is much more difficult). The post generation filter could be done with a SMARTS filter in CDK...

[enhancement] MAYGEN output as SMILES

Hi, any possibility to get SMILES as alternative output to SDF? This way we would get more compact output files and would be easier to plot the molecules using standard toolkits. Thanks!

MAYGEN generates wrong valences

This was found with MAYGREN version 1.6. It generates 582423 isomers for C7H8O3 (Molgen makes 582387 isomers).
Something went wrong with the valence of oxygen atoms. Attached are two examples SDF files (zipped, since GitHub does not support SDF in issue attachments) where oxygen atoms in the result set have more than two bonds (atom 3 in the first example, atoms 8 and 10 in the second example).

first.sdf.zip
last.sdf.zip

feature suggestion: move version information, add help option

The present executable of the application is distributed as MAYGEN-1.7.jar. For future
releases of the application, I suggest the name stays fixed, i.e. MAYGEN.jar as used,
e.g. in the documentation.

As a result, it is easier to use the application with a moderating script, regardless of the
release of MAYGEN currently deployed. Thus, if there are improvements of MAYGEN,
there is no need to open/edit/update these scripts moderating MAYGEN.

Similar to other applications (e.g. ps2eps in Linux Debian, or date2name
moderated by Python), I suggest

  • copy the usage information into a help menu on the CLI, accessed by -h, and/or --help
  • offer access to the release information of the program by -v, and/or --version.

[enhancement] Output only total # structures without generating sdf

Hi, would it be possible to just output the total number of structures MAYGEN would produce for a given formula without actually generating the sdf? Knowing this number would help design downstream workflows without having to actually generate and store all structures. Thank you!

number of threads -> 79

When I run the same formula on my windows machine and in our cluster, Linux, I got 3 for the number of threads. "C2OH4 3 .017 3". I did not get an output with 79. So I need more information for the issue you faced to be able to understand why you got such an output. Can you please open a new issue for that and give us more information about how you run the MAYGEN-1.5.jar file? Did you download the jar file or build the maven project yourself ? But in any case, the jar was built and released based on the latest src code we have in the repository.

I downloaded MAYGEN-1.5.jar:
wget https://github.com/MehmetAzizYirik/MAYGEN/releases/download/V1.5/MAYGEN-1.5.jar

and ran like this:

java -jar MAYGEN-1.5.jar -f C2OH4 -v -t
MAYGEN is generating isomers of C2OH4...
The number of structures is: 3
Time: .015 seconds
C2OH4 3 .015 79

I ran the program in Linux release 7.9.2009

I also noticed that there is a problem with the sdf the program generates when I used a more complex formula

java -jar MAYGEN-1.5.jar -f C10H9N3 -v -t -d temp -m 10

I copy and pasted the problem area. It looks like each thread is writing to the file at the same time.

Molecule 2
MAYGEN 20210615
22 22 0 0 0 0 0 0 0999 V2000

Molecule 3
MAYGEN 20210615
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
22 22 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

Molecule 6
MAYGEN 20210615
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
22 22 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0

feature suggestion: extend consideration of stereo chemistry

MAYGEN (release 1.7) was deployed on sum formula C5H10O and to deposit the results in a .sdf in common. Initial observation suggests the list of isomers is incomplete when it comes to stereogenic centres,* and (E)/(Z) double bonds.

Please improve MAYGEN such the generation of isomers eventually is exhaustive.

The .sdf generated was read by DataWarrior. Right on launching, DataWarrior warns about the .sdf read (cf. the attached .zip archive with the .sdf and screen photos taken). For example, entries 11 (2-methyl cyclopropanol) and 24 (2-methyl cyclobutanol), both with two stereogenic centres display only one entry each, skipping the permutation of (R)/(S) configuration. In an exhaustive generation of isomers, entry 32 (pent-2-enol) is anticipated to yield both explicit (E), as well as explicit (Z) isomer, but is depicted like "undetermined/mixture". This detail possibly does not hurt (as much) in mass spectroscopy than in the generation of substrates for synthesis.

maygen_stereochemistry.zip

*) What about chiral sulfoxides? Perhaps a similar limitation.

Readme instructions list wrong option

The instructions on the readme file say

java -jar MAYGEN.jar -f C2OH4 -v -t -d C:\Users\UserName\Desktop\

but the option to specify the output directory is -o and not -d

The number of threads is not as expected.

Steps to reproduce:

  1. Execute the jar file with this keys: -f formula -v -t

Actual result:

The output contains wrong number of threads ranging from 5 to 79.

Expected result:

The output contains 1 for the number of threads in sequence mode.

Option to constraint generation to contain specific substructures

Thanks for the great repo!

Would you be adding the functionality of constraining the generation to contain specific substructures? A simple way is to generate all possibilities, and discard those that do not contain the substructure. However, a more efficient way would be to restrict the generation so that no time is wasted in generating invalid answers.

One approach could be to initialize the adjacency matrix with the substructure before generation. However, your algorithm first distributes hydrogens to atoms, which does not allow us to do that easily.

If you have any plans or thoughts in this direction, would be great if you can share them.

Thanks!

SDF generation issue

Recently I have tried to generate different molecule from the chemical formula
I have compiled MAYGEN using mvn, compiled successfully. However, output sdf coordinates are all zero.
Same is true for the pre-compiled version 1.6. The command I am using under Linux is

java -jar MAYGEN-1.7.jar -f C2OH4 -v -t -d conformer

output sdf is shown like
`
Molecule 1
MAYGEN 20210615
7 6 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
1 5 1 0 0 0 0
2 6 1 0 0 0 0
2 7 1 0 0 0 0
3 1 1 0 0 0 0
3 4 1 0 0 0 0
M END

$$$$

Molecule 2
MAYGEN 20210615
7 7 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 4 1 0 0 0 0
1 5 1 0 0 0 0
2 6 1 0 0 0 0
2 7 1 0 0 0 0
3 1 1 0 0 0 0
3 2 1 0 0 0 0
M END

$$$$

Molecule 3
MAYGEN 20210615
7 6 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 4 1 0 0 0 0
2 5 1 0 0 0 0
2 6 1 0 0 0 0
2 7 1 0 0 0 0
3 1 2 0 0 0 0
M END

$$$$`

Am I making any mistake?

feature suggest: extend export of .sdf

For an initial generation of isomers matching sum formula C5H10O, I observe the .sdf generated by MAYGEN (release 1.7) lacks the export of explicit hydrogens. Apparently, not only those attached to carbon, but equally the ones attached to heteroatoms are missing.

Please add these atoms (like in Fortran, implicit none). Just think about, e.g. bromochlorofluoromethane.

Equally, since many viewers accept .sdf as a file format, substitute the current export of 2D molecules as 3D objects. At present, all exported atom share z = 0.00.

incomplete

Add an option to filter invalid valences

When I run the input

java -jar MAYGEN-1.7.jar -f C22H32O3N2 -v -t -m -smi -o /Users/jasonb/Desktop/temp

the first smiles that comes back is "[CH]C([C][CH2])([OH2])C1([CH])CCCC(C=CC=CC=C[OH]=CC=CC2C=C2)C[NH2]C1", which looks like

image

If the CDK is being used to generate the SMILES strings, would it be possible to filter out molecules with such nonstandard valences? I see a pentavalent oxygen, plenty of divalent carbons, etc. Currently I am filtering them afterwards but perhaps it could be integrated into the system.

generating output sdf

Hi,

Thank you for sharing your work. I downloaded MAYGEN-1.5.jar and did the following:

java -jar MAYGEN-1.5.jar -f C2OH4 -v -t -d /tmp/

and I got the following standard output:

MAYGEN is generating isomers of C2OH4...
The number of structures is: 3
Time: .015 seconds
C2OH4 3 .015 79

But I didn't get a sdf in /tmp folder. I am using centos 7.9.

Thanks,

Jin

MAYGEN.jar fails to run due to class not found

I can clone the repo and build the jar with mvn package, but when I then do

java -jar target/MAYGEN-jar-with-dependencies.jar
I get
Error: Could not find or load main class MAYGEN.MAYGEN

Suggestion: clone the repo to an empty location and reproduce the problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.