GithubHelp home page GithubHelp logo

spectrum's Introduction

SpecTRUM: Spectral Translator for the Reconstruction of Unknown Molecules

This project trains a Transformer model to tackle the task of de novo GC-MS spectra analysis.

Environment setting

The conda environment files are in the env_specification folder. BARTtrainH100 is the main environment used for data preprocessing, training and evaluation. The NEIMSpy3_environment is specifically used only for NEIMS spectra generation. This was necessary because of the package incompatibility.

Data preprocessing

Because of the size constraints and licensing we cannot provide the datasets we used for training. However, we provide the scripts used to obtain filter and preprocess the ZINC smiles dataset and all the preprocessing scripts for the NIST GC-MS dataset.

For every dataset in the data/datasets folder, there is a README file that provides closer information about the particular dataset and explains how it was obtained.

Pretraining & Finetuning

Pretraining and finetuning can be conducted using the train_bart.py script. The script needs a couple of arguments to run, most importantly the config_file, which is a YAML file that contains all the necessary hyperparameters for the training.

All the run scripts we used for our experiments are in the run_scripts folder and don't need any additional parameters. The scripts are named run_pretrain* and run_finetune*. Their corresponding config files are in the configs folder, again named train_config_pretrain* and train_config_finetune*.

Prediction & Evaluation

Prediciton and evaluation are two separate steps. The prediction process on NIST valid/test splits takes depending on the used hardware from 4 hours to infinity. Once you have the predictions, you can run multiple evaluation runs each taking around a minute.

The prediction script, predict.py has its runner in the run_scripts folder (run_predict.sh) and corresponding config files in the configs folder (predict_config*). The evaluation script, evaluate_predicitons.py has also its runner in the run_scripts folder (run_eval.sh) and corresponding config files in the configs folder (eval_config*).

------------------------- Other folders ------------------------

predicitons

The predictions computed by our models are in the predictions folder. Along with the predictions each folder contains a log_file.yaml with all the evaluation results (sometimes from multiple evaluation runs with different setting) and figures generated by the latest evaluaiton.

tokenizer

The tokenizer folder contains all the different tokenizers used during the experiments and the final training. It also contains the traininig data for the BBPE tokenizers.

bart_spektro

This folder contains the custom implementation of the BART model used for the experiments. The implementation is based on the transformers library and is a modification of the BartForConditionalGeneration class.

notebooks

This folder contains a lot of things. Some of them are useful and nice, some of them you better not look at. I leave it in the repository as a memento of the hard work and the struggle we went through.

That's it.:)

spectrum's People

Contributors

hejjack avatar

Stargazers

 avatar Zhimin Zhang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.