GithubHelp home page GithubHelp logo

bio-chem-lm's Introduction

High level goals

This repository is dedicated specifically to the development of (Large) Language Models, and/or Language/Structure models in the bio-chem space.

Further details can be found here

Bio-LM PubChem Selfies

We are training an Electra-style model on the PubChem dataset with SELFIES representations. The SELFIES is a chemical language that is based on the SMILES language, but is more robust. More info about SELFIES can be found here.

We have released the dataset to HuggingFace Datasets, which contains ~110M compounds in total.

We will perform a hyperparameter search using Maximal Update Parameterization to find a good set of hyperparameters to transfer to a larger model. To launch a sweep on the cluster, run

sbatch --array=1-N mup_train.sh

bio-chem-lm's People

Contributors

zanussbaum avatar loftusa avatar daniel-z-kaplan avatar

Stargazers

Farooq Khan avatar 0x1orz avatar Jeff Carpenter avatar wangm23456 avatar S4lt3d avatar Mark avatar Miguel González Duque avatar Alexander Al-Feghali avatar Mohamed El Mehdi Khalfoun avatar Elijah Spina, PhD avatar  avatar Bradley Woolf avatar  avatar  avatar  avatar Roni Gurvich avatar  avatar Peter Clarke avatar wangyang avatar Samuel Ortion avatar Abhik Seal avatar  avatar Daniel Maturana avatar  avatar Vinicius Reis avatar Christoph Feinauer avatar  avatar

Watchers

Saurav Maheshkar avatar

bio-chem-lm's Issues

Add BACE dataset to HuggingFace

Dataset found in DeepChem, let's see if we can add it to Huggingface so we can

  • Add SELIFES
  • Make it easy to use for finetuning

Add PCBA Dataset to HuggingFace

Add dataset from this paper: https://par.nsf.gov/servlets/purl/10168888

Need to extract right part from DeepChem: https://github.com/deepchem/deepchem/blob/master/deepchem/molnet/load_function/pcba_datasets.py

We select one
of the largest datasets, the dataset with ID 686978 among the 128
datasets to evaluate our method. PCBA-93 contains 302,175 samples.
For each of the three datasets, we randomly select 80% for training, 10% as the validation set and the rest 10% for evaluation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.