GithubHelp home page GithubHelp logo

psyche11 / bio_embeddings Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sacdallago/bio_embeddings

0.0 0.0 0.0 70.08 MB

Get protein embeddings from protein sequences

Home Page: http://docs.bioembeddings.com

License: MIT License

Python 4.95% Shell 0.03% Jupyter Notebook 1.69% HTML 93.28% CSS 0.01% JavaScript 0.03% Dockerfile 0.02%

bio_embeddings's Introduction

Bio Embeddings

Resources to learn about bio_embeddings:

Project aims:

  • Facilitate the use of language model based biological sequence representations for transfer-learning by providing a single, consistent interface and close-to-zero-friction
  • Reproducible workflows
  • Depth of representation (different models from different labs trained on different dataset for different purposes)
  • Extensive examples, handle complexity for users (e.g. CUDA OOM abstraction) and well documented warnings and error messages.

The project includes:

  • General purpose python embedders based on open models trained on biological sequence representations (SeqVec, ProtTrans, UniRep,...)
  • A pipeline which:
    • embeds sequences into matrix-representations (per-amino-acid) or vector-representations (per-sequence) that can be used to train learning models or for analytical purposes
    • projects per-sequence embedidngs into lower dimensional representations using UMAP or t-SNE (for lightwieght data handling and visualizations)
    • visualizes low dimensional sets of per-sequence embeddings onto 2D and 3D interactive plots (with and without annotations)
    • extracts annotations from per-sequence and per-amino-acid embeddings using supervised (when available) and unsupervised approaches (e.g. by network analysis)
  • A webserver that wraps the pipeline into a distributed API for scalable and consistent workfolws

Installation

You can install bio_embeddings via pip or use it via docker. Mind the additional dependencies for align.

Pip

Install the pipeline like so:

pip install bio-embeddings[all]

To install the unstable version, please install the pipeline like so:

pip install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git"

Docker

We provide a docker image at ghcr.io/bioembeddings/bio_embeddings. Simple usage example:

docker run --rm --gpus all \
    -v "$(pwd)/examples/docker":/mnt \
    -v bio_embeddings_weights_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/bioembeddings/bio_embeddings:v0.1.6 /mnt/config.yml

See the docker example in the examples folder for instructions. You can also use ghcr.io/bioembeddings/bio_embeddings:latest which is built from the latest commit.

Dependencies

To use the mmseqs_search protocol, or the mmsesq2 functions in align, you additionally need to have mmseqs2 in your path.

Installation notes

bio_embeddings was developed for unix machines with GPU capabilities and CUDA installed. If your setup diverges from this, you may encounter some inconsistencies (e.g. speed is significantly affected by the absence of a GPU and CUDA). For Windows users, we strongly recommend the use of Windows Subsystem for Linux.

What model is right for you?

Each models has its strengths and weaknesses (speed, specificity, memory footprint...). There isn't a "one-fits-all" and we encourage you to at least try two different models when attempting a new exploratory project.

The models prottrans_bert_bfd, prottrans_albert_bfd, seqvec and prottrans_xlnet_uniref100 were all trained with the goal of systematic predictions. From this pool, we believe the optimal model to be prottrans_bert_bfd, followed by seqvec, which has been established for longer and uses a different principle (LSTM vs Transformer).

Usage and examples

We highly recommend you to check out the examples folder for pipeline examples, and the notebooks folder for post-processing pipeline runs and general purpose use of the embedders.

After having installed the package, you can:

  1. Use the pipeline like:

    bio_embeddings config.yml

    A blueprint of the configuration file, and an example setup can be found in the examples directory of this repository.

  2. Use the general purpose embedder objects via python, e.g.:

    from bio_embeddings.embed import SeqVecEmbedder
    
    embedder = SeqVecEmbedder()
    
    embedding = embedder.embed("SEQVENCE")

    More examples can be found in the notebooks folder of this repository.

Cite

If you use bio_embeddings for your research, we would appreciate it if you could cite the following paper:

Dallago, C., Schütze, K., Heinzinger, M., Olenyi, T., Littmann, M., Lu, A. X., Yang, K. K., Min, S., Yoon, S., Morton, J. T., & Rost, B. (2021). Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols, 1, e113. doi: 10.1002/cpz1.113

The corresponding bibtex:

@article{https://doi.org/10.1002/cpz1.113,
author = {Dallago, Christian and Schütze, Konstantin and Heinzinger, Michael and Olenyi, Tobias and Littmann, Maria and Lu, Amy X. and Yang, Kevin K. and Min, Seonwoo and Yoon, Sungroh and Morton, James T. and Rost, Burkhard},
title = {Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets},
journal = {Current Protocols},
volume = {1},
number = {5},
pages = {e113},
keywords = {deep learning embeddings, machine learning, protein annotation pipeline, protein representations, protein visualization},
doi = {https://doi.org/10.1002/cpz1.113},
url = {https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpz1.113},
eprint = {https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpz1.113},
year = {2021}
}

Additionally, we invite you to cite the work from others that was collected in `bio_embeddings` (see section _"Tools by category"_ below). We are working on an enhanced user guide which will include proper references to all citable work collected in `bio_embeddings`.

Contributors

  • Christian Dallago (lead)
  • Konstantin Schütze
  • Tobias Olenyi
  • Michael Heinzinger

Want to add your own model? See contributing for instructions.

Non-exhaustive list of tools available (see following section for more details):

Datasets

  • prottrans_t5_xl_u50 residue and sequence embeddings of the Human proteome at full precision + secondary structure predictions + sub-cellular localisation predictions: DOI

Tools by category

Pipeline
General purpose embedders

bio_embeddings's People

Contributors

sacdallago avatar konstin avatar t03i avatar hannesstark avatar kvetab avatar mheinzinger avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.