GithubHelp home page GithubHelp logo

efficient-evolution's Introduction

Efficient evolution from general protein language models

Scripts for running the analysis described in the paper "Efficient evolution of human antibodies from general protein language models".

Running the model

To evaluate the model on a new sequence, clone this repository and run

python bin/recommend.py [sequence]

where [sequence] is the wildtype protein sequence you want to evolve. The script will output a list of substitutions and the number of recommending language models.

To recommend mutations to antibody variable domain sequences, we have simply run the above script separately on the heavy and light chain sequences.

We have also made a Google Colab notebook available. However, this notebook requires a full download and installation of the language models for each run and requires Colab Pro instances with a higher memory requirement than the free version of Colab. When making many predictions, we recommend the local installation above, as this will allow you to cache and reuse the models.

Paper analysis scripts

To reproduce the analysis in the paper, first download and extract data with the commands:

wget https://zenodo.org/record/6968342/files/data.tar.gz
tar xvf data.tar.gz

To acquire mutations to a given antibody, run the command

bash bin/eval_models.sh [antibody_name]

where [antibody_name] is one of medi8852, medi_uca, mab114, mab114_uca, s309, regn10987, or c143.

DMS experiments can be run with the command

bash bin/dms.sh

efficient-evolution's People

Contributors

brianhie avatar cynicjon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

efficient-evolution's Issues

dl.fbaipublicfiles.com download links error

Hello. I'm trying to run your great model. But the job keep stopping with the below error.
Should I download each model and run the job?

Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm1v_t33_650M_UR90S_1.pt" to /home/jisun/.cache/torch/hub/checkpoints/esm1v_t33_650M_UR90S_1.pt
Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/urllib/request.py", line 1354, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1007, in _send_output
self.send(msg)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 947, in send
self.connect()
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1414, in connect
super().connect()
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 918, in connect
self.sock = self._create_connection(
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/socket.py", line 808, in create_connection
raise err
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/socket.py", line 796, in create_connection
sock.connect(sa)
OSError: [Errno 101] Network is unreachable

Question about usage

/home/user/anaconda3/envs/py3.6/lib/python3.6/site-packages/esm/pretrained.py:216:
UserWarning: Regression weights not found, predicting contacts will not produce correct results.

Can this warning be ignored, as mentioned in facebookresearch/esm#170?

interpreting the recommended mutations in a one-by-one context

Hi,

I've been using this software for a while now, and I thought of a scheme for which I could attempt to have a "validated" or "rejected" tag for each of the somatic hypermutations of a monoclonal antibody, in the following way:

Given a monoclonal antibody observed in a single cell, I list each of the mutations with respect to the germline V-genes, and produce mAb protein sequences that are identical to the observed except for 1 aminoacid, in which the mutation has been reverted back to the germline (a.k.a 'step-1 mAb').

Then in parallel I then:
(1) submit this step-1 mAb to efficient-evolution/bin/recommend.py
(2) submit the mAb as it was observed the same way

Then I count how many time does the software mark the mutation as:
(1) 'rejected': when it suggests that the original mAb should mutate towards the step-1 mab.
(2) 'validated': when it suggests that the step-1 mAb should mutate towards the original mab.

Doing this for a collection of mAbs, I obtain many more 'rejected' mutations than I get 'validated', in about a 10 to 1 ratio, and I wonder if this is expected?
Should I put a limit to which mutations recommended by the software I take into account? E.g. by using the numbers in the output? Any recommendations?

How to understand the function reconstruct_multi_models() and what it does to the model.

Hi,Developer!

How to understand the function reconstruct_multi_models() and what it does to the model.

Best,
Jamie

Note:

amis.py

def reconstruct_multi_models(
        wt_seq,
        model_names=[
            'esm1b',
            'esm1v1',
            'esm1v2',
            'esm1v3',
            'esm1v4',
            'esm1v5',
        ],
        alpha=None,
        return_names=False,
):
    mutations_models, mutations_model_names = {}, {}
    for model_name in model_names:
        model = get_model_name(model_name)
        if alpha is None:
            wt_new = reconstruct(
                wt_seq, model, decode_kwargs={ 'exclude': 'unnatural' }
            )
            mutations_model = diff(wt_seq, wt_new)
        else:
            mutations_model = soft_reconstruct(
                wt_seq, model, alpha=alpha,
            )
        for mutation in mutations_model:
            if mutation not in mutations_models:
                mutations_models[mutation] = 0
                mutations_model_names[mutation] = []
            mutations_models[mutation] += 1
            mutations_model_names[mutation].append(model.name_)
        del model

    if return_names:
        return mutations_models, mutations_model_names

    return mutations_models

RuntimeError: unexpected EOF, expected 1056959984 more bytes. The file might be corrupted.

Hi, I would like to use your excellent codes, but I encountered an error!

 CUDA_VISIBLE_DEVICES=3 python bin/recommend.py MNHDQEFDPPKVYPPVPAEKRKPIRVLSLFDGIATGLLVLK                                                                                                                                         
Traceback (most recent call last):
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 33, in load_hub_workaround
    data = torch.hub.load_state_dict_from_url(url, progress=False, map_location="cpu")
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/hub.py", line 731, in load_state_dict_from_url
    return torch.load(cached_file, map_location=map_location)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/serialization.py", line 938, in _legacy_load
    typed_storage._storage._set_from_file(
RuntimeError: unexpected EOF, expected 1056959984 more bytes. The file might be corrupted.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/training/nong/app/efficient-evolution/bin/recommend.py", line 39, in <module>
    mutations_models = reconstruct_multi_models(
  File "/training/nong/app/efficient-evolution/bin/amis.py", line 221, in reconstruct_multi_models
    model = get_model_name(model_name)
  File "/training/nong/app/efficient-evolution/bin/amis.py", line 39, in get_model_name
    model = FBModel(
  File "/training/nong/app/efficient-evolution/bin/fb_model.py", line 10, in __init__
    model, alphabet = pretrained.load_model_and_alphabet(name)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 28, in load_model_and_alphabet
    return load_model_and_alphabet_hub(model_name)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 63, in load_model_and_alphabet_hub
    model_data, regression_data = _download_model_and_regression_data(model_name)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 54, in _download_model_and_regression_data
    model_data = load_hub_workaround(url)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 37, in load_hub_workaround
    data = torch.load(
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/serialization.py", line 938, in _legacy_load
    typed_storage._storage._set_from_file(
RuntimeError: unexpected EOF, expected 1056959984 more bytes. The file might be corrupted.

nanobodies

Thank you for your code, I was wondering if it could be trained on antibodies of other species like nanobodies?

Thank you

How to use bin/predict_esm.py? Do I need to change seqs_abs, or under what circumstances do I need to modify seqs_abs?

Hi,Developer!

How to use bin/predict_esm.py? Do I need to change seqs_abs, or under what circumstances do I need to modify seqs_abs?

Best,
Jamie

Note:

predict_esm.py

    seqs_abs = {
        'medi_vh': 'QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSYNAVWNWIRQSPSRGLEWLGRTYYRSGWYNDYAESVKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCARSGHITVFGVNVDAFDMWGQGTMVTVSS',
        'uca_vh': 'QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSNSAAWNWIRQSPSRGLEWLGRTYYRSKWYNDYAVSVKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCARGGHITIFGVNIDAFDIWGQGTMVTVSS',
        'mab114_vh': 'EVQLVESGGGLIQPGGSLRLSCAASGFALRMYDMHWVRQTIDKRLEWVSAVGPSGDTYYADSVKGRFAVSRENAKNSLSLQMNSLTAGDTAIYYCVRSDRGVAGLFDSWGQGILVTVSS',
        'mU_vh': 'EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYDMHWVRQATGKGLEWVSAIGTAGDTYYPGSVKGRFTISRENAKNSLYLQMNSLRAGDTAVYYCVRSDRGVAGLFDSWGQGTLVTVSS',
        's309_vh': 'QVQLVQSGAEVKKPGASVKVSCKASGYPFTSYGISWVRQAPGQGLEWMGWISTYNGNTNYAQKFQGRVTMTTDTSTTTGYMELRRLRSDDTAVYYCARDYTRGAWFGESLIGGFDNWGQGTLVTVSS',
        'r7_vh': 'QVQLVESGGGVVQPGRSLRLSCAASGFTFSNYAMYWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRFTISRDNSKNTLYLQMNSLRTEDTAVYYCASGSDYGDYLLVYWGQGTLVTVSS',
        'c143_vh': 'EVQLVESGGGLVQPGGSLRLSCAASGFSVSTKYMTWVRQAPGKGLEWVSVLYSGGSDYYADSVKGRFTISRDNSKNALYLQMNSLRVEDTGVYYCARDSSEVRDHPGHPGRSVGAFDIWGQGTMVTVSS',
        
        'medi_vl': 'DIQMTQSPSSLSASVGDRVTITCRTSQSLSSYTHWYQQKPGKAPKLLIYAASSRGSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSRTFGQGTKVEIK',
        'uca_vl': 'DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSRTFGQGTKVEIK',
        'mab114_vl': 'DIQMTQSPSSLSASVGDRITITCRASQAFDNYVAWYQQRPGKVPKLLISAASALHAGVPSRFSGSGSGTHFTLTISSLQPEDVATYYCQNYNSAPLTFGGGTKVEIK',
        'mU_vl': 'DIQMTQSPSSLSASVGDRVTITCRASQGISNYLAWYQQKPGKVPKLLIYAASTLQSGVPSRFSGSGSGTDFTLTISSLQPEDVATYYCQKYNSAPLTFGGGTKVEIK',
        's309_vl': 'EIVLTQSPGTLSLSPGERATLSCRASQTVSSTSLAWYQQKPGQAPRLLIYGASSRATGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQHDTSLTFGGGTKVEIK',
        'r7_vl': 'QSALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL',
        'c143_vl': 'QSALTQPASVSGSPGQSITISCTGTSNDVGSYTLVSWYQQYPGKAPKLLIFEGTKRSSGISNRFSGSKSGNTASLTISGLQGEDEADYYCCSYAGASTFVFGGGTKLTVL',
    }

querying heavy and light chain separately or together?

Dears,

I've just found out about this tool, successfully installed it in an Ubuntu Linux computer with an NVidia GPU card with 16Gb VRAM, and ran the prediction script successfully (recommend.py).

I would like to know what's the recommended way of predicting the mutations for the heavy and the light chain (Fv) of a mAb. Should I do it separately for the heavy and the light?

To predict the pdb structure, we use heavy+{GGGGS}x4+light as a single chain input, with the linker being long enough not to interfere with the prediction.

Would that single chain be a good input for the prediction? Or is it better to predict separately the heavy and the light chains?

Thanks in advance

Reproducing results from DMS datasets

I was able to run the first part of bash bin/dms.sh locally (that calls dms.py) but I'm struggling to understand how I can reproduce the manuscript results. It seems that for all of these datasets, (except infA, for which 0.5 was used), alpha was set to 1. However, how can I know what value of k is being used, and what is the best way to specify k when calling dms.py/reconstruct_multi_models? Thank you in advance!

extra fine-tuning on dataset of dog/cat antibodies

Hi,

We have a set of dog and cat antibodies (about 200,000) which we would like to use to do some extra fine-tuning to the model in this software, and we are wondering about the feasibility of doing this, considering the Nature Biotech paper describes evolving human antibodies, rather than dog or cat antibodies.

What would be a good starting point and computational requirements (Nvidia GPUs) for this mini-project? Thanks in advance.

Commercial License + combined mutations

Hello,

The CC4.0 is a very restrictive licence for a project. I was wondering if you are sticking with that license (even though most of the code is a wrapper to using ESM-1b/v) or if you have any simple commercial licenses available through your tech transfer office.

I was also wondering why the subsequent mutations in your paper were manually put together. It seems like it would be possible to do a combinatorial scoring and order best-scoring hits from the manifold. Did I miss something there?

Thanks,
-Jared

ESM-2 implementation?

Any chance there will be an update to use ESM-2 models such as facebook/esm2_t6_8M_UR50D? This would make the code easier to use locally for people who have hardware constraints, and would likely show improved performance as well using the some of the larger ESM-2 models, no?

cuda device option pass-through

Hi,
I've run efficient-evolution recommend.py using the 'cuda:0' device on a Linux machine that has more than one Nvidia card.

Would it be possible to add a parameter to the recommend.py script that points to the cuda device to use in the system where there are many?

AFAICS, the relevant statement in the fb_model.py code is:

toks = toks.to(device='cuda', non_blocking=True)

below:

            with torch.no_grad():
                for batch_idx, (labels, strs, toks) in enumerate(data_loader):
                    if torch.cuda.is_available():
                        toks = toks.to(device='cuda', non_blocking=True)
                    out = self.model_(
                        toks,
                        repr_layers=self.repr_layers_,
                        return_contacts=False
                    )
                    logits = out['logits'].to(device='cpu').numpy()

            output.append(logits[0])

I guess it could be something like:

bin/recommend.py --cuda cuda:2 ...

As an extra option,

Referring to this:
https://pytorch.org/docs/stable/notes/cuda.html#:~:text=torch.cuda%20is%20used%20to,.cuda.device%20context%20manager.

Thanks in advance.

Colab notebook getting killed

Hey all, I tried to use the colab notebook to run the example snippet, but it's getting killed and showing a kill command ^C in the output.

I didn't kill the process myself. I took a look at the resource usage, and while it's close to the max for colab, it doesn't look like it's quite maxed out yet?

image

Have you confirmed that this runs on the free tier colab instance without issues?

Excited to try out your tool!

Regression weight not found issue.

Hi!

I'm trying to run this on my MacBook pro (intel).

just cloned and run bin/recommend.py.

I got results, but errors also occurred.

"...../python3.9/site-packages/esm/pretrained.py:134: Regression weights not found, predicting contacts will not produce correct results."

Some say it is okay, but I just wanna get confirmation from you!

Thank you!

Environment configuration

I had a lot of issues with the environment configuration, like "module 'torch.utils.data' has no attribute 'Dataloader'", "module 'torch' has no attribute 'jit'" and "ModuleNotFoundError: No module named 'esm.model'", when I run "python bin/recommend.py". This appears to be due to an incorrect environment configuration. Could you suggest the environment configuration of conda?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.