brianhie / efficient-evolution Goto Github PK

Efficient evolution from protein language models

License: MIT License

Python 95.06% Shell 4.94%

efficient-evolution's Introduction

Efficient evolution from general protein language models

Scripts for running the analysis described in the paper "Efficient evolution of human antibodies from general protein language models".

Running the model

To evaluate the model on a new sequence, clone this repository and run

python bin/recommend.py [sequence]

where [sequence] is the wildtype protein sequence you want to evolve. The script will output a list of substitutions and the number of recommending language models.

To recommend mutations to antibody variable domain sequences, we have simply run the above script separately on the heavy and light chain sequences.

We have also made a Google Colab notebook available. However, this notebook requires a full download and installation of the language models for each run and requires Colab Pro instances with a higher memory requirement than the free version of Colab. When making many predictions, we recommend the local installation above, as this will allow you to cache and reuse the models.

Paper analysis scripts

To reproduce the analysis in the paper, first download and extract data with the commands:

wget https://zenodo.org/record/6968342/files/data.tar.gz
tar xvf data.tar.gz

To acquire mutations to a given antibody, run the command

bash bin/eval_models.sh [antibody_name]

where [antibody_name] is one of medi8852, medi_uca, mab114, mab114_uca, s309, regn10987, or c143.

DMS experiments can be run with the command

bash bin/dms.sh

efficient-evolution's People

Contributors

Stargazers

Watchers

efficient-evolution's Issues

How to check the cutoff, sample size and samples successes in my dataset ?

I have get the mutation value by pre-trained models, but I can't sure the dos_name, cutoff, k_cutoff in my dataset, what they means in a dataset ? and How to get them ?

:- ( Thanks

dl.fbaipublicfiles.com download links error

Hello. I'm trying to run your great model. But the job keep stopping with the below error.
Should I download each model and run the job?

Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm1v_t33_650M_UR90S_1.pt" to /home/jisun/.cache/torch/hub/checkpoints/esm1v_t33_650M_UR90S_1.pt
Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/urllib/request.py", line 1354, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1007, in _send_output
self.send(msg)
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 947, in send
self.connect()
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 1414, in connect
super().connect()
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/http/client.py", line 918, in connect
self.sock = self._create_connection(
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/socket.py", line 808, in create_connection
raise err
File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.8.10/lib/python3.8/socket.py", line 796, in create_connection
sock.connect(sa)
OSError: [Errno 101] Network is unreachable

Question about usage

/home/user/anaconda3/envs/py3.6/lib/python3.6/site-packages/esm/pretrained.py:216:
UserWarning: Regression weights not found, predicting contacts will not produce correct results.

Can this warning be ignored, as mentioned in facebookresearch/esm#170?

interpreting the recommended mutations in a one-by-one context

Hi,

I've been using this software for a while now, and I thought of a scheme for which I could attempt to have a "validated" or "rejected" tag for each of the somatic hypermutations of a monoclonal antibody, in the following way:

Given a monoclonal antibody observed in a single cell, I list each of the mutations with respect to the germline V-genes, and produce mAb protein sequences that are identical to the observed except for 1 aminoacid, in which the mutation has been reverted back to the germline (a.k.a 'step-1 mAb').

Then in parallel I then:
(1) submit this step-1 mAb to efficient-evolution/bin/recommend.py
(2) submit the mAb as it was observed the same way

Then I count how many time does the software mark the mutation as:
(1) 'rejected': when it suggests that the original mAb should mutate towards the step-1 mab.
(2) 'validated': when it suggests that the step-1 mAb should mutate towards the original mab.

Doing this for a collection of mAbs, I obtain many more 'rejected' mutations than I get 'validated', in about a 10 to 1 ratio, and I wonder if this is expected?
Should I put a limit to which mutations recommended by the software I take into account? E.g. by using the numbers in the output? Any recommendations?

Some testing scripts do not work due to missing file

Scripts for s309, regn10987, mab114 and mab114_uca in the README.md don't work because corresponding FASTA files missed in the data folder.

How to understand the function reconstruct_multi_models() and what it does to the model.

Hi，Developer！

How to understand the function reconstruct_multi_models() and what it does to the model.

Best,
Jamie

Note:

amis.py

def reconstruct_multi_models(
        wt_seq,
        model_names=[
            'esm1b',
            'esm1v1',
            'esm1v2',
            'esm1v3',
            'esm1v4',
            'esm1v5',
        ],
        alpha=None,
        return_names=False,
):
    mutations_models, mutations_model_names = {}, {}
    for model_name in model_names:
        model = get_model_name(model_name)
        if alpha is None:
            wt_new = reconstruct(
                wt_seq, model, decode_kwargs={ 'exclude': 'unnatural' }
            )
            mutations_model = diff(wt_seq, wt_new)
        else:
            mutations_model = soft_reconstruct(
                wt_seq, model, alpha=alpha,
            )
        for mutation in mutations_model:
            if mutation not in mutations_models:
                mutations_models[mutation] = 0
                mutations_model_names[mutation] = []
            mutations_models[mutation] += 1
            mutations_model_names[mutation].append(model.name_)
        del model

    if return_names:
        return mutations_models, mutations_model_names

    return mutations_models

RuntimeError: unexpected EOF, expected 1056959984 more bytes. The file might be corrupted.

Hi, I would like to use your excellent codes, but I encountered an error!

 CUDA_VISIBLE_DEVICES=3 python bin/recommend.py MNHDQEFDPPKVYPPVPAEKRKPIRVLSLFDGIATGLLVLK                                                                                                                                         
Traceback (most recent call last):
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 33, in load_hub_workaround
    data = torch.hub.load_state_dict_from_url(url, progress=False, map_location="cpu")
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/hub.py", line 731, in load_state_dict_from_url
    return torch.load(cached_file, map_location=map_location)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/serialization.py", line 938, in _legacy_load
    typed_storage._storage._set_from_file(
RuntimeError: unexpected EOF, expected 1056959984 more bytes. The file might be corrupted.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/training/nong/app/efficient-evolution/bin/recommend.py", line 39, in <module>
    mutations_models = reconstruct_multi_models(
  File "/training/nong/app/efficient-evolution/bin/amis.py", line 221, in reconstruct_multi_models
    model = get_model_name(model_name)
  File "/training/nong/app/efficient-evolution/bin/amis.py", line 39, in get_model_name
    model = FBModel(
  File "/training/nong/app/efficient-evolution/bin/fb_model.py", line 10, in __init__
    model, alphabet = pretrained.load_model_and_alphabet(name)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 28, in load_model_and_alphabet
    return load_model_and_alphabet_hub(model_name)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 63, in load_model_and_alphabet_hub
    model_data, regression_data = _download_model_and_regression_data(model_name)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 54, in _download_model_and_regression_data
    model_data = load_hub_workaround(url)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/esm/pretrained.py", line 37, in load_hub_workaround
    data = torch.load(
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/training/nong/app/miniconda3/envs/efficient-evolution/lib/python3.9/site-packages/torch/serialization.py", line 938, in _legacy_load
    typed_storage._storage._set_from_file(
RuntimeError: unexpected EOF, expected 1056959984 more bytes. The file might be corrupted.

UserWarning: Regression weights not found, predicting contacts will not produce correct results.

Hi,
When I run "bash bin/eval_models.sh s309" or "python bin/recommend.py SEQUENCE", it shows "UserWarning: Regression weights not found, predicting contacts will not produce correct results". What's the impact on the results? How to deal with this warning?

nanobodies

Thank you for your code, I was wondering if it could be trained on antibodies of other species like nanobodies?

Thank you

How to use bin/predict_esm.py? Do I need to change seqs_abs, or under what circumstances do I need to modify seqs_abs?

Hi，Developer！

How to use bin/predict_esm.py? Do I need to change seqs_abs, or under what circumstances do I need to modify seqs_abs?

Best,
Jamie

Note:

predict_esm.py

    seqs_abs = {
        'medi_vh': 'QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSYNAVWNWIRQSPSRGLEWLGRTYYRSGWYNDYAESVKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCARSGHITVFGVNVDAFDMWGQGTMVTVSS',
        'uca_vh': 'QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSNSAAWNWIRQSPSRGLEWLGRTYYRSKWYNDYAVSVKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCARGGHITIFGVNIDAFDIWGQGTMVTVSS',
        'mab114_vh': 'EVQLVESGGGLIQPGGSLRLSCAASGFALRMYDMHWVRQTIDKRLEWVSAVGPSGDTYYADSVKGRFAVSRENAKNSLSLQMNSLTAGDTAIYYCVRSDRGVAGLFDSWGQGILVTVSS',
        'mU_vh': 'EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYDMHWVRQATGKGLEWVSAIGTAGDTYYPGSVKGRFTISRENAKNSLYLQMNSLRAGDTAVYYCVRSDRGVAGLFDSWGQGTLVTVSS',
        's309_vh': 'QVQLVQSGAEVKKPGASVKVSCKASGYPFTSYGISWVRQAPGQGLEWMGWISTYNGNTNYAQKFQGRVTMTTDTSTTTGYMELRRLRSDDTAVYYCARDYTRGAWFGESLIGGFDNWGQGTLVTVSS',
        'r7_vh': 'QVQLVESGGGVVQPGRSLRLSCAASGFTFSNYAMYWVRQAPGKGLEWVAVISYDGSNKYYADSVKGRFTISRDNSKNTLYLQMNSLRTEDTAVYYCASGSDYGDYLLVYWGQGTLVTVSS',
        'c143_vh': 'EVQLVESGGGLVQPGGSLRLSCAASGFSVSTKYMTWVRQAPGKGLEWVSVLYSGGSDYYADSVKGRFTISRDNSKNALYLQMNSLRVEDTGVYYCARDSSEVRDHPGHPGRSVGAFDIWGQGTMVTVSS',
        
        'medi_vl': 'DIQMTQSPSSLSASVGDRVTITCRTSQSLSSYTHWYQQKPGKAPKLLIYAASSRGSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSRTFGQGTKVEIK',
        'uca_vl': 'DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSRTFGQGTKVEIK',
        'mab114_vl': 'DIQMTQSPSSLSASVGDRITITCRASQAFDNYVAWYQQRPGKVPKLLISAASALHAGVPSRFSGSGSGTHFTLTISSLQPEDVATYYCQNYNSAPLTFGGGTKVEIK',
        'mU_vl': 'DIQMTQSPSSLSASVGDRVTITCRASQGISNYLAWYQQKPGKVPKLLIYAASTLQSGVPSRFSGSGSGTDFTLTISSLQPEDVATYYCQKYNSAPLTFGGGTKVEIK',
        's309_vl': 'EIVLTQSPGTLSLSPGERATLSCRASQTVSSTSLAWYQQKPGQAPRLLIYGASSRATGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQHDTSLTFGGGTKVEIK',
        'r7_vl': 'QSALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL',
        'c143_vl': 'QSALTQPASVSGSPGQSITISCTGTSNDVGSYTLVSWYQQYPGKAPKLLIFEGTKRSSGISNRFSGSKSGNTASLTISGLQGEDEADYYCCSYAGASTFVFGGGTKLTVL',
    }

querying heavy and light chain separately or together?

Dears,

I've just found out about this tool, successfully installed it in an Ubuntu Linux computer with an NVidia GPU card with 16Gb VRAM, and ran the prediction script successfully (recommend.py).

I would like to know what's the recommended way of predicting the mutations for the heavy and the light chain (Fv) of a mAb. Should I do it separately for the heavy and the light?

To predict the pdb structure, we use heavy+{GGGGS}x4+light as a single chain input, with the linker being long enough not to interfere with the prediction.

Would that single chain be a good input for the prediction? Or is it better to predict separately the heavy and the light chains?

Thanks in advance

Reproducing results from DMS datasets

I was able to run the first part of bash bin/dms.sh locally (that calls dms.py) but I'm struggling to understand how I can reproduce the manuscript results. It seems that for all of these datasets, (except infA, for which 0.5 was used), alpha was set to 1. However, how can I know what value of k is being used, and what is the best way to specify k when calling dms.py/reconstruct_multi_models? Thank you in advance!

extra fine-tuning on dataset of dog/cat antibodies

Hi,

We have a set of dog and cat antibodies (about 200,000) which we would like to use to do some extra fine-tuning to the model in this software, and we are wondering about the feasibility of doing this, considering the Nature Biotech paper describes evolving human antibodies, rather than dog or cat antibodies.

What would be a good starting point and computational requirements (Nvidia GPUs) for this mini-project? Thanks in advance.

Commercial License + combined mutations

Hello,

The CC4.0 is a very restrictive licence for a project. I was wondering if you are sticking with that license (even though most of the code is a wrapper to using ESM-1b/v) or if you have any simple commercial licenses available through your tech transfer office.

I was also wondering why the subsequent mutations in your paper were manually put together. It seems like it would be possible to do a combinatorial scoring and order best-scoring hits from the manifold. Did I miss something there?

Thanks,
-Jared

ESM-2 implementation?

Any chance there will be an update to use ESM-2 models such as facebook/esm2_t6_8M_UR50D? This would make the code easier to use locally for people who have hardware constraints, and would likely show improved performance as well using the some of the larger ESM-2 models, no?

cuda device option pass-through

Hi,
I've run efficient-evolution recommend.py using the 'cuda:0' device on a Linux machine that has more than one Nvidia card.

Would it be possible to add a parameter to the recommend.py script that points to the cuda device to use in the system where there are many?

AFAICS, the relevant statement in the fb_model.py code is:

toks = toks.to(device='cuda', non_blocking=True)

below:

            with torch.no_grad():
                for batch_idx, (labels, strs, toks) in enumerate(data_loader):
                    if torch.cuda.is_available():
                        toks = toks.to(device='cuda', non_blocking=True)
                    out = self.model_(
                        toks,
                        repr_layers=self.repr_layers_,
                        return_contacts=False
                    )
                    logits = out['logits'].to(device='cpu').numpy()

            output.append(logits[0])

I guess it could be something like:

bin/recommend.py --cuda cuda:2 ...

As an extra option,

Referring to this:
https://pytorch.org/docs/stable/notes/cuda.html#:~:text=torch.cuda%20is%20used%20to,.cuda.device%20context%20manager.

Thanks in advance.

Colab notebook getting killed

Hey all, I tried to use the colab notebook to run the example snippet, but it's getting killed and showing a kill command ^C in the output.

I didn't kill the process myself. I took a look at the resource usage, and while it's close to the max for colab, it doesn't look like it's quite maxed out yet?

Have you confirmed that this runs on the free tier colab instance without issues?

Excited to try out your tool!

I downloaded the model to another file path, how do I write the command？

Hi，Developer！

I downloaded the model to another file path, how do I write the command？

python bin/recommend.py $wildtype_sequence ？

Best,
Jamie

Regression weight not found issue.

Hi!

I'm trying to run this on my MacBook pro (intel).

just cloned and run bin/recommend.py.

I got results, but errors also occurred.

"...../python3.9/site-packages/esm/pretrained.py:134: Regression weights not found, predicting contacts will not produce correct results."

Some say it is okay, but I just wanna get confirmation from you!

Thank you!

Environment configuration

I had a lot of issues with the environment configuration, like "module 'torch.utils.data' has no attribute 'Dataloader'", "module 'torch' has no attribute 'jit'" and "ModuleNotFoundError: No module named 'esm.model'", when I run "python bin/recommend.py". This appears to be due to an incorrect environment configuration. Could you suggest the environment configuration of conda?

when the protein length is larger than 1022, the recommend output will be staggered by two digits

here is the output, and the number of residues output by the language model is reduced by two.

F1037Y
F1038S
Y1039N
S1040I
N1041M
I1042N
M1043F
N1044F
F1045K
F1046T
K1047E
T1048I
E1049T
I1050L
T1051A
L1052N
A1053G
N1054E
G1055I
E1056R
I1057K
K1059P
R1060L
P1061I
L1062E
I1063T
E1064N
T1065G
N1066E
G1067T
E1068G
T1069E
G1070I
E1071V
V1073D
W1074K
D1075G

brianhie / efficient-evolution Goto Github PK

efficient-evolution's Introduction

Efficient evolution from general protein language models

Running the model

Paper analysis scripts

efficient-evolution's People

Contributors

Stargazers

Watchers

Forkers

efficient-evolution's Issues

Hello. I'm trying to run your great model. But the job keep stopping with the below error. Should I download each model and run the job?

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Hello. I'm trying to run your great model. But the job keep stopping with the below error.
Should I download each model and run the job?