facebookresearch / esm Goto Github PK
View Code? Open in Web Editor NEWEvolutionary Scale Modeling (esm): Pretrained language models for proteins
License: MIT License
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
License: MIT License
This is not an issue, so I apologize for putting it here, but I didn't really know where else to ask.
I have been testing out the various pretrained networks you have trained in this repository, and they seem very interesting and I might use them in a paper I'm working on, so I would like to understand it in detail.
One thing I do not understand about the networks is why they include so many special tokens?
I get that you need the masking token, and similarly the padding token for handling proteins batched together with various sizes.
The cls and eos are used just before and after a protein, but seem unnecessary for proteins unless I'm missing something?
The unk token should signal that an amino acid is unknown if I understand correctly, but isn't X generally the catch all case in protein language for unknown amino acids? So what is the usecase here?
And similarly for the last few tokens used which I have no good guess for.
Following the QuickStart instruction, I load the model through PyTorch Hub.
model, alphabet = torch.hub.load("facebookresearch/esm", "esm1b_t33_650M_UR50S")
Exception experienced:
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'
Double checked torch.__version__
as 1.7.0
. Can provide more info if needed.
Hi
Thanks so much for sharing your great job;
I have two question; I was wondering if you answer both:
1: After downloading the esm1b and esm-MSA, although I did not have any problem to load the esm1b, I faced the below error when I try to load the esm-MSA:
in esm_msa1_t12_100M_UR50S
return load_model_and_alphabet_hub("esm_msa1_t12_100M_UR50S")
in load_model_and_alphabet_hub
model_data = load_hub_workaround(url)
in load_hub_workaround
f"{torch.hub.get_dir()}/checkpoints/{fn}",
AttributeError: module 'torch.hub' has no attribute 'get_dir'
def read_msa(filename: str, nseq: int) -> List[Tuple[str, str]]:
""" Reads the first nseq sequences from an MSA file, automatically removes insertions."""
return [(record.description, remove_insertions(str(record.seq)))
for record in itertools.islice(SeqIO.parse(filename, "fasta"), nseq)]
Thanks so much
Thank you for sharing these excellent results. I'm hoping to clarify on the number of sequences you are sampling in training the MSA transformer (as in the subsample strategy section).
As I understand (please correct me if I'm wrong), you are using 1024 as sequence length L, so the number of sequences you are sampling is only N/L = 2^14/1024=16? And you are keeping to this number in latter supervised contact prediction as well (you mentioned 256 are used in unsupervised prediction)?
Lines 205 to 225 in 5680ba7
If I understand the einsm correctly, the attn_weights
is of shape [head_size, seq_len, batch_size, msa_row_size, msa_row_size] or [H, C, B, R, R].
If above is true, shouldn't we take the softmax at 'C' axis? i.e.
attn_probs = attn_weights.softmax(1)
Thank you!
Hi
I have been working with some of the models from the hugging face library and their tokenizers have a .decode() method associated with them. it seems that 'batch_converter' is the esm equivelent of huggingfaces 'tokenizer', but I can't find a similar method to map back from ids to strings. Is there one?
Thanks
Thank you for this resource! I am currently trying to predict a protein sequence from an embedding.
Is it possible to load the decoder of the ESM model? Is there an example snippet I could use?
Bug overview
Unsupervised contact map prediction from ESM-MSA-1 seems bugged.
Bug description
When I generate the contact-map its seems to be not able to predict any long range contacts or even medium range contacts. I tested it on casp 14 targets and the performance seems much worse than expected based on the results reported in the manuscript.
Additional information
the following code was used to generate the output
model, alphabet = torch.hub.load("facebookresearch/esm", "esm_msa1_t12_100M_UR50S")
batch_converter = alphabet.get_batch_converter()
batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad():
contact = model.predict_contacts(batch_tokens)
Hi!
Thanks for making this model public! I have had a good amount of success using it to predict function similar to how your example colab showed. I was wondering if you had any insight for how I could use the model as is to fine tune on more specific data? I think I could pretty reasonably set up a masking function by setting some values to 0 and using the logits and softmax to get predictions, but if I wanted to do next sequence predicting would I have to add a lot to the current model or is there an output I could use? Eg If a sequence is MYA i want to only use the information before the predicted AA and not the entire sequence. EG to predict M i would use only the fact that its after a start codon, then to predict Y i would use the previous hidden state that knows there is a start codon and M etc for the whole sequence.
The example code in the Quick Start section of the github readme page shows this excerpt:
sequence_representations = []
for i, (_, seq) in enumerate(data):
sequence_representations.append(token_representations[i, 1 : len(seq) + 1].mean(0))
The sequence_representations will then include the last position of token_representations, which appears to be the stop token.
Is this intended?
When trying to run
python extract.py esm1_t34_670M_UR50S examples/P62593.fasta examples/P62593_reprs/ --repr_layers 34 --include mean
I get:
tcmalloc: large alloc 2676842496 bytes == 0x548b6000 @ 0x7f739765cb6b 0x7f739767c379 0x7f734850974e 0x7f734850b7b6 0x7f7382f74ba5 0x7f7392c2f1d9 0x551555 0x5a9dac 0x50a433 0x50beb4 0x507be4 0x508ec2 0x5a4c61 0x5a4fb8 0x4e012e 0x50a461 0x50beb4 0x507be4 0x588e5c 0x59fd0e 0x50d256 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x509900 0x50a2fd 0x50cc96 0x5095c8 0x50a2fd
tcmalloc: large alloc 2676842496 bytes == 0xf418c000 @ 0x7f739765cb6b 0x7f739767c379 0x7f734850974e 0x7f734850b7b6 0x7f7382f74ba5 0x7f7392c2f1d9 0x551555 0x5a9dac 0x50a433 0x50beb4 0x507be4 0x508ec2 0x5a4c61 0x5a4fb8 0x4e012e 0x50a461 0x50beb4 0x507be4 0x588e5c 0x59fd0e 0x50d256 0x507be4 0x509900 0x50a2fd 0x50cc96 0x507be4 0x509900 0x50a2fd 0x50cc96 0x5095c8 0x50a2fd
I'm guessing that the Colab GPU (a T4 with 15Gb of mem in my case) is unable to pull the entire model into memory? Anybody else running into this?
Hi
I would like to fine-tune esm with a language model head. I tried
import torch
model = torch.hub.load("facebookresearch/esm", "modelWithLMHead", "esm1_t34_670M_UR50S")
but I got
RuntimeError: Cannot find callable modelWithLMHead in hubconf
Is there a simple way to do this? Thanks
I'm looking to use ONNX and ONNXRuntime to speed up our workloads to use ESM-1b.
I converted the serialized model esm1b_t33_650M_UR50S.pt
to a .onnx
graph using torch.onnx
then explicitly applied extended optimizations (including conversion to float16
) using onnxruntime_tools
.
When trying to run an ONNXRuntime inference session with the following inputs:
data = [
("protein1", "VLAGG"),
("protein2", "KALTARQ"),
]
I get:
InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Got invalid dimensions for input: input.1 for the following indices
index: 1 Got: 8 Expected: 9
Before the ONNXRuntime inference code, I'd like to share my Pytorch-to-ONNX conversion process first. Just in case the issue resides there:
export MODEL_PATH=/mnt/models/esm1b/esm1b_t33_650M_UR50S.pt # model file downloaded locally
export CONVERTED_GRAPH_PATH=/tmp/models/onnx_esm/graph.onnx # intermediate storage for the graph and the external data binaries ("/tmp/models/onnx_esm" must be created beforehand)
export OPTIMIZED_GRAPH_PATH=/mnt/models/onnx_esm/graph.onnx # final form of the graph, encapsulated within a single 1.3G file ("/mnt/models/onnx_esm" must be created beforehand)
python convert_onnx_esm.py --model-path $MODEL_PATH --converted-model-path $CONVERTED_GRAPH_PATH
python -m onnxruntime_tools.optimizer_cli --float16 --opt_level 99 --use_gpu --model_type bert --hidden_size 1024 --num_heads 16 --input $CONVERTED_GRAPH_PATH --output $OPTIMIZED_GRAPH_PATH
This is the source code for convert_onnx_esm.py
:
import os
import torch
import torch.onnx
import argparse
from esm.pretrained import load_model_and_alphabet_local
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, required=True)
parser.add_argument("--converted-model-path", type=str, required=True)
args = parser.parse_args()
model, alphabet = load_model_and_alphabet_local(args.model_path)
batch_converter = alphabet.get_batch_converter()
data = [
("protein1", "VLAGG"),
("protein2", "KALTARQ"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad():
torch.onnx.export(model,
batch_tokens,
args.converted_model_path,
use_external_data_format=True,
opset_version=12,
do_constant_folding=True
)
Now for the inference code which produced the exception:
import os
import numpy as np
import argparse
from onnxruntime import (
GraphOptimizationLevel,
InferenceSession,
SessionOptions,
get_device,
)
from esm.data import Alphabet
provider = "CUDAExecutionProvider" if get_device() == "GPU" else "CPUExecutionProvider"
parser = argparse.ArgumentParser()
parser.add_argument("--optimized-model-path", type=str, required=True)
args = parser.parse_args()
options = SessionOptions()
options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
model = InferenceSession(args.optimized_model_path, options
)
model.set_providers([provider])
alphabet = Alphabet.from_architecture("protein_bert_base")
batch_converter = alphabet.get_batch_converter()
data = [
("protein1", "VLAGG"),
("protein2", "KALTARQ"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
output = model.run(None, {"input.1": batch_tokens.numpy()}) # input name "input.1" should be the default when exporting with `torch.onnx.export()`
Cuda 11.2, CudNN=8.1.1.33, and Python 3.8.5 with packages:
fair-esm==0.3.0
onnx==1.8.1
onnxconverter-common==1.6.0
onnxruntime-gpu==1.7.0
onnxruntime-tools==1.6.0
Note: If we can find a solution for this issue, I was wondering if it's a good idea for me to clean up and add the model conversion and the inference example as a contribution to the repo.
Hello,
I was trying to get the embedding representations from the MSA transformer.
The code:
import torch
import esm
# Load ESM-1b model
model, alphabet = esm.pretrained.esm_msa1_t12_100M_UR50S()
batch_converter = alphabet.get_batch_converter()
# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
("protein2", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYIIVATPRGYVLAGG"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
# Extract per-residue representations (on CPU)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[12], return_contacts=True)
token_representations = results["representations"][12]
Each time that I run the code, token_representations is different. Is this the expected behavior?
I have a small problem. In my cognition, the prediction of long-range residue contact is more difficult than the short-range one, but the indicators given in this paper are just the opposite. If anyone can explain it, I would be very grateful.
Hello,
Are you planning on releasing the model (architecture and weights) that you have trained in a supervised fashion for predicting contact maps and secondary structure in "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences"?
Thank you for making all this awesome work available!
Hi,
I want to do fine-tuning of the model using my own dataset. Is there away to run your model for this task?
hello ,guys ,is there a method to reproduce the pretrain procedure ?
i want to pretrain the model use my own data .
Thank you.
Hi there!
I'm trying to compare ESM to UniRep, the embedding from the Church lab, for variant function prediction. Eventually, there are a few proteins our lab would like to optimize, and ESM has some advantages over UniRep. I need to "evolutionarily fine tune" ESM, as the Church lab does for UniRep: refine the global model's weights by continuing training on a small neighborhood (~100k sequences) around the target protein.
Could y'all provide any of the code you used in the pre-training task? Eg, your implementations of noising / masking, your loss function, or your gradient descent function?
Thank you, I think ESM is super cool!
Best,
Jacob
Bug description
Having a sequence longer than 1022 residues causes the an unspecific exception on CPU and GPU. On the CPU, it says IndexError: index out of range in self
. On the Quadro RTX 8000 I tested, it causes a CUDA error: device-side assert triggered
that will cause all further attempts to embed sequences of any length to fail with the same error.
I'm aware that esm was trained with sequences of less than 1024 amino acids; I'm opening this issue because this does not seem to be mentioned in the repo nor the paper, and from the error message in the exception it's hard to figure out what is wrong. I'd also be interested in how you'd suggest handling user input with longer sequences: Should this simply error with a message that this is not supported or do you suggest another way of handling this (I've seen #21 (comment) listing some strategies)?
Reproduction steps
Pretty much the readme example, only with a longe sequence added:
c="MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELNGGAYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKRGVYCCREHEHEIAWYTERSEKSYELQTPFEIKLAKKFDTFNGECPNFVFPLNSIIKTIQPRVEKKKLDGFMGRIRSVYPVASPNECNQMCLSTLMKCDHCGETSWQTGDFVKATCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSLAEYHNESGLKTILRKGGRTIAFGGCVFSYVGCHNKCAYWVPRASANIGCNHTGVVGEGSEGLNDNLLEILQKEKVNINIVGDFKLNEEIAIILASFSASTSAFVETVKGLDYKAFKQIVESCGNFKVTKGKAKKGAWNIGEQKSILSPLYAFASEAARVVRSIFSRTLETAQNSVRVLQKAAITILDGISQYSLRLIDAMMFTSDLATNNLVVMAYITGGVVQLTSQWLTNIFGTVYEKLKPVLDWLEEKFKEGVEFLRDGWEIVKFISTCACEIVGGQIVTCAKEIKESVQTFFKLVNKFLALCADSIIIGGAKLKALNLGETFVTHSKGLYRKCVKSREETGLLMPLKAPKEIIFLEGETLPTEVLTEEVVLKTGDLQPLEQPTSEAVEAPLVGTPVCINGLMLLEIKDTEKYCALAPNMMVTNNTFTLKGGAPTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEKCSAYTVELGTEVNEFACVVADAVIKTLQPVSELLTPLGIDLDEWSMATYYLFDESGEFKLASHMYCSFYPPDEDEEEGDCEEEEFEPSTQYEYGTEDDYQGKPLEFGATSAALQPEEEQEEDWLDDDSQQTVGQQDGSEDNQTTTIQTIVEVQPQLEMELTPVVQTIEVNSFSGYLKLTDNVYIKNADIVEEAKKVKPTVVVNAANVYLKHGGGVAGALNKATNNAMQVESDDYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSAYENFNQHEVLLAPLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKLVSSFLEMKSEKQVEQKIAEIPKEEVKPFITESKPSVEQRKQDDKKIKACVEEVTTTLEETKFLTENLLLYIDINGNLHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKALRKVPTDNYITTYPGQGLNGYTVEEAKTVLKKCKSAFYILPSIISNEKQEILGTVSWNLREMLAHAEETRKLMPVCVETKAIVSTIQRKYKGIKIQEGVVDYGARFYFYTSKTTVASLINTLNDLNETLVTMPLGYVTHGLNLEEAARYMRSLKVPATVSVSSPDAVTAYNGYLTSSSKTPEEHFIETISLAGSYKDWSYSGQSTQLGIEFLKRGDKSVYYTSNPTTFHLDGEVITFDNLKTLLSLREVRTIKVFTTVDNINLHTQVVDMSMTYGQQFGPTYLDGADVTKIKPHNSHEGKTFYVLPNDDTLRVEAFEYYHTTDPSFLGRYMSALNHTKKWKYPQVNGLTSIKWADNNCYLATALLTLQQIELKFNPPALQDAYYRARAGEAANFCALILAYCNKTVGELGDVRETMSYLFQHANLDSCKRVLNVVCKTCGQQQTTLKGVEAVMYMGTLSYEQFKKGVQIPCTCGKQATKYLVQQESPFVMMSAPPAQYELKHGTFTCASEYTGNYQCGHYKHITSKETLYCIDGALLTKSSEYKGPITDVFYKENSYTTTIKPVTYKLDGVVCTEIDPKLDNYYKKDNSYFTEQPIDLVPNQPYPNASFDNFKFVCDNIKFADDLNQLTGYKKPASRELKVTFFPDLNGDVVAIDYKHYTPSFKKGAKLLHKPIVWHVNNATNKATYKPNTWCIRCLWSTKPVETSNSFDVLKSEDAQGMDNLACEDLKPVSEEVVENPTIQKDVLECNVKTTEVVGDIILKPANNSLKITEEVGHTDLMAAYVDNSSLTIKKPNELSRVLGLKTLATHGLAAVNSVPWDTIANYAKPFLNKVVSTTTNIVTRCLNRVCTNYMPYFFTLLLQLCTFTRSTNSRIKASMPTTIAKNTVKSVGKFCLEASFNYLKSPNFSKLINIIIWFLLLSVCLGSLIYSTAALGVLMSNLGMPSYCTGYREGYLNSTNVTIATYCTGSIPCSVCLSGLDSLDTYPSLETIQITISSFKWDLTAFGLVAEWFLAYILFTRFFYVLGLAAIMQLFFSYFAVHFISNSWLMWLIINLVQMAPISAMVRMYIFFASFYYVWKSYVHVVDGCNSSTCMMCYKRNRATRVECTTIVNGVRRSFYVYANGGKGFCKLHNWNCVNCDTFCAGSTFISDEVARDLSLQFKRPINPTDQSSYIVDSVTVKNGSIHLYFDKAGQKTYERHSLSHFVNLDNLRANNTKGSLPINVIVFDGKSKCEESSAKSASVYYSQLMCQPILLLDQALVSDVGDSAEVAVKMFDAYVNTFSSTFNVPMEKLKTLVATAEAELAKNVSLDNVLSTFISAARQGFVDSDVETKDVVECLKLSHQSDIEVTGDSCNNYMLTYNKVENMTPRDLGACIDCSARHINAQVAKSHNIALIWNVKDFMSLSEQLRKQIRSAAKKNNLPFKLTCATTRQVVNVVTTKIALKGGKIVNNWLKQLIKVTLVFLFVAAIFYLITPVHVMSKHTDFSSEIIGYKAIDGGVTRDIASTDTCFANKHADFDTWFSQRGGSYTNDKACPLIAAVITREVGFVVPGLPGTILRTTNGDFLHFLPRVFSAVGNICYTPSKLIEYTDFATSACVLAAECTIFKDASGKPVPYCYDTNVLEGSVAYESLRPDTRYVLMDGSIIQFPNTYLEGSVRVVTTFDSEYCRHGTCERSEAGVCVSTSGRWVLNNDYYRSLPGVFCGVDAVNLLTNMFTPLIQPIGALDISASIVAGGIVAIVVTCLAYYFMRFRRAFGEYSHVVAFNTLLFLMSFTVLCLTPVYSFLPGVYSVIYLYLTFYLTNDVSFLAHIQWMVMFTPLVPFWITIAYIICISTKHFYWFFSNYLKRRVVFNGVSFSTFEEAALCTFLLNKEMYLKLRSDVLLPLTQYNRYLALYNKYKYFSGAMDTTSYREAACCHLAKALNDFSNSGSDVLYQPPQTSITSAVLQSGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQSAVKRTIKGTHHWLLLTILTSLLVLVQSTQWSLFFFLYENAFLPFAMGIIAMSAFAMMFVKHKHAFLCLFLLPSLATVAYFNMVYMPASWVMRIMTWLDMVDTSLSGFKLKDCVMYASAVVLLILMTARTVYDDGARRVWTLMNVLTLVYKVYYGNALDQAISMWALIISVTSNYSGVVTTVMFLARGIVFMCVEYCPIFFITGNTLQCIMLVYCFLGYFCTCYFGLFCLLNRYFRLTLGVYDYLVSTQEFRYMNSQGLLPPKNSIDAFKLNIKLLGVGGKPCIKVATVQSKMSDVKCTSVVLLSVLQQLRVESSSKLWAQCVQLHNDILLAKDTTEAFEKMVSLLSVLLSMQGAVDINKLCEEMLDNRATLQAIASEFSSLPSYAAFATAQEAYEQAVANGDSEVVLKKLKKSLNVAKSEFDRDAAMQRKLEKMADQAMTQMYKQARSEDKRAKVTSAMQTMLFTMLRKLDNDALNNIINNARDGCVPLNIIPLTTAAKLMVVIPDYNTYKNTCDGTTFTYASALWEIQQVVDADSKIVQLSEISMDNSPNLAWPLIVTALRANSAVKLQNNELSPVALRQMSCAAGTTQTACTDDNALAYYNTTKGGRFVLALLSDLQDLKWARFPKSDGTGTIYTELEPPCRFVTDTPKGPKVKYLYFIKGLNNLNRGMVLGSLAATVRLQAGNATEVPANSTVLSFCAFAVDAAKAYKDYLASGGQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDHPNPKGFCDLKGKYVQIPTTCANDPVGFTLKNTVCTVCGMWKGYGCSCDQLREPMLQSADAQSFLNGFAV"
import torch
import esm
# File downloaded with `wget https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt`
model, alphabet = esm.pretrained.load_model_and_alphabet_local("../main/esm1b_t33_650M_UR50S.pt")
batch_converter = alphabet.get_batch_converter()
# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
("protein3", c[:1025]), # <== Setting this to 1022 makes the code pass
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[33], return_contacts=True) # CPU error occurs here
model = model.to(torch.device("cuda"))
with torch.no_grad():
results = model(batch_tokens.to(torch.device("cuda")), repr_layers=[33], return_contacts=True) # GPU error occurs here
# This was initially working, but no doesn't anymore:
data = [
("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
with torch.no_grad():
results = model(batch_tokens.to(torch.device("cuda")), repr_layers=[33], return_contacts=True) # GPU error occurs here
Expected behavior
Either a way to handle long sequences, or an error message that explains the length limit, ideally with a note in the readme. That error also really shouldn't poison the GPU in ways that I need to restart the process before I can do any proper computation again, but not sure if you can do anything about it or if that's an issue with torch and/or cuda
Logs
CPU:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/esm/model.py", line 131, in forward
x = x + self.embed_positions(tokens)
File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/esm/modules.py", line 225, in forward
return F.embedding(
File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/torch/nn/functional.py", line 1852, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
GPU:
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[...]
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [252,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/project/seqvec-search/.venv/lib/python3.8/site-packages/esm/model.py", line 149, in forward
if not padding_mask.any():
RuntimeError: CUDA error: device-side assert triggered
Trying sequences shorter than 1024 afterwards:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
RuntimeError: CUDA error: device-side assert triggered
Additional context
Ubuntu 18.04, python 3.8, torch 1.7.1, cuda 10.2, Driver Version 455.23.05, pip install -U git+https://github.com/facebookresearch/esm
with 537ad6a
I am running into an error when trying to reload a fine-tuned version of the models. After further-training, models were saved using the below code:
# Load model
model, alphabet = torch.hub.load("facebookresearch/esm", "esm1_t12_85M_UR50S")
# Training code
# Save model
torch.save(model.state_dict(), BEST_MODEL)
Upon trying to reload the model using the below
# Load model
model, alphabet = torch.hub.load("facebookresearch/esm", "esm1_t12_85M_UR50S")
model.load_state_dict(torch.load(BEST_MODEL))
I run into the error
RuntimeError Traceback (most recent call last)
<ipython-input-4-b6ed0ebd023b> in <module>
----> 1 model.load_state_dict(torch.load("esm1_t12_85M_UR50S-Best.pt"))
~/anaconda3/envs/c10_stability/lib/python3.8/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
1049
1050 if len(error_msgs) > 0:
-> 1051 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
1052 self.__class__.__name__, "\n\t".join(error_msgs)))
1053 return _IncompatibleKeys(missing_keys, unexpected_keys)
This is a fairly standard pytorch error that you get when trying to load the wrong state_dict into a model. Upon further inspection, it looks like all keys in saved state_dict have a "model." prefix on their names while the keys in the model itself do not (I attached a file of the full error which shows this mismatch in key names).
What do you recommend for bypassing this problem? Is there a specific GitHub repo tag that I should be trying to work off of? It looks like there might have been some change to the torch.hub models between my original training script and trying to load now. Would it be reasonable to just remove the "model." prefix in the saved state_dict?
Thank you!
Hi guys,
Congrats for the excellent work and great results. Great to see sequence embedding works indeed.
May I ask in your MSA Transformer supervised contact prediction network, did you use the outer concat of the query sequence embedding (or with symmetries row self-attention maps) as the only input so that it demonstrates the superior information content of sequence embedding replacing traditional MSA-related features or did you still include all the RaptorX features as input to the resnet as stated in Rives et al 2020? If latter, did you conduct an ablation study like that in Rives et al 2020, to see how much does the sequence embedding contribute to the improved contact precision?
Thanks in advance.
First of all, thank you for your fantastic work! I tried your embeddings on some basic protein family tasks and the results are amazing!
However, I have some issues when I try to generate contact map predictions. 1. From your code, the batch_conventer function will padding sequences to the maximum length. In my case is 600. Whereas, the generated contact maps have the dimension of [batch_size, 599 or 600, 599 or 600]. I am wondering why this happens. 2. Another issue is with the maximum length of the sequence the model can proceed, I am working with protein families that can have a maximum of over 30,000 long. And when I trying to use the model on such protein, the position encoding seems to fail. So I am curious what is the maximum length of protein this model can handle so I can set a reasonable cutoff.
Hi
Thanks again for your great repositories;
I have a question; how did you calculate contact map in your pre-retrained model? Have you used any supervising datasets or just use the embedded representation of amino acids and pairwise scoring?
Thanks
Nasser
Hey,
Thank you for doing the research which is needed in order to many biotech issues.
Is there any plan to add support for extracting per-residue embeddings on GPU (multi-GPU)?
...
# Extract per-residue embeddings (on CPU)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[34])
...
I have another question: how can I apply ESM embedding to get per-protein vector?
Is it enough if I will apply mean(dim=0)
?
Thanks,
Piotr
Hi there,
Thanks for making this work public.
I was trying to load the MSA-1 model from pytorch hub, but it seems that it is not on the list yet.
Could you please add the msa-1 model to hubconf.py?
Thanks a lot again.
Hello, I was trying to use
python extract.py esm1b_t33_650M_UR50S ../test/test.fasta my_reprs/ --repr_layers 0 32 33 --include mean per_tok
And it returns:
Traceback (most recent call last):
File "extract.py", line 134, in <module>
main(args)
File "extract.py", line 66, in main
dataset = FastaBatchedDataset.from_file(args.fasta_file)
File "/home/jsun/msa_transformer/esm/esm/data.py", line 52, in from_file
assert len(set(sequence_labels)) == len(sequence_labels)
AssertionError
I guess that there are problems with my fasta file (about 350 sequences and 450 amino acids for each). Are there any restrictions on fasta file? Can anyone please help me with this problem?
When running the ESM model CPU mode, i found that GPU was occupied(about 1.7GB)
# Load ESM-1b model
model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
batch_converter = alphabet.get_batch_converter()
# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
# Extract per-residue representations (on CPU)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[33], return_contacts=True)
token_representations = results["representations"][33]
Could I split the sequence into 1024 length chunks, run each separately with the BOS and EOS tokens occurring in the first and last chunks, concatenate the resulting embeddings, then take the average?
Seems like since during training the model used random crops of >1024 length sequences, this should work, but want to make sure.
Also, some warning that your sequence is too long might be helpful, since as of now trying to embed a larger than 1024 length sequence while running on gpu results in the unhelpful "device-side assert triggered" CUDA runtime error.
In the jupyter notebook example of [contact-prediction using the MSA transformer] (https://github.com/facebookresearch/esm/blob/0097c27290c7c88c4bc0fc4df3692d2c23a6aa54/examples/contact_prediction.ipynb) there is no mention to the threshold use to define a contact.
I'm assuming I'm doing something wrong in the following, but I can't really see what it could be so hopefully you guys can help point out what it is!
I'm hoping to use your model in my research, but first I wanted to do some validation of how it works. So rather than looking at the embedding layer, I'm looking at the actual output from the model. Which I assume can be parsed through a softmax function in order to return token probabilities, from which I can get the predicted amino acids by taking argmax.
However when I do that, I find that the returned probabilities makes no sense. What I'm doing is given in the code below. What am I doing wrong?
import torch
import esm
import numpy as np
# Load 34 layer model
model, alphabet = esm.pretrained.esm1_t34_670M_UR50S()
batch_converter = alphabet.get_batch_converter()
# Prepare data (two protein sequences)
data = [("protein1", "MYLYQKIKN"), ("protein2", "MNAKYD")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
aa = alphabet.all_toks
# model.model_version
# Extract per-residue embeddings (on CPU)
with torch.no_grad():
results = model(batch_tokens)
logits = results['logits']
prob = torch.softmax(logits,dim=2)
pred = torch.argmax(prob,dim=2)
pred_str = aa[pred[0,0,]]
Great work! I find no model pre-training and downstream task fine-tuning scripts in the repository. Could you provide them?
Thanks.
Bug description
I am trying to embed 10k+ protein sequences (about 250 aa residues each, all stored in a single fasta file foo
). This errs out. Running the same command on head -n1000 foo > bar
works as expected. Is there a limit here?
Reproduction steps
Try to run the model on 10k sequences.
Expected behavior
Embed them.
Logs
python esm/extract.py esm1b_t33_650M_UR50S foo my_reprs/ --repr_layers 33 --include mean
Traceback (most recent call last):
File "esm/extract.py", line 144, in <module>
main(args)
File "esm/extract.py", line 71, in main
dataset = FastaBatchedDataset.from_file(args.fasta_file)
File ".../tmp/fair-esm/esm/esm/data.py", line 52, in from_file
assert len(set(sequence_labels)) == len(sequence_labels)
AssertionError
In your quick start example you have an example for cpu:
# Extract per-residue embeddings (on CPU) with torch.no_grad(): results = model(batch_tokens, repr_layers=[34]) token_embeddings = results["representations"][34]
if I wanted to Extract per-residue embeddings on a gpu how would I change the above lines?
Thanks
Hi,
Thanks for making the model public! I would like to fine-tune the model for a downstream task, however in your colab model you only make inference using the pre-trained model, without any fine tuning it seems. Could you advise me on the best way to go about finetuning with this model? provide an example script?
Thanks so much.
Hi,
For people working in the field of protein science, it'd be useful to find the sequences/structures in uniparc that are similar to a given sequence in the esm embedding space, as a great alternative to the existing protein sequence-based search tools and methods.
Is the easiest way to do that 1) download uniparc 2) get esm-1b embeddings 3) build a kNN index via Faiss or pynndescent, or are you planning to release a script and/or a kNN/faiss index to facilitate that somehow?
I can imagine that you have been already using a Faiss instance to do that internally, but the question is whether you'd like to release it or not 😄
Cheers.
Hi there;
I hope you are well;
I have question, given a sequence as input in esm1b, how I can extract 660 attention map associated with each head in each layer?
Thanks so much
See title.
That would be useful for the conda-forge package
Does the ESM model deal with special symbols for proteins?
Does it deal with input sequences with gaps? For example, sequence = ---AB----C
?
Does it deal with ambiguous residues like BZJX
?
Thank you!
Would that be possible to add a git tag to the repo and so version the lib so we can make a package of it on conda-forge or similar?
Hello,
thank you for the amazing work!
I just would like to let you know, that the latest modification of README has broken pip install, as there is a README.rst among data_files in the setup.py, but now the repo has README.md instead.
Best regards,
Raman
Hi again.
Thank you for adding the MSA-1 model.
I am experimenting if we can get a better representation of proteins for some down-stream tasks with the MSA model.
Do you have any suggestions on how to prepare a proper MSA file (a3m) of a specific protein that fits the model? Like what tools you used, the procedures...
Thanks.
Hi,
Congratulations on your great work and for releasing the pertained models.
In your work you compared your results against SeqVec, our early work, and I was wondering if you have plans to compare it to our new work ProtTrans ?
https://github.com/agemagician/ProtTrans
By making a quick comparison, it seems ProtBert-BFD model performs better than Transformer-34 on SS8 with only 63% of Transformer-34 capacity.
I believe more analysis is needed here, especially, to see how Roberta style models compared to other transformers including XLNet, Albert, Bert, Electra, Transformer XL, etc.
I took a look at the SSP labels that come with the Structural Split dataset and those include (T, E, X, B, H, G, S, I, -). In the paper, it says these labels were pulled from Joosten et al 2010 where the labels correspond to (BEGHITS). What does the X character represent?
Hi, I read paper <MSA Transformer>, which achieves an exciting results.
However, I don't know how to reproduce the result show in paper ?
Could somebody help me to do this, thank you ~
Hey folks :)
Great work!. As I mentioned on Twitter, it'd be nice to add your models to bio_embeddings. Purpose of the pipeline: make it easy for less-tech-savy bio/informatician to use protein LMs. Since you use torch, this should be quite straightforward since we already have some transformer models.
Out of the box, you get the whole "read FASTA in, make run reproducible", project, viz & embedding annotation transfer (goPredSim) pipelines. Edit: oh, and the auto-batching of large sequence files between GPU/CPU (which is not at all intuitive for the avg user), + per-sequence vs per-AA representations (looking through closed issues, #2 )
I noticed you have some variant prediction code, maybe it makes sense to include that as a pipeline step if it is sensible?
I'll link this to our issue for integration so that we can cross-follow the status: sacdallago/bio_embeddings#62
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.