sokrypton / colabfold Goto Github PK

View Code? Open in Web Editor NEW

1.7K 1.7K 442.0 141.35 MB

Making Protein folding accessible to all!

License: MIT License

Jupyter Notebook 91.96% Python 7.68% Shell 0.33% Dockerfile 0.02%

colabfold's People

Contributors

Stargazers

Watchers

Forkers

peldom biocreator luiscervela gezmi chiyokk hyejun18 aparnamalisetty aaranwang along1313 xinghao-1210 bbyun28 qhuzhl ainaadekunle sukritipaul05 aninda2020 superxiang git-jrwang chunfuxu dzyla yuyuqi-design universvm deffosic stjordanis feng-yu-wang yamule joserfjuniorllms biocheming ronioncloud plxavier kaczmarj davidswang jilimcaoco pablo-arantes proteindesignlab betainverse satyam-cyc marioernestovaldes biogeeker ivorobyev magicchem valdes-tresanco-ms aldante minghao2016 milot-mirdita gushaocheng edgarpick jianglab wenbostar ivorlaisir hushuangwei guruace yqyang733 957-z hongzhonglu whbpt clacri newcooldiscoveries gjoni lanart93 lykksmz haruka-kono hsouporto seb-leb liorz franzfalkenhaus yangluom konstin typhooner vsarpe tianyabeef woyokenki yaoyinying type3chinu rkrutyholowa aephir lqx-ai jeteveux yutake27 lcarnero2093 celalp navjeet0211 sjdv1982 xvazquezc rick-baker pstansfeld chmnk samusram eaherrerat xiaoyaodejian jkosinski zappapc cyrilma amdens-sci pankev-in rhallez samuelmurail kmdalton marcmk6 alexcristea01 enzoandree

colabfold's Issues

multiple conformations?

As has been pointed out, AlphaFold will generally only give one conformation of a protein or complex. Is this simply because it tries to maximize contacts of coevolving residue pairs? In cases where we have prior structural knowledge, it might be helpful to have the option to suppress a predicted contact, for instance if we would like to visualize a ligand-activated complex and we know that a contact is present only in the unactivated/ligand-free state. I may see if simply replacing residues of this sort with U gives the desired results. Of course it may be easier to simply use templates in this case. Are you considering any better ways to increase the number of predicted conformations?

Amber-relax fails

Amber-relax fails on some structures. For example:

/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py in _execute_compiled_primitive(prim, compiled, result_handler, *args)
388 device, = compiled.local_devices()
389 input_bufs = list(it.chain.from_iterable(device_put(x, device) for x in args if x is not token))
--> 390 out_bufs = compiled.execute(input_bufs)
391 check_special(prim.name, out_bufs)
392 return result_handler(*out_bufs)

RuntimeError: Internal: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Inferencing many proteins

Thanks for the nice scripts.
Do you have any idea how to implement the "inferencing-many-proteins" function mentioned here:
https://github.com/deepmind/alphafold#inferencing-many-proteins

In my test case of folding a 489-AA protein with 225 MSAs, the compilation takes about the same time of the prediction step.
It would be great to have a AlphaFold2_manyMSA_noTemplates_noMD script for making predictions on a large number of pre-computed MSAs.

AlphaFold2_advanced use templates

Hi!

I would like to test colabfold on multi-chain protein structures. I found that only AlphaFold2_advanced can solve this problem. But advanced notebook doesn't support templates. Is there an easy way to support templates in advanced notebook?

Bug report - 'bool' object is not subscriptable

I am trying the new pairing feature for the MSAs and get the following error:

found 0 pairs
47155 Sequences Found in Total
merging/filtering MSA using mmseqs2
7082 Sequences Found in Total (after filtering)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:239: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:240: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-ef6f7c7fd665> in <module>()
    239     gap_ = msa_ != "-"
    240     qid_ = msa_ == np.array(list(sequence))
--> 241     gapid = np.stack([gap_[:,Ln[i]:Ln[i+1]].max(-1) for i in range(len(seqs))],-1)
    242     seqid = np.stack([qid_[:,Ln[i]:Ln[i+1]].mean(-1) for i in range(len(seqs))],-1).sum(-1) / gapid.sum(-1)
    243     non_gaps = gap_.astype(np.float)

<ipython-input-3-ef6f7c7fd665> in <listcomp>(.0)
    239     gap_ = msa_ != "-"
    240     qid_ = msa_ == np.array(list(sequence))
--> 241     gapid = np.stack([gap_[:,Ln[i]:Ln[i+1]].max(-1) for i in range(len(seqs))],-1)
    242     seqid = np.stack([qid_[:,Ln[i]:Ln[i+1]].mean(-1) for i in range(len(seqs))],-1).sum(-1) / gapid.sum(-1)
    243     non_gaps = gap_.astype(np.float)

TypeError: 'bool' object is not subscriptable

Big thanks for the amazing work by the way :)

Residue with no atoms, causes error with amber (MMseqs2)

While running UniProt ID P02744 (sequence pasted below) in the AlphaFold2 w/ MMseqs2 notebook with templates and amber selected, I get an error from amber saying at least one residue n the protein has no atoms and it can't relax it.

Not sure why this is occurring, i don't HAVE to use the amber minimization (although it would be nice), but I'm worried about what is causing this empty residue in the first place.

Error message:

ValueError Traceback (most recent call last)
in ()
49 Ls=[len(query_sequence)]*homooligomer,
50 model_params=model_params, use_model=use_model,
---> 51 do_relax=use_amber)

3 frames
/content/alphafold/relax/amber_minimize.py in _check_residues_are_well_defined(prot)
139 """Checks that all residues contain non-empty atom sets."""
140 if (prot.atom_mask.sum(axis=-1) == 0).any():
--> 141 raise ValueError("Amber minimization can only be performed on proteins with"
142 " well-defined residues. This protein contains at least"
143 " one residue with no atoms.")

ValueError: Amber minimization can only be performed on proteins with well-defined residues. This protein contains at least one residue with no atoms.

P02744 sequence:

LEEGEITSKVKFPPSSSPSFPRLVMVGTLPDLQEITLCYWFKVNQLKGTLGHSRVLBMFSYATAKKDNELLTFLDEQGDFLFNV

About GPU in jupyter

Hi, I wonder how to use GPU during inference in jupyter. It seems that it only use CPU. I don't know how to setting it.

AlphaFold2 protein embedding Notebook as example

Good morning,

I would like to propose you to share another extremely useful example of AF2 usage. Many scientists are using protein embeddings for downstream tasks (i.e. function prediction). AF2 issue described the codebase which gonna access you to the protein embedding vector but many users are not able to handle it by themself.

I hope you will consider my idea, to demonstrate how to load and prepare AF2 minimum setup to execute embedding part of the workflow on Colab or local machine. The most expected example could be AA sequence on the input and fixed-length numerical vector as output (averaged residue vector).

Warm regards,
Piotr

AlphaFold2_complexes for 3 or more proteins?

For 2 protein complexes, the current notebook works really nice!
Is there a way to compute 3 protein complexes? I know this would be more complicated for ABC than just AB.
Great thanks!

MMseqs returns no hits.

The following sequence returns no hits when submitted to either af_mmseqs2 or af_advanced:

NVEPLNGQSEVTGMLDKDITLQWQITFLKGEMLQSHDIYLPNRTKIVSNQPPELTPVGKRMYGTRLVPVFDADAAVFKLTLKNVKFTDSSHNFTLVVAFERKDDFNRRTGVADINIVNVE

However, if I truncate it by one aa, I get 70 hits. This is with both af_mmseqs2 or af_advanced
NVEPLNGQSEVTGMLDKDITLQWQITFLKGEMLQSHDIYLPNRTKIVSNQPPELTPVGKRMYGTRLVPVFDADAAVFKLTLKNVKFTDSSHNFTLVVAFERKDDFNRRTGVADINIVNV

I am unable to tell if this is the same bug as reported in issue #49

Notebook loading error

Hi -

Trying to load AlphaFold2_complexes.ipynb, I get the message

There was an error loading this notebook. Ensure that the file is accessible and try again.
Check dependency list! Synchronous require cannot resolve module 'vs/platform/quickinput/common/quickInput'. This is the first mention of this module!
https://github.com/sokrypton/ColabFold/blob/main/beta/AlphaFold2_advanced.ipynb

Memory error at 14Gb with 25Gb memory

Hello,

I am having a memory error while trying to analyze my protein (sequence and error message attached below). I am using Colab Pro with 25Gb memory. The prediction works if I cut my protein in half, but I would like to analyze the full length protein if possible. Can I know if there is a way to get this to work?

Thanks!
Danny

Protein sequence:

EESAAPQVHLSILATTDIHANMMDYDYYSDKETADFGLARTAQLIQKHREQNPNTLLVDNGDLIQGNPLGEYAVKYQKDDIISGTKTHPIISVMNALKYDAGTLGNHEFNYGLDFLDGTIKGADFPIVNANVKTTSGENRYTPYVINEKTLIDENGNEQKVKVGYIGFVPPQIMTWDKKNLEGQVQVQDIVESANETIPKMKAEGADVIIALAHTGIEKQAQSSGAENAVFDLATKTKGIDAIISGHQHGLFPSAEYAGVAQFNVEKGTINGIPVVMPSSWGKYLGVIDLKLEKADGSWKVADSKGSIESIAGNVTSRNETVTNTIQQTHQNTLEYVRKPVGKTEADINSFFAQVKDDPSIQIVTDAQKWYAEKEMKDTEYKNLPILSAGAPFKAGGRNGANYYTNIPAGDLAIKNVGDLYLYDNTVQIVKLTGSEVKDWLEMSAGQFNQIDPAKGGDQALLNENFRSYNFDVIDGVTYQVDVTKPAKYNENGKVINADSSRIINLSYEGKPISPSQEFLVVTNNYRASGGGGFPHLTSDKIVHGSAVENRQVLMDYIIEQKTVNPKADNNWSIAPVSGTNLTFESSLLAKPFADKADDVAYVGKSANEGYGVYKLQFDDDSNPDPPKDGLWDLTVMHTNDTHAHLDDAARRMTKINEVRSETNHNILLDAGDVFSGDLYFTKWNGLADLKMMNMMGYDAMTFGNHEFDKGPTVLSDFLSGNSATVDPANRYHFEAPEFPIVSANVDVSNEPKLKSFVKKPQTFTAGEKKEAGIHPYILLDVDGEKVAVFGLTTEDTATTSSPGKSIVFNDAFETAQNTVKAIQEEEKVNKIIALTHIGHNRDLELAKKVKGIDLIIGGHTHTLVDKMEVVNNEEPTIVAQAKEYGQFLGRVDVAFDEKGVVQTDKSNLSVLPIDEHTEENPEAKQELDQFKNELEDVKNEKVGYTDVALDGQREHVRTKETNLGNFIADGMLAKAKEAAGARIAITNGGGIRAGIDKGDITLGEVLNVMPFGNTLYVADLTGKQIKEALEQGLSNVENGGGAFPQVAGIEYTFTLNNKPGHRVLEVKIESPNGDKVAINTDDTYRVATNNFVGAGGDGYSVFTEASHGEDLGYVDYEIFTEQLKKLGNKVSPKVEGRIKEVFLPTKQKDGSWTLDEDKFAIYAKNANTPFVYYGIHEGSQEKPINLKVKKDQVKLLKERESDPSLTMFNYWYSMKMPMANLKTADTAIGIKSTGELDVSLSDVYDFTVKQKGKEIKSFKEPVQLSLRMFDIEEAHNPAIYHVDRKKKAFTKTGHGSVDDDMVTGYTNHFSEYTILNSGSNNKPPAFPSDQPTGGDDGNHGGGSDKPGGKQPTDGNGGNDTPPGTQPTNGSGGNGSGGSGTDGPAGGLLPDT

Error message:

running model_1

UnfilteredStackTrace Traceback (most recent call last)
in ()
50 model_params=model_params, use_model=use_model,
---> 51 do_relax=use_amber)

13 frames
UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 14268435552 bytes.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py in _execute_compiled(compiled, avals, handlers, kept_var_idx, *args)
910 for i, x in enumerate(args)
911 if x is not token and i in kept_var_idx))
--> 912 out_bufs = compiled.execute(input_bufs)
913 check_special(xla_call_p.name, out_bufs)
914 return [handler(*bs) for handler, bs in zip(handlers, _partition_outputs(avals, out_bufs))]

RuntimeError: Resource exhausted: Out of memory while trying to allocate 14268435552 bytes.

Download cell error, num_relax not defined

This line was throwing an error in the Download cell - when I commented it out the cell ran successfully:
text_file.write(f"num_relax={num_relax}\n")

NameError: name 'num_relax' is not defined

Databases used in the mmseq2 search, local version

Hello,
I would like to run locally the msa building step of the colab notebook and use the exact same set of databases to do some comparison with other databases.
Is it possible to get access to the set of databases the mmseq2 server is using as well as the version of mmseqs2 and the specific command lines executed on the server?
In the slides you presented (awesome presentation!), you mentioned you are using a 30%id clustered DB built from SMAG, MGNIFY, BFD, and MetaEuk. Do you provide somewhere a downloadable version of the master 30%seq_id db?

Thanks a lot!

Session crashed after using all available RAM (AlphaFold Colab) for homoligomer

Hi I'm trying to get oligomeric structure of a protein. I'm able to get the monomer through AlphaFold Colab but when i try to use the oligomeric feature it is crashing.
Error # Your session crashed after using all available RAM. If you are interested in access to high-RAM runtimes, you may want to check out Colab Pro.

NameError Traceback (most recent call last)
in ()
5 use_model = {}
6 if "model_params" not in dir(): model_params = {}
----> 7 for model_name in ["model_1","model_2","model_3","model_4","model_5"][:num_models]:
8 use_model[model_name] = True
9 if model_name not in model_params:
NameError: name 'num_models' is not defined

When i try to use local runtime, getting another error...
ModuleNotFoundError is coming continuously...
Please help, if you have any solution for this.
Thanks
Pankaj

Wait for an upload box to appear at the end of the "Input Protein ..." box. (not seeing it)

When I select msa_mod->custom I do not see an "upload box" at the end of the "Input Protein ..." box.

Specify GPU(s) to use with local runtime?

Hi,

I have some longer sequences I would like to try so I have switched to using a local runtime. Is there an easy way to restrict which GPUs are selected for processing? Currently it is trying to allocate memory on a GPU that is already maxed out by an unrelated process.

Thanks

Format of custom MSA for AAA:BBB complex?

When inputting a complex in AAA:BBB format, and custom MSA, I keep getting "ERROR: the length of msa does not match input sequence". My MSA has hyphens in it because some homologs have insertions; I have tried including the appropriate hyphens in the input sequence, but it looks like they are being ignored for the length calculation. Is it possible to use an MSA where my target is shorter than some of the homologs?

Amber works in AlphaFold2_mmseqs2 but not in AlphaFold2_batch

When I try to run the predefined example sequence (PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK) with templates, one model and amber it works in AlphaFold2_mmseqs2, but fails in AlphaFold2_batch with the following error:

ValueError                                Traceback (most recent call last)

<ipython-input-3-a76dac23e0b1> in <module>()
    391                            Ls=[len(query_sequence)], crop_len=crop_len,
    392                            model_params=model_params, use_model=use_model,
--> 393                            do_relax=use_amber)
    394 
    395   # gather MSA info

<ipython-input-3-a76dac23e0b1> in predict_structure(prefix, feature_dict, Ls, crop_len, model_params, use_model, do_relax, random_seed)
    276                                               stiffness=10.0,exclude_residues=[],
    277                                               max_outer_iterations=20)      
--> 278         relaxed_pdb_str, _, _ = amber_relaxer.process(prot=unrelaxed_protein)
    279         relaxed_pdb_lines.append(relaxed_pdb_str)
    280 

/content/alphafold/relax/relax.py in process(self, prot)
     62         tolerance=self._tolerance, stiffness=self._stiffness,
     63         exclude_residues=self._exclude_residues,
---> 64         max_outer_iterations=self._max_outer_iterations)
     65     min_pos = out['pos']
     66     start_pos = out['posinit']

/content/alphafold/relax/amber_minimize.py in run_pipeline(prot, stiffness, max_outer_iterations, place_hydrogens_every_iteration, max_iterations, tolerance, restraint_set, max_attempts, checks, exclude_residues)
    459   # `protein.to_pdb` will strip any poorly-defined residues so we need to
    460   # perform this check before `clean_protein`.
--> 461   _check_residues_are_well_defined(prot)
    462   pdb_string = clean_protein(prot, checks=checks)
    463 

/content/alphafold/relax/amber_minimize.py in _check_residues_are_well_defined(prot)
    139   """Checks that all residues contain non-empty atom sets."""
    140   if (prot.atom_mask.sum(axis=-1) == 0).any():
--> 141     raise ValueError("Amber minimization can only be performed on proteins with"
    142                      " well-defined residues. This protein contains at least"
    143                      " one residue with no atoms.")

ValueError: Amber minimization can only be performed on proteins with well-defined residues. This protein contains at least one residue with no atoms.

Out of memory

When I submitted a long sequence, I met the 'out of memory' error.

run alphafold fails when "use turbo" is unchecked

Error states that variable L is undefined.

Singularity installations?

Hi,

I was wondering if the project could be integrated into a singularity installation of AlphaFold2. If yes, how would one go about achieving it?

Best,
Pranav

How to predict the structure of cyclic peptides?

I have a cyclic peptide sequence. I put it into Alphafold2 Colab, but I didn't get a cyclic peptide structure. What should I do to connect the C-terminal and N-terminal for the next dynamic simulation (GROMACS)?

Should I do some processing on the structure obtained by Alphafold2 Colab for Gromacs dynamics? Or input sequence to Alphafold2 Colab need for some pre-processing? Or can Alphafold2 not predict the sequence of cyclic peptides?

It's really important for my research, thanks for any help.

Cannot reshape a tensor with 2705220 elements to shape [6441,127,1]

Hello，I am using this ipynb file on Colab
https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb

When I did the Gather input features, predict structure step. I found an error.

=============================================
Running model_1
InvalidArgumentError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in _create_c_op(graph, node_def, inputs, control_inputs, op_def)
1879 try:
-> 1880 c_op = pywrap_tf_session.TF_FinishOperation(op_desc)
1881 except errors.InvalidArgumentError as e:

InvalidArgumentError: Cannot reshape a tensor with 2705220 elements to shape [6441,127,1] (818007 elements) for '{{node reshape_msa}} = Reshape[T=DT_INT32, Tshape=DT_INT32](Const_6, reshape_msa/shape)' with input shapes: [6441,420], [3] and with input tensors computed as partial shapes: input[1] = [6441,127,1].

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
12 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in _create_c_op(graph, node_def, inputs, control_inputs, op_def)
1881 except errors.InvalidArgumentError as e:
1882 # Convert to ValueError for backwards compatibility.
-> 1883 raise ValueError(str(e))
1884
1885 return c_op
ValueError: Cannot reshape a tensor with 2705220 elements to shape [6441,127,1] (818007 elements) for '{{node reshape_msa}} = Reshape[T=DT_INT32, Tshape=DT_INT32](Const_6, reshape_msa/shape)' with input shapes: [6441,420], [3] and with input tensors computed as partial shapes: input[1] = [6441,127,1].

============================================================

My protein sequence is:

SMNPPPPETSNPNKPKRQTNQLQYLLRVVLKTLWKHQFAWPFQQPVDAVKLNLPDYYKIIKTPMDMGTIKKRLENNYYWNAQECIQDFNTMFTNCYIYNKPGDDIVLMAEALEKLFLQKINELPTEE

Amber_relax_fails_2

Summary: The Amber relaxation step fails because the number of atoms in one or more residues is zero (?).

The error message:

ValueError: Amber minimization can only be performed on proteins with well-defined residues. This protein contains at least one residue with no atoms.

[Resolved] There was an X in the sequence.

tricking alphafold2 docker to use the same reference sequences as mmseq2

Hi, I've been running the MMseq2 ColabFold for a specific type of protein sequences, and I always get the same set of ~4000 sequences in the .a3m file. Would it be possible to put these sequences from the .a3m file in a file in one of the folders for the local installation of AlphaFold2, and tricking AlphaFold2 to always look at these when running?

AlphaFold2_advanced: New/recent IndexError upon running 'run Alphafold'

I have been playing with the new AlphaFold2_advanced rollout and have gotten through the whole pipeline for several heterologous protein-protein interactions. Suddenly this afternoon I am receiving this error but I don't know where at in the code the problem could be coming from or why it has suddenly changed. I have been keeping all of my parameters the same and only changing the second amino-acid sequence in the sequence input.

Prediction for FAD binding?

Many thanks for setting this up! I've found it really useful in my research. I'm looking into some FAD-dependent oxidases, but it seems that cofactors like FAD is not included in the predicted model. Is there a way to do this? Or do i have to dock this after generating the apo protein model?

Thanks!

cublas status execution failed error

edit: this doesn't appear if use_turbo is unchecked

in the run alphafold section of the advanced notebook, this error appears while the first model is being run:

UnfilteredStackTrace Traceback (most recent call last)

in ()
206 # predict
--> 207 prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed),"cpu")
208

11 frames

UnfilteredStackTrace: RuntimeError: Internal: CUBLAS_STATUS_EXECUTION_FAILED

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)

/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py in _execute_compiled(compiled, avals, handlers, kept_var_idx, *args)
910 for i, x in enumerate(args)
911 if x is not token and i in kept_var_idx))
--> 912 out_bufs = compiled.execute(input_bufs)
913 check_special(xla_call_p.name, out_bufs)
914 return [handler(*bs) for handler, bs in zip(handlers, _partition_outputs(avals, out_bufs))]

RuntimeError: Internal: CUBLAS_STATUS_EXECUTION_FAILED

A 700 amino acid long protein - How Long to predict?

Using the AlphaFold2.0 Google Colab Notebook I am trying to predict the structure of a 700-800 amino acid long protein chain, How long will the program take to run?

Expected shape error in AlphaFold step, advanced notebook

Go this error twice today on AlphaFold step with different input sequences - can you help?
Advanced notebook, input was a protein plus peptide (formatted as AAAAAAA:BBBBBB), genetic search succeeded.
Thanks for any help!

Running model_1_ptm_seed_0: 0%
0/5 [elapsed: 10:06 remaining: ?]

IndexError Traceback (most recent call last)
in ()
188
189 prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed),"cpu")
--> 190 outs[key] = parse_results(prediction_result, processed_feature_dict)
191
192 # report

3 frames
/usr/local/lib/python3.7/dist-packages/jax/_src/numpy/lax_numpy.py in _expand_bool_indices(idx, shape)
5400 expected_shape = shape[len(out): len(out) + _ndim(i)]
5401 if i_shape != expected_shape:
-> 5402 raise IndexError("boolean index did not match shape of indexed array in index "
5403 f"{dim_number}: got {i_shape}, expected {expected_shape}")
5404 out.extend(np.where(i))

IndexError: boolean index did not match shape of indexed array in index 2: got (63,), expected (64,)

Error Amber-relax

Hello,

I am trying to refine my structure predictions using Amber relax (in the alphafold2_advance book). However, I am getting the following error, both for my own structure as well as for the test sequence/structure:

UnfilteredStackTrace                      Traceback (most recent call last)

<ipython-input-16-404fe963ee1d> in <module>()
     63               max_outer_iterations=20)
---> 64           relaxed_pdb_lines, _, _ = amber_relaxer.process(prot=outs[key]["unrelaxed_protein"])
     65           with open(pred_output_path, 'w') as f:

33 frames

UnfilteredStackTrace: TypeError: take requires ndarray or scalar arguments, got <class 'list'> at position 0.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------


The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)

/usr/local/lib/python3.7/dist-packages/jax/_src/numpy/lax_numpy.py in _check_arraylike(fun_name, *args)
    557                     if not _arraylike(arg))
    558     msg = "{} requires ndarray or scalar arguments, got {} at position {}."
--> 559     raise TypeError(msg.format(fun_name, type(arg), pos))
    560 
    561 def _check_no_float0s(fun_name, *args):

TypeError: take requires ndarray or scalar arguments, got <class 'list'> at position 0.

Could you maybe help me figure out what's going on? Thank you very much!

fasta database

If I choose mmseqs2 (uniref + environmental), it should include SMAG and MetaEuk databases. Could you tell me the exact link of the two databases? Thank you!

alphafold2.ipynb throws an error with custom a3m alignments

The alphafold2_mmseqs2 notebook throws an error if I give it a custom sequence in a3m format where the sequence is broken over several lines. It works if I manually edit the alignment so each sequence is one line.

(This doesn't happen on the alphafold2_advanced notebook, as far as I can tell)

I can reproduce this behavior with the following alignments:

Works: alignment_single-line-seqs.a3m.txt

Throws error: alignment_multi-line-seqs.a3m.txt

Error text:

running model_1
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in _create_c_op(graph, node_def, inputs, control_inputs, op_def)
   1879   try:
-> 1880     c_op = pywrap_tf_session.TF_FinishOperation(op_desc)
   1881   except errors.InvalidArgumentError as e:

InvalidArgumentError: Cannot reshape a tensor with 1710 elements to shape [15,14,1] (210 elements) for '{{node reshape_msa}} = Reshape[T=DT_INT32, Tshape=DT_INT32](Const_6, reshape_msa/shape)' with input shapes: [15,114], [3] and with input tensors computed as partial shapes: input[1] = [15,14,1].

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
12 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in _create_c_op(graph, node_def, inputs, control_inputs, op_def)
   1881   except errors.InvalidArgumentError as e:
   1882     # Convert to ValueError for backwards compatibility.
-> 1883     raise ValueError(str(e))
   1884 
   1885   return c_op

ValueError: Cannot reshape a tensor with 1710 elements to shape [15,14,1] (210 elements) for '{{node reshape_msa}} = Reshape[T=DT_INT32, Tshape=DT_INT32](Const_6, reshape_msa/shape)' with input shapes: [15,114], [3] and with input tensors computed as partial shapes: input[1] = [15,14,1].

Terminates after 'Gather input features, predict structure' step

The Google Notebook program terminates at this step and gives an error message "File Not Found Error". Any help on this issue would be great,

Error at input features and predict structure stage

running model_1

UnfilteredStackTrace Traceback (most recent call last)
in ()
50 model_params=model_params, use_model=use_model,
---> 51 do_relax=use_amber)

35 frames
UnfilteredStackTrace: TypeError: take requires ndarray or scalar arguments, got <class 'list'> at position 0.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

The above exception was the direct cause of the following exception:

TypeError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/jax/_src/numpy/lax_numpy.py in _check_arraylike(fun_name, *args)
557 if not _arraylike(arg))
558 msg = "{} requires ndarray or scalar arguments, got {} at position {}."
--> 559 raise TypeError(msg.format(fun_name, type(arg), pos))
560
561 def _check_no_float0s(fun_name, *args):

TypeError: take requires ndarray or scalar arguments, got <class 'list'> at position 0.

Color bar legend for py3Dmol IDDT scores

Great tool and interface, many thanks to everyone involved!

One small suggestion to help folks interpret and judge model quality quickly from browser:
Would it be possible to add a color bar to the py3Dmol structure preview for the IDDT colors? It would go a long ways towards judging model quality, especially at regions of particular interest, without requiring users to have PyMol expertise or to try and match residue numbers between the IDDT plot and the structure.

Thanks again!

Out of Memory

Hello!

I was getting an out of memory issue. I get the error in step 5 of "Gather input features, predict structure". I was originally running a sequence of ~480 amino acids for 3 homooligomers. Thinking it was a sequence length issue, I then truncated the sequence to have ~360 amino acids for 1 homooligomer. However, I am still getting this issue. I have "Factory reset runtime" to see if that would help, but still the same error.

How to run the notebooks over a .fasta file?

Could you please tell how to run the notebooks over a fasta file ? I wish to loop through the fasta file and generate .pdb files

Hanging at running mmseqs2

Starting today, after the progress bar fills completely for this step, we're hanging here (output after interrupting the kernel)

KeyboardInterrupt
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/tmp/ipykernel_57/1853154669.py in <module>
     53     prefix = os.path.join('tmp',prefix)
     54     print(f"running mmseqs2")
---> 55     A3M_LINES = cf.run_mmseqs2(seqs, prefix, filter=True)
     56 
     57     # filter sequences to 10K\n",

/app/alphafold/colabfold.py in run_mmseqs2(x, prefix, use_env, filter)
    111         while out["status"] in ["UNKNOWN","RUNNING","PENDING"]:
    112           t = 5 + random.randint(0,5)
--> 113           time.sleep(t)
    114           out = status(ID)
    115           pbar.set_description(out["status"])

Is there a way to find the max GPU memory watermark? How to run locally with minimal setup

According to the README.md, the memory goes as follows:

Maximum length limits depends on free GPU provided by Google-Colab fingers-crossed

For GPU: Tesla T4 or Tesla P100 with ~16G the max length is ~1400
For GPU: Tesla K80 with ~12G the max length is ~1000
To check what GPU you got, open a new code cell and type !nvidia-smi

I am interested in structures of around either (a) one single chain of 240-280aa or around (b) 2 different chains of ~120 + ~140aa. What would be the minimal GPU that would allow us to run this locally?

I am thinking that given our own custom MSAs, it wouldn't need to connect to MMSeqs2 or download the 2Tb of sequence data, thus going straight into running the prediction based on the MSA of internal data on the docker container?

Or am I missing something obvious that would still require Colab or something else remote?

Failed jobs and name error

Hi,

I submitted hExoI with variant residues found in one of the samples.
It gets terminated with no error message at Gather input features, predict structure step.
I also get name errors even after factory reset and re-run.

NameError                                 Traceback (most recent call last)
<ipython-input-2-1d8cadd9b758> in <module>()
      1 #@title Gather input features, predict structure
      2 # parse TEMPLATES
----> 3 if use_templates: template_features = mk_template(jobname)
      4 else: template_features = mk_mock_template(query_sequence)
      5 

NameError: name 'use_templates' is not defined

Can you please let me know how to fix it?

MSA query stuck at "PENDING 0%" for a particular sequence.

Not sure if I'm doing something incorrectly.
A particular sequence doing a batch run using the alphafold2_batch or a single run with the AlphaFold2_mmseqs2 notebook seems to cause the MMSeq2 query step to hang.
My current run has been stuck at 0% on the MSA (MMSeq2 (Uniref+environmental)) step for the past 1h 43 minutes and it's been stuck for longer in the past before I kill the process.
The raw sequence is; MAQVQLVESGGGLVQAGGSLRLSCAVSGRPFSEYNLGWFRQAPGKEREFVARIRSSGTTVYTDSVKGRFSASRDNFLATT
LERIEKNFVITDPRLPDNPIIFASDSFLQLTEYSREEILGRNCRFLQGPETDRATVRKIRDAIDNQTEVTVQLINYTKSG
KKFWNLFHLQPMRDQKGDVQYFIGVQLDGTEHVRDAAEREGVMLIKKTAENIDEAAKELAKNMGYLQLNSLEPEDTAVYY
CAMSRVDTDSPAFYDYWGQGTQVTVSTPRS

Other variations on this sequence have worked flawlessly with these settings, so I'm not sure what is wrong with this sequence.
Running the notebook off a colab instance, not locally.

LaM-8_NA73-O.txt
file extension changed from fasta to txt so that github would allow the upload

MMSeqs version

Hello,
I would like to build a pipeline to search MSAs using local MMSeqs. However, There are some problems when I fllow the script 'msa.sh' returned from online MMSeqs service. For examle, when I run the command

"${MMSEQS}" expandaln "${BASE}/qdb" "${DBBASE}/${DB1}.idx" "${BASE}/res" "${DBBASE}/${DB1}.idx" "${BASE}/res_exp" --db-load-mode 2 ${EXPAND_PARAM}

Then the following error will return :

Input database "/nfs/database/uniref30_mmseqs/uniref30.idx" has the wrong type (Generic)
Allowed input:
- Alignment
- Prefilter
- Bi-directional prefilter
- Clustering

I just want to figure out whether the MMSeqs used in service is consistent with the latest one from github.
If not, how can we get the version applied to the service?

AlphaFold2 fails with RuntimeError: Internal: CUBLAS_STATUS_EXECUTION_FAILED

I am trying to use AlphaFold2_advanced.ipynb Colab with default parameters to predict the structure of the following sequence:

MQFSTVASVAFVALANFVAAESAAAISQITDGQIQATTTATTEATTTAAPSSTVETVSPSSTETISQQTENGAAKAAVGMGAGALAAAAMLL

However, as the model_1_ptm_seed_0 runs, AlphaFold2 fails with error:

UnfilteredStackTrace                      Traceback (most recent call last)
<ipython-input-4-0d880bbb1ecf> in <module>()
    188 
--> 189       prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed),"cpu")
    190       outs[key] = parse_results(prediction_result, processed_feature_dict)

11 frames
UnfilteredStackTrace: RuntimeError: Internal: CUBLAS_STATUS_EXECUTION_FAILED

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py in _execute_compiled(compiled, avals, handlers, kept_var_idx, *args)
    910           for i, x in enumerate(args)
    911           if x is not token and i in kept_var_idx))
--> 912   out_bufs = compiled.execute(input_bufs)
    913   check_special(xla_call_p.name, out_bufs)
    914   return [handler(*bs) for handler, bs in zip(handlers, _partition_outputs(avals, out_bufs))]

RuntimeError: Internal: CUBLAS_STATUS_EXECUTION_FAILED

Any insight is highly appreciated!
Best regards,

ValueError: setting an array element with a sequence

I am trying to use the AlphaFold complexes notebook, but I keep getting an error code.

Is there something on my end that I should correct?

Error while creating the prediction directory in the AlphaFold2_advanced.ipynb

Looks like there is a difference in the mmseq2/complexes notebooks compared to the advanced notebook. The mmseq2/complex notebooks define the following function for creating the hash for a given job:

import hashlib
def add_hash(x,y):
    return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]

However, in the advanced notebook, this function is absent and there is instead a get_hash method call where the object doesn't seem to be defined? Here is the relevant traceback:

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-1-f214cb658cff> in <module>()
     12 
     13 # prediction directory
---> 14 output_dir = 'prediction_' + cf.get_hash(full_sequence)[:5]
     15 os.makedirs(output_dir, exist_ok=True)
     16 print(f"working directory: {output_dir}")

NameError: name 'cf' is not defined

Thanks

MMSeq2 doesn't return an msa

Hi guys,
First of all, thank you very much for a terrific job. Incredibly useful in facilitating alphaFold.

I encountered a problem with the MSA construction. when I send a request to the API, I'm getting "MMseqs2 server did not return a valid result" and an empty msa file.
I thought I might have abused the API, but even after 24h, I cannot submit a single request.

Thanks in advance,

prediction_result, processed_feature_dict suffered a error

Hi,
I always suffered a broken when the program started prediction (calling parse_results(prediction_result, processed_feature_dict) ) after searching.
Is there any suggestion and solution for this? Many thanks!
See the error below:
..../python3.7/site-packages/jax/_src/numpy/lax_numpy.py in _expand_bool_indices(idx, shape)
5400 expected_shape = shape[len(out): len(out) + _ndim(i)]
5401 if i_shape != expected_shape:
-> 5402 raise IndexError("boolean index did not match shape of indexed array in index "
5403 f"{dim_number}: got {i_shape}, expected {expected_shape}")
5404 out.extend(np.where(i))

IndexError: boolean index did not match shape of indexed array in index 2: got (63,), expected (64,)

AlphaFold2_complexes out of memory

Hi there,

I was trying to model a bacterial protein complex based on the suggestions on top of the notebook, i.e. pair_msa and disable_mmseqs2_filter options on, when it ran out of memory at the Gather input features, predict structure step.
The two input proteins were 901 and 351 amino acids.

I thought it could be a good idea to report the error:

pairs found: 417
running model_1_ptm

---------------------------------------------------------------------------

UnfilteredStackTrace                      Traceback (most recent call last)

<ipython-input-7-0be664d1f433> in <module>()
     57 }
---> 58 plddts, paes = predict_structure(jobname, feature_dict, Ls=Ls)

13 frames

UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 12341783136 bytes.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------


The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)

/usr/local/lib/python3.7/dist-packages/jax/interpreters/xla.py in _execute_compiled(compiled, avals, handlers, kept_var_idx, *args)
    891           for i, x in enumerate(args)
    892           if x is not token and i in kept_var_idx))
--> 893   out_bufs = compiled.execute(input_bufs)
    894   check_special(xla_call_p.name, out_bufs)
    895   return [handler(*bs) for handler, bs in zip(handlers, _partition_outputs(avals, out_bufs))]

RuntimeError: Resource exhausted: Out of memory while trying to allocate 12341783136 bytes.

sokrypton / colabfold Goto Github PK

colabfold's People

Contributors

Stargazers

Watchers

Forkers

colabfold's Issues

Error message:

running model_1

Running model_1_ptm_seed_0: 0% 0/5 [elapsed: 10:06 remaining: ?]

running model_1

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Running model_1_ptm_seed_0: 0%
0/5 [elapsed: 10:06 remaining: ?]