steineggerlab / foldseek Goto Github PK

Foldseek enables fast and sensitive comparisons of large structure sets.

License: GNU General Public License v3.0

CMake 1.40% Shell 1.55% C++ 31.37% C 59.14% Dockerfile 0.02% Perl 0.01% Makefile 0.44% Batchfile 0.03% Python 0.33% Meson 0.13% Lua 0.01% Starlark 0.05% HTML 0.41% Roff 0.11% R 0.01% Rust 2.02% Jupyter Notebook 1.63% JavaScript 1.33%

protein-structure alignments bioinformatics clustering

foldseek's People

Contributors

Stargazers

Watchers

Forkers

sailfish009 smb20200615 schnamo mvankem gezmi asclepiusinformatica paolabc bkbonde nousiaso mrliuw gamcil yz26cn zhengzhengzhj natnaelt jaylee2000 milot-mirdita wook2014 peterdfields ebetica animesh micolak0115 superxiang bbyun28 woosub-kim limchanyoung1116 igchoi matrixmai stephanieskim algw71 kehan777 juanmanuelhuerta jeversbioprodict ai-and-ml timothystiles daniel-liu-c0deb0t xinluzhu schaudge abdo3a ineffablekenobi lzh93 gsgithub17 tgfheisenber qqlaoxia melarok jlingford exjustice pnnl-compbio airbj31 amorehead guruace jinzhuwei zeeva85 datnoor dengzq1234 kevinduringwork ringhalsun linxingchen simoneatt11 yeojingi davidswang kiheonbaek dong-haha naturegeorge wangchenscu spyfighting xingxingshen chungongyu erikrikarddaniel doctordean lifanchen-simm lyndonlens chaunceydust luoyiiii jun-lizst luna120120 rachelse rufus-willy xnought tigerwindwood zhengzha2000 winnibobby fatemehpouryahya chasooyoung amelie-iska sheva-ws meldaw84 j-faustin-instadeep yansonggu

foldseek's Issues

Implement --alt-ali in foldseek

See https://github.com/soedinglab/MMseqs2/blob/fcf52600801a73e95fd74068e1bb1afb437d719d/src/alignment/Alignment.cpp#L399 in MMseqs2

Generate alignments/give rotation&translation matrix

Hi,

It is a very nice software and extremely fast, thank you! One question: would it be possible that the server provides the superimposed structures and/or the translation-rotation matrix for the query vs hits? That would be super useful!

Thank you for your great work!

Dataset problem

Context

I find some problems in the link you give about Benchmark data and Foldseek databases: http://wwwuser.gwdg.de/~compbiol/foldseek/

pdb.tar.gz and scp40pdb.tar.gz. It seems that they are just the same.
scop.fasta The Fasta file only contains 11209 seqences. While we have 11211 structures(pdbs). Their numbers can't match.

inconsistency between 3Di sequence length and Aminoacid sequence length for some PDB ids

Hello ! I was playing around with Foldseek and I mapped the encoded structure sequence from the PDB database downloaded from you (pdb_ss.fasta) to their respective aminoacid fasta sequence from PDB, I found inconsistency for 25.000 ids, like the following:

pdb_id_chain ss_seq_len aa_seq_len
11as_B 327 328
148l_E 163 162
155c_A 134 121
1a16_A 439 441
1a7a_A 416 431

The reasons can be multiple like in some structure the CA atom is missing or the amino-acid/sidechain wasn’t well properly defined etc.

I was wondering if it is possible to have or to know where did you retrieve the aminoacid sequences and if you applied some kind of filtering or cleaning.

Thank you very much !

Access to states

If I understand correctly, the VQ-VAE used by foldseek translates each amino acid into one of 20 "states". Do we have access to these, i.e. is it possible to get the "state sequence"? Like:

AVGAI -> states 1, 5, 7, 1, 13

Thanks!

Command line has too many arguments

With a large database containing many files the base build step dies due to the command line character limit

For example, when trying to build a database of complete AlphaFold output.

Might be nice to enable an input settings file containing .cif / .pdb structures.

Hits are not sorted by e_value, and multiple copies of some records exist.

Dear foldseek developers,

When I search the 2ekj_A against PDB database with the default settings (3Di/AA), in the html output, I find the 2ekj_A as the first hit and I don't see it anymore in the output. But when I download the search results by selecting "Download All", I see 2ekj_A is reported multiple times in the table, with different e-values.
1st hit: "job.pdb_A 2ekj_A 100.000 105 0 0 1 105 1 105 3.285E-16 807 105 105"
569th hit: "job.pdb_A 2ekj_A 100.000 105 0 0 1 105 1 105 9.501E-16 785 105 105"
1121th hit: "job.pdb_A 2ekj_A 100.000 105 0 0 1 105 1 105 4.406E-16 802 105 105"

Why does it happen? I have renamed 2ekj.pdb to 2ekj.txt and attached it.

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Spacepharer Output (for bugs)

Please make sure to also post the complete output of Spacepharer. You can use gist.github.com for large output.

Context

2ekj.txt

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute foldseek without any parameters):
Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.):
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

Add rmsd to server

Add RMSD to server output

link target ids to real name

Hi,
Is there a way to request NCBI or Uniprot accessions and/or the functions of the target proteins using foldseek api ?

with the python command : result = get('https://search.foldseek.com/api/result/' + ticket['id'] + '/0').json()
I get the target accession only.
If not, is there a way to batch request functions with alphafold database accessions ?

Thank you for this great tool by the way

Error running example

Expected Behavior

Run example without error

foldseek createdb example/ targetDB
foldseek createdb example/ queryDB
foldseek search queryDB targetDB aln tmpFolder -a
foldseek aln2tmscore queryDB targetDB aln aln_tmscore
foldseek createtsv queryDB targetDB aln_tmscore aln_tmscore.tsv

Current Behavior

Input database is not compatible with aln2tmscore.

aln2tmscore queryDB targetDB aln aln_tmscore 

MMseqs Version: 188e5299e4d4e47b94614c4e8c67032f78f3ed21
Threads         96
Compressed      0
Verbosity       3

Input database "queryDB" has the wrong type (Aminoacid)
Allowed input:
- Unknown

Steps to Reproduce (for bugs)

Here is bash script that downloads: foldseek binary and git repo ,to get the example data, and runs the provided example generating the error.

#Create working directory
WORK_DIR=./foldseek_linux_avx

if [ -d $WORK_DIR ]; then
     rm -rf $WORK_DIR
fi
mkdir $WORK_DIR
cd $WORK_DIR

#Download binary
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz --no-check-certificate
tar xvzf foldseek-linux-avx2.tar.gz
foldseek_bin=./foldseek/bin/foldseek

#Get test data
git clone https://github.com/steineggerlab/foldseek.git ./foldseek_git_src
EXAMPLE_DATA=./foldseek_git_src/example/


#Run foldseek; Generate error
$foldseek_bin createdb $EXAMPLE_DATA targetDB
$foldseek_bin createdb $EXAMPLE_DATA queryDB
$foldseek_bin search queryDB targetDB aln tmpFolder -a
$foldseek_bin aln2tmscore queryDB targetDB aln aln_tmscore
$foldseek_bin createtsv queryDB targetDB aln_tmscore aln_tmscore.tsv

Context

I've also run the example with all 3 available compiled version (avx, sse and conda) and got the same error "Input database "queryDB" has the wrong type".
It will be nice to have a function that can check if an input database is compatible with a function or another.

Thank you

Foldseek-MPI crashes while running on more than one node

Hi,
I am trying to perform an all vs all search on the Alphafold/UniProt50 dataset.
Given the size of the job, my plan is to use Foldseek-MPI on a computer cluster.
The problem I am currently facing is that the job crashes whenever I try to use more than one node. It even crashes when testing it on a smaller dataset (~200k pdbs)
Since an all vs all search on the same smaller dataset works correctly on a non MPI build of foldseek (single node) the problem must be related to MPI.
Maybe the problem is in the way I am setting up the mpi runner?

Steps to Reproduce (for bugs)

(...)
#PBS -l nodes=2:ppn=24
(...)

foldseek search data_200k  data_200k alignments tmpFolder \
                -s 7.5 --max-seqs 20000 --mpi-runner "mpirun -map-by ppr:1:node:pe=24" \
                --split 2 --split-mode 1

Output

This link contains the stderr and stdout of the job.

The specific error is:

Error: Alignment step died
*** Error in `foldseek': double free or corruption (out): 0x0000000002e9a050 ***

Your Environment

Git commit used (The string after "MMseqs Version:" when you execute foldseek without any parameters):
a8eb681-MPI
Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.):
self-compiled - a8eb681-MPI
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
openmpi/4.1.2, gnu/9.3.0, cmake/3.19.1
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
AVX2/SSE support, 750GB of RAM per node
Operating system and version:
Centos 7.9, pbspro is the scheduler

Problem converting aln data to Tmscore

Expected Behavior

Using this code foldseek aln2tmscore converted_full/ converted_full/ aln tmscore I should receive tm scores.

Current Behavior

I get this error,
Input database "converted_full/" has the wrong type (Generic)
Allowed input:

Unknown

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

run foldseek easy-search, then run the above code.

Spacepharer Output (for bugs)

Please make sure to also post the complete output of Spacepharer. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute foldseek without any parameters):
Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.):
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

Clustering AFDB using foldseek e-values?

I am looking to reduce the Alphafold indices further. I was wondering if the cluster mode in foldseek can be used to cluster structures further down than X% sequence identities? For example, using the default AA+3Di mode and evalues provided by the pairwise alignments?

Thanks for an awesome tool 👍

Segmentation fault when running createdb on alphafold mmcif_files directory

Expected Behavior

Database successfully creates for 180207 files in directory

Current Behavior

Database builds up to a certain point and then fails:

[=========================================================>       ] 88.05% 158.66K eta 20m 25s
Segmentation fault (core dumped)

The example commands provided on the repo run without error:

foldseek createdb example/ targetDB
foldseek easy-search example/d1asha_ targetDB aln.m8 tmpFolder

Output from example:

createdb example/ targetDB 

MMseqs Version:	797a5a3ab5e2b6ba7104ac5fb20cfe4f817c88d5
Threads  	32
Verbosity	3

Output file: targetDB
[=================================================================] 100.00% 26 0s 2ms      
Time for merging to targetDB_ss: 0h 0m 0s 0ms>                    ] 68.00% 18 eta 0s       
Time for merging to targetDB_h: 0h 0m 0s 0ms
Time for merging to targetDB_ca: 0h 0m 0s 0ms
Time for merging to targetDB: 0h 0m 0s 0ms
Ignore 0 out of 26.
Too short: 0, incorrect  0.
Time for processing: 0h 0m 0s 22ms

foldseek easy-search example/d1asha_ targetDB aln.m8 tmpFolder
aln.m8 exists and will be overwritten
Create directory tmpFolder
easy-search example/d1asha_ targetDB aln.m8 tmpFolder 

MMseqs Version:           	797a5a3ab5e2b6ba7104ac5fb20cfe4f817c88d5
Seq. id. threshold        	0
Coverage threshold        	0
Coverage mode             	0
Max reject                	2147483647
Max accept                	2147483647
Add backtrace             	false
Include identical seq. id.	false
TMscore threshold         	0.5
Threads                   	32
Verbosity                 	3
Substitution matrix       	nucl:3di.out,aa:3di.out
Alignment mode            	3
Alignment mode            	0
Allow wrapped scoring     	false
E-value threshold         	0.001
Min alignment length      	0
Seq. id. mode             	0
Alternative alignments    	0
Max sequence length       	65535
Compositional bias        	1
Preload mode              	0
Pseudo count a            	1
Pseudo count b            	1.5
Score bias                	0
Realign hits              	false
Realign score bias        	-0.2
Realign max seqs          	2147483647
Gap open cost             	nucl:5,aa:11
Gap extension cost        	nucl:2,aa:1
Zdrop                     	40
Compressed                	0
Seed substitution matrix  	nucl:3di.out,aa:3di.out
Sensitivity               	5.7
k-mer length              	0
k-score                   	2147483647
Alphabet size             	nucl:5,aa:21
Max results per query     	300
Split database            	0
Split mode                	2
Split memory limit        	0
Diagonal scoring          	true
Exact k-mer matching      	0
Mask residues             	1
Mask lower case residues  	0
Minimum diagonal score    	15
Spaced k-mers             	1
Spaced k-mer pattern      	
Local temporary path      	
Alignment type            	0

createdb example/d1asha_ tmpFolder/2764274086546556879/query --threads 32 -v 3 

Output file: tmpFolder/2764274086546556879/query
[=================================================================] 100.00% 1 eta -
Time for merging to query_ss: 0h 0m 0s 0ms
Time for merging to query_h: 0h 0m 0s 0ms
Time for merging to query_ca: 0h 0m 0s 0ms
Time for merging to query: 0h 0m 0s 0ms
Ignore 0 out of 1.
Too short: 0, incorrect  0.
Time for processing: 0h 0m 0s 17ms
Create directory tmpFolder/2764274086546556879/search_tmp
search tmpFolder/2764274086546556879/query targetDB tmpFolder/2764274086546556879/result tmpFolder/2764274086546556879/search_tmp 

prefilter tmpFolder/2764274086546556879/query_ss targetDB_ss tmpFolder/2764274086546556879/search_tmp/8562932631619137899/pref --sub-mat nucl:3di.out,aa:3di.out --seed-sub-mat nucl:3di.out,aa:3di.out -s 7.5 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 0 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 32 --compressed 0 -v 3 

Query database size: 1 type: Aminoacid
Estimated memory consumption: 256M
Target database size: 26 type: Aminoacid
Index table k-mer threshold: 96 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 100.00% 26 0s 1ms      
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 26 0s 0ms      
Index statistics
Entries:          3491
DB size:          128 MB
Avg k-mer size:   0.000208
Top 10 k-mers
    FFGFFF	8
    FFLFFF	6
    DHQFFF	6
    GDQQQQ	5
    EEDDQE	4
    FFGFDF	4
    DFQFDF	4
    FFDFFF	4
    GEFHFF	4
    RQQREG	4
Time for index table init: 0h 0m 0s 73ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 96
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1
Target db start 1 to 26
[=================================================================] 100.00% 1 eta -

149.986395 k-mers per position
611 DB matches per sequence
0 overflows
0 queries produce too many hits (truncated result)
24 sequences passed prefiltering per query sequence
24 median result list length
0 sequences with 0 size result lists
Time for merging to pref: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 127ms
align tmpFolder/2764274086546556879/query_ss targetDB_ss tmpFolder/2764274086546556879/search_tmp/8562932631619137899/pref tmpFolder/2764274086546556879/search_tmp/8562932631619137899/aln --sub-mat nucl:3di.out,aa:3di.out -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 32 --compressed 0 -v 3 

Compute score, coverage and sequence identity
Query database size: 1 type: Aminoacid
Target database size: 26 type: Aminoacid
Calculation of alignments
[=================================================================] 100.00% 1 eta -
Time for merging to aln: 0h 0m 0s 0ms
24 alignments calculated
24 sequence pairs passed the thresholds (1.000000 of overall calculated)
24.000000 hits per query sequence
Time for processing: 0h 0m 0s 79ms
mvdb tmpFolder/2764274086546556879/search_tmp/8562932631619137899/aln tmpFolder/2764274086546556879/result 

Time for processing: 0h 0m 0s 0ms
Removing temporary files
rmdb tmpFolder/2764274086546556879/search_tmp/8562932631619137899/pref 

Time for processing: 0h 0m 0s 0ms
aln.m8 exists and will be overwritten
convertalis tmpFolder/2764274086546556879/query targetDB tmpFolder/2764274086546556879/result aln.m8 --sub-mat nucl:3di.out,aa:3di.out --format-mode 0 --format-output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits --translation-table 1 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --db-output 0 --db-load-mode 0 --search-type 0 --threads 32 --compressed 0 -v 3 

[=================================================================] 100.00% 1 eta -
Time for merging to aln.m8: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 6ms
rmdb tmpFolder/2764274086546556879/result -v 3 

Time for processing: 0h 0m 0s 0ms
rmdb tmpFolder/2764274086546556879/query -v 3 

Time for processing: 0h 0m 0s 0ms
rmdb tmpFolder/2764274086546556879/query_h -v 3 

Time for processing: 0h 0m 0s 0ms
rmdb tmpFolder/2764274086546556879/query_ca -v 3 

Time for processing: 0h 0m 0s 0ms
rmdb tmpFolder/2764274086546556879/query_ss -v 3 

Time for processing: 0h 0m 0s 0ms

Steps to Reproduce (for bugs)

foldseek createdb --threads 20 mmcif_files/ pdb_db

Context

Command is run from within the pdb_mmcif directory of the alphafold database download

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute foldseek without any parameters): 797a5a3
Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.): self-compiled version
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
- cmake version 3.16.3
- GNU Make 4.2.1
- g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Running cat /proc/cpuinfo | grep sse4_1 and cat /proc/cpuinfo | grep avx2 return 32 lines each. System has 16 cores/32 threads and 128 GB RAM.
Operating system and version: Ubuntu 20.04.2 LTS

Perhaps my system doesn't have the specs for a DB of this size? I tried running the SSE4.1 version and it had the same behavior.

Thank you for your time!

How to do local search and report TM-scores

Dear foldseek developers,

I want to search many structures against many structures using foldseek, and I want to do local structure searching. I need to know the TM-score and the coordinates of the part of the query that aligns with the target i.e. qstart, qend, tstart, and tend.
As I need to do local alignment, I didn't use "--alignment-type 1" which is for the global alignment.

When I use the following code, it doesn't show qstart, qend, tstart, and tend.

foldseek createdb example/ targetDB
foldseek createdb example/ queryDB
foldseek search queryDB targetDB aln tmpFolder -a
foldseek aln2tmscore queryDB targetDB aln aln_tmscore
foldseek createtsv queryDB targetDB aln_tmscore aln_tmscore.tsv

How can I do local alignment and also have the qstart, qend, tstart, and tend?
I was thinking of running the search with default mode (that reports hit based on e-value) and then merging the data from this step with the table I get from converting alignments with e-value to tm-score. However, if a query protein has multiple domains that all align with the same template protein, it would fail.

tmscore-threshold doesn't work

Hi,
first, thanks for this amazing tool :)
I tried using the argument '--tmscore-threshold' with some cutoffs (0.25, 0.5, 0.6, 0.75, 0.9) and it seems that I get the same results every time, with tmscore lower than the threashold.
for example, I run this command:
foldseek easy-search quary_ranked_0.pdb my_db_archaea_db out_archea.txt /tmp/ --format-mode 4 --alignment-type 2 -c 0.7 --cov-mode 0 --format-output query,target,fident,alnlen,alntmscore,evalue,bits,mismatch,gapopen,qstart,qend,tstart,tend,qcov,tcov --tmscore-threshold 0.6

and I get this:
query target fident alnlen alntmscore evalue bits mismatch gapopen qstart qend tstart tend qcov tcov
ranked_0.pdb AF-Q58952-F1-model_v1.pdb.gz 0.083 274 4.084E-01 2.546E-06 280 157 18 33 258 16 243 0.873 0.916
ranked_0.pdb AF-Q58484-F1-model_v1.pdb.gz 0.115 233 2.404E-01 8.810E-06 260 135 13 3 195 6 207 0.745 0.716
...

the values in column 'alntmscore' are smaller than 0.6: 0.408 0.24...

do you know why it is happening?
I am using the linux-sse41 version
Thanks!
itai

foldseek afdb50_best database

Hello,

Do you provide the foldseek AF50_Best (selecting best plddt for each cluster or taking into account AF corrections for the 4% misfolded proteins) available for the standalone version? I can see It is available on the foldseek server (thanks so much), and you mentioned it will be available to download (foldseek databases) on twitter, so I was wondering if it is somewhere I did not find (I looked at https://foldseek.steineggerlab.workers.dev/ but did not find it, afdb50 seems to be the previous version uploaded in August 2022) or if you still plan to do it? That would be awesome!

Thanks a lot in advance!

Foldseek server TM-score

Hi,

I was running a PDB file with foldseek on the swissprot Af models, using TM-align option. I noticed something weird with the listing of the TM-score

(all results: https://search.foldseek.com/result/aZf9k2cw6BPxvcZSqkVczCDixXzPjNd-hhuY8A/0#result-0-0)

As I understood, the 5th column should be the TM-score, however it is not the same as the TM-score written in light blue above the structure. Is this a bug? If not, what is the difference between the 2 TM-scores?

Thank you for looking into this issue!

.m8 file - ?

I cannot find much information on how to view or process .m8 files. Please advise?

Extracting residue-by-residue distances between query and target

Hello,

I am wondering if you could advise on if I am able to extract a tabular output indicating, for each residue in the query, how far it is from the aligned residue in the target.

In the HTML output, this is represented by the blue arrows between the target and query structure (see below). I am looking for the length of those arrows.

Is there a way I can extract this information from the alignment information or even the HTML file?

Thank you very much! I am a huge fan of foldseek and am very grateful for the time you've spend putting this together.

200 million unprot database

Hello team，

is the newest 200 million uniprot structure by alphafold available？

Thanks，

Jianshu

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Spacepharer Output (for bugs)

Please make sure to also post the complete output of Spacepharer. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute foldseek without any parameters):
Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.):
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

Unrecognized parameter "--dbtype" when running easy-cluster

Expected Behavior

Cluster structures in a given database made using foldseek createdb.

Current Behavior

When running easy-cluster on a very small set of 10 PDB structures, running the following command

foldseek easy-cluster ./target_db ./output_cluster ./tmp

results in this error message:

usage: foldseek createdb <i:PDB|mmCIF[.gz]> ... <i:PDB|mmCIF[.gz]> <o:sequenceDB> [options]
 By Martin Steinegger <[email protected]>
options: common:
 --threads INT   Number of CPU-cores used (all by default) [4]
 -v INT          Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

examples:
 Convert PDB/mmCIF files to an db.

references:
 - Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)
Unrecognized parameter "--dbtype". Did you mean "--threads" (Threads)?

Looks like perhaps foldseek easy-cluster is trying to run foldseek createdb with an outdated option.

Environment

I used the precompiled binary for osx from https://mmseqs.com/foldseek/

foldseek-osx-universal.tar.gz 09-Jul-2021 07:55 6998109

foldseek Version: 75bb2e3

Operating system version: macOS Catalina 10.15.7

aln2tmscore freezes

Hi,

Thanks for your amazing program. I have noticed if the following structure is in my query database, after I search it against PDB, aln2tmscore freezes.

The problem happens if this protein is in my list of queries: https://alphafold.ebi.ac.uk/entry/Q381H2
My foldseek version: 2dd3b2f
I use a statistically compiled program(avx2).

Best,

[Web service] "Error loading search result"

Hi, I got this error from the Foldseek webserver. I tried both modes, and there was no difference. Could you check this problem?

foldseek all-vs-all mode fails?

I'm running a simple easy-search for a folder against itself (to get the pairwise tm-scores) using the following:

foldseek easy-search --alignment-type 1 --tmscore-threshold 0.0 /some/folder /some/folder aln.m8 tmpFolder

/some/folder has 3 pdb files in it (all simple, single chain pdb files for proteins of L=100).

Thus, I expected aln.m8 to have 3 * 3 (or 3*2/2) rows in it representing the tmScores for each possible pair. However, the aln.m8 file showed in my case that there are only 6 rows, so some pairs are missing. Setting the sensetivity (-s) flag to 10 increased the number of rows to 8, but some rows are still missing still.

What's the reason not all the pairs appear in aln.m8? What configuration am I missing to enable that?

createdb ignores all PDB files

Command
foldseek createdb pfamprocessed/ foldseekDB
directory pfamprocessed contains PDB files in format:

ATOM 7 CG LEU 1 41.236 47.777 43.697
ATOM 9 CD1 LEU 1 42.032 47.298 42.478
ATOM 13 CD2 LEU 1 39.833 48.166 43.219
ATOM 17 C LEU 1 40.159 45.638 46.822
ATOM 18 O LEU 1 40.806 45.323 47.828
ATOM 19 N SER 2 39.248 44.822 46.297
ATOM 21 CA SER 2 38.875 43.515 46.837
ATOM 23 CB SER 2 37.869 42.860 45.885

they come from C-I-Tasser pre-computed database of models
https://zhanglab.ccmb.med.umich.edu/C-I-TASSER/pfam/
ls -1 pfamprocessed/|head

PF00094.pdb
PF00242.pdb
PF00257.pdb
PF00260.pdb
PF00324.pdb
PF00336.pdb
PF00363.pdb
PF00379.pdb

Program output:

createdb pfamprocessed/ foldseekDB

MMseqs Version: GITDIR-NOTFOUND
Threads         128
Verbosity       3


Output file: foldseekDB
[=================================================================] 100.00% 8.27K 0s 209ms
Time for merging to foldseekDB_ss: 0h 0m 0s 2ms
Time for merging to foldseekDB_h: 0h 0m 0s 2ms
Time for merging to foldseekDB_ca: 0h 0m 0s 2ms
Time for merging to foldseekDB: 0h 0m 0s 2ms
Ignore 8266 out of 8266.
Too short: 0, incorrect  8266.
Time for processing: 0h 0m 0s 267ms

--db-output

Hello,

Could you please give a couple of examples with --db-output option? I thought it would create an output database containing the search results, but my output is only 2 file: outDB.dbtype, and outDB.index, whereas the index file contains only one line. Did I misunderstand this option or is there something wrong with it?
My idea was to get a searchable output database to run a new search with another query or parameters. Also, it would be helpful to have search restriction options like --gilist or -seqidlist in blast.

Thanks,
Harut

ERROR "Format code alntmscore does not exist."

When I run the command line proposed for "Rescore aligments using TMscore" (with my own intputs data"):
foldseek easy-search example/ example/ aln tmp --format-output query,target,alntmscore,u,t
I am getting the error "Format code alntmscore does not exist.". Why?

How to generated AlphaFold Swissprot v3 Taxdb ？

I used to generated the taxdb for Swissprot v3 but using the taxid option will cause the following error
/remote/foldseek_db/swiss-prot/swissprot_mapping is empty. Rerun createtaxdb to recreate taxonomy mapping.

How to generated AlphaFold Swissprot v3 Taxdb ？

Kmer matching step died with Alphafold/UniProt-NO-CA

Expected Behavior

First of all, thanks for making the new Uniprot structures available as indices so quickly! Having a 70Gb is a lot less to download than 23Tb - amazing :)

First test: I am having issues using easy-search of the Alphafold/UniProt-NO-CA databases with the latest foldseek version (downloaded Aug 4 release - avx2 binaries). Search dies right after createdb (full log at bottom).

Potential explanation: Could my afdb database be faulty? I could not get the foldseek download of UniProt-NO-CA (named afdb.tar.gz) to work, so I downloaded the file separately and untarred it. Untar worked fine.

Current Behavior

I get tested using random .cif.gz files from AFDB. WHen running, I get the following error See full log below

foldseek easy-search AF-Q5G6D8-F1-model_v3.cif.gz afdb --alignment-type 2 res.m8 tmp

...
Index table k-mer threshold: 78 at k-mer size 6
Index table: counting k-mers
Illegal instruction (core dumped)                                 ] 0.00% 1 eta -
Error: Kmer matching step died
Error: Search died

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

foldseek easy-search AF-Q5G6D8-F1-model_v3.cif.gz afdb --alignment-type 2 res.m8 tmp

Spacepharer Output (for bugs)

Please make sure to also post the complete output of Spacepharer. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute foldseek without any parameters):
Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.):

Statically compiled avx2 binaries from 4/8-22.
MMseqs Version: 4002f69

Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Swissprot foldseek jobs works just fine.
Operating system and version:
Ubuntu 18.04

foldseek easy-search AF-Q5G6D8-F1-model_v3.cif.gz afdb --alignment-type 2 res.m8 tmp
easy-search AF-Q5G6D8-F1-model_v3.cif.gz afdb --alignment-type 2 res.m8 tmp

MMseqs Version:                 4002f69c92a99b129a667b7399bb9d185a43a61b
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   0
Max reject                      2147483647
Max accept                      2147483647
Add backtrace                   false
TMscore threshold               0.5
TMalign fast                    1
Preload mode                    0
Threads                         32
Verbosity                       3
Substitution matrix             aa:3di.out,nucl:3di.out
Alignment mode                  3
Alignment mode                  0
E-value threshold               0.001
Min alignment length            0
Seq. id. mode                   0
Alternative alignments          0
Max sequence length             65535
Compositional bias              1
Compositional bias              1
Gap open cost                   aa:10,nucl:10
Gap extension cost              aa:1,nucl:1
Compressed                      0
Seed substitution matrix        aa:3di.out,nucl:3di.out
Sensitivity                     9.5
k-mer length                    6
k-score                         seq:2147483647,prof:2147483647
Max results per query           1000
Split database                  0
Split mode                      2
Split memory limit              0
Diagonal scoring                true
Exact k-mer matching            0
Mask residues                   0
Mask residues probability       0.99995
Mask lower case residues        0
Minimum diagonal score          15
Spaced k-mers                   1
Spaced k-mer pattern
Local temporary path
Alignment type                  2
Remove temporary files          true
MPI runner
Force restart with latest tmp   false
Chain name mode                 0
Write lookup file               1
Tar Inclusion Regex             .*
Tar Exclusion Regex             ^$
Alignment format                0
Format alignment output         query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                 false

createdb AF-Q5G6D8-F1-model_v3.cif.gz tmp/3246194427880033517/query --chain-name-mode 0 --write-lookup 1 --tar-include '.*' --tar-exclude '^$' --threads 32 -v 3

Output file: tmp/3246194427880033517/query

[=================================================================] 100.00% 1 eta -
Time for merging to query_ss: 0h 0m 0s 649ms
Time for merging to query_h: 0h 0m 0s 734ms
Time for merging to query_ca: 0h 0m 0s 874ms
Time for merging to query: 0h 0m 1s 5ms
Ignore 0 out of 1.
Too short: 0, incorrect  0.
Time for processing: 0h 0m 7s 165ms
Create directory tmp/3246194427880033517/search_tmp
search tmp/3246194427880033517/query afdb tmp/3246194427880033517/result tmp/3246194427880033517/search_tmp --alignment-mode 3 --comp-bias-corr 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 -s 9.5 -k 6 --mask 0 --mask-prob 0.99995 --alignment-type 2 --remove-tmp-files 1

prefilter tmp/3246194427880033517/query_ss afdb_ss tmp/3246194427880033517/search_tmp/1200766418368934916/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 9.5 -k 6 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.99995 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 32 --compressed 0 -v 3

Query database size: 1 type: Aminoacid
Target split mode. Searching through 9 splits
Estimated memory consumption: 94G
Target database size: 214684311 type: Aminoacid
Process prefiltering step 1 of 9

Index table k-mer threshold: 78 at k-mer size 6
Index table: counting k-mers
Illegal instruction (core dumped)                                 ] 0.00% 1 eta -
Error: Kmer matching step died
Error: Search died

Guidance for masking query protein structures

Hi,

I'm interested in finding matches for a local substructure. To prevent aligning to other domains and partial overlaps, I tried removing everything from the query structure that I'm not interested in, introducing chain breaks. Essentially reducing the protein to its core only.

I observe that I stop getting any hits if I use a structure that was trimmed down too aggressively - also missing hits that have great overlap with the region to which I trimmed down, which I recover well when using the full structure.

Are there general guidelines for doing this - should trimming structures be avoided altogether, or does a minimum number of residues need to be maintained to avoid alignment from failing?

Thanks!

3Di substitution matrix

Is the matrix found in foldseek/data/mat3di.out the one used to compute the SmithWatterman structural alignment mentioned in the paper?

I'm using your 3Di structural sequence to align structures, but the common BLOSUM62 matrix creates erroneous results, therefore I would like to use the one you employ in foldseek and see if different sequence alignment methods produce different structural alignments.

Thanks in advance, awesome tool and work as always!

Resuming downloads

DB Download ERROR

While downloading big databases, would it be possible to resume the download after an ERROR 500 or other connection error ?
I have been trying for several days to complete the download of the full Alphafold/UniProt db .. it seems that the server does not accept the continuing download options and it always fail, with the foldseek program itself or externally with wget.

Last server error was just now at 91%, after downloading 554.82G !

search.foldseek.com output columns

Hi,

Could you help me understanding the .m8 output format provided by the server? There are 20 columns and I am wondering what is what. Could not find it documented anywhere, as in the local version I would control this myself.

Thanks!

Default sensitivity is 5.7, not 7.5

Expected Behavior

The default threshold for sensitivity should be 7.5 as specified in the documentation.

Current Behavior

Easy-search by default runs with a -s value of 5.7. Is this intended? Or should this be switched to 7.5 as specified in the documentation?

Thank you!

Error: target createdb died

While running foldseek easy-search I ran into the following error.

##installed foldseek
conda install -c conda-forge -c bioconda foldseek

##installed database
foldseek databases Alphafold/Proteome afdb tmp

##try to run easysearch
foldseek easy-search /home/z76r142/AlphaFold/models/SLA-04_01093-RD/ranked_1.pdb /home/z76r142/condaenv/ 1093.m8 tmpfolder

##easy search results in following error

MMseqs2 was not compiled with zlib support. Cannot read compressed input.
Error: target createdb died

Your Environment

foldseek Version: 2.8bd520
MMseqs2 Version: 13.45111

Cannot create databases

Expected Behavior

foldseek databases PDB pdb tmp should setup PDB database

Current Behavior

Returns:
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Downloaded pdb.tar.gz is empty. It looks like the target URL (http://wwwuser.gwdg.de/~compbiol/foldseek/) no longer has uploaded databases.

Your Environment

Git commit used: 1c40553
Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.): foldseek-linux-sse41.tar.gz
Operating system and version: Ubuntu 18.04 run on WSL v1, Windows 10

Segmentation Fault when comparing two databases

Expected Behavior

I'm currently trying to align a database with 23391 structures against a dabase of CATH s35 PDB files in an all vs all fashion.

Current Behavior

Segmentation fault at line 17 structureresearch.sh

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders. - Done
Generated with createdb successfully, empty tmp folder.
Launched as programs/foldseek/bin/foldseek search species/foldseek_db/homo_sapiens cath_s35_scans/s35_database cath_s35_scans/results/homo_sapiens_s35.m8 tmp

The two databases show as complete, although the program fails at
tmp/11743319497277053650/structuresearch.sh: line 17: 56940 Segmentation fault $RUNNER "$MMSEQS" prefilter "${QUERY_PREFILTER}" "${TARGET_PREFILTER}${INDEXEXT}" "${TMP_PATH}/pref" ${PREFILTER_PAR}
Error: Kmer matching step died

Here is the full output

programs/foldseek/bin/foldseek search species/foldseek_db/homo_sapiens cath_s35_scans/s35_database cath_s35_scans/results/homo_sapiens_s35.m8 tmp
cath_s35_scans/results/homo_sapiens_s35.m8 exists and will be overwritten
Create directory tmp
search species/foldseek_db/homo_sapiens cath_s35_scans/s35_database cath_s35_scans/results/homo_sapiens_s35.m8 tmp

MMseqs Version:                 75bb2e3a2718903f47008d7d8cc3be099e35d1e9
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   0
Max reject                      2147483647
Max accept                      2147483647
Add backtrace                   false
Include identical seq. id.      false
TMscore threshold               0.5
Threads                         48
Verbosity                       3
Substitution matrix             nucl:3di.out,aa:3di.out
Alignment mode                  3
Alignment mode                  0
Allow wrapped scoring           false
E-value threshold               0.001
Min alignment length            0
Seq. id. mode                   0
Alternative alignments          0
Max sequence length             65535
Compositional bias              0
Preload mode                    0
Pseudo count a                  1
Pseudo count b                  1.5
Score bias                      0
Realign hits                    false
Realign score bias              -0.2
Realign max seqs                2147483647
Gap open cost                   nucl:5,aa:11
Gap extension cost              nucl:2,aa:1
Zdrop                           40
Compressed                      0
Seed substitution matrix        nucl:3di.out,aa:3di.out
Sensitivity                     7.5
k-mer length                    0
k-score                         2147483647
Alphabet size                   nucl:5,aa:21
Max results per query           300
Split database                  0
Split mode                      2
Split memory limit              0
Diagonal scoring                true
Exact k-mer matching            0
Mask residues                   0
Mask lower case residues        0
Minimum diagonal score          15
Spaced k-mers                   1
Spaced k-mer pattern
Local temporary path
Alignment type                  0

prefilter species/foldseek_db/homo_sapiens_ss cath_s35_scans/s35_database_ss tmp/11743319497277053650/pref --sub-mat nucl:3di.out,aa:3di.out --seed-sub-mat nucl:3di.out,aa:3di.out -s 7.5 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 0 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 48 --compressed 0 -v 3

Query database size: 23391 type: Aminoacid
Estimated memory consumption: 359M
Target database size: 32388 type: Aminoacid
Index table k-mer threshold: 96 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 32.39K 0s 225ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 100.00% 32.39K 0s 261ms
Index statistics
Entries:          4390456
DB size:          153 MB
Avg k-mer size:   0.261692
Top 10 k-mers
    CCCCCC      1391
    PCCCKS      1141
    PCCCGI      1027
    CPPPCP      913
    PPPPPP      833
    IPPPPP      656
    MCCCKS      612
    CCCCCS      557
    CKKIPP      504
    PCCCHS      480
Time for index table init: 0h 0m 0s 965ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 96
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 23391
Target db start 1 to 32388
tmp/11743319497277053650/structuresearch.sh: line 17: 56940 Segmentation fault      $RUNNER "$MMSEQS" prefilter "${QUERY_PREFILTER}" "${TARGET_PREFILTER}${INDEXEXT}" "${TMP_PATH}/pref" ${PREFILTER_PAR}
Error: Kmer matching step died

Context

If I use easy-search instead of search the first database shows a Query database size: 0

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute foldseek without any parameters): 75bb2e3
Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.): statically-compiled (AVX2)
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: N/A
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): CPU supports AVX2 and SSE2
I ran this with 32G of RAM and 8 Threads
Operating system and version: CentOS 7

Alternatively, can Foldseek run on all the structures contained within a folder?

Thank you for all your the great tools and work!

What is the different between the three type of afdb？

Dear foldseek developers

I’m glad to hear that the newly updated Alphafold 210M database is now avaliable for foldseek , but there are three type of the afdb (afdb.tar.gz/
afdb50.tar.gz/
afdb_ca.tar.gz)
I learned that afdb50 was the product of Alphafold 210M db clusted by mmseqs , but what the difference between the other dbs? please help me explain this problem , THX!

add --format-output for both html and table

(Not a bug, just asking for a new feature :) )

Hi, I started using the tool, which looks very useful and efficient! thanks for all the hard work
It will be very useful for me if I could have more than one output format in one run;
To be more specific: the pretty HTML format + the output table (mode 0/4) so I could choose the columns.

If it isn't not much of a bother, could you add this option?
thanks :)
Itai

toggle full query button does not work

Expected Behavior

Hitting the "toggle full query" button hides the portion of the query structure that does not align to the target.

Current Behavior

The button has no effect. (The button for the toggling full target does work.)

Steps to Reproduce (for bugs)

Expand 1nqg_A (the top pdb100 hit) in
https://search.foldseek.com/result/CkjKZC8aEOSkQc-KtPlzKZQco_oRGaK2mQb5Bg/0#result-1-0)

Your Environment

Using Chrome 104.0.5112.101 (64-bit) on windows

m8 Format fields are different when using TMalign or 3di

Expected Behavior

A default run of foldseek search followed by convertalis (or, alternatively, easy-search) outputs an m8 file with columns as follow:

query target identity alignment_length mismatches gap_openings query_start query_end target_start target_end evalue bitscore
i.e.
3pvlA_A 3pvlA01_A 0.951 207 10 0 1 207 1 207 1.675E-142 487

Current Behavior

Running the same two structures with the --alignment-type 1 flag turned on to use TMalign returns values that are inconsistent in position and format.

i.e.
3pvlA_A 3pvlA01_A 0.048 597 197 0 1 597 1 208 1.000E+00 100

It seems that the third column is no longer fident but something different (RMSD?), while the second to last is no longer an evalue but the TMalign score of 1 (in this case) reported in scientific notation.
Regarding the third column, I can't seem to find an equivalent result on the standalone TMalign output

TMalign

-bash-4.2$ ./TMalign ../data/cath.S20.v4_3_0.chainpdb/3pvlA ../data/cath.S20.v4_3_0.domainpdb/3pvlA01

 *********************************************************************
 * TM-align (Version 20210224): protein structure alignment          *
 * References: Y Zhang, J Skolnick. Nucl Acids Res 33, 2302-9 (2005) *
 * Please email comments and suggestions to [email protected]   *
 *********************************************************************

Name of Chain_1: ../data/cath.S20.v4_3_0.chainpdb/3pvlA (to be superimposed onto Chain_2)
Name of Chain_2: ../data/cath.S20.v4_3_0.domainpdb/3pvlA01
Length of Chain_1: 597 residues
Length of Chain_2: 208 residues

Aligned length= 208, RMSD=   0.00, Seq_ID=n_identical/n_aligned= 1.000
TM-score= 0.34841 (if normalized by length of Chain_1, i.e., LN=597, d0=8.55)
TM-score= 1.00000 (if normalized by length of Chain_2, i.e., LN=208, d0=5.37)
(You should use TM-score normalized by length of the reference structure)

(":" denotes residue pairs of d <  5.0 Angstrom, "." denotes other aligned residues)
EEDLSEYKFAKFAATYFQGTTTHSYTRRPLKQPLLYHDDEGDQLAALAVWITILRFMGDLPEPKYHKIPVMTKIYETLGKKTYKRELQALQQGNSMLEDRPTSNLEKLHFIIGNGILRPALRDEIYCQISKQLTHNPSKSSYARGWILVSLCVGCFAPSEKFVKYLRNFIHGGPPGYAPYCEERLRRTFVNGTRTQPPSWLELQATKSKKPIMLPVTFMDGTTKTLLTDSATTARELCNALADKISLKDRFGFSLYIALFDKVSSLGSGSDHVMDAISQCEQYAKEQGAQERNAPWRLFFRKEVFTPWHNPSEDNVATNLIYQQVVRGVKFGEYRCEKEDDLAELASQQYFVDYGSEMILERLLSLVPTYIPDREITPLKNLEKWAQLAIAAHKKGIYAQRRTDSQKVKEDVVNYARFKWPLLFSRFYEAYKFSGPPLPKSDVIVAVNWTGVYFVDEQEQVLLELSFPEIMAVSSSRGTKMMAPSFTLATIKGDEYTFTSSNAEDIRDLVVTFLEGLRKRSKYVVALQDNPNSGFLSFAKGDLIILDHDTGEQVMNSGWANGINERTKQRGDFPTDCVYVMPTVTLPPREIVALVTM
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
EEDLSEYKFAKFAATYFQGTTTHSYTRRPLKQPLLYHDDEGDQLAALAVWITILRFMGDLPEPKYHKIPVMTKIYETLGKKTYKRELQALQQGNSMLEDRPTSNLEKLHFIIGNGILRPALRDEIYCQISKQLTHNPSKSSYARGWILVSLCVGCFAPSEKFVKYLRNFIHGGPPGYAPYCEERLRRTFVNGTRTQPPSWLELQATKS-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Total CPU time is  0.57 seconds
-bash-4.2$

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Foldseek-3di
./foldseek/bin/foldseek easy-search data/cath.S20.v4_3_0.chainpdb/3pvlA data/cath.S20.v4_3_0.domainpdb/3pvlA01 3pvlA.m8 tmp --alignment-type 1 -c 0.6 --cov-mode 1 -s 9
Foldseek-TMalign
./foldseek/bin/foldseek easy-search data/cath.S20.v4_3_0.chainpdb/3pvlA data/cath.S20.v4_3_0.domainpdb/3pvlA01 3pvlA.m8 tmp -c 0.6 --cov-mode 1 -s 9
TMalign
TMalign ../data/cath.S20.v4_3_0.chainpdb/3pvlA ../data/cath.S20.v4_3_0.domainpdb/3pvlA01

I'm attaching both structures for reference.
Structures.zip

What are the TMalign parameters used by Foldseek so I can crosscheck the results?

Thanks again for all your work on this, it's an amazing piece of software.

Making Database from a tar file

Dear Foldseek developers,

I wondered if making a foldseek database from a tar file is possible. The total number of structures in the latest alphafold release is more than 200M. My computational server has a limitation that doesn't allow me to have more than 1M files. It would be great if foldseek could make a database from the concatenated files (like tarred files).

build from src fails: header file missing

Expected Behavior

Should compile on linux without errors

Current Behavior

Throws error about missing header file encoder_weights_3di.kerasify.h during make

Steps to Reproduce (for bugs)

normal build procedure:
clone git repo, untar, create build dir, cmake, make

Foldssek Output (for bugs)

Please make sure to also post the complete output of Spacepharer. You can use gist.github.com for large output.

[ 87%] Building CXX object lib/3di/CMakeFiles/3di.dir/structureto3di.cpp.o
/home/user/packages/foldseek/foldseek/lib/3di/structureto3di.cpp:7:10: fatal error: encoder_weights_3di.kerasify.h: No such file or directory
    7 | #include "encoder_weights_3di.kerasify.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [lib/3di/CMakeFiles/3di.dir/build.make:76: lib/3di/CMakeFiles/3di.dir/structureto3di.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:880: lib/3di/CMakeFiles/3di.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

Environment

self-compiled from git source
commit 90b2545
RebornOS (ArchLinux)

The aligned part is one amino acid shorter than expected

Hi,

Thanks for your great program.
I am using foldseek to search some cif files against a database of pdb files. I see foldseek reports the alignment "one amino acid shorter than what it is supposed to report". For instance, I chopped the structure of A0A017S8D7 from position 17 to 229, and searched the whole structure against the chopped part. I saw the alignment is one amino acid shorter than what was expected.

Expected Behavior

I expected to see qend to be 229 and tend to be 213.

Current Behavior

qend = 228, and tend = 212

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
A0A017S8D7_17_229.pdb.gz
I assume you have downloaded the attached pdb.gz file and it is in your current working directory.

mkdir query
mkdir target
mv A0A017S8D7_17_229.pdb.gz target
wget https://alphafold.ebi.ac.uk/files/AF-A0A017S8D7-F1-model_v3.cif
mv AF-A0A017S8D7-F1-model_v3.cif query

foldseek createdb target/ Tdb --threads 1
foldseek createdb query/ Qdb --threads 1

foldseek search Qdb Tdb aln tmpFolder -a --threads 1
foldseek convertalis Qdb Tdb aln aln_tm \
--format-output "query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,qlen,tlen,evalue,bits,alntmscore" \
--threads 1

#Please take a look at the aln_tm file

Spacepharer Output (for bugs)

Please make sure to also post the complete output of Spacepharer. You can use gist.github.com for large output.

Context

Providing context helps us

come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute foldseek without any parameters):
Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.):
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version:

3Di sequences which file after database command

Hello! I just downloaded the PDB databases. I received those files:
pdb
pdb_ca
pdb_ca.dbtype
pdb_ca.index
pdb_h
pdb_h.dbtype
pdb_h.index
pdb_mapping
pdb_ss
pdb_ss.dbtype
pdb_ss.index
pdb_taxonomy
pdb.dbtype
pdb.index
pdb.lookup
pdb.md5sum

I was wondering which one contains the 3Di sequences.
Thank you

Brief description of output columns?

Thanks a lot for this wonderful tool.

Unfortunately, I couldn't find a brief description of the output format. The output I have downloaded misses column headers. So, I don't know which data belongs to which column. I tried to guess some of them (seqident, ...) but without success. Looking at mmseqs2 site didn't help me. It would be very nice if the README contains a brief explanation of the output format by giving an example.

Thanks in advance,
Reza

Web Service Fails/Hangs when Using TM-align mode

When searching any protein on the web server (https://search.foldseek.com/) I continue to get failures when using the TM-align mode. I tried this with a number of proteins and loaded accessions.

Error during easy-search

Expected Behavior

Aligning human AlphaFold proteome to the Mycobacterium tuberculosis proteome

Current Behavior

Index table: counting k-mers
/data/local/vfranke/VFranke_Structures/TMP/16012876887949026468/search_tmp/8683525511783924040/structuresearch.sh: line 17: 49775 Illegal instruction $RUNNER "$MMSEQS" prefilter "${QUERY_PREFILTER}" "${TARGET_PREFILTER}${INDEXEXT}" "${TMP_PATH}/pref" ${PREFILTER_PAR}
Error: Kmer matching step died
Error: Search died

Steps to Reproduce (for bugs)

Data is downloaded

foldseek='~/bin/Software/Proteins/foldseek/bin/foldseek'

inpath='~/Base/AlphaFold'
outpath=' /data/local/VFranke_Structures'
tmpdir=$outpath'/TMP';mkdir $tmpdir 2> /dev/null

db_mt=$outpath/'foldseek_myctu_db'
myctu=(find $inpath/MYCTU | grep cif)
$foldseek createdb ${myctu[@]} $db_mt --threads 32

outdir=$outpath/"HUMAN-MYCTU-align"
hgfiles=(find $inpath/HUMAN | grep cif)

$foldseek easy-search ${hgfiles[@]} $db_mt $outdir/hits.m8 $tmpdir --threads 32

Spacepharer Output (for bugs)

MMseqs Version: 75bb2e3
Seq. id. threshold 0
Coverage threshold 0
Coverage mode 0
Max reject 2147483647
Max accept 2147483647
Add backtrace false
Include identical seq. id. false
TMscore threshold 0.5
Threads 32
Verbosity 3
Substitution matrix nucl:3di.out,aa:3di.out
Alignment mode 3
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Max sequence length 65535
Compositional bias 1
Preload mode 0
Pseudo count a 1
Pseudo count b 1.5
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Gap open cost nucl:5,aa:11
Gap extension cost nucl:2,aa:1
Zdrop 40
Compressed 0
Seed substitution matrix nucl:3di.out,aa:3di.out
Sensitivity 5.7
k-mer length 0
k-score 2147483647
Alphabet size nucl:5,aa:21
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask lower case residues 0
Minimum diagonal score 15
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Alignment type 0

search /data/local/vfranke/VFranke_Structures/TMP/16012876887949026468/query /data/local/vfranke/VFranke_Structures/foldseek_myctu_db /data/local/vfranke/VFranke_Structures/TMP/16012876887949026468/result /data/local/vfranke/VFranke_Structures/TMP/16012876887949026468/search_tmp --threads 32

/data/local/vfranke/VFranke_Structures/TMP/16012876887949026468/search_tmp/8683525511783924040/structuresearch.sh: line 17: 51199 Illegal instruction $RUNNER "$MMSEQS" prefilter "${QUERY_PREFILTER}" "${TARGET_PREFILTER}${INDEXEXT}" "${TMP_PATH}/pref" ${PREFILTER_PAR}
Error: Kmer matching step died
Error: Search died

Context

Trying to align the human proteome to the mycobacerium tuberculosis proteome

Your Environment

I downloaded the pre-compiled binary. Test run executed perfectly.

Different search results between webserver and local search

The reason for this are differences in parameter.

Foldseek default parameter: --max-seq 300 -e 0.001
Webserver --max-seqs 1000 -e 0.1