GithubHelp home page GithubHelp logo

tobiasheol / kasearch Goto Github PK

View Code? Open in Web Editor NEW
10.0 2.0 9.0 1.35 MB

KA-Search: Rapid and exhaustive sequence identity search of known antibodies

License: BSD 3-Clause "New" or "Revised" License

Python 39.01% Jupyter Notebook 60.99%
rapid antibody exhaustive identity repertoire search sequence similarity

kasearch's Introduction


KA-Search, a method for rapid and exhaustive sequence identity search of known antibodies


by Tobias H. Olsen $^{1,\dagger}$, Brennan A. Kenyon $^{1,\dagger}$, Iain H. Moal $^{2}$ and Charlotte M. Deane $^{1,3}$

$^{1}$ Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, United Kingdom
$^{2}$ GSK Medicines Research Centre, GlaxoSmithKline plc, Stevenage, United Kingdom
$^{3}$ Exscientia plc, Oxford, United Kingdom
$^{\dagger}$ These authors contributed equally to this work and share first authorship

Abstract

Antibodies with similar amino acid sequences, especially across their complementary-determining regions, often share properties. Finding that an antibody of interest has a similar sequence to naturally expressed antibodies in healthy or diseased repertoires is a powerful approach for the prediction of antibody properties, such as immunogenicity or antigen specificity. However, as the number of available antibody sequences is now in the billions and continuing to grow, repertoire mining for similar sequences has become increasingly computationally expensive. Existing approaches are limited by either being low-throughput, non-exhaustive, not antibody-specific, or only searching against entire chain sequences. Therefore, there is a need for a specialized tool, optimized for a rapid and exhaustive search of any antibody region against all known antibodies, to better utilize the full breadth of available repertoire sequences.

We introduce Known Antibody Search (KA-Search), a tool that allows for rapid search of billions of antibody sequences by sequence identity across either the whole chain, the CDRs, or a user defined antibody region. We show KA-Search in operation on the ~2.4 billion antibody sequences available in the OAS database. KA-Search can be used to find the most similar sequences from OAS within 30 minutes using 5 CPUs. We give examples of how KA-Search can be used to obtain new insights about an antibody of interest. KA-Search is freely available at https://github.com/oxpig/kasearch.


Software implementation

KA-Search is freely available python package.

The latest stable version can be installed with pip.

    pip install kasearch

and the latest updated version directly from github.

    pip install -U git+https://github.com/oxpig/kasearch

NB: You need to manually install a version of ANARCI in the same environment. ANARCI can also be installed using bioconda; however, this version is maintained by a third party.

    conda install -c bioconda anarci

Download pre-aligned data to search against

The following list contains the download links for the paper version of the pre-aligned OAS and any future releases, ready for KA-Search.

NB: Some of the datasets are quite large, you should therefore ensure you have enough space before trying to download them.
NB: For convenience, OAS-aligned-small and OAS-aligned-tiny can be downloaded automatically when initiating KA-Search.

Dataset Size Date Comments
OAS-aligned (Paper version) 63GB January 2023 A pre-aligned version of OAS with 2.4 billion sequences
OAS-aligned-small (Paper version) 2.8GB January 2023 A pre-aligned version of OAS with 86 million sequences
OAS-aligned-tiny (Paper version) 260MB January 2023 A pre-aligned version of OAS with 10 million sequences

After downloading, extract the pre-aligned dataset with "tar -xf downloaded_file.tar". Give the extacted dataset path when initiating KA-Search to search against it. See how to do this by following the KA-Search notebook guide below.


KA-Search guide

KA-Search is designed to be downloaded and run locally.

NB: Out of the box, KA-Search requires an internet connection to retrieve meta data; see below for how to use KA-Search offline.

As a demo, we have set up a reduced version of KA-Search on a Colab notebook that can be run remotely. KA-Search, as setup on the Colab, uses the OAS-aligned-tiny version of OAS to reduce the time and memory required to download the database. The Colab demo is composed of two parts:

  • Quick and easy use of KA-Search: Here we allow the user to try out KA-Search with minimal configuration, simply paste your antibody variable domain sequence in and try it out!!

  • KA-Search with more configuration: Here we expose the KA-Search API and go through a more in depth tutorial of how it can be set up for your particular usecase. We explain how to preprocess the query sequence, the possible search configurations, how to extract the metadata after finding the most identical sequences and how to preprocess your own database so it can be used with KA-Search.

If the user want to follow this tutorial locally, we also provide a Jupyter notebook showcasing KA-Search. The content of the Jupyter notebook is the same as what is in the "KA-Search with more configuration" section of the Colab. By running it locally you can also search against the whole of OAS-aligned.


Description of the returned results

The returned output from KA-Search contains all columns and metadata in the pre-aligned datasets searched as well as a column named "Identity", which contains the calculated sequence identity. The returned output is always sorted by highest identity.

For the OAS-aligned datasets these columns are;

  • Each column from AIRR's rearrangement schema (see here for exact description).
  • Additional sequence specific information derived by OAS processing, i.e. nucleotides for the constant region if present, ANARCI numbering and ANARCI status. For more information see the OAS paper.
  • Metadata from the OAS data unit the sequence was derived from, i.e. author, species, experimental run and unique sequences in run. For more information see the OAS help page.
  • Lastly, the column "Identity", which contains the calculated sequence identity between the query and target sequence.

NB: Some returned columns contain NaNs. These columns could not be populated when the data was originally processed, and it is therefore not a side-effect of KA-Search. The only column populated by KA-Search, is the "Identity" column.


Description of the main arguments and examples of different types of search

The main arguments for your search are;

  • database_path: Path to the database to search. If not specified, the OAS-aligned-tiny (~260MB of 10m human heavy chain sequences) dataset will be downloaded and searched against.
  • allowed_chain: Which chain to search, either only heavy (Heavy), only light (Light) or any chain (Any)
  • allowed_species Which species to search against (this depends on what species are in the used pre-aligned data). For OAS-aligned this includes, Human, Mouse, Camel and Humanized.
  • regions: Which specific region to search against. A list of regions to search, either the provided ones ('whole', 'cdrs' or 'cdr3'), or user-defined ones. An example of a user-defined one is ['111 ', '111A', '112A', '112 '].
  • length_matched: A list of false and true for whether to only compare sequences where the length of the region to search match. Example: [False, True, True]
  • local_oas_path: For offline use, the path to a local version of OAS.

NB: The length of regions list and length_matched list needs to be the same.
NB: For offline use, a local version of OAS is needed the the metadata extraction. OAS currently takes up ~1.1T. It is therefore recommended to run KA-Search locally, but with internet access.

1. Example of searching against whole variable heavy domains from humans.

In this example, we search for similar human heavy chains across the whole variable domain, while also allowing sequences which might differ in length.

raw_queries = [
    'VKLLEQSGAEVKKPGASVKVSCKASGYSFTSYGLHWVRQAPGQRLEWMGWISAGTGNTKYSQKFRGRVTFTRDTSATTAYMGLSSLRPEDTAVYYCARDPYGGGKSEFDYWGQGTLVTVSS',
]

results = EasySearch(
    raw_queries, 
    allowed_chain='Heavy',  
    allowed_species='Human', 
    regions=['whole'],  
    length_matched=[False], 
)

2. Example of searching for similar CDRH3s from any species, but only return CDRH3s with an exact length match.

In this example, we search for sequences with an exact length CDRH3 from any species. If one is interested in finding sequences with CDR3 lengths that differ in length, the length_match argument should be set to False.

raw_queries = [
    'VKLLEQSGAEVKKPGASVKVSCKASGYSFTSYGLHWVRQAPGQRLEWMGWISAGTGNTKYSQKFRGRVTFTRDTSATTAYMGLSSLRPEDTAVYYCARDPYGGGKSEFDYWGQGTLVTVSS',
]

results = EasySearch(
    raw_queries, 
    allowed_chain='Heavy', 
    allowed_species='Any', 
    regions=['cdr3'],  
    length_matched=[True], 
)

3. Example of searching with a user-defined region (i.e. the paratope).

In this example, we search for sequences with a similar paratope. The positions of the paratope needs to follow the IMGT numbering scheme and be one of the 200 allowed positions in the canonical alignment introduced in the KA-Search paper.

raw_queries = [
    'VKLLEQSGAEVKKPGASVKVSCKASGYSFTSYGLHWVRQAPGQRLEWMGWISAGTGNTKYSQKFRGRVTFTRDTSATTAYMGLSSLRPEDTAVYYCARDPYGGGKSEFDYWGQGTLVTVSS',
]

paratope = ["107 ", "108 ","111C", "114 ","115 "]

results = EasySearch(
    raw_queries, 
    allowed_chain='Heavy',  
    allowed_species='Any', 
    regions=[paratope],   
    length_matched=[True], 
)

4. Example of searching with KA-Search offline.

In this example, we specify the path to a local version of OAS. This allows us to extract metadata for the returned sequences offline.

raw_queries = [
    'VKLLEQSGAEVKKPGASVKVSCKASGYSFTSYGLHWVRQAPGQRLEWMGWISAGTGNTKYSQKFRGRVTFTRDTSATTAYMGLSSLRPEDTAVYYCARDPYGGGKSEFDYWGQGTLVTVSS',
]

results = EasySearch(
    raw_queries, 
    allowed_chain='Heavy',  
    allowed_species='Any',
    regions=['cdr3'],  
    length_matched=[True], 
    local_oas_path='/path/to/local/oas/'
)

5. Example of searching custom data.

In this example, we pre-align a set of custom sequences, and subsequently search them with KA-Search.

First we format the data as an OAS data unit file. The minimal format requires an empty metadata and a single column with the variable domain of the antibody sequence, but can contain as many additional columns with sequence specific information as desired. This extra information will be retrieved when extracting metadata.

import json, os, shutil
import pandas as pd

custom_data_file = "custom-data-examples.csv"

seq_df = pd.DataFrame([
    ["EVQLVESGGGLAKPGGSLRLHCAASGFAFSSYWMNWVRQAPGKRLEWVSAINLGGGLTYYAASVKGRFTISRDNSKNTLSLQMNSLRAEDTAVYYCATDYCSSTYCSPVGDYWGQGVLVTVSS"],
    ["EVQLVQSGAEVKRPGESLKISCKTSGYSFTSYWISWVRQMPGKGLEWMGAIDPSDSDTRYNPSFQGQVTISADKSISTAYLQWSRLKASDTATYYCAIKKYCTGSGCRRWYFDLWGPGT"],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGEPRYDYAWFAYWGQGTLVTVS'],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGPATAWFAYWGQGTLVTVS'],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARSAWFAYWGQGTLVTVS'],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGGYWGQGTTLTVSS'],
    ['QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGGLRRGAWFAYWGQGTLVTVS']
], columns = ['heavy_sequences'])

meta_data = pd.Series(name=json.dumps({"Species":"Human", "Chain":"Heavy"}), dtype='object')

meta_data.to_csv(custom_data_file, index=False)
seq_df.to_csv(custom_data_file, index=False, mode='a')

After each custom_data_file has been created, each file needs to be pre-aligned and formatted into a single dataset.

path_to_custom_db = "my_kasearch_db"
many_custom_data_files = [custom_data_file]

customDB = PrepareDB(db_path=path_to_custom_db, n_jobs=2, from_scratch=True)

for num, data_file in enumerate(many_custom_data_files):
    
    customDB.prepare_sequences(
        data_file,
        file_id=num, 
        chain='Heavy', # This needs to change depending on the custom data file
        species='Human', # This needs to change depending on the custom data file
        seq_column_name = 'heavy_sequences', # This needs to change depending on the custom data file
    )
    shutil.copy(data_file, os.path.join(path_to_custom_db, 'extra_data'))
    
customDB.finalize_prepared_files()

Finally, the pre-aligned custom dataset can be searched by providing its path when initiating the search.

raw_queries = [
    'VKLLEQSGAEVKKPGASVKVSCKASGYSFTSYGLHWVRQAPGQRLEWMGWISAGTGNTKYSQKFRGRVTFTRDTSATTAYMGLSSLRPEDTAVYYCARDPYGGGKSEFDYWGQGTLVTVSS',
]

results = EasySearch(
    raw_queries, 
    database_path=path_to_custom_db, 
    allowed_chain='Any', 
    allowed_species='Any',
    regions=['whole'],
    length_matched=[False],
)

Citation

@article{Olsen2023,
  title={KA-Search, a method for rapid and exhaustive sequence identity search of known antibodies},
  author={Tobias H. Olsen, Brennan A. Kenyon, Iain H. Moal and Charlotte M. Deane},
  journal={Scientific Reports},
  doi={10.1038/s41598-023-38108-7},
  year={2023}
}

kasearch's People

Contributors

algw71 avatar brennanaba avatar henriettecapel avatar tobiasheol avatar wjs20 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

kasearch's Issues

wifi traffic

Dear Developer,

I downloaded oasdb_small to my computer and tried to run search locally, but noticed there is a heavy wifi traffic. Is it normal? the search is done in a remote server or in a local computer? Thanks!

Error when running EasySearch: only results for "Identity" column

Dear kasearch team,

First of all, thanks for all your work, kasearch is really promising!! I'm really hoping I can get it running soon.

I'm trying to run EasySearch on the sample sequence. I downloaded the publication dataset into this folder: /researchers/laura.twomey/Tools/omics_tools/kasearch/oasdb_20230111/

from kasearch import EasySearch
# Run ka search
results = EasySearch('QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS',
    allowed_chain='Heavy',  
    allowed_species='Human', 
    regions=['whole'],  
    length_matched=[False], 
    database_path='/researchers/laura.twomey/Tools/omics_tools/kasearch/oasdb_20230111/'
)

But get this error:

Traceback (most recent call last):
  File "/home/ltwomey/src/Analysis/scRNAseq/run_kasearch.py", line 15, in <module>
    results = EasySearch(
              ^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/easy_search.py", line 56, in EasySearch
    return targetdb.get_meta(n_query=0, n_region=0, n_sequences='all', n_jobs=n_jobs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/kasearch.py", line 150, in get_meta
    metadf = self._extract_meta(self.current_best_ids[n_query, :n_sequences, n_region], n_jobs=n_jobs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/meta_extract.py", line 87, in _extract_meta
    fetched_metadata = pd.concat(Parallel(n_jobs=n_jobs)(delayed(self._get_single_study_meta)(group) for group in groups))
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/joblib/parallel.py", line 1918, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/joblib/parallel.py", line 1847, in _get_sequential_output
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/meta_extract.py", line 47, in _get_single_study_meta
    study_file = self.id_to_study[study_id]
                 ~~~~~~~~~~~~~~~~^^^^^^^^^^
KeyError: np.int64(3595)

I'm using:

  • biopython=1.83
  • hmmer=3.4
  • muscle=3.8.1551-6
  • anarci=1.3 (commit 79f6c575056dedef86cb8f405ebb039197923eec)
  • kasearch (commit fb0ebc7)

Create custom database example return KeyError -1 when doing EasySearch on new DB

Hi!
I've run your example notebooks both locally and on GoogleColab, and I always get a KeyError: -1 when doing the EasySearch on the new database.
I think the error may be due to how the database is generated because I'm able to run EasySearch when using oas-aligned-tiny.

Here is the traceback error that I get on GoogleColab:

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

[<ipython-input-60-5ac3149581ac>](https://localhost:8080/#) in <cell line: 5>()
      3 query = 'QVQLQQSGAELARPGASVKLSCKASGYTFTSYWMQWVKQRPGQGLEWIGAIYPGDGDTRYTQKFKGKATLTADKSSSTAYMQLSSLASEDSAVYYCARGGLRRGAWFAYWGQGTLVTVS'
      4 
----> 5 results = EasySearch(query, 
      6                keep_best_n=10,
      7                database_path=path_to_save_new_db,

5 frames

[/usr/local/lib/python3.10/site-packages/kasearch/easy_search.py](https://localhost:8080/#) in EasySearch(query, keep_best_n, database_path, allowed_chain, allowed_species, regions, length_matched, include_ends, local_oas_path, n_jobs)
     54     targetdb.search(querydb[:1], keep_best_n=keep_best_n)
     55 
---> 56     return targetdb.get_meta(n_query=0, n_region=0, n_sequences='all', n_jobs=n_jobs)

[/usr/local/lib/python3.10/site-packages/kasearch/kasearch.py](https://localhost:8080/#) in get_meta(self, n_query, n_region, n_sequences, n_jobs)
    147         assert n_sequences > 0
    148 
--> 149         metadf = self._extract_meta(self.current_best_ids[n_query, :n_sequences, n_region], n_jobs=n_jobs)
    150         metadf['Identity'] = self.current_best_identities[n_query, :n_sequences, n_region]
    151         return metadf

[/usr/local/lib/python3.10/site-packages/kasearch/meta_extract.py](https://localhost:8080/#) in _extract_meta(self, idxs, n_jobs)
     77         n_jobs = n_groups if n_groups <  n_jobs else n_jobs
     78         chunksize= n_groups // n_jobs
---> 79 
     80         fetched_metadata = pd.concat(Parallel(n_jobs=n_jobs)(delayed(self._get_single_study_meta)(group) for group in groups))
     81 

[/usr/local/lib/python3.10/site-packages/joblib/parallel.py](https://localhost:8080/#) in __call__(self, iterable)
   1853             output = self._get_sequential_output(iterable)
   1854             next(output)
-> 1855             return output if self.return_generator else list(output)
   1856 
   1857         # Let's create an ID that uniquely identifies the current call. If the

[/usr/local/lib/python3.10/site-packages/joblib/parallel.py](https://localhost:8080/#) in _get_sequential_output(self, iterable)
   1782                 self.n_dispatched_batches += 1
   1783                 self.n_dispatched_tasks += 1
-> 1784                 res = func(*args, **kwargs)
   1785                 self.n_completed_tasks += 1
   1786                 self.print_progress()

[/usr/local/lib/python3.10/site-packages/kasearch/meta_extract.py](https://localhost:8080/#) in _get_single_study_meta(self, idxs)
     44         """  
     45         print('hi')
---> 46         study_id, line_ids = idxs[0,0], idxs[:,1]
     47         study_file = self.id_to_study[study_id]
     48 

KeyError: -1

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.