GithubHelp home page GithubHelp logo

chainsaw's Introduction

Chainsaw

Chainsaw is a deep learning method for predicting protein domain boundaries for a given protein structure.

If you find Chainsaw useful in your research, please cite:

Chainsaw: protein domain segmentation with fully convolutional neural networks

Jude Wells, Alex Hawkins-Hooker, Nicola Bordin, Brooks Paige and Christine Orengo

bioRxiv

Installation

  1. install stride: source code and instructions are packaged in this repository in the stride directory. You will need to compile stride and put the executable in your path. Update the STRIDE_EXE variable in src/constants.py to point to the stride executable.

  2. install the python dependencies: pip install -r requirements.txt

  3. test it's working by running python get_predictions.py --structure_file example_files/AF-A0A1W2PQ64-F1-model_v4.pdb --output results/test.tsv by default the output will be saved in the results directory.

Optional: To visualise the domain assignments, ensure that you have pymol installed and update the PYMOL_EXE variable in src/constants.py to point to the pymol executable.

Usage

python get_predictions.py --structure_file /path/to/file.pdb or python get_predictions.py --structure_directory /path/to/pdb_or_mmcif_directory

Note that the output predicted boundaries are based on residue consecutive indexing starting from 1 (not based on pdb auth numbers).

chainsaw's People

Contributors

judewells avatar sillitoe avatar alex-hh avatar bordin89 avatar

Stargazers

Hyeonsung Byeon avatar seeun avatar Mauro Risonho de Paula Assumpção avatar Elly Poretsky avatar George Bouras avatar Naoya Kobayashi avatar Stephan Breimann avatar David Li avatar Stefaan Verwimp avatar Valentyn Bezshapkin avatar Roberto Alejandro Calzadilla avatar Kaiyu (Rossmann) Qiu avatar Thibault Tubiana avatar Johnny Tam avatar Francesco Costa avatar Daniel DeMonte avatar  avatar Alex Mattausch avatar Alex Naka avatar A.s. avatar Ryan Conrad avatar Mensur Dlakic avatar Diego Javier Zea avatar Andy Lau avatar Patchy avatar

Watchers

 avatar Jim Procter avatar  avatar  avatar  avatar

chainsaw's Issues

Allow for chain identifiers different from A

Hi!

I have been using chainsaw to process some AFDB and ESM models, where the chain is always 'A'. But i wanted to use it over some chains extracted from PDB structures, where the chain may not be 'A'. However, it seems chainsaw does not work in these cases because it always expects that the chain identifier is 'A'. It may be worth to change this behaviour...

Best wishes
Joana

Cluster warning

[W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.

Does this matter?

Chainsaw Software Distribution License

Hi, Jude.

Cool work! I was taking a look at the chainsaw preprint and code and I didn't see a license for the project. Is this project fully open source?

If so, I really recommend adding a permissive license for your project. [I can highly suggest the MIT license] or a similarly permissive license. See here for details on adding a license to the project.

Please let me know.

Thanks in advance!

Add chain specifier

Currently Chainsaw has Chain A hardcoded as AlphaFold models only have the A chain.

def predict(model, pdb_path, renumber_pdbs=True, ss_mod=False, pdbchain="A") -> List[PredictionResult]:
    """
    Makes the prediction and returns a list of PredictionResult objects
    """
    start = time.time()

    # get model structure metadata
    model_structure = featurisers.get_model_structure(pdb_path)
    model_structure_seq = featurisers.get_model_structure_sequence(model_structure, chain=pdbchain)
    model_structure_md5 = hashlib.md5(model_structure_seq.encode('utf-8')).hexdigest()

If we want to make this generalisable to PDB files that have multiple chains (i.e. 4wgvC), we should be able to specify the chain to avoid issues like

Traceback (most recent call last):
  File "get_predictions.py", line 292, in <module>
    main(parse_args())
  File "get_predictions.py", line 223, in main
    result = predict(model, pdb_path, ss_mod=args.ss_mod)
  File "get_predictions.py", line 105, in predict
    model_structure_seq = featurisers.get_model_structure_sequence(model_structure, chain=pdbchain)
  File "/SAN/orengolab/af_esm/tools/chainsaw/src/featurisers.py", line 34, in get_model_structure_sequence
    residues = [c for c in structure_model[chain].child_list]
  File "/SAN/orengolab/af_esm/tools/chainsaw/venv/lib/python3.8/site-packages/Bio/PDB/Entity.py", line 45, in __getitem__
    return self.child_dict[id]
KeyError: 'A'

Residues with non-sequential index

We've run into an error ValueError: Index 70 not in model_res_label_by_index when attempting to make a prediction for the PDB file 5yclA

10/16/2023 03:21:54 PM | INFO | Making prediction for file 5yclA.pdb (chain '5yclA')
10/16/2023 03:21:54 PM | WARNING | No chain specified for /scratch0/nbordin/chainsaw-3158440-9/holdingpen_008/pdb/5yclA.pdb, using first chain
10/16/2023 03:21:54 PM | INFO | Running command: /SAN/orengolab/af_esm/tools/chainsaw/stride/stride /scratch0/nbordin/chainsaw-3158440-9/holdingpen_008/pdb/5yclA_renum.pdb -rA
10/16/2023 03:21:54 PM | INFO | Distance matrix shape: (1, 131, 131), SS matrix shape: (131, 131)
10/16/2023 03:21:55 PM | INFO | Segments (index to label): ['4-68'] -> ['7-71']
Traceback (most recent call last):
  File "get_predictions.py", line 332, in <module>
    main(parse_args())
  File "get_predictions.py", line 264, in main
    result = predict(model, pdb_path, ss_mod=args.ss_mod, pdbchain=pdb_chain_id)
  File "get_predictions.py", line 175, in predict
    segs_str = [f"{seg.start_label}-{seg.end_label}" for seg in dom.segs]
  File "get_predictions.py", line 175, in <listcomp>
    segs_str = [f"{seg.start_label}-{seg.end_label}" for seg in dom.segs]
  File "get_predictions.py", line 143, in start_label
    return self.res_label_of_index(self.start_index)
  File "get_predictions.py", line 138, in res_label_of_index
    raise ValueError(f"Index {index} not in model_res_label_by_index ({model_res_label_by_index})")
ValueError: Index 70 not in model_res_label_by_index ({1: '4', 2: '5', 3: '6', 4: '7', 5: '8', 6: '9', 7: '10', 8: '11', 9: '12', 10: '13', 11: '14', 12: '15', 13: '16', 14: '17', 15: '18', 16: '19', 17: '20', 18: '21', 19: '22', 20: '23', 21: '24', 22: '25', 23: '26', 24: '27', 25: '28', 26: '29', 27: '30', 28: '31', 29: '32', 30: '33', 31: '34', 32: '35', 33: '36', 34: '37', 35: '38', 36: '39', 37: '40', 38: '41', 39: '42', 40: '43', 41: '44', 42: '45', 43: '46', 44: '47', 45: '48', 46: '49', 47: '50', 48: '51', 49: '52', 50: '53', 51: '54', 52: '55', 53: '56', 54: '57', 55: '58', 56: '59', 57: '60', 58: '61', 59: '62', 60: '63', 61: '64', 62: '65', 63: '66', 64: '67', 65: '68', 66: '69', 67: '70', 68: '71', 69: '72', 71: '74', 72: '75', 73: '76', 75: '78', 76: '79', 77: '80', 78: '81', 79: '82', 80: '83', 81: '84', 82: '85', 83: '86', 84: '87', 86: '89', 87: '90', 88: '91', 89: '92', 90: '93', 91: '94', 92: '95', 93: '96', 94: '97', 95: '98', 96: '99', 97: '100', 98: '101', 99: '102', 100: '103', 101: '104', 102: '105', 104: '107', 105: '108', 106: '109', 107: '110', 108: '111', 109: '112', 110: '113', 111: '117', 112: '118', 113: '119', 114: '120', 115: '121', 116: '122', 117: '123', 118: '124', 119: '125', 120: '126', 121: '127', 122: '128', 123: '129', 124: '130', 125: '131', 126: '132', 127: '133', 128: '134', 129: '135', 130: '136', 131: '137', 132: '138', 133: '139', 134: '140', 135: '141'})

Note: the index skips from 69 to 71.

It seems likely that this is due to the index being created before the residues are filtered for non-standard amino acids. If so, the ideal solution would be for the index to be sequential with respect to the final, filtered sequence, while maintaining the integrity of the PDB labels.

Remove hard-coded python environment variable in renum_pdb_file()

The following code block will not run if eg. user's environment has python=python version 2

def renum_pdb_file(pdb_path, output_pdb_path):
    pdb_reres_path = REPO_ROOT / 'src/utils/pdb_reres.py'
    with open(output_pdb_path, "w") as output_file:
        subprocess.run(['python', str(pdb_reres_path), pdb_path],
                       stdout=output_file,
                       check=True,
                       text=True)

Reproducibility: Datasets required

Hello!

Great work on this project!

In order to reproduce the performance evaluation, I would like to know which datasets were used for training and validating your approach. While the methodology in your paper is quite thorough, it does not offer a deterministic approach to generate these, and I can't find anything in your repo either.

Could you please either publish the identifiers of your data splits or provide some other means to reproduce the datasets?

Best regards,
Nicole

Submit script should loop around archive files

Currently the submit script will work on a file containing a list of archive files.

The general idea is:

  • extract all the PDB files from the archives into a given directory
  • process all the PDB files

This means that there is heavy network IO at the start of the job. It would be better to spread the network load out at to random points throughout the job - so..

for each archive file:

  • extract all PDB files
  • process all PDB files

One extra bonus is that this can provide the opportunity to record milestones that a given archive file has been processed correctly (and can potentially be ignored on future runs).

Change zero-indexed output behaviour

Current behaviour is to output domain predictions based on 0-indexed sequential residue position in the PDB file.
Better would be 1-indexed to match AlphaFold structures.
Best would be indexing that matches the author numbers in the provided PDB file.

Zip extraction performance

Would simply operating on proteome files be faster, since we could in that case simply use unzip? If so, by how much?

The partitioned zip index files have 10000 af structures each. (There are 18897 partitions). If we ran one prediction per second this would be 166 minutes per partition. Which would mean that unzipping for 30 minutes is a substantial fraction of the overall cost.

Are these partitioned index files processed in important ways? (eg guess they might have unique chains only?)

Output chopping by residue labels (as well as sequential numbering)

We have started using chainsaw to identify domain boundaries in PDB files.

The residue labels in AlphaFold structures are numeric and sequential (1-n). The residue labels in PDB files are much less predictable. They do not always start at 1, they can skip numbers, can include negative numbers, can have optional "insert code" characters, etc, etc.

At the moment, chainsaw outputs chopping boundaries that correspond to the sequential numbering of the aminoacid sequence in the PDB file, ie it completely ignores the residue labels in the PDB file. This can make it problematic when trying to map domain boundaries back onto the structural data.

However, when parsing the PDB file, Chainsaw will have access to both the residue labels AND the sequential numbering. So it would be relatively easy to output the same chopping in two forms: numeric and PDB labels.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.