judewells / chainsaw Goto Github PK

View Code? Open in Web Editor NEW

25.0 5.0 2.0 8.13 MB

License: MIT License

Python 57.87% Shell 3.18% Jupyter Notebook 38.96%

chainsaw's Introduction

Chainsaw

Chainsaw is a deep learning method for predicting protein domain boundaries for a given protein structure.

If you find Chainsaw useful in your research, please cite:

Chainsaw: protein domain segmentation with fully convolutional neural networks

Jude Wells, Alex Hawkins-Hooker, Nicola Bordin, Brooks Paige and Christine Orengo

bioRxiv

Installation

install stride: source code and instructions are packaged in this repository in the stride directory. You will need to compile stride and put the executable in your path. Update the STRIDE_EXE variable in src/constants.py to point to the stride executable.
install the python dependencies: pip install -r requirements.txt
test it's working by running python get_predictions.py --structure_file example_files/AF-A0A1W2PQ64-F1-model_v4.pdb --output results/test.tsv by default the output will be saved in the results directory.

Optional: To visualise the domain assignments, ensure that you have pymol installed and update the PYMOL_EXE variable in src/constants.py to point to the pymol executable.

Usage

python get_predictions.py --structure_file /path/to/file.pdb or python get_predictions.py --structure_directory /path/to/pdb_or_mmcif_directory

Note that the output predicted boundaries are based on residue consecutive indexing starting from 1 (not based on pdb auth numbers).

chainsaw's People

Contributors

Stargazers

Watchers

Forkers

ndnng gyeongpil-jo

chainsaw's Issues

Allow for chain identifiers different from A

Hi!

I have been using chainsaw to process some AFDB and ESM models, where the chain is always 'A'. But i wanted to use it over some chains extracted from PDB structures, where the chain may not be 'A'. However, it seems chainsaw does not work in these cases because it always expects that the chain identifier is 'A'. It may be worth to change this behaviour...

Best wishes
Joana

Refactor inference_time_create_features() to be method of PairwisePredictor object

Cluster warning

[W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware.

Does this matter?

Chainsaw Software Distribution License

Hi, Jude.

Cool work! I was taking a look at the chainsaw preprint and code and I didn't see a license for the project. Is this project fully open source?

If so, I really recommend adding a permissive license for your project. [I can highly suggest the MIT license] or a similarly permissive license. See here for details on adding a license to the project.

Please let me know.

Thanks in advance!

Add chain specifier

Currently Chainsaw has Chain A hardcoded as AlphaFold models only have the A chain.

def predict(model, pdb_path, renumber_pdbs=True, ss_mod=False, pdbchain="A") -> List[PredictionResult]:
    """
    Makes the prediction and returns a list of PredictionResult objects
    """
    start = time.time()

    # get model structure metadata
    model_structure = featurisers.get_model_structure(pdb_path)
    model_structure_seq = featurisers.get_model_structure_sequence(model_structure, chain=pdbchain)
    model_structure_md5 = hashlib.md5(model_structure_seq.encode('utf-8')).hexdigest()

If we want to make this generalisable to PDB files that have multiple chains (i.e. 4wgvC), we should be able to specify the chain to avoid issues like

Traceback (most recent call last):
  File "get_predictions.py", line 292, in <module>
    main(parse_args())
  File "get_predictions.py", line 223, in main
    result = predict(model, pdb_path, ss_mod=args.ss_mod)
  File "get_predictions.py", line 105, in predict
    model_structure_seq = featurisers.get_model_structure_sequence(model_structure, chain=pdbchain)
  File "/SAN/orengolab/af_esm/tools/chainsaw/src/featurisers.py", line 34, in get_model_structure_sequence
    residues = [c for c in structure_model[chain].child_list]
  File "/SAN/orengolab/af_esm/tools/chainsaw/venv/lib/python3.8/site-packages/Bio/PDB/Entity.py", line 45, in __getitem__
    return self.child_dict[id]
KeyError: 'A'

Add setup.py or equivalent to enable proper packaging and simplify interactive usage

see e.g. the first cell of the demo notebook for current issue/hack: https://github.com/JudeWells/chainsaw/blob/main/notebooks/InteractiveUsageExample.ipynb

Residues with non-sequential index

We've run into an error ValueError: Index 70 not in model_res_label_by_index when attempting to make a prediction for the PDB file 5yclA

10/16/2023 03:21:54 PM | INFO | Making prediction for file 5yclA.pdb (chain '5yclA')
10/16/2023 03:21:54 PM | WARNING | No chain specified for /scratch0/nbordin/chainsaw-3158440-9/holdingpen_008/pdb/5yclA.pdb, using first chain
10/16/2023 03:21:54 PM | INFO | Running command: /SAN/orengolab/af_esm/tools/chainsaw/stride/stride /scratch0/nbordin/chainsaw-3158440-9/holdingpen_008/pdb/5yclA_renum.pdb -rA
10/16/2023 03:21:54 PM | INFO | Distance matrix shape: (1, 131, 131), SS matrix shape: (131, 131)
10/16/2023 03:21:55 PM | INFO | Segments (index to label): ['4-68'] -> ['7-71']
Traceback (most recent call last):
  File "get_predictions.py", line 332, in <module>
    main(parse_args())
  File "get_predictions.py", line 264, in main
    result = predict(model, pdb_path, ss_mod=args.ss_mod, pdbchain=pdb_chain_id)
  File "get_predictions.py", line 175, in predict
    segs_str = [f"{seg.start_label}-{seg.end_label}" for seg in dom.segs]
  File "get_predictions.py", line 175, in <listcomp>
    segs_str = [f"{seg.start_label}-{seg.end_label}" for seg in dom.segs]
  File "get_predictions.py", line 143, in start_label
    return self.res_label_of_index(self.start_index)
  File "get_predictions.py", line 138, in res_label_of_index
    raise ValueError(f"Index {index} not in model_res_label_by_index ({model_res_label_by_index})")
ValueError: Index 70 not in model_res_label_by_index ({1: '4', 2: '5', 3: '6', 4: '7', 5: '8', 6: '9', 7: '10', 8: '11', 9: '12', 10: '13', 11: '14', 12: '15', 13: '16', 14: '17', 15: '18', 16: '19', 17: '20', 18: '21', 19: '22', 20: '23', 21: '24', 22: '25', 23: '26', 24: '27', 25: '28', 26: '29', 27: '30', 28: '31', 29: '32', 30: '33', 31: '34', 32: '35', 33: '36', 34: '37', 35: '38', 36: '39', 37: '40', 38: '41', 39: '42', 40: '43', 41: '44', 42: '45', 43: '46', 44: '47', 45: '48', 46: '49', 47: '50', 48: '51', 49: '52', 50: '53', 51: '54', 52: '55', 53: '56', 54: '57', 55: '58', 56: '59', 57: '60', 58: '61', 59: '62', 60: '63', 61: '64', 62: '65', 63: '66', 64: '67', 65: '68', 66: '69', 67: '70', 68: '71', 69: '72', 71: '74', 72: '75', 73: '76', 75: '78', 76: '79', 77: '80', 78: '81', 79: '82', 80: '83', 81: '84', 82: '85', 83: '86', 84: '87', 86: '89', 87: '90', 88: '91', 89: '92', 90: '93', 91: '94', 92: '95', 93: '96', 94: '97', 95: '98', 96: '99', 97: '100', 98: '101', 99: '102', 100: '103', 101: '104', 102: '105', 104: '107', 105: '108', 106: '109', 107: '110', 108: '111', 109: '112', 110: '113', 111: '117', 112: '118', 113: '119', 114: '120', 115: '121', 116: '122', 117: '123', 118: '124', 119: '125', 120: '126', 121: '127', 122: '128', 123: '129', 124: '130', 125: '131', 126: '132', 127: '133', 128: '134', 129: '135', 130: '136', 131: '137', 132: '138', 133: '139', 134: '140', 135: '141'})

Note: the index skips from 69 to 71.

It seems likely that this is due to the index being created before the residues are filtered for non-standard amino acids. If so, the ideal solution would be for the index to be sequential with respect to the final, filtered sequence, while maintaining the integrity of the PDB labels.

optimise feature creation: especially function: calc_dist_matrix

Remove hard-coded python environment variable in renum_pdb_file()

The following code block will not run if eg. user's environment has python=python version 2

def renum_pdb_file(pdb_path, output_pdb_path):
    pdb_reres_path = REPO_ROOT / 'src/utils/pdb_reres.py'
    with open(output_pdb_path, "w") as output_file:
        subprocess.run(['python', str(pdb_reres_path), pdb_path],
                       stdout=output_file,
                       check=True,
                       text=True)

should output and save_dir arguments agree / share directory?

Currently output completely ignores save_dir (i.e. output path is specified independently of save_dir, which seems possibly redundant and a bit confusing)

Add option to remove low PLDDT residues from domain assignments on AlphaFold structures

Add some functionality to the post-processing which modifies domain assignments to remove residues which have low PLDDT (which is the b-factor column of PDB files generated by AlphaFold).

Reproducibility: Datasets required

Hello!

Great work on this project!

In order to reproduce the performance evaluation, I would like to know which datasets were used for training and validating your approach. While the methodology in your paper is quite thorough, it does not offer a deterministic approach to generate these, and I can't find anything in your repo either.

Could you please either publish the identifiers of your data splits or provide some other means to reproduce the datasets?

Best regards,
Nicole

Submit script should loop around archive files

Currently the submit script will work on a file containing a list of archive files.

The general idea is:

extract all the PDB files from the archives into a given directory
process all the PDB files

This means that there is heavy network IO at the start of the job. It would be better to spread the network load out at to random points throughout the job - so..

for each archive file:

extract all PDB files
process all PDB files

One extra bonus is that this can provide the opportunity to record milestones that a given archive file has been processed correctly (and can potentially be ignored on future runs).

Add default submission args to qsub scripts

e.g. t_mem/h_vmem/h_rt which are standard and will lead to jobs being processed more quickly
(Q. Do these get overridden by manually specified things?)

Change zero-indexed output behaviour

Current behaviour is to output domain predictions based on 0-indexed sequential residue position in the PDB file.
Better would be 1-indexed to match AlphaFold structures.
Best would be indexing that matches the author numbers in the provided PDB file.

Zip extraction performance

Would simply operating on proteome files be faster, since we could in that case simply use unzip? If so, by how much?

The partitioned zip index files have 10000 af structures each. (There are 18897 partitions). If we ran one prediction per second this would be 166 minutes per partition. Which would mean that unzipping for 30 minutes is a substantial fraction of the overall cost.

Are these partitioned index files processed in important ways? (eg guess they might have unique chains only?)

Output chopping by residue labels (as well as sequential numbering)

We have started using chainsaw to identify domain boundaries in PDB files.

The residue labels in AlphaFold structures are numeric and sequential (1-n). The residue labels in PDB files are much less predictable. They do not always start at 1, they can skip numbers, can include negative numbers, can have optional "insert code" characters, etc, etc.

At the moment, chainsaw outputs chopping boundaries that correspond to the sequential numbering of the aminoacid sequence in the PDB file, ie it completely ignores the residue labels in the PDB file. This can make it problematic when trying to map domain boundaries back onto the structural data.

However, when parsing the PDB file, Chainsaw will have access to both the residue labels AND the sequential numbering. So it would be relatively easy to output the same chopping in two forms: numeric and PDB labels.