GithubHelp home page GithubHelp logo

mpc-bioinformatics / protgraph Goto Github PK

View Code? Open in Web Editor NEW
9.0 6.0 5.0 16.95 MB

ProtGraph - A Graph-Generator for Proteins

License: Other

Python 98.98% Shell 1.02%
graph peptide fasta protein python3 pypi bioconda uniprot

protgraph's People

Contributors

antonneubauer avatar luxxii avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

protgraph's Issues

PyPi Integration and Documentation

  • PyPi Integration (automatically via action)
  • Documentation Clean UP
  • Adding of Cheat Sheet for arguments (since, we have a lot of them!) into README.md
  • Available via CLI (protgraph

INIT_MET has errors on Proteins: A0A4X1VEZ3 and F1SN05

Here are the actual error Messages:

Accession: F1SN05, Aminoacid(s): None, Position: 1
Additional Context: No M found to skip for the feature INIT_MET for the given cases
Message: type: INIT_MET
location: [0:1]
qualifiers:
    Key: evidence, Value: ECO:0000256|HAMAP-Rule:MF_03009
    Key: note, Value: Removed
Accession: A0A4X1VEZ3, Aminoacid(s): None, Position: 1
Additional Context: No M found to skip for the feature INIT_MET for the given cases
Message: type: INIT_MET
location: [0:1]
qualifiers:
    Key: evidence, Value: ECO:0000256|HAMAP-Rule:MF_03009
    Key: note, Value: Removed

Outsource Parsing into Processes

ProtGraph currently has a reading process, which also parses the entry via biopython. We could outsource this to the graph-generating processes (as well as the blacklist).

This could possibly even further speed up the graph-generation, since we can currently observe with a high number of processes, that the reading thread is not fast enough for the consumers.

Protein Q9R1E6 generates parallel edges, how do we want to handle parallel edges?

FT   MUTAGEN         12..27
FT                   /note="Missing: Complete inhibition of secretion."
FT                   /evidence="ECO:0000269|PubMed:17208043"
FT   MUTAGEN         12..22
FT                   /note="Missing: Complete inhibition of secretion."
FT                   /evidence="ECO:0000269|PubMed:17208043"
FT   MUTAGEN         23..27
FT                   /note="Missing: No effect on secretion."
FT                   /evidence="ECO:0000269|PubMed:17208043"

These Entries are selected specifically in this Protein, so that ProtGraph is going to generate a parallel edge from 12 to 27 by three features.

What should we do about parallel edges in general in ProtGraph?

Protgraph PR 0.30 consumes to much RAM

The newer unreleased Version of ProtGraph consumes roughly per pocess 5-7 GB of RAM. This can be an issue on servers with many cores and few available RAM.

However there is a workarround by simply reducing the number of threads -np.

Bioconda add License to recipe

The License-File is currently missing in the recipe for BioConda as well as for our PyPI package. We need to add this

Error in PepFasta, when retrieving substitution information

An error exists in variants, where we actually try to retrieve the substitution

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "../../Luxxii/ProtGraph/utilities/generate_fasta_from_peppg_pickle_export.py", line 121, in execute
    peptide, part_header = get_pep_and_header_def(row[0], row[1], base_folder)
  File "../../Luxxii/ProtGraph/utilities/generate_fasta_from_peppg_pickle_export.py", line 141, in get_pep_and_header_def
    l_str_qualifiers = PF._get_qualifiers(graph, edges)
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/export/peptides/pep_fasta.py", line 100, in _get_qualifiers
    + str(f.location.end) + "," + self._get_variant_qualifier(f) + "]"
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/export/peptides/pep_fasta.py", line 118, in _get_variant_qualifier
    message = message[:message.index("(")-1]
ValueError: substring not found

Potential Bottlenecks in ProtGraph

I ran a line profiler on the generate_graph_consumer method on the human_review dataset (20k proteins)

It gives me the following output:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   114                                           def generate_graph_consumer(entry_queue, graph_queue, common_out_queue, proc_id, **kwargs):
   115                                               """
   116                                               TODO
   117                                               describe kwargs and consumer until a graph is generated and digested etc ...
   118                                               """
   119                                               # Set proc id
   120         1          3.0      3.0      0.0      kwargs["proc_id"] = proc_id
   121                                           
   122                                               # Set feature_table dict boolean table
   123         1          1.0      1.0      0.0      ft_dict = dict()
   124         1          1.0      1.0      0.0      if kwargs["feature_table"] is None or len(kwargs["feature_table"]) == 0 or "ALL" in kwargs["feature_table"]:
   125         1          6.0      6.0      0.0          ft_dict = dict(VARIANT=True, VAR_SEQ=True, SIGNAL=True, INIT_MET=True, MUTAGEN=True, CONFLICT=True)
   126                                               else:
   127                                                   for i in kwargs["feature_table"]:
   128                                                       ft_dict[i] = True
   129                                           
   130                                               # Initialize the exporters for graphs
   131         1         58.0     58.0      0.0      graph_exporters = Exporters(**kwargs)
   132                                           
   133                                               while True:
   134                                                   # Get next entry
   135     20387    7568810.0    371.3      0.7          entry = entry_queue.get()
   136                                           
   137                                                   # Stop if entry is None
   138     20387      18156.0      0.9      0.0          if entry is None:
   139                                                       # --> Stop Condition of Process
   140         1          3.0      3.0      0.0              break
   141                                           
   142                                                   # Beginning of Graph-Generation
   143                                                   # We also collect interesting information here!
   144                                           
   145                                                   # Generate canonical graph (initialization of the graph)
   146     20386    4416910.0    216.7      0.4          graph = _generate_canonical_graph(entry.sequence, entry.accessions[0])
   147                                           
   148                                                   # FT parsing and appending of Nodes and Edges into the graph
   149                                                   # The amount of isoforms, etc.. can be retrieved on the fly
   150     20386      22188.0      1.1      0.0          num_isoforms, num_initm, num_signal, num_variant, num_mutagens, num_conficts =\
   151     20386  312577641.0  15333.0     29.4              _include_ft_information(entry, graph, ft_dict)
   152                                           
   153                                                   # Replace Amino Acids based on user defined rules: E.G.: "X -> A,B,C"
   154     20386      83272.0      4.1      0.0          replace_aa(graph, kwargs["replace_aa"])
   155                                           
   156                                                   # Digest graph with enzyme (unlimited miscleavages)
   157     20386  457306111.0  22432.4     43.0          num_of_cleavages = digest(graph, kwargs["digestion"])
   158                                           
   159                                                   # Merge (summarize) graph if wanted
   160     20386      29893.0      1.5      0.0          if not kwargs["no_merge"]:
   161     20386  268518281.0  13171.7     25.3              merge_aminoacids(graph)
   162                                           
   163                                                   # Collapse parallel edges in a graph
   164     20386      29694.0      1.5      0.0          if not kwargs["no_collapsing_edges"]:
   165     20386   10804029.0    530.0      1.0              collapse_parallel_edges(graph)
   166                                           
   167                                                   # Annotate weights for edges and nodes (maybe even the smallest weight possible to get to the end node)
   168     20386     948172.0     46.5      0.1          annotate_weights(graph, **kwargs)
   169                                           
   170                                                   # Calculate statistics on the graph:
   171     20386      11921.0      0.6      0.0          (
   172     20386      12094.0      0.6      0.0              num_nodes, num_edges, num_paths, num_paths_miscleavages, num_paths_hops,
   173     20386       9768.0      0.5      0.0              num_paths_var, num_path_mut, num_path_con
   174     20386     297176.0     14.6      0.0          ) = get_statistics(graph, **kwargs)
   175                                           
   176                                                   # Verify graphs if wanted:
   177     20386      11624.0      0.6      0.0          if kwargs["verify_graph"]:
   178                                                       verify_graph(graph)
   179                                           
   180                                                   # Persist or export graphs with speicified exporters
   181     20386      38415.0      1.9      0.0          graph_exporters.export_graph(graph, common_out_queue)
   182                                           
   183                                                   # Output statistics we gathered during processing
   184     20386      10500.0      0.5      0.0          if kwargs["no_description"]:
   185                                                       entry_protein_desc = None
   186                                                   else:
   187     20386      37338.0      1.8      0.0              entry_protein_desc = entry.description.split(";", 1)[0]
   188     20386      37142.0      1.8      0.0              entry_protein_desc = entry_protein_desc[entry_protein_desc.index("=") + 1:]
   189                                           
   190     40772     312422.0      7.7      0.0          graph_queue.put(
   191     20386      12337.0      0.6      0.0              (
   192     20386      11818.0      0.6      0.0                  entry.accessions[0],  # Protein Accesion
   193     20386      10432.0      0.5      0.0                  entry.entry_name,  # Protein displayed name
   194     20386       9196.0      0.5      0.0                  num_isoforms,  # Number of Isoforms
   195     20386       9231.0      0.5      0.0                  num_initm,  # Number of Init_M (either 0 or 1)
   196     20386       9244.0      0.5      0.0                  num_signal,  # Number of Signal Peptides used (either 0 or 1)
   197     20386       9232.0      0.5      0.0                  num_variant,  # Number of Variants applied to this protein
   198     20386       9227.0      0.5      0.0                  num_mutagens,  # Number of applied mutagens on the graph
   199     20386       9231.0      0.5      0.0                  num_conficts,  # Number of applied conflicts on the graph
   200     20386       9274.0      0.5      0.0                  num_of_cleavages,  # Number of cleavages (marked edges) this protein has
   201     20386       9240.0      0.5      0.0                  num_nodes,  # Number of nodes for the Protein/Peptide Graph
   202     20386       9269.0      0.5      0.0                  num_edges,  # Number of edges for the Protein/Peptide Graph
   203     20386       9311.0      0.5      0.0                  num_paths,  # Possible (non repeating paths) to the end of a graph. (may conatin repeating peptides)
   204     20386       9318.0      0.5      0.0                  num_paths_miscleavages,  # As num_paths, but binned to the number of miscleavages (by list idx, at 0)
   205     20386       9288.0      0.5      0.0                  num_paths_hops,  # As num_paths, only that we bin by hops (E.G. useful for determine DFS or BFS depths)
   206     20386       9363.0      0.5      0.0                  num_paths_var,  # Num paths of feture variant
   207     20386       9519.0      0.5      0.0                  num_path_mut,  # Num paths of feture mutagen
   208     20386       9508.0      0.5      0.0                  num_path_con,  # Num paths of feture conflict
   209     20386       9476.0      0.5      0.0                  entry_protein_desc,  # Description name of the Protein (can be lenghty)
   210                                                       )
   211                                                   )
   212                                           
   213                                               # Close exporters (maybe opened files, database connections, etc... )
   214         1         13.0     13.0      0.0      graph_exporters.close()

Bottlenecks are:

  • Merge Aminoacids (~25%)
  • Apply Features (~29%)
  • Digestion (~43%)

Split Protgraph into possibly two Projects?

Currently we have a large Project, reading SwissProt-EMBL and generating graphs out of it.

Currently it is not possible to retrieve them via Python directly. It is only possible through Pickle by saving and loading graphs separately.

Currently there are many Consumer and one Producer Thread, which do very basic operations. These operations may be separated into another Project, so that by importing it, the Protgraph references to a library.

Pep Export depending on mass

Instead of using the number of AAs, we could (more precisely) use a range of allowed masses.

This could be a new CLI Parameter in ProtGraph

Export: JanusGraph

We currently only have exports to files and databases (redis and Postgres).

Those are not well suited to traverse and process graphs (postgres is actually very perfromant if the rec. depth is not large).

We need to use some dedicated graph processing algorithms/databases for such large graphs. Here JanusGraph is tested

Error occures when digesting via full

An error which is not catched occurs, when digesting via the digestion method full

Here is the stacktrace:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/graph_generator.py", line 119, in generate_graph_consumer
    num_of_cleavages = digest(graph, kwargs["digestion"])
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/digestion.py", line 11, in digest
    return dict(
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/digestion.py", line 114, in _digest_via_full
    end_out.remove(i)
ValueError: list.remove(x): x not in list

This is happens with the recent Version of ProtGraph 0.1.0. It seems like a case is not considered here.

Conda channel priority problems

At least in my (clean) miniconda environment, I needed to run
conda install -c bioconda --no-channel-priority protgraph
to get protgraph installed, with Python 3.9.16 as only additional package installed before.
Maybe hint to it in the documentation.

Only apply variants which are significant (or unknown or other)

If looking at the Feature Viewer in UniProt for a single protein, options occur where it can be selected between "Likely Disease", "Predicted Consequences", etc..

These Information is parsed from UniProt via the note= - Information.

If we want to apply only significant (or other interesting variants on a protein) we should also implement such a filtering.

As a general consensus: Everything that uses a: in XXX is a variant which causes likely the disease XXX.

Everything that contains Unknown or something similar is then categorized specifically.

Maybe we should send a Message to the UniProt-Team how they bin those Variants (do they have some specific keywords)

Optimization of Merge Aminoacids

The current implementation needs a lot of time to process huge graphs (e.g. for the protein Titin).

We should at some point look into the implementation of merging nodes/aminoacids here, and further optimize it.

Updating PTMs

Currently we only provide n-terminal (and c-terminal) peptide PTMs as well as PTMs on aminoacids. There are also other and different PTMs which ProtGraph should be able to map

PEPTIDE and PROPEP (CHAIN, maybe even SIGNAL-peptides) may have an unique identifier

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.