mpc-bioinformatics / protgraph Goto Github PK

View Code? Open in Web Editor NEW

9.0 6.0 5.0 16.95 MB

ProtGraph - A Graph-Generator for Proteins

License: Other

Python 98.98% Shell 1.02%

graph peptide fasta protein python3 pypi bioconda uniprot

protgraph's People

Contributors

Stargazers

Watchers

Forkers

luxxii ipark2021 tabeasays antonneubauer enmingguo

protgraph's Issues

Protein P20729 seems to be generated wrong

Generating a graph from protein P20729 yields an edge that connects the start and the end node directly.

This is something that should not happen.

Extend VC-Count in BPCSR Output to also consider not only VARIANTs but user-chosen Features

See title.

Specifically: we want to parameterize the following line:

https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/pcsr.py#L82

PyPi Integration and Documentation

PyPi Integration (automatically via action)
Documentation Clean UP
Adding of Cheat Sheet for arguments (since, we have a lot of them!) into README.md
Available via CLI (protgraph

INIT_MET has errors on Proteins: A0A4X1VEZ3 and F1SN05

Here are the actual error Messages:

Accession: F1SN05, Aminoacid(s): None, Position: 1
Additional Context: No M found to skip for the feature INIT_MET for the given cases
Message: type: INIT_MET
location: [0:1]
qualifiers:
    Key: evidence, Value: ECO:0000256|HAMAP-Rule:MF_03009
    Key: note, Value: Removed

Accession: A0A4X1VEZ3, Aminoacid(s): None, Position: 1
Additional Context: No M found to skip for the feature INIT_MET for the given cases
Message: type: INIT_MET
location: [0:1]
qualifiers:
    Key: evidence, Value: ECO:0000256|HAMAP-Rule:MF_03009
    Key: note, Value: Removed

Outsource Parsing into Processes

ProtGraph currently has a reading process, which also parses the entry via biopython. We could outsource this to the graph-generating processes (as well as the blacklist).

This could possibly even further speed up the graph-generation, since we can currently observe with a high number of processes, that the reading thread is not fast enough for the consumers.

Protein Q9R1E6 generates parallel edges, how do we want to handle parallel edges?

FT   MUTAGEN         12..27
FT                   /note="Missing: Complete inhibition of secretion."
FT                   /evidence="ECO:0000269|PubMed:17208043"
FT   MUTAGEN         12..22
FT                   /note="Missing: Complete inhibition of secretion."
FT                   /evidence="ECO:0000269|PubMed:17208043"
FT   MUTAGEN         23..27
FT                   /note="Missing: No effect on secretion."
FT                   /evidence="ECO:0000269|PubMed:17208043"

These Entries are selected specifically in this Protein, so that ProtGraph is going to generate a parallel edge from 12 to 27 by three features.

What should we do about parallel edges in general in ProtGraph?

Protgraph PR 0.30 consumes to much RAM

The newer unreleased Version of ProtGraph consumes roughly per pocess 5-7 GB of RAM. This can be an issue on servers with many cores and few available RAM.

However there is a workarround by simply reducing the number of threads -np.

Dependency apsw is not provided in PyPI

It looks like apsw is not officially in PyPI.

Last working version in ProtGraph:pip install apsw==3.8.11.1-r1

No Isoform generation for Q9QXS1 possible

It is currently not possible to generate isoforms for the protein https://www.uniprot.org/uniprot/Q9QXS1.txt

This should be fixed. Instead of handcrafted parsing (using ,) we should use other mechanisms to scheck, wheather the varitaional sequences are present in FT and in CC.

Generted FASTAs contain OR[|None] Entries, which contain no information

This is probably due to cleavages directly at the ending or beginning of a protein. Removing such entries could slightly reduce the exported FASTA.

Bioconda add License to recipe

The License-File is currently missing in the recipe for BioConda as well as for our PyPI package. We need to add this

Error in PepFasta, when retrieving substitution information

An error exists in variants, where we actually try to retrieve the substitution

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "../../Luxxii/ProtGraph/utilities/generate_fasta_from_peppg_pickle_export.py", line 121, in execute
    peptide, part_header = get_pep_and_header_def(row[0], row[1], base_folder)
  File "../../Luxxii/ProtGraph/utilities/generate_fasta_from_peppg_pickle_export.py", line 141, in get_pep_and_header_def
    l_str_qualifiers = PF._get_qualifiers(graph, edges)
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/export/peptides/pep_fasta.py", line 100, in _get_qualifiers
    + str(f.location.end) + "," + self._get_variant_qualifier(f) + "]"
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/export/peptides/pep_fasta.py", line 118, in _get_variant_qualifier
    message = message[:message.index("(")-1]
ValueError: substring not found

Potential Bottlenecks in ProtGraph

I ran a line profiler on the generate_graph_consumer method on the human_review dataset (20k proteins)

It gives me the following output:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   114                                           def generate_graph_consumer(entry_queue, graph_queue, common_out_queue, proc_id, **kwargs):
   115                                               """
   116                                               TODO
   117                                               describe kwargs and consumer until a graph is generated and digested etc ...
   118                                               """
   119                                               # Set proc id
   120         1          3.0      3.0      0.0      kwargs["proc_id"] = proc_id
   121                                           
   122                                               # Set feature_table dict boolean table
   123         1          1.0      1.0      0.0      ft_dict = dict()
   124         1          1.0      1.0      0.0      if kwargs["feature_table"] is None or len(kwargs["feature_table"]) == 0 or "ALL" in kwargs["feature_table"]:
   125         1          6.0      6.0      0.0          ft_dict = dict(VARIANT=True, VAR_SEQ=True, SIGNAL=True, INIT_MET=True, MUTAGEN=True, CONFLICT=True)
   126                                               else:
   127                                                   for i in kwargs["feature_table"]:
   128                                                       ft_dict[i] = True
   129                                           
   130                                               # Initialize the exporters for graphs
   131         1         58.0     58.0      0.0      graph_exporters = Exporters(**kwargs)
   132                                           
   133                                               while True:
   134                                                   # Get next entry
   135     20387    7568810.0    371.3      0.7          entry = entry_queue.get()
   136                                           
   137                                                   # Stop if entry is None
   138     20387      18156.0      0.9      0.0          if entry is None:
   139                                                       # --> Stop Condition of Process
   140         1          3.0      3.0      0.0              break
   141                                           
   142                                                   # Beginning of Graph-Generation
   143                                                   # We also collect interesting information here!
   144                                           
   145                                                   # Generate canonical graph (initialization of the graph)
   146     20386    4416910.0    216.7      0.4          graph = _generate_canonical_graph(entry.sequence, entry.accessions[0])
   147                                           
   148                                                   # FT parsing and appending of Nodes and Edges into the graph
   149                                                   # The amount of isoforms, etc.. can be retrieved on the fly
   150     20386      22188.0      1.1      0.0          num_isoforms, num_initm, num_signal, num_variant, num_mutagens, num_conficts =\
   151     20386  312577641.0  15333.0     29.4              _include_ft_information(entry, graph, ft_dict)
   152                                           
   153                                                   # Replace Amino Acids based on user defined rules: E.G.: "X -> A,B,C"
   154     20386      83272.0      4.1      0.0          replace_aa(graph, kwargs["replace_aa"])
   155                                           
   156                                                   # Digest graph with enzyme (unlimited miscleavages)
   157     20386  457306111.0  22432.4     43.0          num_of_cleavages = digest(graph, kwargs["digestion"])
   158                                           
   159                                                   # Merge (summarize) graph if wanted
   160     20386      29893.0      1.5      0.0          if not kwargs["no_merge"]:
   161     20386  268518281.0  13171.7     25.3              merge_aminoacids(graph)
   162                                           
   163                                                   # Collapse parallel edges in a graph
   164     20386      29694.0      1.5      0.0          if not kwargs["no_collapsing_edges"]:
   165     20386   10804029.0    530.0      1.0              collapse_parallel_edges(graph)
   166                                           
   167                                                   # Annotate weights for edges and nodes (maybe even the smallest weight possible to get to the end node)
   168     20386     948172.0     46.5      0.1          annotate_weights(graph, **kwargs)
   169                                           
   170                                                   # Calculate statistics on the graph:
   171     20386      11921.0      0.6      0.0          (
   172     20386      12094.0      0.6      0.0              num_nodes, num_edges, num_paths, num_paths_miscleavages, num_paths_hops,
   173     20386       9768.0      0.5      0.0              num_paths_var, num_path_mut, num_path_con
   174     20386     297176.0     14.6      0.0          ) = get_statistics(graph, **kwargs)
   175                                           
   176                                                   # Verify graphs if wanted:
   177     20386      11624.0      0.6      0.0          if kwargs["verify_graph"]:
   178                                                       verify_graph(graph)
   179                                           
   180                                                   # Persist or export graphs with speicified exporters
   181     20386      38415.0      1.9      0.0          graph_exporters.export_graph(graph, common_out_queue)
   182                                           
   183                                                   # Output statistics we gathered during processing
   184     20386      10500.0      0.5      0.0          if kwargs["no_description"]:
   185                                                       entry_protein_desc = None
   186                                                   else:
   187     20386      37338.0      1.8      0.0              entry_protein_desc = entry.description.split(";", 1)[0]
   188     20386      37142.0      1.8      0.0              entry_protein_desc = entry_protein_desc[entry_protein_desc.index("=") + 1:]
   189                                           
   190     40772     312422.0      7.7      0.0          graph_queue.put(
   191     20386      12337.0      0.6      0.0              (
   192     20386      11818.0      0.6      0.0                  entry.accessions[0],  # Protein Accesion
   193     20386      10432.0      0.5      0.0                  entry.entry_name,  # Protein displayed name
   194     20386       9196.0      0.5      0.0                  num_isoforms,  # Number of Isoforms
   195     20386       9231.0      0.5      0.0                  num_initm,  # Number of Init_M (either 0 or 1)
   196     20386       9244.0      0.5      0.0                  num_signal,  # Number of Signal Peptides used (either 0 or 1)
   197     20386       9232.0      0.5      0.0                  num_variant,  # Number of Variants applied to this protein
   198     20386       9227.0      0.5      0.0                  num_mutagens,  # Number of applied mutagens on the graph
   199     20386       9231.0      0.5      0.0                  num_conficts,  # Number of applied conflicts on the graph
   200     20386       9274.0      0.5      0.0                  num_of_cleavages,  # Number of cleavages (marked edges) this protein has
   201     20386       9240.0      0.5      0.0                  num_nodes,  # Number of nodes for the Protein/Peptide Graph
   202     20386       9269.0      0.5      0.0                  num_edges,  # Number of edges for the Protein/Peptide Graph
   203     20386       9311.0      0.5      0.0                  num_paths,  # Possible (non repeating paths) to the end of a graph. (may conatin repeating peptides)
   204     20386       9318.0      0.5      0.0                  num_paths_miscleavages,  # As num_paths, but binned to the number of miscleavages (by list idx, at 0)
   205     20386       9288.0      0.5      0.0                  num_paths_hops,  # As num_paths, only that we bin by hops (E.G. useful for determine DFS or BFS depths)
   206     20386       9363.0      0.5      0.0                  num_paths_var,  # Num paths of feture variant
   207     20386       9519.0      0.5      0.0                  num_path_mut,  # Num paths of feture mutagen
   208     20386       9508.0      0.5      0.0                  num_path_con,  # Num paths of feture conflict
   209     20386       9476.0      0.5      0.0                  entry_protein_desc,  # Description name of the Protein (can be lenghty)
   210                                                       )
   211                                                   )
   212                                           
   213                                               # Close exporters (maybe opened files, database connections, etc... )
   214         1         13.0     13.0      0.0      graph_exporters.close()

Bottlenecks are:

Merge Aminoacids (~25%)
Apply Features (~29%)
Digestion (~43%)

Include Feature CHAIN

The CHAIN feature-information can also be "cleaved" as stated in the documentation: https://www.uniprot.org/help/chain

ProtGraph therefore should also set those points as specific cleavage points (similar to PEPTIDE and PROPPEP).

It was first noticed in https://www.uniprot.org/uniprotkb/P05067/entry (for Amyloid-Beta 40/42)

New Export Format PEFF

If we export Fasta, then we should also somehow include PEFF.

Split Protgraph into possibly two Projects?

Currently we have a large Project, reading SwissProt-EMBL and generating graphs out of it.

Currently it is not possible to retrieve them via Python directly. It is only possible through Pickle by saving and loading graphs separately.

Currently there are many Consumer and one Producer Thread, which do very basic operations. These operations may be separated into another Project, so that by importing it, the Protgraph references to a library.

New Statistic: Get all possible Weights from a Protein Graph

We could use Dynamic Programming by Rev. Top Sort and sets to propagate possible weights to the start node.

the final set for proteins should be small with some Test Scripts (around 1/60)

FT: INIT_MET currently ignores ref (isoforms) and duplicates edges from the start node to the second canonical node

This is not correct and should be fixed. INIT_MET references the aminoacid M explicitly from either a isoform (via the reference information) or from the canonical sequence (emtpy reference).

Project Name Suggestions

Suggestions:

Prograph (ProGraph)
Protgraph (ProtGraph)

Feel free to add other suggestions!

Pep Export depending on mass

Instead of using the number of AAs, we could (more precisely) use a range of allowed masses.

This could be a new CLI Parameter in ProtGraph

FASTA Export & Full Cleavage

Bug? Edges can go directly from start to end

There are some Protens, where such a case happens.

Example is needed!

Split the AminoAcids (B, Z maybe even X or others) into the concrete AminoAcids

We should or could add a small script, which adds/changes features/sequences, to expand Letters, which refer to more than 1 Aminoacid.

E.G.: B -> D or N or J -> I or L (, X -> A, C, .....)

Error of reading SP-EMBL files via Windows

It seems that the new reader is not able to read files from windows. This may be due to the new line of \r\n

Cannot parse Isoform Mutagens and Conflicts for specific Proteins

Currently some Proteins with Isoform Mutagens and Conflicts are currently skipped.

Example for Conflict: https://www.uniprot.org/uniprot/P52744.txt

Example for Mutagen: https://www.uniprot.org/uniprot/P35613.txt

Update Readme with new available Options and changing CLI

This should also include

MUTAGEN
CONFLICT
Replacement Syntax
FT and Digestion new CLI options

Export: JanusGraph

We currently only have exports to files and databases (redis and Postgres).

Those are not well suited to traverse and process graphs (postgres is actually very perfromant if the rec. depth is not large).

We need to use some dedicated graph processing algorithms/databases for such large graphs. Here JanusGraph is tested

Error occures when digesting via full

An error which is not catched occurs, when digesting via the digestion method full

Here is the stacktrace:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/graph_generator.py", line 119, in generate_graph_consumer
    num_of_cleavages = digest(graph, kwargs["digestion"])
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/digestion.py", line 11, in digest
    return dict(
  File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/digestion.py", line 114, in _digest_via_full
    end_out.remove(i)
ValueError: list.remove(x): x not in list

This is happens with the recent Version of ProtGraph 0.1.0. It seems like a case is not considered here.

ProtGraph might not generate files while called through Python

The calls (like in the functional tests) might not work properly.

Here is an example:

        args = protgraph.parse_args([] + self.procs_num + self.example_files)
        protgraph.prot_graph(**args)

Conda channel priority problems

At least in my (clean) miniconda environment, I needed to run
conda install -c bioconda --no-channel-priority protgraph
to get protgraph installed, with Python 3.9.16 as only additional package installed before.
Maybe hint to it in the documentation.

Reimplementing (Refactoring) of Signal Peptides

The Protein which causes a Problem: P20729 (can produce "null"-Peptides/Proteins)

This should be easily fixable if using the new implementation of PEPTIDE or PROPEP

Utilities should import Protgraph if methods are used from them

We should import the methods, which are used by ProtGraph from the utilities folder. This reduces redundant code

Add new Feature: CONFLICT

Only apply variants which are significant (or unknown or other)

If looking at the Feature Viewer in UniProt for a single protein, options occur where it can be selected between "Likely Disease", "Predicted Consequences", etc..

These Information is parsed from UniProt via the note= - Information.

If we want to apply only significant (or other interesting variants on a protein) we should also implement such a filtering.

As a general consensus: Everything that uses a: in XXX is a variant which causes likely the disease XXX.

Everything that contains Unknown or something similar is then categorized specifically.

Maybe we should send a Message to the UniProt-Team how they bin those Variants (do they have some specific keywords)

(SIGNAL)
https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/peptides/pep_fasta.py#L103

(PEPTIDE)
https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/peptides/pep_fasta.py#L116

(PROPEP)
https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/peptides/pep_fasta.py#L112

mpc-bioinformatics / protgraph Goto Github PK

protgraph's People

Contributors

Stargazers

Watchers

Forkers

protgraph's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs