mpc-bioinformatics / protgraph Goto Github PK
View Code? Open in Web Editor NEWProtGraph - A Graph-Generator for Proteins
License: Other
ProtGraph - A Graph-Generator for Proteins
License: Other
Generating a graph from protein P20729 yields an edge that connects the start and the end node directly.
This is something that should not happen.
See title.
Specifically: we want to parameterize the following line:
https://github.com/mpc-bioinformatics/ProtGraph/blob/master/protgraph/export/pcsr.py#L82
protgraph
Here are the actual error Messages:
Accession: F1SN05, Aminoacid(s): None, Position: 1
Additional Context: No M found to skip for the feature INIT_MET for the given cases
Message: type: INIT_MET
location: [0:1]
qualifiers:
Key: evidence, Value: ECO:0000256|HAMAP-Rule:MF_03009
Key: note, Value: Removed
Accession: A0A4X1VEZ3, Aminoacid(s): None, Position: 1
Additional Context: No M found to skip for the feature INIT_MET for the given cases
Message: type: INIT_MET
location: [0:1]
qualifiers:
Key: evidence, Value: ECO:0000256|HAMAP-Rule:MF_03009
Key: note, Value: Removed
ProtGraph currently has a reading process, which also parses the entry via biopython. We could outsource this to the graph-generating processes (as well as the blacklist).
This could possibly even further speed up the graph-generation, since we can currently observe with a high number of processes, that the reading thread is not fast enough for the consumers.
FT MUTAGEN 12..27
FT /note="Missing: Complete inhibition of secretion."
FT /evidence="ECO:0000269|PubMed:17208043"
FT MUTAGEN 12..22
FT /note="Missing: Complete inhibition of secretion."
FT /evidence="ECO:0000269|PubMed:17208043"
FT MUTAGEN 23..27
FT /note="Missing: No effect on secretion."
FT /evidence="ECO:0000269|PubMed:17208043"
These Entries are selected specifically in this Protein, so that ProtGraph is going to generate a parallel edge from 12
to 27
by three features.
What should we do about parallel edges in general in ProtGraph?
The newer unreleased Version of ProtGraph consumes roughly per pocess 5-7 GB of RAM. This can be an issue on servers with many cores and few available RAM.
However there is a workarround by simply reducing the number of threads -np
.
It looks like apsw is not officially in PyPI.
Last working version in ProtGraph:pip install apsw==3.8.11.1-r1
It is currently not possible to generate isoforms for the protein https://www.uniprot.org/uniprot/Q9QXS1.txt
This should be fixed. Instead of handcrafted parsing (using ,
) we should use other mechanisms to scheck, wheather the varitaional sequences are present in FT and in CC.
This is probably due to cleavages directly at the ending or beginning of a protein. Removing such entries could slightly reduce the exported FASTA.
The License-File is currently missing in the recipe for BioConda as well as for our PyPI package. We need to add this
An error exists in variants, where we actually try to retrieve the substitution
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "../../Luxxii/ProtGraph/utilities/generate_fasta_from_peppg_pickle_export.py", line 121, in execute
peptide, part_header = get_pep_and_header_def(row[0], row[1], base_folder)
File "../../Luxxii/ProtGraph/utilities/generate_fasta_from_peppg_pickle_export.py", line 141, in get_pep_and_header_def
l_str_qualifiers = PF._get_qualifiers(graph, edges)
File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/export/peptides/pep_fasta.py", line 100, in _get_qualifiers
+ str(f.location.end) + "," + self._get_variant_qualifier(f) + "]"
File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/export/peptides/pep_fasta.py", line 118, in _get_variant_qualifier
message = message[:message.index("(")-1]
ValueError: substring not found
I ran a line profiler on the generate_graph_consumer
method on the human_review dataset (20k proteins)
It gives me the following output:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
114 def generate_graph_consumer(entry_queue, graph_queue, common_out_queue, proc_id, **kwargs):
115 """
116 TODO
117 describe kwargs and consumer until a graph is generated and digested etc ...
118 """
119 # Set proc id
120 1 3.0 3.0 0.0 kwargs["proc_id"] = proc_id
121
122 # Set feature_table dict boolean table
123 1 1.0 1.0 0.0 ft_dict = dict()
124 1 1.0 1.0 0.0 if kwargs["feature_table"] is None or len(kwargs["feature_table"]) == 0 or "ALL" in kwargs["feature_table"]:
125 1 6.0 6.0 0.0 ft_dict = dict(VARIANT=True, VAR_SEQ=True, SIGNAL=True, INIT_MET=True, MUTAGEN=True, CONFLICT=True)
126 else:
127 for i in kwargs["feature_table"]:
128 ft_dict[i] = True
129
130 # Initialize the exporters for graphs
131 1 58.0 58.0 0.0 graph_exporters = Exporters(**kwargs)
132
133 while True:
134 # Get next entry
135 20387 7568810.0 371.3 0.7 entry = entry_queue.get()
136
137 # Stop if entry is None
138 20387 18156.0 0.9 0.0 if entry is None:
139 # --> Stop Condition of Process
140 1 3.0 3.0 0.0 break
141
142 # Beginning of Graph-Generation
143 # We also collect interesting information here!
144
145 # Generate canonical graph (initialization of the graph)
146 20386 4416910.0 216.7 0.4 graph = _generate_canonical_graph(entry.sequence, entry.accessions[0])
147
148 # FT parsing and appending of Nodes and Edges into the graph
149 # The amount of isoforms, etc.. can be retrieved on the fly
150 20386 22188.0 1.1 0.0 num_isoforms, num_initm, num_signal, num_variant, num_mutagens, num_conficts =\
151 20386 312577641.0 15333.0 29.4 _include_ft_information(entry, graph, ft_dict)
152
153 # Replace Amino Acids based on user defined rules: E.G.: "X -> A,B,C"
154 20386 83272.0 4.1 0.0 replace_aa(graph, kwargs["replace_aa"])
155
156 # Digest graph with enzyme (unlimited miscleavages)
157 20386 457306111.0 22432.4 43.0 num_of_cleavages = digest(graph, kwargs["digestion"])
158
159 # Merge (summarize) graph if wanted
160 20386 29893.0 1.5 0.0 if not kwargs["no_merge"]:
161 20386 268518281.0 13171.7 25.3 merge_aminoacids(graph)
162
163 # Collapse parallel edges in a graph
164 20386 29694.0 1.5 0.0 if not kwargs["no_collapsing_edges"]:
165 20386 10804029.0 530.0 1.0 collapse_parallel_edges(graph)
166
167 # Annotate weights for edges and nodes (maybe even the smallest weight possible to get to the end node)
168 20386 948172.0 46.5 0.1 annotate_weights(graph, **kwargs)
169
170 # Calculate statistics on the graph:
171 20386 11921.0 0.6 0.0 (
172 20386 12094.0 0.6 0.0 num_nodes, num_edges, num_paths, num_paths_miscleavages, num_paths_hops,
173 20386 9768.0 0.5 0.0 num_paths_var, num_path_mut, num_path_con
174 20386 297176.0 14.6 0.0 ) = get_statistics(graph, **kwargs)
175
176 # Verify graphs if wanted:
177 20386 11624.0 0.6 0.0 if kwargs["verify_graph"]:
178 verify_graph(graph)
179
180 # Persist or export graphs with speicified exporters
181 20386 38415.0 1.9 0.0 graph_exporters.export_graph(graph, common_out_queue)
182
183 # Output statistics we gathered during processing
184 20386 10500.0 0.5 0.0 if kwargs["no_description"]:
185 entry_protein_desc = None
186 else:
187 20386 37338.0 1.8 0.0 entry_protein_desc = entry.description.split(";", 1)[0]
188 20386 37142.0 1.8 0.0 entry_protein_desc = entry_protein_desc[entry_protein_desc.index("=") + 1:]
189
190 40772 312422.0 7.7 0.0 graph_queue.put(
191 20386 12337.0 0.6 0.0 (
192 20386 11818.0 0.6 0.0 entry.accessions[0], # Protein Accesion
193 20386 10432.0 0.5 0.0 entry.entry_name, # Protein displayed name
194 20386 9196.0 0.5 0.0 num_isoforms, # Number of Isoforms
195 20386 9231.0 0.5 0.0 num_initm, # Number of Init_M (either 0 or 1)
196 20386 9244.0 0.5 0.0 num_signal, # Number of Signal Peptides used (either 0 or 1)
197 20386 9232.0 0.5 0.0 num_variant, # Number of Variants applied to this protein
198 20386 9227.0 0.5 0.0 num_mutagens, # Number of applied mutagens on the graph
199 20386 9231.0 0.5 0.0 num_conficts, # Number of applied conflicts on the graph
200 20386 9274.0 0.5 0.0 num_of_cleavages, # Number of cleavages (marked edges) this protein has
201 20386 9240.0 0.5 0.0 num_nodes, # Number of nodes for the Protein/Peptide Graph
202 20386 9269.0 0.5 0.0 num_edges, # Number of edges for the Protein/Peptide Graph
203 20386 9311.0 0.5 0.0 num_paths, # Possible (non repeating paths) to the end of a graph. (may conatin repeating peptides)
204 20386 9318.0 0.5 0.0 num_paths_miscleavages, # As num_paths, but binned to the number of miscleavages (by list idx, at 0)
205 20386 9288.0 0.5 0.0 num_paths_hops, # As num_paths, only that we bin by hops (E.G. useful for determine DFS or BFS depths)
206 20386 9363.0 0.5 0.0 num_paths_var, # Num paths of feture variant
207 20386 9519.0 0.5 0.0 num_path_mut, # Num paths of feture mutagen
208 20386 9508.0 0.5 0.0 num_path_con, # Num paths of feture conflict
209 20386 9476.0 0.5 0.0 entry_protein_desc, # Description name of the Protein (can be lenghty)
210 )
211 )
212
213 # Close exporters (maybe opened files, database connections, etc... )
214 1 13.0 13.0 0.0 graph_exporters.close()
Bottlenecks are:
The CHAIN feature-information can also be "cleaved" as stated in the documentation: https://www.uniprot.org/help/chain
ProtGraph therefore should also set those points as specific cleavage points (similar to PEPTIDE and PROPPEP).
It was first noticed in https://www.uniprot.org/uniprotkb/P05067/entry (for Amyloid-Beta 40/42)
If we export Fasta, then we should also somehow include PEFF.
Currently we have a large Project, reading SwissProt-EMBL and generating graphs out of it.
Currently it is not possible to retrieve them via Python directly. It is only possible through Pickle by saving and loading graphs separately.
Currently there are many Consumer and one Producer Thread, which do very basic operations. These operations may be separated into another Project, so that by importing it, the Protgraph references to a library.
We could use Dynamic Programming by Rev. Top Sort and sets
to propagate possible weights to the start node.
the final set for proteins should be small with some Test Scripts (around 1/60)
This is not correct and should be fixed. INIT_MET references the aminoacid M explicitly from either a isoform (via the reference information) or from the canonical sequence (emtpy reference).
Suggestions:
Feel free to add other suggestions!
Instead of using the number of AAs, we could (more precisely) use a range of allowed masses.
This could be a new CLI Parameter in ProtGraph
There are some Protens, where such a case happens.
Example is needed!
We should or could add a small script, which adds/changes features/sequences, to expand Letters, which refer to more than 1 Aminoacid.
E.G.: B -> D or N
or J -> I or L
(, X -> A, C, .....
)
It seems that the new reader is not able to read files from windows. This may be due to the new line of \r\n
Currently some Proteins with Isoform Mutagens and Conflicts are currently skipped.
Example for Conflict: https://www.uniprot.org/uniprot/P52744.txt
Example for Mutagen: https://www.uniprot.org/uniprot/P35613.txt
This should also include
We currently only have exports to files and databases (redis and Postgres).
Those are not well suited to traverse and process graphs (postgres is actually very perfromant if the rec. depth is not large).
We need to use some dedicated graph processing algorithms/databases for such large graphs. Here JanusGraph is tested
An error which is not catched occurs, when digesting via the digestion method full
Here is the stacktrace:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/graph_generator.py", line 119, in generate_graph_consumer
num_of_cleavages = digest(graph, kwargs["digestion"])
File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/digestion.py", line 11, in digest
return dict(
File "/home/luxii/.local/share/virtualenvs/ProtGraph-x7sicAnP/lib/python3.8/site-packages/protgraph-0.1.0-py3.8.egg/protgraph/digestion.py", line 114, in _digest_via_full
end_out.remove(i)
ValueError: list.remove(x): x not in list
This is happens with the recent Version of ProtGraph 0.1.0. It seems like a case is not considered here.
The calls (like in the functional tests) might not work properly.
Here is an example:
args = protgraph.parse_args([] + self.procs_num + self.example_files)
protgraph.prot_graph(**args)
At least in my (clean) miniconda environment, I needed to run
conda install -c bioconda --no-channel-priority protgraph
to get protgraph installed, with Python 3.9.16 as only additional package installed before.
Maybe hint to it in the documentation.
The Protein which causes a Problem: P20729
(can produce "null"-Peptides/Proteins)
This should be easily fixable if using the new implementation of PEPTIDE or PROPEP
We should import the methods, which are used by ProtGraph from the utilities folder. This reduces redundant code
If looking at the Feature Viewer in UniProt for a single protein, options occur where it can be selected between "Likely Disease", "Predicted Consequences", etc..
These Information is parsed from UniProt via the note=
- Information.
If we want to apply only significant (or other interesting variants on a protein) we should also implement such a filtering.
As a general consensus: Everything that uses a: in XXX
is a variant which causes likely the disease XXX
.
Everything that contains Unknown
or something similar is then categorized specifically.
Maybe we should send a Message to the UniProt-Team how they bin those Variants (do they have some specific keywords)
The current implementation needs a lot of time to process huge graphs (e.g. for the protein Titin).
We should at some point look into the implementation of merging nodes/aminoacids here, and further optimize it.
Currently we only provide n-terminal (and c-terminal) peptide PTMs as well as PTMs on aminoacids. There are also other and different PTMs which ProtGraph should be able to map
It would be interesting to see the distribution of the number of possible paths for the exact number of miscleavages (cummulative?)
Currently these features (except for CHAINs), currently do not provide these IDs in FASTA_Headers. We should expand this!
It may be enough to change the following lines:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.