salilab / ihmvalidation Goto Github PK
View Code? Open in Web Editor NEWValidation software for integrative models deposited to PDB
License: MIT License
Validation software for integrative models deposited to PDB
License: MIT License
I believe that with the 1.10 update of the IHM dictionary explicit setting of _ihm_derived_distance_restraint.restraint_type
became mandatory thus the following record now produce an error:
loop_ _ihm_derived_distance_restraint.id
_ihm_derived_distance_restraint.group_id
_ihm_derived_distance_restraint.feature_id_1
_ihm_derived_distance_restraint.feature_id_2
_ihm_derived_distance_restraint.restraint_type
_ihm_derived_distance_restraint.dataset_list_id
1 1 1 2 . 2
2 1 1 3 . 2
/usr/local/lib/python3.8/dist-packages/ihm/reader.py in __call__(self, id, group_id, dataset_list_id, feature_id_1, feature_id_2, restraint_type, group_conditionality, probability, mic_value, distance_lower_limit, distance_upper_limit)
2105 r.feature2 = self.sysr.features.get_by_id(feature_id_2)
2106 print(restraint_type)
-> 2107 r.distance = _handle_distance[restraint_type](distance_lower_limit,
2108 distance_upper_limit,
2109 self.get_float)
KeyError: None
I guess it should be fixed in the entry? @brindakv
According to the dictionary the template_seq_id_begin
and template_seq_id_end
fields have to be integer values. However they are set with a ?
symbol:
#
loop_
_ihm_starting_comparative_models.id
_ihm_starting_comparative_models.starting_model_id
_ihm_starting_comparative_models.starting_model_auth_asym_id
_ihm_starting_comparative_models.starting_model_seq_id_begin
_ihm_starting_comparative_models.starting_model_seq_id_end
_ihm_starting_comparative_models.template_auth_asym_id
_ihm_starting_comparative_models.template_seq_id_begin
_ihm_starting_comparative_models.template_seq_id_end
_ihm_starting_comparative_models.template_sequence_identity
_ihm_starting_comparative_models.template_sequence_identity_denominator
_ihm_starting_comparative_models.template_dataset_list_id
_ihm_starting_comparative_models.alignment_file_id
1 3 A 1 51 C ? ? ? ? 3 .
2 4 A 1 51 D ? ? ? ? 3 .
#
IHM follows the spec and fails with the error:
/usr/local/lib/python3.8/dist-packages/ihm/reader.py in __call__(self, starting_model_id, template_dataset_list_id, alignment_file_id, template_auth_asym_id, starting_model_seq_id_begin, starting_model_seq_id_end, template_seq_id_begin, template_seq_id_end, template_sequence_identity, template_sequence_identity_denominator)
1538 seq_id_range = (int(starting_model_seq_id_begin),
1539 int(starting_model_seq_id_end))
-> 1540 template_seq_id_range = (int(template_seq_id_begin),
1541 int(template_seq_id_end))
1542 identity = ihm.startmodel.SequenceIdentity(
TypeError: int() argument must be a string, a bytes-like object or a number, not '__UnknownValue'
@brindakv what is the best course of action in this case?
How to interpret the report? What are good and bad values? Add information to the user guide (wherever possible) regarding what are good and bad values.
Good places to start would be in validation_help.html
, particularly in the "Model Quality Assessment" and "Fit to Data Used for Modeling Assessment" sections. For example, state for each score whether higher or lower values are "better".
Some problems of #23 were caused by inaccessible paths to static js/css resources. Looks like there are some duplications and version mismatches that require refactoring. Overall it would be better to simplify a set of resources and sync it with layout.html
.
E.g. about_validation.html
and validation_help.html
have a somewhat mixed set of links.
<!-- add Javasscript file from js file -->
<script type="text/javascript" src="js/jquery.min.js"></script>
<script type="text/javascript" src="js/bootstrap.min.js"></script>
<script type="text/javascript" src="js/main.js"></script>
<script type="text/javascript" src="js/jquery-3.3.1.min.js"></script>
<script type="text/javascript" src="js/popper1.12.9.min.js"></script>
<script type="text/javascript" src=".js/bootstrap4.1.3.min.js"></script>
<script type="text/javascript" src="js/bootstrap3-typeahead.min.js"></script>
wkhtmltopdf
settings seem to be conflicting. Despite the fact that javascript was disabled, timeout is still applied and causes a considerable delay during pdf generation. Related to #38
IHMValidation/example/Execute.py
Lines 78 to 79 in 92b1d37
IHMValidation/example/Execute.py
Lines 95 to 96 in 92b1d37
A mini-review is below. I'll try to comment on each class of issue only once.
Tests will rot if they're not run periodically. It's pretty straightforward to set them up to run on each push, using GitHub Actions. See for example https://github.com/ihmwg/python-ihm/blob/main/.github/workflows/testpy.yml
Docs can also be auto-built with readthedocs.io. See e.g. https://python-ihm.readthedocs.io/. I can set that up for you if you like.
Exclude __pycache__
with a .gitignore
file in https://github.com/salilab/IHMValidation/tree/master/master/pyext/src/validation. There's no point in tracking .pyc
files in source control.
You might consider renaming your main branch from 'master' to 'main'. The latter is quickly becoming the standard on GitHub.
Never ever use tabs in Python code. Tabs are evil. (Perhaps configure your editor to insert 4 spaces for each tab keypress.) Run your code through a PEP-8 formatter like autopep8, reindent.py, or black.
flake8
over your code which will pick up a lot of issues like this; it can be quite educational).
Template_Dict
-> template_dict
).
list()
here is unnecessary and perhaps inefficient.
'static/results/'
here kind of defeats the point of usingpath.join
in the first place. Use 'static', 'results'
instead. The path.abspath
is unnecessary too since you are not changing directory between when you construct the filename and when you use it to open the file.
global
? You should always carefully check any global usage.
except
. Always explicitly list the exceptions you want to catch. Otherwise you may catch exceptions that should be fixed (e.g. a syntax error in the try block).
with
(a context manager) and put the read()
inside the with
body. This ensures the file is closed at the end of the scope.
_entry.id
and _struct.entry_id
are different. This shouldn't happen (pretty sure the dictionary requires them to be the same) but if it does, you could ask the python-ihm developer to handle this ;)
IHMValidation/master/pyext/src/validation/__init__.py
Lines 70 to 76 in ddf1a08
return "; ".join(aut for aut in cit[0].authors)
?
entities[0].description
perhaps?
map
in modern Python code; it's largely been replaced with comprehensions, e.g. assembly_id = [int(x) for x in self.get_assembly_ID_of_models()
IHMValidation/master/pyext/src/validation/__init__.py
Lines 223 to 225 in ddf1a08
chain = [el._id for el in ass]
would be more concise.
used.append()
). If it's something clever, add a comment to help the poor reader.
enumerate
if you're not using the count? Just use for el in self.system.asym_units:
instead.
_
is usually used as a "I'm not using this value" placeholder, so this looks weird (and is also hard to read). Use a real variable name, e.g. software
, instead.
?
and '?'
which you're losing here. It would be more correct to say if _.version == ihm.unknown:
as error
implies you're going to use the error object... and then you don't.
loc = 'Not listed'
?
isinstance
instead.
IHMValidation/master/pyext/src/validation/__init__.py
Lines 454 to 457 in ddf1a08
return 'SAS' in str(data_type) and 'SAS' in str(database)
would be more concise here.
self.nos = self.get_number_of_models()
here unless you have overridden the method in this class but really want to still call the base class method (which would be confusing).
IHMValidation/master/pyext/src/validation/cx.py
Lines 223 to 232 in ddf1a08
linkers = {'DSS': 30, 'EDC': 20}
return dist <= linkers.get(linker, 30)
That dict could go in a utility module somewhere so you can use it elsewhere, e.g. in cx_plots.py
.
IHMValidation/master/pyext/src/validation/excludedvolume.py
Lines 56 to 57 in ddf1a08
model_spheres
object, and then another in the DataFrame
). Assuming you're using pandas 0.13 or later, you can avoid the intermediate model_spheres
object by passing a generator to the DataFrame
constructor instead of a dict.
os.getcwd()
isn't needed here. If you're trying to make the path absolute (although not needed here) os.path.abspath
is the way to do it.
IHMValidation/master/pyext/src/validation/molprobity.py
Lines 67 to 68 in ddf1a08
with open(f_name,'w+') as outfile:
would be more normal.
j.replace(',','').replace(':','').split()
a bunch of times here. Store it in a variable to make it more efficient and easier to read.
Å
entity would be simpler).
raise
be more appropriate if this is an error?
list()
is unnecessary here.
lambda x: x[1]
use operator.itemgetter(1)
.
IHMValidation/master/pyext/src/validation/sas.py
Lines 418 to 419 in ddf1a08
None
doesn't have an append
method.
0 < len(sascifline) < 3
if not val_m[1].empty:
or perhaps if val_m[1].empty is False:
if you also need to check it's not something non-boolean (e.g. None
).
IHMValidation/master/pyext/src/validation/utility.py
Lines 50 to 58 in ddf1a08
val += foo
val
is destroyed and replaced with a new, longer string object. Whenever you have val += foo
inside a for
loop, consider replacing with something like val = ''.join(something)
.
range
of 0 is redundant - that's the default.
IHMValidation/master/pyext/src/validation/utility.py
Lines 208 to 210 in ddf1a08
new_restraints = {key: list(set(val)) for key,val in restraints.items()}
os.listdir('.')
would be more concise.e.g. at https://pdb-dev-beta.wwpdb.org/Validation/PDBDEV_00000016/htmls/main.html clicking on the "Data Quality" tab in the table under "Overall quality" gives a 404. This is likely because this entry uses SAXS data that is not available in SASBDB. A page to that effect should be generated instead.
Fix typo on https://pdb-dev-beta.wwpdb.org/about_validation.html (data quality assesments
).
create a docker image with the right dependencies
Gerado suggests:
Yes I would just add a sentence in the "Overall quality" section that mentions where to find the more detailed reports (i.e. the pdf and the submenus available further up on the page).
Personally I scrolled down the page, saw the "This validation report contains model quality assessments for the structure" sentence and assumed that what followed was the whole report...
In Model Quality: add some measure of connectivity restraint satisfaction. This could be useful especially for bead models of the kind in IMP that represent regions of unknown structure.
PDB-Dev logo pic: images/logon.png
has a size of 1.1M which is about the same size as the whole report for some entries. Optimization with optipng
reduces the size by only 8%. The JPEG version is about ~400K.
In case of migration to JPEG following files have to be modified:
about_validation.html: <img src="images/logon.png" class="float-left" alt="PDBDEV.org" height="100" width="110" style="margin-top: 0px; margin-bottom: 0px" />
templates/layout.html: <img src="../../../images/logon.png" class="float-left" alt="PDBDEV.org" height="100" width="110" style="margin-top: 0px; margin-bottom: 0px" />
templates/notformodeling.html: <img src="../../static/webimages/logon.png" class="float-left" alt="PDBDEV.org" height="100" width="110" style="margin-top: 10px; margin-bottom: 10px" />
templates/introduction.html: <img src="../../static/webimages/logon.png" class="float-left" alt="PDBDEV.org" height="100" width="110" style="margin-top: 10px; margin-bottom: 10px" />
validation_help.html: <img src="images/logon.png" class="float-left" alt="PDBDEV.org" height="100" width="110" style="margin-top: 0px; margin-bottom: 0px" />
Also, judging from the current code, pic size can be further reduced by decreasing the size.
Currently, only the raw numbers are reported for any outliers (clashes, Ramachandran and standard geometry outliers from Molprobity). This cannot be used as an indicator of model quality since it scales as the size of the model. Report the percentage as well to make this easier to interpret.
(Another suggestion would be to report outliers per 100 residues. A simple percentage is easier to calculate however and reports roughly the same thing. "Number of residues" is a slightly tricky quantity for integrative models, for example if there are parts of the model that are not atomic, or that don't report outliers, for example ligands.)
bokeh 3.x changes API, for instance
works in 2.4.3 but fails in 3.0.0There are multiple points in the entry 88 where ihm
fails during parsing:
Traceback (most recent call last):
File "/IHMValidation/example/../master/pyext/src/validation/__init__.py", line 74, in __init__
self.system, = ihm.reader.read(fh, model_class=self.model)
File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 3298, in read
more_data = r.read_file()
File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 594, in read_file
return self._read_file_c()
File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c
eof, more_data = _format.ihm_read_file(self._c_format)
File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte
is the result of the sentence:
Typically, 14<B7>106 to 20<B7>106 photons were recorded at TAC channel-width of 14.1\xa0ps (IBH-5000U) or 8\xa0ps (EasyTau300).
The other error:
File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 471889: invalid start byte
ogirinates from:
Sample conditions for the EPR experiments were 100 <B5>M protein in 100 mM NaCl, 50 mM Tris-HCl, 5 mM MgCl2, pH 7.4 dissolved in D2O with 12.5 % (v/v) glycerol-d8.
And finally, after deleting symbols causing previous errors:
Traceback (most recent call last):
File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c
eof, more_data = _format.ihm_read_file(self._c_format)
_format.FileFormatError: Wrong number of data values in loop (should be an exact multiple of the number of keys) at line 1940098
This is a continuation of the #38
Looks like utility.dict_to_JSlist
is heavily used throughout the code and performs a lot of iterations and list comprehensions.
IHMValidation/master/pyext/src/validation/utility.py
Lines 32 to 43 in 4df2f34
Though a list comprehension is quite efficient, it is overused causing a sizable delay.
Paths for tools from the molprobity
and ATSAS
suites are explicitly defined through environment variables. I think this is redundant since there is a generic and OS-independent way of using the PATH environment variable.
ATSAS=""
Molprobity_ramalyze=""
Molprobity_molprobity=""
Molprobity_clashscore=""
Molprobity_rotalyze=""
wkhtmltopdf=""
Moreover, in case of ATSAS it's a little bit misleading, since only datcmp
used and not the whole ATSAS package.
IHMValidation/master/pyext/src/validation/sas.py
Lines 345 to 346 in 0bb8ad2
Currently entries without SAS (or any other additional data):
Data quality and fit to model assessments for other datasets and model uncertainty are under development.
while entries with SAS have:
Data quality assessment for SAS datasets and fit to model assessments for SAS datasets is also included in this assessment. Data quality and fit to model assessments for other datasets and model uncertainty are under development.
We should sync/rework text to explicitly show users what types of data there are and what types already have validation implemented.
The current way of handling time has an implicit prerequisite that user time already has the America/Los_angeles
It would be better to first get the UTC time from the user, and later convert it to the proper timezone.
It would be nice to have the ability to update the PDB-Dev in parallel. Together with the recent updates in #38 it would allow to rebuild the whole repo (with recalculated values) in under 2 minutes on a modern 32-128 core node.
So far I identified several places which interfere with parallel execution:
IHMValidation/master/pyext/src/validation/__init__.py
Lines 737 to 739 in f1aec61
uses a hardcoded test.cif
as a temporary filename
IHMValidation/master/pyext/src/validation/utility.py
Lines 475 to 490 in f1aec61
removes any temp files by mask, including temp files generated for other structures. It specifically hits sascif processing (looks like other files are not reread again. at least when molprobity and excluded volume are already recalculated).
IHMValidation/master/pyext/src/validation/sas.py
Lines 370 to 372 in f1aec61
IHMValidation/master/pyext/src/validation/sas.py
Lines 381 to 382 in f1aec61
temp files for sas processing.
There are several places where "None" is written into the output HTML. One example is at https://pdb-dev-beta.wwpdb.org/Validation/PDBDEV_00000009/htmls/data_quality.html where the Dmax error
is reported to be None nm
. This is likely because the Python None value is being used as-is. "None nm" is obviously nonsensical; this should be reported instead as "0 nm" if there really is no error or, in the much more likely case that the error could not be calculated for some reason, that reason should be stated to the user.
We need to set up a set of milestones to prioritize issues. Especially for the initial SAS release.
Looks like none of the bokeh plots currently have bounds set on their x/y ranges. This allows them to scroll away from the data or even in some cases results in odd-looking initial plots. For example see the excluded volume plot https://pdb-dev-beta.wwpdb.org/Validation/PDBDEV_00000012/htmls/main.html. There will never be a negative number of violations so it should not be possible to scroll the x range that way. This can be done with something like
p = bokeh.plotting.figure(..., x_range=Range1d(0, xmax, bounds=(0, None)))
Looks like there is a problem with the width settings of some sections. The beginning of the report looks ok, problems start from the Data quality
section and continue down to the very end.
Seems like the problem has been there since the beginning, at least it was already there in Ben's update which was about the time of 9591c5a commit.
Add more tests to the tests
subdirectory to ensure that things that are fixed stay fixed. Add code coverage with codecov
so that we can see where we're still lacking tests. Add these to GitHub Actions so that commits and pull requests are checked for breakage.
Even if no outliers were detected (which means that everything is ok), the following message is printed:
Standard geometry: bond outliers[?]
Bond length outliers can not be evaluated for this model
Also incorrect formatting for the number of angle outliers:
Standard geometry: angle outliers[?]
There are 628 angle outliers in this entry (62800.0% of all angles). A summary is provided below, and a detailed list of outliers can be found
To be fixed here
Entries for testing: 9, 55, 141
After the fix all reports have to be updated.
Report model precision and population of models in the cluster, if provided by the authors in the mmCIF file.
Plots should be smaller. Non-weighted residual can be dropped. Cluster Log(I) vs q plots with weighted residuals plots. P-values and chi squared values can be combined in one table.
The pipeline
Template_Dict
in many parts of the code)Since the output HTML is static, step 3 is redundant. Jinja2 logic can be used instead to generate the final HTML directly in step 2. This would result in much less bulky HTML, and any errors would be detected at build time, rather than at runtime.
Full trace:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_2655681/2431891216.py in <module>
1 fname = '/home/domain/data/silwer/pdb_dev/IHMValidation_aozalevsky/example/PDBDEV_00000013.cif'
2 with open(fname, encoding='utf8') as f:
----> 3 m, = ihm.reader.read(f, model_class=ihm.model.Model)
/usr/local/lib/python3.8/dist-packages/ihm/reader.py in read(fh, model_class, format, handlers, warn_unknown_category, warn_unknown_keyword, read_starting_model_coord, starting_model_class, reject_old_file, variant)
3296 ukhandler.add_category_handlers(hs)
3297 r.category_handler = dict((h.category, h) for h in hs)
-> 3298 more_data = r.read_file()
3299 for h in hs:
3300 h.finalize()
/usr/local/lib/python3.8/dist-packages/ihm/format.py in read_file(self)
587
588 :exc:`CifParserError` will be raised if the file cannot be parsed.
--> 589
590 :return: True iff more data blocks are available to be read.
591 """
/usr/local/lib/python3.8/dist-packages/ihm/format.py in _read_file_c(self)
638 if self.unknown_category_handler is not None:
639 _format.add_unknown_category_handler(self._c_format,
--> 640 self.unknown_category_handler)
641 if self.unknown_keyword_handler is not None:
642 _format.add_unknown_keyword_handler(self._c_format,
/usr/local/lib/python3.8/dist-packages/ihm/reader.py in __call__(self, starting_model_id, asym_id, entity_poly_segment_id, dataset_list_id, starting_model_auth_asym_id, starting_model_sequence_offset, description)
1500 starting_model_sequence_offset, description):
1501 m = self.sysr.starting_models.get_by_id(starting_model_id)
-> 1502 asym = self.sysr.ranges.get(
1503 self.sysr.asym_units.get_by_id(asym_id), entity_poly_segment_id)
1504 m.asym_unit = asym
/usr/local/lib/python3.8/dist-packages/ihm/reader.py in get(self, asym_or_entity, range_id)
190 return asym_or_entity
191 else:
--> 192 return asym_or_entity(*self._id_map[range_id])
193
194
KeyError: '1'
I narrowed down the issue to the order of two sections. The code fails on
1409 loop_
1410 _ihm_starting_model_details.starting_model_id
1411 _ihm_starting_model_details.entity_id
1412 _ihm_starting_model_details.entity_description
1413 _ihm_starting_model_details.asym_id
1414 _ihm_starting_model_details.entity_poly_segment_id
1415 _ihm_starting_model_details.starting_model_source
1416 _ihm_starting_model_details.starting_model_auth_asym_id
1417 _ihm_starting_model_details.starting_model_sequence_offset
1418 _ihm_starting_model_details.dataset_list_id
1419 1 1 CYP199A2 A 1 'experimental model' A -13 1
1420 2 2 HaPux B 2 'experimental model' A 0 2
because actual _ihm_entity_poly_segment
records are defined ~40 lines below
1455 loop_
1456 _ihm_entity_poly_segment.id
1457 _ihm_entity_poly_segment.entity_id
1458 _ihm_entity_poly_segment.seq_id_begin
1459 _ihm_entity_poly_segment.seq_id_end
1460 _ihm_entity_poly_segment.comp_id_begin
1461 _ihm_entity_poly_segment.comp_id_end
1462 1 1 1 399 SER ALA
1463 2 2 1 106 PRO THR
If I swap them with each other parsing continues. Indeed, according to the scheme _ihm_entity_poly_segment
table should go first. @benmwebb can you check my analysis?
Would be helpful to add tooltips to at least some of the plots with actual values, e.g. the model quality plots at https://pdb-dev-beta.wwpdb.org/Validation/PDBDEV_00000092/htmls/main.html, using bokeh's HoverTool
.
Report generation (with precalculated data) takes typically anything from several minutes up to several hours. It looks a bit unrealistic for a simple rendering task. There have to be some bottlenecks.
Below is a sample, profiling log for the PDBDEV_00000004
ncalls tottime percall cumtime percall filename:lineno(function)
1762/1 0.020 0.000 103.081 103.081 {built-in method builtins.exec}
1 0.005 0.005 103.081 103.081 Execute.py:7(<module>)
2 0.000 0.000 51.620 25.810 api.py:30(from_file)
2 0.000 0.000 51.602 25.801 pdfkit.py:160(to_pdf)
4 0.000 0.000 51.590 12.898 subprocess.py:1090(communicate)
2 0.000 0.000 51.589 25.794 subprocess.py:1926(_communicate)
195 51.588 0.265 51.588 0.265 {method 'poll' of 'select.poll' objects}
2 0.000 0.000 51.587 25.794 selectors.py:403(select)
1 0.000 0.000 51.281 51.281 Execute.py:157(write_pdf)
1 0.003 0.003 23.990 23.990 Report.py:117(run_model_quality)
20 0.310 0.015 23.882 1.194 utility.py:16(dict_to_JSlist)
31841 23.570 0.001 23.570 0.001 utility.py:40(<listcomp>)
220 0.003 0.000 18.939 0.086 connectionpool.py:518(urlopen)
220 0.002 0.000 18.899 0.086 connectionpool.py:357(_make_request)
1 0.000 0.000 17.385 17.385 Report.py:348(run_sas_validation_plots)
429 0.001 0.000 14.035 0.033 socket.py:690(readinto)
220 0.001 0.000 13.262 0.060 client.py:1327(getresponse)
220 0.001 0.000 13.259 0.060 client.py:312(begin)
220 0.001 0.000 13.235 0.060 client.py:279(_read_status)
1457 0.001 0.000 13.234 0.009 {method 'readline' of '_io.BufferedReader' objects}
193 0.001 0.000 10.929 0.057 webdriver.py:404(execute)
193 0.001 0.000 10.925 0.057 remote_connection.py:402(execute)
193 0.002 0.000 10.922 0.057 remote_connection.py:423(_request)
193 0.000 0.000 10.903 0.056 request.py:58(request)
193 0.001 0.000 10.902 0.056 poolmanager.py:352(urlopen)
242 10.803 0.045 10.803 0.045 {method 'recv_into' of '_socket.socket' objects}
181 0.000 0.000 10.062 0.056 request.py:98(request_encode_body)
After a brief analysis of the calls and code I identified several bottlenecks:
wkhtmltopdf
calls. 2 0.000 0.000 51.602 25.801 pdfkit.py:160(to_pdf)
4 0.000 0.000 51.590 12.898 subprocess.py:1090(communicate)
2 0.000 0.000 51.589 25.794 subprocess.py:1926(_communicate)
utility.dict_to_JSlist
20 0.310 0.015 23.882 1.194 utility.py:16(dict_to_JSlist)
3. various get requests
220 0.003 0.000 18.939 0.086 connectionpool.py:518(urlopen)
220 0.002 0.000 18.899 0.086 connectionpool.py:357(_make_request)
Let this issue be an umbrella issue. I'll open separate issues for individual bottlenecks.
example/Execute.py
can take a very long time to run for some PDB-Dev entries. This is likely because it has to recalculate all the various SAS plots. This makes regenerating entries to fix minor typos rather time consuming. Consider caching the outputs of running ATSAS, perhaps in the Validation/results
directory, in the same way that MolProbity outputs are cached. Care should be taken though to clear or invalidate the cache if part of the SAS pipeline itself changes.
There is a great deal of duplication in the HTML templates in the templates
directory. This means that changes need to be made in multiple locations and things can get out of sync. More use of Jinja2 blocks and macros and "extends" should be made to reduce this, following on from afb8b3f.
HTML template files need to be designed better.
Function get_restraints_info
has to be refactored to:
avoid nonoptimal formatting:
IHMValidation/master/pyext/src/validation/__init__.py
Lines 514 to 519 in c3bb0ef
update if-tree to support current him specs, for instance, ihm.restraint.PredictedContactRestraint
can have multiple types ('lower bound', 'upper bound', 'lower upper bound').
It would be better to provide p-values in some form in the "Fit to Data used for modeling" tab on the "Overall quality" page.
The drop down menus, e.g. "Validation Overview" at https://pdb-dev-beta.wwpdb.org/Validation/PDBDEV_00000016/htmls/main.html, don't work, at least on some Firefox instances. They look like they're supposed to drop down on mouseover, but this happens only slowly, or only on mouse click, and JavaScript errors are seen in the console.
Execute.py
fails for PDB-Dev entries 62 and 63 with
Traceback (most recent call last):
File "/IHMValidation/example/Execute.py", line 208, in <module>
template_dict, molprobity_dict, exv_data = report.run_model_quality(
File "/IHMValidation/example/../master/pyext/src/validation/Report.py", line 240, in run_model_quality
clashscores, Template_Dict['tot'] = I_mp.clash_summary_table(
File "/IHMValidation/example/../master/pyext/src/validation/molprobity.py", line 574, in clash_summary_table
dict1 = self.orderclashdict(dict1)
File "/IHMValidation/example/../master/pyext/src/validation/molprobity.py", line 584, in orderclashdict
df = pd.DataFrame(modeldict)
File "/root/miniforge/lib/python3.9/site-packages/pandas/core/frame.py", line 636, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/root/miniforge/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 502, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/root/miniforge/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 120, in arrays_to_mgr
index = _extract_index(arrays)
File "/root/miniforge/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 674, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
Looks like for some reason the code is only finding clash scores for 20 models even though the PDB-Dev entry has 25. Either the parsing of the MolProbity output is deficient here, or there really are no clashes for some of the models (in which case empty lists likely need to be returned so that everything works).
It looks like the code is hardcoded for the 1/A
units, thus it is failing on files with 1/nm
units (related to #53)
There are multiple places using a hardcoded A to nm conversion:
The information about units is stored in the sascif file:
_sas_scan.unit 1/A
_sas_scan.unit 1/nm
Parsing of the PDB-Dev entries 8 and 66 fails with the following trace:
Traceback (most recent call last):
File "/IHMValidation/example/Execute.py", line 203, in <module>
report = WriteReport(args.f)
File "/IHMValidation/example/../master/pyext/src/validation/Report.py", line 26, in __init__
self.input = GetInputInformation(self.mmcif_file)
File "/IHMValidation/example/../master/pyext/src/validation/__init__.py", line 32, in __init__
self.system, = ihm.reader.read(fh, model_class=self.model)
File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 3260, in read
more_data = r.read_file()
File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 589, in read_file
return self._read_file_c()
File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 640, in _read_file_c
eof, more_data = _format.ihm_read_file(self._c_format)
File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 1247, in __call__
a.append(self.sysr.ranges.get(obj, entity_poly_segment_id))
File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 191, in get
return asym_or_entity(*self._id_map[range_id])
File "/root/miniforge/lib/python3.9/site-packages/ihm/__init__.py", line 1289, in __call__
return AsymUnitRange(self, seq_id_begin, seq_id_end)
File "/root/miniforge/lib/python3.9/site-packages/ihm/__init__.py", line 1198, in __init__
raise TypeError("Can only create ranges for polymeric entities")
TypeError: Can only create ranges for polymeric entities
The parsing code seems to be quite generic:
IHMValidation/master/pyext/src/validation/__init__.py
Lines 29 to 35 in 9d767d1
So I presume the problem is indeed in the cif files. PDB-Dev 8 has this in the header:
#
loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
_entity.details
1 polymer man chr2L_60-161 ? 1 ?
#
<...>
loop_
_struct_asym.id
_struct_asym.entity_id
_struct_asym.details
A 1 chr2L_60-161
#
Which I guess causes failure at the asym.entity
check from the python-ihm
https://github.com/ihmwg/python-ihm/blob/0989b68412c01359e9f51aaf8413325532306737/ihm/__init__.py#L1196-L1201
@benmwebb I guess I need your advice on this: is this an actual artifact in the cif or it should be handled in the code?
In
IHMValidation/master/pyext/src/validation/Report.py
Lines 267 to 270 in 9258c8f
and
IHMValidation/master/pyext/src/validation/Report.py
Lines 277 to 278 in 9258c8f
number of keys in the exv_data
dict will be returned instead of actual number of models.
Current HTML export code fails if the path to firefox/geckrodriver is not directly pointing to binary executable. This is exactly the case of the Conda installation used in the docker recipe.
The issue was reported here: bokeh/bokeh#10108
As a workaround, the path to conda firefox executable can be hardcoded like this:
export PATH=/root/miniforge/bin/FirefoxApp:${PATH}
I'll update docker and singularity recipes later.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.