GithubHelp home page GithubHelp logo

salilab / ihmvalidation Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 2.0 223.26 MB

Validation software for integrative models deposited to PDB

License: MIT License

Python 0.36% HTML 99.62% CSS 0.02% JavaScript 0.01% Shell 0.01% Dockerfile 0.01%

ihmvalidation's People

Contributors

aozalevsky avatar benmwebb avatar brindakv avatar saijananiganesan avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ihmvalidation's Issues

Missing restraint_types in _ihm_derived_distance_restraint in the PDBDEV_00000054

I believe that with the 1.10 update of the IHM dictionary explicit setting of _ihm_derived_distance_restraint.restraint_type became mandatory thus the following record now produce an error:

loop_                                                                                                                                                                                                      _ihm_derived_distance_restraint.id                                                                                                                                                                         
_ihm_derived_distance_restraint.group_id                                                                                                                                                                   
_ihm_derived_distance_restraint.feature_id_1                                                                                                                                                               
_ihm_derived_distance_restraint.feature_id_2                                                                                                                                                               
_ihm_derived_distance_restraint.restraint_type                                                                                                                                                             
_ihm_derived_distance_restraint.dataset_list_id                                                                                                                                                            
1 1 1 2 . 2                                                                                                                                                                                                
2 1 1 3 . 2      
/usr/local/lib/python3.8/dist-packages/ihm/reader.py in __call__(self, id, group_id, dataset_list_id, feature_id_1, feature_id_2, restraint_type, group_conditionality, probability, mic_value, distance_lower_limit, distance_upper_limit)
   2105         r.feature2 = self.sysr.features.get_by_id(feature_id_2)
   2106         print(restraint_type)
-> 2107         r.distance = _handle_distance[restraint_type](distance_lower_limit,
   2108                                                       distance_upper_limit,
   2109                                                       self.get_float)

KeyError: None

I guess it should be fixed in the entry? @brindakv

Type mismatches in _ihm_starting_comparative_models section in PDBDEV_00000059

According to the dictionary the template_seq_id_begin and template_seq_id_end fields have to be integer values. However they are set with a ? symbol:

#                                                                                                                                                                                                          
loop_                                                                                                                                                                                                      
_ihm_starting_comparative_models.id                                                                                                                                                                        
_ihm_starting_comparative_models.starting_model_id                                                                                                                                                         
_ihm_starting_comparative_models.starting_model_auth_asym_id                                                                                                                                               
_ihm_starting_comparative_models.starting_model_seq_id_begin                                                                                                                                               
_ihm_starting_comparative_models.starting_model_seq_id_end                                                                                                                                                 
_ihm_starting_comparative_models.template_auth_asym_id                                                                                                                                                     
_ihm_starting_comparative_models.template_seq_id_begin                                                                                                                                                     
_ihm_starting_comparative_models.template_seq_id_end                                                                                                                                                       
_ihm_starting_comparative_models.template_sequence_identity                                                                                                                                                
_ihm_starting_comparative_models.template_sequence_identity_denominator                                                                                                                                    
_ihm_starting_comparative_models.template_dataset_list_id                                                                                                                                                  
_ihm_starting_comparative_models.alignment_file_id                                                                                                                                                         
1 3 A 1 51 C ? ? ? ? 3 .                                                                                                                                                                                   
2 4 A 1 51 D ? ? ? ? 3 .                                                                                                                                                                                   
#                            

IHM follows the spec and fails with the error:

/usr/local/lib/python3.8/dist-packages/ihm/reader.py in __call__(self, starting_model_id, template_dataset_list_id, alignment_file_id, template_auth_asym_id, starting_model_seq_id_begin, starting_model_seq_id_end, template_seq_id_begin, template_seq_id_end, template_sequence_identity, template_sequence_identity_denominator)
   1538         seq_id_range = (int(starting_model_seq_id_begin),
   1539                         int(starting_model_seq_id_end))
-> 1540         template_seq_id_range = (int(template_seq_id_begin),
   1541                                  int(template_seq_id_end))
   1542         identity = ihm.startmodel.SequenceIdentity(

TypeError: int() argument must be a string, a bytes-like object or a number, not '__UnknownValue'

@brindakv what is the best course of action in this case?

Add more information on how to interpret the report

How to interpret the report? What are good and bad values? Add information to the user guide (wherever possible) regarding what are good and bad values.

Good places to start would be in validation_help.html, particularly in the "Model Quality Assessment" and "Fit to Data Used for Modeling Assessment" sections. For example, state for each score whether higher or lower values are "better".

Paths to js/css resources

Some problems of #23 were caused by inaccessible paths to static js/css resources. Looks like there are some duplications and version mismatches that require refactoring. Overall it would be better to simplify a set of resources and sync it with layout.html.

E.g. about_validation.html and validation_help.html have a somewhat mixed set of links.

 <!-- add Javasscript file from js file -->

            <script type="text/javascript" src="js/jquery.min.js"></script>
            <script type="text/javascript" src="js/bootstrap.min.js"></script>
            <script type="text/javascript" src="js/main.js"></script>
            <script type="text/javascript" src="js/jquery-3.3.1.min.js"></script>
            <script type="text/javascript" src="js/popper1.12.9.min.js"></script>
            <script type="text/javascript" src=".js/bootstrap4.1.3.min.js"></script>
            <script type="text/javascript" src="js/bootstrap3-typeahead.min.js"></script>

Comments on the code so far

A mini-review is below. I'll try to comment on each class of issue only once.

Tests will rot if they're not run periodically. It's pretty straightforward to set them up to run on each push, using GitHub Actions. See for example https://github.com/ihmwg/python-ihm/blob/main/.github/workflows/testpy.yml

Docs can also be auto-built with readthedocs.io. See e.g. https://python-ihm.readthedocs.io/. I can set that up for you if you like.

Exclude __pycache__ with a .gitignore file in https://github.com/salilab/IHMValidation/tree/master/master/pyext/src/validation. There's no point in tracking .pyc files in source control.

You might consider renaming your main branch from 'master' to 'main'. The latter is quickly becoming the standard on GitHub.

Never ever use tabs in Python code. Tabs are evil. (Perhaps configure your editor to insert 4 spaces for each tab keypress.) Run your code through a PEP-8 formatter like autopep8, reindent.py, or black.


Bad practice to import multiple modules on one line (consider running flake8 over your code which will pick up a lot of issues like this; it can be quite educational).

def run_entry_composition(self,Template_Dict:dict)->dict:

Note that using type hints makes your code require a fairly recent version of Python 3. This may be OK. Function arguments should also usually be lowercase (Template_Dict -> template_dict).

Template_Dict['Data']=[i.upper() for i in list(set(self.I.get_dataset_comp()['Dataset type']).difference({'Experimental model','Comparative model'}))]

Sets should already be iterable; the list() here is unnecessary and perhaps inefficient.

filename = os.path.abspath(os.path.join(os.getcwd(), 'static/results/',str(Template_Dict['ID'])+'_temp_mp.txt'))

Using 'static/results/' here kind of defeats the point of usingpath.join in the first place. Use 'static', 'results' instead. The path.abspath is unnecessary too since you are not changing directory between when you construct the filename and when you use it to open the file.

global clashscore;global rama;global sidechain

Are you sure you need these variables to be global? You should always carefully check any global usage.


Never use bare except. Always explicitly list the exceptions you want to catch. Otherwise you may catch exceptions that should be fixed (e.g. a syntax error in the try block).

print ("Molprobity cannot be calculated...")

Consider using Python's logging module for these sorts of prints, so they can be turned off if desired.

class get_input_information(object):

Normally classes are CamelCase, e.g. GetInputInformation.

self.system, = ihm.reader.read(open(self.mmcif_file),

Always better to open the file handle using with (a context manager) and put the read() inside the with body. This ensures the file is closed at the end of the scope.

def get_id_from_entry(self)->str:

This seems to be trying to handle the case where _entry.id and _struct.entry_id are different. This shouldn't happen (pretty sure the dictionary requires them to be the same) but if it does, you could ask the python-ihm developer to handle this ;)

aut=cit[0].authors
for ind in range(0,len(aut)):
if ind==0:
authors=str(aut[ind])
else:
authors+=';'+str(aut[ind])
return authors

This seems unnecessarily verbose. What about return "; ".join(aut for aut in cit[0].authors) ?

mol_name=entities.description

This doesn't look right. Are you sure you test it? Shouldn't it be entities[0].description perhaps?

"""check resolution of structure,returns 0 if its atomic and 1 if the model is multires"""

Wouldn't True/False be more standard than 1/0 ?

assembly_id=map(int,self.get_assembly_ID_of_models())

It's unusual to see map in modern Python code; it's largely been replaced with comprehensions, e.g. assembly_id = [int(x) for x in self.get_assembly_ID_of_models()

sampling_comp={'Step number':[], 'Protocol ID':[],'Method name':[],'Method type':[], \

Backslashes are unnecessary within brackets or parentheses.

RB=self.get_empty_chain_dict();RB_nos=[];all_nos=[];flex=self.get_empty_chain_dict()

Usually code is easier to read on multiple lines. Semicolons should generally be avoided.

chain=[]
for el in ass:
chain.append(el._id)

chain = [el._id for el in ass] would be more concise.

unique=[used.append(x) for x in chain if x not in used]

I'm not sure what you're trying to do here. It's certainly unusual for the left side of a list comprehension to have side effects (used.append()). If it's something clever, add a comment to help the poor reader.

for _,el in enumerate(self.system.asym_units):

Why are you using enumerate if you're not using the count? Just use for el in self.system.asym_units: instead.


_ is usually used as a "I'm not using this value" placeholder, so this looks weird (and is also hard to read). Use a real variable name, e.g. software, instead.

if str(_.version) == '?':

There's a difference in mmCIF between ? and '?' which you're losing here. It would be more correct to say if _.version == ihm.unknown:

except AttributeError as error:

as error implies you're going to use the error object... and then you don't.


Huh? What's wrong with loc = 'Not listed' ?

if i.data_type =='unspecified':

"unspecified" isn't a valid value according to the IHM dictionary. Do you mean "Other" instead?

if 'CrossLink' in str(i.__class__.__name__):

This is not the right way to do this. Use isinstance instead.

if 'SAS' in str(data_type) and 'SAS' in str(database):
return True
else:
return False

return 'SAS' in str(data_type) and 'SAS' in str(database) would be more concise here.

self.nos=get_input_information.get_number_of_models(self)

You can just say self.nos = self.get_number_of_models() here unless you have overridden the method in this class but really want to still call the base class method (which would be confusing).

if linker=='DSS' and dist<=30:
return 1
elif linker=='EDC' and dist<=20:
return 1
elif linker=='EDC' and dist>20:
return 0
elif dist<=30:
return 1
else:
return 0

Would perhaps be cleaner to use a dict here, e.g. something like

linkers = {'DSS': 30, 'EDC': 20}
return dist <= linkers.get(linker, 30)

That dict could go in a utility module somewhere so you can use it elsewhere, e.g. in cx_plots.py.

self.filename = os.path.join('Output/images//')

"Joining" one thing is weird.

model_spheres={i+1:[j.x,j.y,j.z,j.radius] for i,j in enumerate(spheres)}
model_spheres_df=pd.DataFrame(model_spheres, index=['X','Y','Z','R'])

You might use a lot of memory doing it this way (since you construct three copies of the coordinates - one in IHM, one in your model_spheres object, and then another in the DataFrame). Assuming you're using pandas 0.13 or later, you can avoid the intermediate model_spheres object by passing a generator to the DataFrame constructor instead of a dict.

filename = open(os.path.join(os.getcwd(),self.resultpath,self.ID+'_temp_rama.txt'))

os.getcwd() isn't needed here. If you're trying to make the path absolute (although not needed here) os.path.abspath is the way to do it.

f_name_handle=open(f_name,'w+')
with f_name_handle as outfile:

This seems odd. with open(f_name,'w+') as outfile: would be more normal.

clashes_ordered=dict(sorted(clashes.items()))

dicts are never ordered, so sorting the inputs does nothing here.

if len(j.replace(',','').replace(':','').split()[0])>2:

Looks like you repeat j.replace(',','').replace(':','').split() a bunch of times here. Store it in a variable to make it more efficient and easier to read.

dict1['Observed distance (&#8491)'].append(val)

Should add a comment for those of us that haven't memorized all of Unicode. I assume this is the Angstrom symbol (maybe the HTML &Aring entity would be simpler).

print ("Error....unable to fetch data from SASBDB, please check the entry ID")

Wouldn't raise be more appropriate if this is an error?

for num,key in enumerate(list(data.keys())):

list() is unnecessary here.

list_sort=sorted(list_sub, key=lambda x: x[1])

Rather than lambda x: x[1] use operator.itemgetter(1).

if parameter_table['Estimated volume'] is None:
parameter_table['Estimated volume'].append('N/A')

How could this ever work? None doesn't have an append method.

if len(sascifline)<3 and len(sascifline)>0 and '_sas_sample.specimen_concentration' in sascifline[0]:

Maybe more concise to say 0 < len(sascifline) < 3

if val_m[1].empty==False:

Would normally be written as if not val_m[1].empty: or perhaps if val_m[1].empty is False: if you also need to check it's not something non-boolean (e.g. None).

val=''
for el in tex:
for subel in el:
if subel==el[-1] and el==tex[-1]:
val+=str(subel)+'. '
elif subel==el[-1] and el!= tex[-1]:
val+=str(subel)+', '
else:
val+=str(subel)+':'

Strings are immutable, so every time you say val += foo val is destroyed and replaced with a new, longer string object. Whenever you have val += foo inside a for loop, consider replacing with something like val = ''.join(something).

def format_tupple(tex:list)->str:

"tupple" should be "tuple"

sublist=['%s: Chain %s (%d residues)' % (sub_dict['Subunit name'][i],sub_dict['Chain ID'][i],sub_dict['Total residues'][i]) for i in range(0,model_number)]

A first argument to range of 0 is redundant - that's the default.

new_restraints=dict()
for key,val in restraints.items():
new_restraints[key]=list(set(val))

Seems a good candidate for a dict comprehension, e.g. new_restraints = {key: list(set(val)) for key,val in restraints.items()}


os.listdir('.') would be more concise.

Docker image

create a docker image with the right dependencies

Add link to more detailed reports in "Overall quality" section

Gerado suggests:

Yes I would just add a sentence in the "Overall quality" section that mentions where to find the more detailed reports (i.e. the pdf and the submenus available further up on the page).

Personally I scrolled down the page, saw the "This validation report contains model quality assessments for the structure" sentence and assumed that what followed was the whole report...

Optimize static assets: logon.png

PDB-Dev logo pic: images/logon.png has a size of 1.1M which is about the same size as the whole report for some entries. Optimization with optipng reduces the size by only 8%. The JPEG version is about ~400K.

In case of migration to JPEG following files have to be modified:

about_validation.html:                            <img src="images/logon.png" class="float-left" alt="PDBDEV.org" height="100" width="110" style="margin-top: 0px; margin-bottom: 0px" />
templates/layout.html:                            <img src="../../../images/logon.png" class="float-left" alt="PDBDEV.org" height="100" width="110" style="margin-top: 0px; margin-bottom: 0px" />
templates/notformodeling.html:                            <img src="../../static/webimages/logon.png" class="float-left" alt="PDBDEV.org" height="100"  width="110"  style="margin-top: 10px; margin-bottom: 10px" />
templates/introduction.html:                            <img src="../../static/webimages/logon.png" class="float-left" alt="PDBDEV.org" height="100" width="110" style="margin-top: 10px; margin-bottom: 10px" />
validation_help.html:                            <img src="images/logon.png" class="float-left" alt="PDBDEV.org" height="100" width="110" style="margin-top: 0px; margin-bottom: 0px" />

Also, judging from the current code, pic size can be further reduced by decreasing the size.

Add percentages to all outlier reports

Currently, only the raw numbers are reported for any outliers (clashes, Ramachandran and standard geometry outliers from Molprobity). This cannot be used as an indicator of model quality since it scales as the size of the model. Report the percentage as well to make this easier to interpret.

(Another suggestion would be to report outliers per 100 residues. A simple percentage is easier to calculate however and reports roughly the same thing. "Number of residues" is a slightly tricky quantity for integrative models, for example if there are parts of the model that are not atomic, or that don't report outliers, for example ligands.)

ihm fails on PDBDEV_00000088

There are multiple points in the entry 88 where ihm fails during parsing:

Traceback (most recent call last):                                                                    
  File "/IHMValidation/example/../master/pyext/src/validation/__init__.py", line 74, in __init__
    self.system, = ihm.reader.read(fh, model_class=self.model)                                        
  File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 3298, in read       
    more_data = r.read_file()                                                                         
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 594, in read_file
    return self._read_file_c()                                                                        
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c  
    eof, more_data = _format.ihm_read_file(self._c_format)                                            
  File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 470153: invalid start byte

is the result of the sentence:

Typically, 14<B7>106 to 20<B7>106 photons were recorded at TAC channel-width of 14.1\xa0ps (IBH-5000U) or 8\xa0ps (EasyTau300).

The other error:

  File "/root/miniforge/lib/python3.9/codecs.py", line 322, in decode                                                                                                                                       
    (result, consumed) = self._buffer_decode(data, self.errors, final)                                
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 471889: invalid start byte  

ogirinates from:

Sample conditions for the EPR experiments were 100 <B5>M protein in 100 mM NaCl, 50 mM Tris-HCl, 5 mM MgCl2, pH 7.4 dissolved in D2O with 12.5 % (v/v) glycerol-d8.

And finally, after deleting symbols causing previous errors:

Traceback (most recent call last):
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 645, in _read_file_c
    eof, more_data = _format.ihm_read_file(self._c_format)
_format.FileFormatError: Wrong number of data values in loop (should be an exact multiple of the number of keys) at line 1940098

@benmwebb @brindakv I need your help on that.

Refactor utility.dict_to_JSlist

This is a continuation of the #38

Looks like utility.dict_to_JSlist is heavily used throughout the code and performs a lot of iterations and list comprehensions.

if bool(d) and len(list(d.keys())) > 0:
# add headers for table, which are the keys of the dict
output_list.append(list(d.keys()))
# add each row of the table as a list
target = list(d.values())
for ind in range(len(target[0])):
sublist = []
for el in target:
el = ['_' if str(i) == '?' else str(i) for i in el]
sublist.append(str(el[ind]))
output_list.append(sublist)
return output_list

Though a list comprehension is quite efficient, it is overused causing a sizable delay.

Redundant paths definitions

Paths for tools from the molprobity and ATSAS suites are explicitly defined through environment variables. I think this is redundant since there is a generic and OS-independent way of using the PATH environment variable.

ATSAS="" 
Molprobity_ramalyze=""
Molprobity_molprobity=""
Molprobity_clashscore=""
Molprobity_rotalyze=""
wkhtmltopdf=""

run([config('Molprobity_ramalyze'), self.mmcif_file], stdout=outfile)

run([config('Molprobity_molprobity'), self.mmcif_file,

run([config('Molprobity_clashscore'), self.mmcif_file], stdout=outfile)

run([config('Molprobity_rotalyze'), self.mmcif_file], stdout=outfile)

Moreover, in case of ATSAS it's a little bit misleading, since only datcmp used and not the whole ATSAS package.

run([config('ATSAS'), 'fit1.csv',
'fit2.csv'], stdout=outfile, shell=False)

Sync text between entries with and without SAS data

Currently entries without SAS (or any other additional data):

Data quality and fit to model assessments for other datasets and model uncertainty are under development.

while entries with SAS have:

Data quality assessment for SAS datasets and fit to model assessments for SAS datasets is also included in this assessment. Data quality and fit to model assessments for other datasets and model uncertainty are under development.

We should sync/rework text to explicitly show users what types of data there are and what types already have validation implemented.

Hardcoded paths prevent parallel execution

It would be nice to have the ability to update the PDB-Dev in parallel. Together with the recent updates in #38 it would allow to rebuild the whole repo (with recalculated values) in under 2 minutes on a modern 32-128 core node.

So far I identified several places which interfere with parallel execution:

if os.path.isfile('test.cif'):
os.remove('test.cif')
file_re = open('test.cif', 'w')

uses a hardcoded test.cif as a temporary filename

def clean_all():
'''
delete all generated files
'''
# dirname_ed = os.getcwd()
os.listdir('.')
for item in os.listdir('.'):
if item.endswith('.txt'):
os.remove(item)
if item.endswith('.csv'):
os.remove(item)
if item.endswith('.json'):
os.remove(item)
if item.endswith('.sascif'):
os.remove(item)

removes any temp files by mask, including temp files generated for other structures. It specifically hits sascif processing (looks like other files are not reread again. at least when molprobity and excluded volume are already recalculated).

with open(code+'.json', 'w') as f:

with open(code+'.sascif', 'w') as f:

fname = key+str(fitnum)+'fit.csv'
with open(fname, 'w') as f:
f.write(fit.text)

fit_1.to_csv('fit1.csv', header=False, index=False)

fit_2.to_csv('fit2.csv', header=False, index=False)
f1 = open('pval.txt', 'w+')

temp files for sas processing.

Don't emit "None" in output HTML

There are several places where "None" is written into the output HTML. One example is at https://pdb-dev-beta.wwpdb.org/Validation/PDBDEV_00000009/htmls/data_quality.html where the Dmax error is reported to be None nm. This is likely because the Python None value is being used as-is. "None nm" is obviously nonsensical; this should be reported instead as "0 nm" if there really is no error or, in the much more likely case that the error could not be calculated for some reason, that reason should be stated to the user.

Create milestones

We need to set up a set of milestones to prioritize issues. Especially for the initial SAS release.

Set bounds on bokeh plot ranges

Looks like none of the bokeh plots currently have bounds set on their x/y ranges. This allows them to scroll away from the data or even in some cases results in odd-looking initial plots. For example see the excluded volume plot https://pdb-dev-beta.wwpdb.org/Validation/PDBDEV_00000012/htmls/main.html. There will never be a negative number of violations so it should not be possible to scroll the x range that way. This can be done with something like

p = bokeh.plotting.figure(..., x_range=Range1d(0, xmax, bounds=(0, None)))

Some pages in PDF reports are cut off

Looks like there is a problem with the width settings of some sections. The beginning of the report looks ok, problems start from the Data quality section and continue down to the very end.

Screenshot from 2022-05-09 14-02-30

Seems like the problem has been there since the beginning, at least it was already there in Ben's update which was about the time of 9591c5a commit.

Add more test cases

Add more tests to the tests subdirectory to ensure that things that are fixed stay fixed. Add code coverage with codecov so that we can see where we're still lacking tests. Add these to GitHub Actions so that commits and pull requests are checked for breakage.

Misleading message about the number of bond outliers

Even if no outliers were detected (which means that everything is ok), the following message is printed:

Standard geometry: bond outliers[?]
Bond length outliers can not be evaluated for this model

Also incorrect formatting for the number of angle outliers:

Standard geometry: angle outliers[?]
There are 628 angle outliers in this entry (62800.0% of all angles). A summary is provided below, and a detailed list of outliers can be found

To be fixed here

Entries for testing: 9, 55, 141

After the fix all reports have to be updated.

Add timezone into PDF report

It is always better to have the timezone explicitly described, especially when the service targets the international community

2022-03-18-213321_757x271_scrot

Replace JavaScript with Jinja2

The pipeline

  1. extracts information from MolProbity/ATSAS into a Python dict (called Template_Dict in many parts of the code)
  2. uses Jinja2 to substitute this dict into JavaScript in the output HTML
  3. at runtime, relies on the user's browser to execute that JavaScript to fill in the page content, e.g. generating tables on the fly or disabling parts of the page that are not relevant.

Since the output HTML is static, step 3 is redundant. Jinja2 logic can be used instead to generate the final HTML directly in step 2. This would result in much less bulky HTML, and any errors would be detected at build time, rather than at runtime.

Parsing of PDBDEV_00000013 fails

Full trace:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_2655681/2431891216.py in <module>
      1 fname = '/home/domain/data/silwer/pdb_dev/IHMValidation_aozalevsky/example/PDBDEV_00000013.cif'
      2 with open(fname, encoding='utf8') as f:
----> 3     m, = ihm.reader.read(f, model_class=ihm.model.Model)

/usr/local/lib/python3.8/dist-packages/ihm/reader.py in read(fh, model_class, format, handlers, warn_unknown_category, warn_unknown_keyword, read_starting_model_coord, starting_model_class, reject_old_file, variant)
   3296             ukhandler.add_category_handlers(hs)
   3297         r.category_handler = dict((h.category, h) for h in hs)
-> 3298         more_data = r.read_file()
   3299         for h in hs:
   3300             h.finalize()

/usr/local/lib/python3.8/dist-packages/ihm/format.py in read_file(self)
    587 
    588            :exc:`CifParserError` will be raised if the file cannot be parsed.
--> 589 
    590            :return: True iff more data blocks are available to be read.
    591         """

/usr/local/lib/python3.8/dist-packages/ihm/format.py in _read_file_c(self)
    638         if self.unknown_category_handler is not None:
    639             _format.add_unknown_category_handler(self._c_format,
--> 640                                                  self.unknown_category_handler)
    641         if self.unknown_keyword_handler is not None:
    642             _format.add_unknown_keyword_handler(self._c_format,

/usr/local/lib/python3.8/dist-packages/ihm/reader.py in __call__(self, starting_model_id, asym_id, entity_poly_segment_id, dataset_list_id, starting_model_auth_asym_id, starting_model_sequence_offset, description)
   1500                  starting_model_sequence_offset, description):
   1501         m = self.sysr.starting_models.get_by_id(starting_model_id)
-> 1502         asym = self.sysr.ranges.get(
   1503             self.sysr.asym_units.get_by_id(asym_id), entity_poly_segment_id)
   1504         m.asym_unit = asym

/usr/local/lib/python3.8/dist-packages/ihm/reader.py in get(self, asym_or_entity, range_id)
    190             return asym_or_entity
    191         else:
--> 192             return asym_or_entity(*self._id_map[range_id])
    193 
    194 

KeyError: '1'

I narrowed down the issue to the order of two sections. The code fails on

 1409 loop_                                                                                                                                                                                                
 1410 _ihm_starting_model_details.starting_model_id                                                                                                                                                        
 1411 _ihm_starting_model_details.entity_id                                                                                                                                                                
 1412 _ihm_starting_model_details.entity_description                                                                                                                                                       
 1413 _ihm_starting_model_details.asym_id                                                                                                                                                                  
 1414 _ihm_starting_model_details.entity_poly_segment_id                                                                                                                                                   
 1415 _ihm_starting_model_details.starting_model_source                                                                                                                                                    
 1416 _ihm_starting_model_details.starting_model_auth_asym_id                                                                                                                                              
 1417 _ihm_starting_model_details.starting_model_sequence_offset                                                                                                                                           
 1418 _ihm_starting_model_details.dataset_list_id                                                                                                                                                          
 1419     1  1  CYP199A2    A    1   'experimental model'  A  -13  1                                                                                                                                       
 1420     2  2  HaPux       B    2   'experimental model'  A    0  2   

because actual _ihm_entity_poly_segment records are defined ~40 lines below

 1455 loop_                                                                                                                                                                                                
 1456 _ihm_entity_poly_segment.id                                                                                                                                                                          
 1457 _ihm_entity_poly_segment.entity_id                                                                                                                                                                   
 1458 _ihm_entity_poly_segment.seq_id_begin                                                                                                                                                                
 1459 _ihm_entity_poly_segment.seq_id_end                                                                                                                                                                  
 1460 _ihm_entity_poly_segment.comp_id_begin                                                                                                                                                               
 1461 _ihm_entity_poly_segment.comp_id_end                                                                                                                                                                 
 1462 1 1 1 399 SER ALA                                                                                                                                                                                    
 1463 2 2 1 106 PRO THR     

If I swap them with each other parsing continues. Indeed, according to the scheme _ihm_entity_poly_segment table should go first. @benmwebb can you check my analysis?

Reduce report generation time

Report generation (with precalculated data) takes typically anything from several minutes up to several hours. It looks a bit unrealistic for a simple rendering task. There have to be some bottlenecks.

Below is a sample, profiling log for the PDBDEV_00000004

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   1762/1    0.020    0.000  103.081  103.081 {built-in method builtins.exec}
        1    0.005    0.005  103.081  103.081 Execute.py:7(<module>)
        2    0.000    0.000   51.620   25.810 api.py:30(from_file)
        2    0.000    0.000   51.602   25.801 pdfkit.py:160(to_pdf)
        4    0.000    0.000   51.590   12.898 subprocess.py:1090(communicate)
        2    0.000    0.000   51.589   25.794 subprocess.py:1926(_communicate)
      195   51.588    0.265   51.588    0.265 {method 'poll' of 'select.poll' objects}
        2    0.000    0.000   51.587   25.794 selectors.py:403(select)
        1    0.000    0.000   51.281   51.281 Execute.py:157(write_pdf)
        1    0.003    0.003   23.990   23.990 Report.py:117(run_model_quality)
       20    0.310    0.015   23.882    1.194 utility.py:16(dict_to_JSlist)
    31841   23.570    0.001   23.570    0.001 utility.py:40(<listcomp>)
      220    0.003    0.000   18.939    0.086 connectionpool.py:518(urlopen)
      220    0.002    0.000   18.899    0.086 connectionpool.py:357(_make_request)
        1    0.000    0.000   17.385   17.385 Report.py:348(run_sas_validation_plots)
      429    0.001    0.000   14.035    0.033 socket.py:690(readinto)
      220    0.001    0.000   13.262    0.060 client.py:1327(getresponse)
      220    0.001    0.000   13.259    0.060 client.py:312(begin)
      220    0.001    0.000   13.235    0.060 client.py:279(_read_status)
     1457    0.001    0.000   13.234    0.009 {method 'readline' of '_io.BufferedReader' objects}
      193    0.001    0.000   10.929    0.057 webdriver.py:404(execute)
      193    0.001    0.000   10.925    0.057 remote_connection.py:402(execute)
      193    0.002    0.000   10.922    0.057 remote_connection.py:423(_request)
      193    0.000    0.000   10.903    0.056 request.py:58(request)
      193    0.001    0.000   10.902    0.056 poolmanager.py:352(urlopen)
      242   10.803    0.045   10.803    0.045 {method 'recv_into' of '_socket.socket' objects}
      181    0.000    0.000   10.062    0.056 request.py:98(request_encode_body)

After a brief analysis of the calls and code I identified several bottlenecks:

  1. wkhtmltopdf calls.
        2    0.000    0.000   51.602   25.801 pdfkit.py:160(to_pdf)
        4    0.000    0.000   51.590   12.898 subprocess.py:1090(communicate)
        2    0.000    0.000   51.589   25.794 subprocess.py:1926(_communicate)
  1. utility.dict_to_JSlist
       20    0.310    0.015   23.882    1.194 utility.py:16(dict_to_JSlist)

3. various get requests
  220    0.003    0.000   18.939    0.086 connectionpool.py:518(urlopen)
  220    0.002    0.000   18.899    0.086 connectionpool.py:357(_make_request)

Let this issue be an umbrella issue. I'll open separate issues for individual bottlenecks.

Cache ATSAS outputs

example/Execute.py can take a very long time to run for some PDB-Dev entries. This is likely because it has to recalculate all the various SAS plots. This makes regenerating entries to fix minor typos rather time consuming. Consider caching the outputs of running ATSAS, perhaps in the Validation/results directory, in the same way that MolProbity outputs are cached. Care should be taken though to clear or invalidate the cache if part of the SAS pipeline itself changes.

Don't duplicate HTML templates

There is a great deal of duplication in the HTML templates in the templates directory. This means that changes need to be made in multiple locations and things can get out of sync. More use of Jinja2 blocks and macros and "extends" should be made to reduce this, following on from afb8b3f.

Improve handling of restraints in the summary table

Function get_restraints_info has to be refactored to:

  1. avoid nonoptimal formatting:

    elif isinstance(i, ihm.restraint.PredictedContactRestraint):
    restraints_comp['Restraint info'].append('Distance: '+str(i.distance.distance)
    + ' between residues ' +
    str(i.resatom1.seq_id)
    + ' and ' + str(i.resatom2.seq_id))

  2. update if-tree to support current him specs, for instance, ihm.restraint.PredictedContactRestraint can have multiple types ('lower bound', 'upper bound', 'lower upper bound').

Fix parsing of clash scores for PDB-Dev 62, 63

Execute.py fails for PDB-Dev entries 62 and 63 with

Traceback (most recent call last):
  File "/IHMValidation/example/Execute.py", line 208, in <module>
    template_dict, molprobity_dict, exv_data = report.run_model_quality(
  File "/IHMValidation/example/../master/pyext/src/validation/Report.py", line 240, in run_model_quality
    clashscores, Template_Dict['tot'] = I_mp.clash_summary_table(
  File "/IHMValidation/example/../master/pyext/src/validation/molprobity.py", line 574, in clash_summary_table
    dict1 = self.orderclashdict(dict1)
  File "/IHMValidation/example/../master/pyext/src/validation/molprobity.py", line 584, in orderclashdict
    df = pd.DataFrame(modeldict)
  File "/root/miniforge/lib/python3.9/site-packages/pandas/core/frame.py", line 636, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/root/miniforge/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 502, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/root/miniforge/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 120, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/root/miniforge/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 674, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

Looks like for some reason the code is only finding clash scores for 20 models even though the PDB-Dev entry has 25. Either the parsing of the MolProbity output is deficient here, or there really are no clashes for some of the models (in which case empty lists likely need to be returned so that everything works).

SAS processing is unaware of measuring units

It looks like the code is hardcoded for the 1/A units, thus it is failing on files with 1/nm units (related to #53)

There are multiple places using a hardcoded A to nm conversion:

I_df['Q'] = I_df['Q']*10

I_df['Q'] = I_df['Q']*10

pdf_re['Q'] = pdf_re['Q']*10

G_df_range['Q'] = G_df['Q']*10

G_df_range['Q2A'] = G_df_range['Q2']*100

The information about units is stored in the sascif file:

SASDC29

_sas_scan.unit                      1/A

SASDDD6

_sas_scan.unit                      1/nm

Validation pipeline fails on PDB-Dev 8 and 66

Parsing of the PDB-Dev entries 8 and 66 fails with the following trace:

Traceback (most recent call last):
  File "/IHMValidation/example/Execute.py", line 203, in <module>
    report = WriteReport(args.f)
  File "/IHMValidation/example/../master/pyext/src/validation/Report.py", line 26, in __init__
    self.input = GetInputInformation(self.mmcif_file)
  File "/IHMValidation/example/../master/pyext/src/validation/__init__.py", line 32, in __init__
    self.system, = ihm.reader.read(fh, model_class=self.model)
  File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 3260, in read
    more_data = r.read_file()
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 589, in read_file
    return self._read_file_c()
  File "/root/miniforge/lib/python3.9/site-packages/ihm/format.py", line 640, in _read_file_c
    eof, more_data = _format.ihm_read_file(self._c_format)
  File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 1247, in __call__
    a.append(self.sysr.ranges.get(obj, entity_poly_segment_id))
  File "/root/miniforge/lib/python3.9/site-packages/ihm/reader.py", line 191, in get
    return asym_or_entity(*self._id_map[range_id])
  File "/root/miniforge/lib/python3.9/site-packages/ihm/__init__.py", line 1289, in __call__
    return AsymUnitRange(self, seq_id_begin, seq_id_end)
  File "/root/miniforge/lib/python3.9/site-packages/ihm/__init__.py", line 1198, in __init__
    raise TypeError("Can only create ranges for polymeric entities")
TypeError: Can only create ranges for polymeric entities

The parsing code seems to be quite generic:

self.model = ihm.model.Model
try:
with open(self.mmcif_file, encoding='utf8') as fh:
self.system, = ihm.reader.read(fh, model_class=self.model)
except UnicodeDecodeError:
with open(self.mmcif_file, encoding='ascii', errors='ignore') as fh:
self.system, = ihm.reader.read(fh, model_class=self.model)

So I presume the problem is indeed in the cif files. PDB-Dev 8 has this in the header:

#
loop_
_entity.id
_entity.type
_entity.src_method
_entity.pdbx_description
_entity.formula_weight
_entity.pdbx_number_of_molecules
_entity.details
1 polymer man chr2L_60-161 ? 1 ?
#
<...>
loop_
_struct_asym.id
_struct_asym.entity_id
_struct_asym.details
A 1 chr2L_60-161
#

Which I guess causes failure at the asym.entity check from the python-ihm https://github.com/ihmwg/python-ihm/blob/0989b68412c01359e9f51aaf8413325532306737/ihm/__init__.py#L1196-L1201

@benmwebb I guess I need your advice on this: is this an actual artifact in the cif or it should be handled in the code?

Incorrect model number detection for excluded volume data

In

exv_data = {
'Models': line[0], 'Excluded Volume Satisfaction (%)':
line[1], 'Number of violations': line[2]}
Template_Dict['NumModels'] = len(exv_data)

and

exv_data = I_ev.run_exc_vol_parallel(model_dict)
Template_Dict['NumModels'] = len(exv_data)

number of keys in the exv_data dict will be returned instead of actual number of models.

Potential issue with webdriver

Current HTML export code fails if the path to firefox/geckrodriver is not directly pointing to binary executable. This is exactly the case of the Conda installation used in the docker recipe.

The issue was reported here: bokeh/bokeh#10108
As a workaround, the path to conda firefox executable can be hardcoded like this:
export PATH=/root/miniforge/bin/FirefoxApp:${PATH}

I'll update docker and singularity recipes later.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.