conorfwild / pandda_2_gemmi Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 3.0 6.19 MB

Python 40.04% Shell 0.52% CMake 0.51% C++ 58.80% Makefile 0.06% Fortran 0.04% CSS 0.01% JavaScript 0.01%

pandda_2_gemmi's Introduction

PanDDA 2

New in version 0.1.0

Improved README.md
Phenix dependency removed
PanDDA defaults to using present cifs with no rebuilding
Cleaner logs
Improved ranking for unbuildable events
Memory performance optimizations
Which filters to apply to datasets can be configured with --data_quality_filters and --reference_comparability_filters

Reporting Errors

PanDDA 2 is still in development and feedback is always appreciated!

If you have a problem with installation, then it is most likely specific to your system and best to just email me.

If the program errors while it runs, it is most helpful if you include the command line output and the json log in a GitHub issue.

If you uncertain about the correctness of the results, then a GitHub issue is appropriate if you can share information publicly, in particular screenshots of maps or ligand fits. If you cannot, then an email to me is the best way to raise the concerns. Either way, please include the program output, the json log and screenshots of the offending z maps/event maps/autobuilds.

Installation

It is recommended that you install PanDDA 2 in its own python 3.8 anaconda environment. If you do not have an Anaconda enviroment you can install one by following the instructions at https://www.anaconda.com/products/distribution#linux.

Then:

conda create -n pandda2 python=3.9
conda activate pandda2
conda install -c conda-forge -y fire numpy scipy joblib scikit-learn umap-learn bokeh dask dask-jobqueue hdbscan matplotlib rich seaborn rdkit openbabel
pip install ray
git clone https://github.com/ConorFWild/pandda_2_gemmi.git
cd pandda_2_gemmi
pip install -e .
cd _gemmi
pip install .
pip install numpy==1.21.0

Installing PanDDA 2 this way will add various scripts to your path, but only while you are in this anaconda environment.

Running PanDDA 2

PanDDA 2 supports the autobuilding of events and ranking them by autobuildability. All one needs to do is ensure that BUSTER is set up in their path (and hence ana_pdbmaps and rhofit).

Once you have installed PanDDA 2 in a conda environment, it can be run from that enviroment with autobuilding and automated ground state identification with the following:

python /path/to/analyse.py --data_dirs=<data directories> --out_dir=<output directory> --pdb_regex=<pdb regex> --mtz_regex=<mtz regex> <options>

Minimal Run

If you want to run the lightest possible PanDDA (no clustering of datasets, no autobuilding, ect: basically PanDDA 1), then a command like the following is appropriate:

python /path/to/analyse.py --data_dirs=<data directories> --out_dir=<output directory> --pdb_regex=<pdb regex> --mtz_regex=<mtz regex> --autobuild=False --rank_method="size" --comparison_strategy="high_res_first" <options>

Running With Distributed Computing At Diamond

It is strongly recommended that if you are qsub'ing a script that will run PanDDA 2 you set up your environment on the head node (by activating the anaconda environment in which PanDDA 2 is installed) and use the "-V" option on qsub to copy your current environment to the job.

An example of how to run with distributed computing at Diamond Light Source is as follows:

# Ensuring availability of Global Phasing code for autobuilding and phenix for building cifs
module load ccp4
module load buster

# Put the following in the file submit.sh
python /path/to/analyse.py --data_dirs=<data dirs> --out_dir=<output dirs> --pdb_regex=<pdb regex> --mtz_regex=<mtz regex> --global_processing="distributed" <options>

# Submitting
chmod 777 submit.sh
qsub -V -o submit.o -e submit.e -q medium.q -pe smp 12 -l m_mem_free=15G submit.sh

How PanDDA 2 works

PanDDA 2 differs from PanDDA 1 in two major methodological ways.

Firstly, it attempts to identify which sets of datasets should be used to produce a statistical model which has the optimal contrast (to each test dataset individually). This allows it to handle subtle heterogeneity between datasets.

Secondly it attempts to autobuild the events returned, and then rank the events based on the quality of the model of the fragment that could be constructed. This allows improved rankings of events.

Statistical Model Dataset Selection

PanDDA 2 selects the datasets to construct each statistical model it tries by identifying the sets of datasets which are closest to each other. This is achieved by:

Finding all pairwise distances between electron density maps
Finding the nearest neighbours of each dataset's electron density map
Finding the non-overlapping neighbourhoods with the minimum standard deviation between the maps they contain.

Statistical Model Selection

Currently PanDDA 2 only chooses one statistical model to progress to event map generation and autobuilding. This is done by fitting several conformations of the expected fragment into each event from each statistical by differential evolution.

The statistical which has the best fragment fit is then selected for progression to event map generation and autobuilding.

Autobuilding

PanDDA 2 autobuilds the events of each dataset. This is done with a custom Rhofit invocation to account for the differing distribution of values between event maps and normal crystallographic 2Fo-Fc maps.

Autobuild Selection

Current limitations with the interaction between pandda.inspect and PanDDA 2 mean that it is only possible to show one autobuild in the GUI.

Therefore, the highest scoring autobuild by RSCC from any event in each dataset is selected and included in the initial model pandda.inspect shows the user.

This has the effect that users may open apparently good hit density with no autobuild present, if another hit which is better fit by the autobuilding is present in the same dataset.

pandda_2_gemmi's People

Contributors

Stargazers

Watchers

Forkers

deebee33 cv-gphl ctomlinson-dls

pandda_2_gemmi's Issues

Add a mode options that allows setting many options at once

That is ok, but "fanciness" in the DLS context might mean something different from a single computer. I would highly recommend investigating some kind of macro-system, i.e a way to run with e.g.

--mode "DLS"
--mode "standlone"

Host ML model for event scoring and test data using a github LFS

The model is now a fairly essential part of the program and needs to be easily distributable

No warnings and confusing error when datasets are empty or wrong regex is used

See run on NUDT21A:
diamond:mx-static-highmem-htcondor-submit:/opt/clusterscratch/pandda/output/pandda_cluster_results/NUDT21A

No option to report PanDDA version / github commit

This would allow more transparency for reporucibility

Building wheel for gemmi (setup.py) fails

Error happens when building the env for pandda_2 (gemmi installation)

Initial and last lines of error:

_(pandda2) [vrangel@uk-abi-svnc04 gemmi]$ pip install .
Processing /data/SB_Results/Results_Data/Z_test-Victor/PanDDA/PanDDA2/untouched/pandda_2_gemmi/_gemmi
Preparing metadata (setup.py) ... done
Building wheels for collected packages: gemmi
Building wheel for gemmi (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [107 lines of output]
/home/UK/vrangel/.conda/envs/pandda2/lib/python3.9/site-packages/setuptools/installer.py:27: SetuptoolsDeprecationWarning: setuptools.installer is deprecated. Requirements should be satisfied by a PEP 517 installer.

/data/SB_Results/Results_Data/Z_test-Victor/PanDDA/PanDDA2/untouched/pandda_2_gemmi/_gemmi/.eggs/pybind11-2.10.0-py3.9.egg/pybind11/include/pybind11/pybind11.h:2379:48: required from ‘pybind11::iterator pybind11::make_iterator(Iterator&&, Sentinel&&, Extra&& ...) [with pybind11::return_value_policy Policy = pybind11::return_value_policy::reference_internal; Iterator = const gemmi::SpaceGroup (&)[555]; Sentinel = const gemmi::SpaceGroup*; ValueType = const gemmi::SpaceGroup&; Extra = {}]’
python/sym.cpp:212:67: required from here
/data/SB_Results/Results_Data/Z_test-Victor/PanDDA/PanDDA2/untouched/pandda_2_gemmi/_gemmi/.eggs/pybind11-2.10.0-py3.9.egg/pybind11/include/pybind11/pybind11.h:2347:29: error: lvalue required as increment operand
2347 | ++s.it;
| ~~^~
error: command '/cm/local/apps/gcc/10.2.0/bin/gcc' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for gemmi
Running setup.py clean for gemmi
Failed to build gemmi

Smoothing factor value not properly updated

What do we expect as a result for teh "b-factor smoothing"? Should
that adjust the "smoothing_factor" value of the Dataset
structrure? I can't see anything happening ... am I missing
something?

<- this is a silly error

Improve code organisation: split up pandda_functions and pandda_types

Remove sourcing bashrc from rhofit script

ModuleNotFoundError: No module named 'seaborn'

I installed this as instructed but am unable to run it. could you help me understand what is missing from my setup

the result of an attempt via analyse

File "/home/derek/pandda_2_gemmi/pandda_gemmi/analyse.py", line 211, in process_pandda
structure_factors)

UnboundLocalError: local variable 'structure_factors' referenced before assignment

'exception': "local variable 'structure_factors' referenced before assignment",
'trace': 'Traceback (most recent call last):\n'
' File "/home/derek/pandda_2_gemmi/pandda_gemmi/analyse.py", line '
'211, in process_pandda\n'
' structure_factors)\n'
"UnboundLocalError: local variable 'structure_factors' referenced "
'before assignment\n'}

Select mtz columns by default

Integrate edstats analysis/rejection

Pretty print input arguments

Input arguments in PanDDA console output are just a rich print of a python dictionary

Fail faster when programs like elbow are missing

PanDDA default output is unclear on many stages

What are those messages

Opened pool in 5.566833019256592, closed pool in
0.7841935157775879, mapped in 265.9636478424072

report? Timings? memory? Something else? Is that a useful
precision here?

What is the meaning of "batches" in messages like
        total_sample_size= 502
        batch_size= 90
        num_batches= 6
        All batches larger than batch size, trying smaller split!
        All batches larger than batch size, trying smaller split!
        All batches larger than batch size, trying smaller split!
        All batches larger than batch size, trying smaller split!
        All batches larger than batch size, trying smaller split!
        Batches are: [array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
Is that related to clusters? Or to general processing (local, SGE
etc)? It seems to come in the "Deciding on the datasets to
characterise the groundstate for each dataset to analyse" section
... which is a slightly confusing title: is that the part where
the clustering is happening?

When I see

Exception ignored in: <module 'threading' from '/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/threading.py'>
Traceback (most recent call last):
File "/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/threading.py", line 1408, in _shutdown
Exception ignored in: <module 'threading' from '/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/threading.py'>
Traceback (most recent call last):
File "/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/threading.py", line 1408, in _shutdown
def _shutdown():def _shutdown():
File "/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/site-packages/ray/worker.py", line 1056, in sigterm_handler
File "/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/site-packages/ray/worker.py", line 1056, in sigterm_handler
      sys.exit(signum)sys.exit(signum)
SystemExit: 15
SystemExit: 15

why is pandda still running? Clearly, there is an error ... but it
seems to ignore it (not very comforting).

Installation unsuccessful - Error: class GetAutobuildResultRhofit(GetAutobuildResultInterface): TypeError: init() takes 2 positional arguments but 4 were given

Command used with data from the Pandda tutorial:
python ~/pandda_2_gemmi/pandda_gemmi/analyse.py --data_dirs='./data/' --out_dir='./output/' --pdb_regex='.dimple.pdb' --mtz_regex='.data.mtz'
Same error for:
python ~/pandda_2_gemmi/pandda_gemmi/analyse.py --data_dirs='./data/' --out_dir='./output/' --pdb_regex='.dimple.pdb' --mtz_regex='.data.mtz' --autobuild=False --rank_method=size --comparison_strategy=high_res_first

Error message:
Traceback (most recent call last):
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/autobuild_pandda_results.py", line 49, in
from pandda_gemmi.logs import (
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/logs.py", line 13, in
from pandda_gemmi.event import Event
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/event/init.py", line 3, in
from pandda_gemmi.event.event_scoring_inbuilt import GetEventScoreInbuilt
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/event/event_scoring_inbuilt.py", line 21, in
from pandda_gemmi.autobuild.cif import generate_cif
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/autobuild/init.py", line 1, in
from pandda_gemmi.autobuild.autobuild import (merge_ligand_into_structure_from_paths,
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/autobuild/autobuild.py", line 1401, in
class GetAutobuildResultRhofit(GetAutobuildResultInterface):
TypeError: init() takes 2 positional arguments but 4 were given

Consider adding a buffer resolution/a larger number of minimum datasets to prevent datasets with small numbers of comparators from recieving poor models

This occurs in the XChem system D68EV3CPROA (/dls/labxchem/data/lb32627/lb32627-22) where the very high resolution dataset x1247 which is a clear hit in PanDDA 1 fails to recieve a good model because of the small number of comparators.

Convert JSON logs to more viewable format

Futher testing of pandda analyse GUI

The GUI in scripts/pandda_run_analyse.py seems desired but is poorly tested, especially in local running mode.

Add test data/system instructions to github

Help message is out of date

-h message incorrect: doesn't contain "=" in arguments ... or
maybe that is not needed anyway?

What does

--structure_factors STRUCTURE_FACTORS
A string which gives the structure factors to use. It
should be of the form 'FWT,PHWT'.

mean? --structure_factors='2FOFCWT,PH2FOFCWT' doesn't seem to
work: I get at some point

StructureFactors(f='2', phi='F')

<- needs fixing in message and also internally

The explanations to --local_processing are very opaque and
technical. What should one use if sitting on a multi-core box?
Maybe --local_processing=multiprocessing_spawn and
--global_processing=serial? Furthermore, the default reported
(local_processing=ray) is not even explained.

Why is --local_cpus defaulting to 6 on a 32 thread machine? And
why does it say memory_availability=low as a default? I would
expect the default to be "run only on local machine, using all
available threads" - anything finetuned towards cluster systems,
DLS or other environment should require special arguments (and not
defaults). Or maybe provide settings files (similar to our macros)
that could hold specific collections of settings that could be
loaded via a command-line flag.

<- needs clearer readme and an option internally

Error is deeply unclear when wrong structure factors / wrongly formatted ones are given to program

when running with incorrect
--structure_factors='("2FOFCWT","PH2FOFCWT")' we get a

Looking for common structure factors in datasets...
Found structure factors!

which is totally unhelpful. Afterwards we get

Filtering datasets with invalid structure factors...
Done!

and then all dataseta are being filtered. The argument parsing, error
checking and messages to stdout are not very helpful: they don't
tell me what is being done, what the result of a step is, what
potential problems were encountered ... Clearly, a simple mistake
like giving the incorrect column names is something that should be
handled gracefully here ... ?

It doesn't help that the various classes (like PanDDAFSModel etc)
don't seem to provide a function for adequate pretty-printing via
pprint ... or am I missing something?

<- should be straightforward to fix?

Make readme clearer on various options to run the program

fix documentation at README.md (--data_dirs) and explain that the

argument needs to end in a "/". Or better: make the program aware
that the --data_dirs argument could have no trailing "/" (and add
it automatically). E.g. via

<- can pull from Clemens

what is the set of commands to run in a normal way, i.e. on the

current machine (and not get messages like)

  E0228 14:03:18.298266714   54072 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
  E0228 14:03:18.323015215   54072 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
  E0228 14:03:18.343549771   54072 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies

when running with incorrect

when running with incorrect
--structure_factors='("2FOFCWT","PH2FOFCWT")' we get a

Looking for common structure factors in datasets...
Found structure factors!

which is totally unhelpful. Afterwards we get
Filtering datasets with invalid structure factors...
Done!

and then all dataseta are being filtered. The argument parsing, error
checking and messages to stdout are not very helpful: they don't
tell me what is being done, what the result of a step is, what
potential problems were encountered ... Clearly, a simple mistake
like giving the incorrect column names is something that should be
handled gracefully here ... ?

<- needs fixing internally as well as fixing in readme

What does

--structure_factors STRUCTURE_FACTORS
A string which gives the structure factors to use. It
should be of the form 'FWT,PHWT'.

mean? --structure_factors='2FOFCWT,PH2FOFCWT' doesn't seem to
work: I get at some point

StructureFactors(f='2', phi='F')

The README.md file also shows a different syntax.

How can I switch off that validator option? I don't want it to
automatically test for Rfree values ...

In general: I would like a PanDDA version that just does what it
says on the tin: look for events within a given set of
changed_state map/models knowing about a given set of ground_state
map/models.

On top of that basic/code mode, there can be a ground-state
defining module, a clustering module and then some preparation
modules (checking, job distribution setup etc).

At the moment I feel overwhelmed by restrictions, checks and
assumptions that prevent me from doing the actual analysis
...

<- needs implementation internally and updates to readme

The explanations to --local_processing are very opaque and
technical. What should one use if sitting on a multi-core box?
Maybe --local_processing=multiprocessing_spawn and
--global_processing=serial? Furthermore, the default reported
(local_processing=ray) is not even explained.

Why is --local_cpus defaulting to 6 on a 32 thread machine? And
why does it say memory_availability=low as a default? I would
expect the default to be "run only on local machine, using all
available threads" - anything finetuned towards cluster systems,
DLS or other environment should require special arguments (and not
defaults). Or maybe provide settings files (similar to our macros)
that could hold specific collections of settings that could be
loaded via a command-line flag.

<- once implemented needs detailing in readme

Dataset filtering due to structure factors results in unclear error

Clearer error when all datasets filtered due to structure factors and cannot get a reference structure:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/asharff/miniconda3a/envs/pandda2-centos7-develop/lib/python3.9/site-packages/pandda_gemmi/ │
│ analyse.py:475 in process_pandda │
│ │
│ 472 │ │ │
│ 473 │ │ # Select refernce │
│ 474 │ │ with STDOUTManager('Deciding on reference dataset...', f'\tDone!'): │
│ ❱ 475 │ │ │ reference: ReferenceInterface = GetReferenceDataset()( │
│ 476 │ │ │ │ datasets_wilson, │
│ 477 │ │ │ │ datasets_statistics, │
│ 478 │ │ │ ) │
│ │
│ /home/asharff/miniconda3a/envs/pandda2-centos7-develop/lib/python3.9/site-packages/pandda_gemmi/ │
│ dataset/dataset.py:747 in call │
│ │
│ 744 │ datasets: DatasetsInterface, │
│ 745 │ dataset_statistics: DatasetsStatisticsInterface, │
│ 746 │ ) -> ReferenceInterface: │
│ ❱ 747 │ │ return get_reference_from_datasets(datasets, dataset_statistics) │
│ 748 │
│ 749 │
│ 750 @dataclasses.dataclass() │
│ │
│ /home/asharff/miniconda3a/envs/pandda2-centos7-develop/lib/python3.9/site-packages/pandda_gemmi/ │
│ dataset/dataset.py:737 in get_reference_from_datasets │
│ │
│ 734 │ │ # min_resolution_structure = datasets[min_resolution_dtag].structure │
│ 735 │ │ # min_resolution_reflections = datasets[min_resolution_dtag].reflections │
│ 736 │ │ │
│ ❱ 737 │ │ return Reference(min_resolution_dtag, │
│ 738 │ │ │ │ │ │ datasets[min_resolution_dtag] │
│ 739 │ │ │ │ │ │ ) │
│ 740 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnboundLocalError: local variable 'min_resolution_dtag' referenced before assignment
Saving PanDDA log to: results/BAZ2BA-21/pandda_log.json

New install - unrecognised arguments for data_dirs and output

After git cloning and installing on 03-12-2021 receive an error for unrecognised arguments for /path/to/analyse.py

Previous install with same command runs ok.

Sites are no longer meaningful because of differing local frames

Sites need to be calculated using aligned centroids in order for them to be meaningful (and to prevent their enourmous proliferation)

Ensure there is a way to limit resolutions of data nalysed

How can one limit the resolution of the data analysed? The
explanation given

--high_res_upper_limit HIGH_RES_UPPER_LIMIT
A float that gives the highest resolution datasets to
conside for processing.
--high_res_lower_limit HIGH_RES_LOWER_LIMIT
A float that gives the lowest resolution datasets to
consider processing

is not very clear. How can a float define a "resolution datasets"?
And: I can't see where that parameter would be used anywhere
anyway ... confused ...

separate analysis and auto-building step

It would be great if analysis and auto-building steps could be ran separately. Also, it would be good if the csv files for pandda.inspect could be written out after the analysis step finished, even if auto-building with rhofit runs as part of the process. We had instances where during auto-building an error appeared which made the process stop and which essentially led to having no suitabe input for pandda.inspect, even though 2500 event maps have been written out.

Improve and test readme

Program usage has developed significantly since the last time this was worked on seriously: it needs updating and testing by an external user.

Clean up pandda console output

A lot of things render poorly including:

Chain statistics
Ligand files
Print output of times (most of which should probably go behind debug)
Lack of printing progress through datasets (i.e. : x/y)
Autobuild scores

Exception raised during analysis

When I try to run pandda analysis, after the while processing stops with:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│                                                                              │
│ /home/elmjag/pandda2/pandda_2_gemmi/analyse.py:609 in process_pandda         │
│                                                                              │
│   606 │   │   │   │   │   process_autobuilds = process_global                │
│   607 │   │   │   │                                                          │
│   608 │   │   │   │   time_autobuild_start = time.time()                     │
│ ❱ 609 │   │   │   │   autobuild_results_list: Dict[EventID, AutobuildResult] │
│   610 │   │   │   │   │   [                                                  │
│   611 │   │   │   │   │   │   Partial(                                       │
│   612 │   │   │   │   │   │   │   autobuild_func,                            │
│ /home/elmjag/pandda2/pandda_2_gemmi/pandda_gemmi/pandda_functions.py:40 in   │
│ process_local_serial                                                         │
│                                                                              │
│     37 def process_local_serial(funcs):                                      │
│     38 │   results = []                                                      │
│     39 │   for func in funcs:                                                │
│ ❱   40 │   │   results.append(func())                                        │
│     41 │                                                                     │
│     42 │   return results                                                    │
│     43                                                                       │
│                                                                              │
│ /home/elmjag/pandda2/pandda_2_gemmi/pandda_gemmi/common/partial_func.py:8 in │
│ __call__                                                                     │
│                                                                              │
│   5 │   │   self.kwargs = kwargs                                             │
│   6 │                                                                        │
│   7 │   def __call__(self, *args, **kwargs):                                 │
│ ❱ 8 │   │   return self.func(*args, *self.args, **kwargs, **self.kwargs)     │
╰──────────────────────────────────────────────────────────────────────────────╯

See attached files below for full log.
stdout.log
stderr.log

The analysis is launched with following arguments:

python analyse.py \
	--autobuild no \
	--cif_strategy default \
	--data_dirs /data/staff/biomax/elmir/pandda2/ \
	--mtz_regex=final_rfill.mtz \
	--pdb_regex=final.pdb \
	--out_dir out \
	--local_processing serial

Let me know if I should provide more details to diagnose this issue.

'tuple' object has no attribute 'items

Syntax:
python analyse.py --data_dirs all_data --out_dir all_data_output --pdb_regex="final.pdb" --mtz_regex="final.mtz" --autobuild=False --rank_method="size" --comparison_strategy="high_res_random"

Getting this error during the "Processing resolution shells" step.

"trace": "Traceback (most recent call last):\n  File \"/media/hauptman/Data/x-ray_data/Pandda/../../../../../../home/hauptman/Desktop/pandda2/pandda_2_gemmi/pandda_gemmi/analyse.py\", line 604, in process_pandda\n    shells: ShellsInterface = get_shells(\n  File \"/home/hauptman/Desktop/pandda2/pandda_2_gemmi/pandda_gemmi/pandda_functions.py\", line 1812, in get_shells\n    for dtag, comparison_dtags in comparators.items():\nAttributeError: 'tuple' object has no attribute 'items'\n",
"exception": "'tuple' object has no attribute 'items'"

Full json log:
{
"Start time": 1650371914.5108826,
"The arguments to the main function and their values": {
"data_dirs": "all_data",
"out_dir": "all_data_output",
"pdb_regex": "final.pdb",
"mtz_regex": "final.mtz",
"ligand_dir_regex": "compound",
"ligand_cif_regex": ".cif",
"ligand_pdb_regex": ".pdb",
"ligand_smiles_regex": "*.smiles",
"statmaps": "False",
"low_memory": "False",
"ground_state_datasets": "None",
"exclude_from_z_map_analysis": "None",
"exclude_from_characterisation": "None",
"only_datasets": "None",
"ignore_datasets": "None",
"dynamic_res_limits": "True",
"high_res_upper_limit": "1.0",
"high_res_lower_limit": "4.0",
"high_res_increment": "0.05",
"max_shell_datasets": "60",
"min_characterisation_datasets": "25",
"structure_factors": "None",
"all_data_are_valid_values": "True",
"low_resolution_completeness": "4.0",
"sample_rate": "3.0",
"max_rmsd_to_reference": "1.5",
"max_rfree": "0.4",
"max_wilson_plot_z_score": "0.4",
"same_space_group_only": "False",
"similar_models_only": "False",
"resolution_factor": "0.25",
"grid_spacing": "0.5",
"padding": "3.0",
"density_scaling": "True",
"outer_mask": "8.0",
"inner_mask": "2.0",
"inner_mask_symmetry": "2.0",
"contour_level": "2.5",
"negative_values": "False",
"min_blob_volume": "10.0",
"min_blob_z_peak": "3.0",
"clustering_cutoff": "1.5",
"cluster_cutoff_distance_multiplier": "1.5",
"max_site_distance_cutoff": "1.732",
"min_bdc": "0.0",
"max_bdc": "0.95",
"increment": "0.05",
"output_multiplier": "2.0",
"comparison_strategy": "high_res_random",
"comparison_res_cutoff": "0.5",
"comparison_min_comparators": "30",
"comparison_max_comparators": "30",
"known_apos": "None",
"exclude_local": "5",
"cluster_selection": "close",
"local_processing": "ray",
"local_cpus": "6",
"global_processing": "serial",
"memory_availability": "low",
"job_params_file": "None",
"distributed_scheduler": "SGE",
"distributed_queue": "medium.q",
"distributed_project": "labxchem",
"distributed_num_workers": "12",
"distributed_cores_per_worker": "12",
"distributed_mem_per_core": "10",
"distributed_resource_spec": "m_mem_free=10G",
"distributed_tmp": "/tmp",
"distributed_job_extra": "["--exclusive", ]",
"distributed_walltime": "30:00:00",
"distributed_watcher": "False",
"distributed_slurm_partition": "False",
"autobuild": "False",
"autobuild_strategy": "rhofit",
"rhofit_coord": "False",
"cif_strategy": "elbow",
"rank_method": "size",
"debug": "False"
},
"FS model building time": 2.484349489212036,
"Reference Dtag": "XXX-AU10-CPS-6888-pos7",
"Time to perform b factor smoothing": 44.83301091194153,
"trace": "Traceback (most recent call last):\n File "/media/hauptman/Data/x-ray_data/Pandda/../../../../../../home/hauptman/Desktop/pandda2/pandda_2_gemmi/pandda_gemmi/analyse.py", line 604, in process_pandda\n shells: ShellsInterface = get_shells(\n File "/home/hauptman/Desktop/pandda2/pandda_2_gemmi/pandda_gemmi/pandda_functions.py", line 1812, in get_shells\n for dtag, comparison_dtags in comparators.items():\nAttributeError: 'tuple' object has no attribute 'items'\n",
"exception": "'tuple' object has no attribute 'items'"
}

Argparse fails to parse literals correctly

why do we need a --cif_strategy=grade if we have no
--autobuild=True flag? Ah: this seems to be the default and we
have to set --autobuild=False explicitly (update documentation).

Hmmm ... this doesn't seem to have any effect: I always get a
report that autobuild=True (even after modifying constant.pyu)?
What am I doing wrong?

If we change the default to False in constants.py and then set
--autobuild=False it reports it as autobuild=True. If I set
nothing it reports it correcty as autobuild=False.

Ok: it seems a logic flaw in parsing booleans in argparse. These
sould follow
https://blog.actorsfit.com/a?ID=01750-2899f071-6f57-4ef7-aa9c-82624ceb73a9
and use

type=ast.literal_eval

instead of

type=bool

(in pandda_gemmi/args/args.py)

<- Simple fix, might even be possible to pull from Clemen's

Give clearer descriptions of stages (especially B-Factor smoothing)

What does

Performing b-factor smoothing ...

context has already been set

mean? What is context? Can that be explained in simpler,
understandable terms? Or use a better wording here?

Make logs less verbose

Report all numbers to meaningful precissions

Move to symlinking data for PDBs, MTZs, cifs, ect

Why is PanDDA copying all those files over - even if they are
fully accessible throughout the job? not only does that take time,
it also occupies disk space, creates confusion (files are being
renamed and timestamps are lost) and seems totally unnecessary. At
least this should be done via symbolic links ... but ideally, the
rogram should access files in their original location (so that any
log/json file has correct provenance tracking info). Only if
absolutely necessary because of local setup (cluster nodes,
networked filesystems etc) should copying happen, i.e. upon
command-line flag.

Look into reordering filters

The filtering seems to have a fixed order (the various dataset
results have distinct names, not somethjing like going from
datasets to datasets_new and then a copying over for the next
filter). Is there a reason for that? It seems that the order

Removing datasets with dissimilar models ...

Removing datasets whose models have large gaps ...

Removing datasets with dissimilar spacegroups to the reference ...

is a bit odd: if one wants to reject incorrect SG datasets, why
waste time on those to check for gaps etc?

Improve invalid mtz column name errors

Determine how much effort to make work with XCE

Improve program help message: use argparse

Ligand is assigned lowest residue id the protein has rather than hasn't got

See diamond:/dls/labxchem/data/2021/lb29658-1/processing/analysis/pandda2_20220221_CT

Handle the local alignment of electron density associated with DNA/RNA

Currently the local alignment of electron density does not support DNA/RNA, which can lead to poor results, especially for pure DNA/RNA systems.

A test system will be needed, possibly externally or from an academic XChem system containing mixed protein and RNA/DNA

Crashing when processing shells

Hey Conor, I am experiencing that pandda2 crashes when processing the shells. I tried downloading the FALZA dataset just to see if it was specific to my data, but it doesn't appear to be so.

The error message is:
"Time to process all shells": 51183.45810365677,
"trace": "Traceback (most recent call last):\n File "/data1/x-ray_data/pandda2/pandda_2_gemmi/pandda_gemmi/analyse.py", line 732, in process_pandda\n console.summarise_shells(shell_results, all_events, event_scores)\n File "/data1/x-ray_data/pandda2/pandda_2_gemmi/pandda_gemmi/pandda_logging/pandda_console.py", line 262, in summarise_shells\n event_score = dataset_event_scores[event_id]\nKeyError: EventID(dtag=Dtag(dtag='FALZA-x0052'), event_idx=EventIDX(event_idx=329))\n",
"exception": "EventID(dtag=Dtag(dtag='FALZA-x0052'), event_idx=EventIDX(event_idx=329))"
}

Attached is the full log.json file (added .log so I could upload it). Let me know if you need anything else
pandda_log.json.log
.