GithubHelp home page GithubHelp logo

pandda_2_gemmi's Introduction

PanDDA 2

New in version 0.1.0

  • Improved README.md
  • Phenix dependency removed
  • PanDDA defaults to using present cifs with no rebuilding
  • Cleaner logs
  • Improved ranking for unbuildable events
  • Memory performance optimizations
  • Which filters to apply to datasets can be configured with --data_quality_filters and --reference_comparability_filters

Reporting Errors

PanDDA 2 is still in development and feedback is always appreciated!

If you have a problem with installation, then it is most likely specific to your system and best to just email me.

If the program errors while it runs, it is most helpful if you include the command line output and the json log in a GitHub issue.

If you uncertain about the correctness of the results, then a GitHub issue is appropriate if you can share information publicly, in particular screenshots of maps or ligand fits. If you cannot, then an email to me is the best way to raise the concerns. Either way, please include the program output, the json log and screenshots of the offending z maps/event maps/autobuilds.

Installation

It is recommended that you install PanDDA 2 in its own python 3.8 anaconda environment. If you do not have an Anaconda enviroment you can install one by following the instructions at https://www.anaconda.com/products/distribution#linux.

Then:

conda create -n pandda2 python=3.9
conda activate pandda2
conda install -c conda-forge -y fire numpy scipy joblib scikit-learn umap-learn bokeh dask dask-jobqueue hdbscan matplotlib rich seaborn rdkit openbabel
pip install ray
git clone https://github.com/ConorFWild/pandda_2_gemmi.git
cd pandda_2_gemmi
pip install -e .
cd _gemmi
pip install .
pip install numpy==1.21.0

Installing PanDDA 2 this way will add various scripts to your path, but only while you are in this anaconda environment.

Running PanDDA 2

PanDDA 2 supports the autobuilding of events and ranking them by autobuildability. All one needs to do is ensure that BUSTER is set up in their path (and hence ana_pdbmaps and rhofit).

Once you have installed PanDDA 2 in a conda environment, it can be run from that enviroment with autobuilding and automated ground state identification with the following:

python /path/to/analyse.py --data_dirs=<data directories> --out_dir=<output directory> --pdb_regex=<pdb regex> --mtz_regex=<mtz regex> <options>

Minimal Run

If you want to run the lightest possible PanDDA (no clustering of datasets, no autobuilding, ect: basically PanDDA 1), then a command like the following is appropriate:

python /path/to/analyse.py --data_dirs=<data directories> --out_dir=<output directory> --pdb_regex=<pdb regex> --mtz_regex=<mtz regex> --autobuild=False --rank_method="size" --comparison_strategy="high_res_first" <options>

Running With Distributed Computing At Diamond

It is strongly recommended that if you are qsub'ing a script that will run PanDDA 2 you set up your environment on the head node (by activating the anaconda environment in which PanDDA 2 is installed) and use the "-V" option on qsub to copy your current environment to the job.

An example of how to run with distributed computing at Diamond Light Source is as follows:

# Ensuring availability of Global Phasing code for autobuilding and phenix for building cifs
module load ccp4
module load buster

# Put the following in the file submit.sh
python /path/to/analyse.py --data_dirs=<data dirs> --out_dir=<output dirs> --pdb_regex=<pdb regex> --mtz_regex=<mtz regex> --global_processing="distributed" <options>

# Submitting
chmod 777 submit.sh
qsub -V -o submit.o -e submit.e -q medium.q -pe smp 12 -l m_mem_free=15G submit.sh

How PanDDA 2 works

PanDDA 2 differs from PanDDA 1 in two major methodological ways.

Firstly, it attempts to identify which sets of datasets should be used to produce a statistical model which has the optimal contrast (to each test dataset individually). This allows it to handle subtle heterogeneity between datasets.

Secondly it attempts to autobuild the events returned, and then rank the events based on the quality of the model of the fragment that could be constructed. This allows improved rankings of events.

Statistical Model Dataset Selection

PanDDA 2 selects the datasets to construct each statistical model it tries by identifying the sets of datasets which are closest to each other. This is achieved by:

  • Finding all pairwise distances between electron density maps
  • Finding the nearest neighbours of each dataset's electron density map
  • Finding the non-overlapping neighbourhoods with the minimum standard deviation between the maps they contain.

Statistical Model Selection

Currently PanDDA 2 only chooses one statistical model to progress to event map generation and autobuilding. This is done by fitting several conformations of the expected fragment into each event from each statistical by differential evolution.

The statistical which has the best fragment fit is then selected for progression to event map generation and autobuilding.

Autobuilding

PanDDA 2 autobuilds the events of each dataset. This is done with a custom Rhofit invocation to account for the differing distribution of values between event maps and normal crystallographic 2Fo-Fc maps.

Autobuild Selection

Current limitations with the interaction between pandda.inspect and PanDDA 2 mean that it is only possible to show one autobuild in the GUI.

Therefore, the highest scoring autobuild by RSCC from any event in each dataset is selected and included in the initial model pandda.inspect shows the user.

This has the effect that users may open apparently good hit density with no autobuild present, if another hit which is better fit by the autobuilding is present in the same dataset.

pandda_2_gemmi's People

Contributors

conorfwild avatar

Stargazers

Evgenii Osipov avatar TJ Lane avatar Gustavo Lima avatar Derek B avatar John Chodera avatar

Watchers

James Cloos avatar Gustavo Lima avatar  avatar

pandda_2_gemmi's Issues

Building wheel for gemmi (setup.py) fails

Error happens when building the env for pandda_2 (gemmi installation)

Initial and last lines of error:

_(pandda2) [vrangel@uk-abi-svnc04 gemmi]$ pip install .
Processing /data/SB_Results/Results_Data/Z_test-Victor/PanDDA/PanDDA2/untouched/pandda_2_gemmi/_gemmi
Preparing metadata (setup.py) ... done
Building wheels for collected packages: gemmi
Building wheel for gemmi (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [107 lines of output]
/home/UK/vrangel/.conda/envs/pandda2/lib/python3.9/site-packages/setuptools/installer.py:27: SetuptoolsDeprecationWarning: setuptools.installer is deprecated. Requirements should be satisfied by a PEP 517 installer.

/data/SB_Results/Results_Data/Z_test-Victor/PanDDA/PanDDA2/untouched/pandda_2_gemmi/_gemmi/.eggs/pybind11-2.10.0-py3.9.egg/pybind11/include/pybind11/pybind11.h:2379:48: required from ‘pybind11::iterator pybind11::make_iterator(Iterator&&, Sentinel&&, Extra&& ...) [with pybind11::return_value_policy Policy = pybind11::return_value_policy::reference_internal; Iterator = const gemmi::SpaceGroup (&)[555]; Sentinel = const gemmi::SpaceGroup*; ValueType = const gemmi::SpaceGroup&; Extra = {}]’
python/sym.cpp:212:67: required from here
/data/SB_Results/Results_Data/Z_test-Victor/PanDDA/PanDDA2/untouched/pandda_2_gemmi/_gemmi/.eggs/pybind11-2.10.0-py3.9.egg/pybind11/include/pybind11/pybind11.h:2347:29: error: lvalue required as increment operand
2347 | ++s.it;
| ~~^~
error: command '/cm/local/apps/gcc/10.2.0/bin/gcc' failed with exit code 1
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for gemmi
Running setup.py clean for gemmi
Failed to build gemmi

Smoothing factor value not properly updated

  • What do we expect as a result for teh "b-factor smoothing"? Should
    that adjust the "smoothing_factor" value of the Dataset
    structrure? I can't see anything happening ... am I missing
    something?

<- this is a silly error

ModuleNotFoundError: No module named 'seaborn'

I installed this as instructed but am unable to run it. could you help me understand what is missing from my setup

the result of an attempt via analyse

File "/home/derek/pandda_2_gemmi/pandda_gemmi/analyse.py", line 211, in process_pandda
structure_factors)

UnboundLocalError: local variable 'structure_factors' referenced before assignment

'exception': "local variable 'structure_factors' referenced before assignment",
'trace': 'Traceback (most recent call last):\n'
' File "/home/derek/pandda_2_gemmi/pandda_gemmi/analyse.py", line '
'211, in process_pandda\n'
' structure_factors)\n'
"UnboundLocalError: local variable 'structure_factors' referenced "
'before assignment\n'}

PanDDA default output is unclear on many stages

  • What are those messages

    Opened pool in 5.566833019256592, closed pool in
    0.7841935157775879, mapped in 265.9636478424072

    report? Timings? memory? Something else? Is that a useful
    precision here?

  • What is the meaning of "batches" in messages like

            total_sample_size= 502
            batch_size= 90
            num_batches= 6
            All batches larger than batch size, trying smaller split!
            All batches larger than batch size, trying smaller split!
            All batches larger than batch size, trying smaller split!
            All batches larger than batch size, trying smaller split!
            All batches larger than batch size, trying smaller split!
            Batches are: [array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
    

    Is that related to clusters? Or to general processing (local, SGE
    etc)? It seems to come in the "Deciding on the datasets to
    characterise the groundstate for each dataset to analyse" section
    ... which is a slightly confusing title: is that the part where
    the clustering is happening?

  • When I see

    Exception ignored in: <module 'threading' from '/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/threading.py'>
    Traceback (most recent call last):
    File "/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/threading.py", line 1408, in _shutdown
    Exception ignored in: <module 'threading' from '/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/threading.py'>
    Traceback (most recent call last):
    File "/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/threading.py", line 1408, in _shutdown
    def _shutdown():def _shutdown():

    File "/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/site-packages/ray/worker.py", line 1056, in sigterm_handler
    File "/home/vonrhein/Projects/PanDDA/20211104/Anaconda/Results.babinet/20211104/anaconda3/envs/pandda2-20220228/lib/python3.9/site-packages/ray/worker.py", line 1056, in sigterm_handler
          sys.exit(signum)sys.exit(signum)
    

    SystemExit: 15
    SystemExit: 15

    why is pandda still running? Clearly, there is an error ... but it
    seems to ignore it (not very comforting).

Installation unsuccessful - Error: class GetAutobuildResultRhofit(GetAutobuildResultInterface): TypeError: init() takes 2 positional arguments but 4 were given

Command used with data from the Pandda tutorial:
python ~/pandda_2_gemmi/pandda_gemmi/analyse.py --data_dirs='./data/' --out_dir='./output/' --pdb_regex='.dimple.pdb' --mtz_regex='.data.mtz'
Same error for:
python ~/pandda_2_gemmi/pandda_gemmi/analyse.py --data_dirs='./data/
' --out_dir='./output/' --pdb_regex='.dimple.pdb' --mtz_regex='.data.mtz' --autobuild=False --rank_method=size --comparison_strategy=high_res_first

Error message:
Traceback (most recent call last):
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/autobuild_pandda_results.py", line 49, in
from pandda_gemmi.logs import (
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/logs.py", line 13, in
from pandda_gemmi.event import Event
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/event/init.py", line 3, in
from pandda_gemmi.event.event_scoring_inbuilt import GetEventScoreInbuilt
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/event/event_scoring_inbuilt.py", line 21, in
from pandda_gemmi.autobuild.cif import generate_cif
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/autobuild/init.py", line 1, in
from pandda_gemmi.autobuild.autobuild import (merge_ligand_into_structure_from_paths,
File "/nfs/home/f_beck26/pandda_2_gemmi/pandda_gemmi/autobuild/autobuild.py", line 1401, in
class GetAutobuildResultRhofit(GetAutobuildResultInterface):
TypeError: init() takes 2 positional arguments but 4 were given

Help message is out of date

  • -h message incorrect: doesn't contain "=" in arguments ... or
    maybe that is not needed anyway?

  • What does

    --structure_factors STRUCTURE_FACTORS
    A string which gives the structure factors to use. It
    should be of the form 'FWT,PHWT'.

    mean? --structure_factors='2FOFCWT,PH2FOFCWT' doesn't seem to
    work: I get at some point

    StructureFactors(f='2', phi='F')

<- needs fixing in message and also internally


  • The explanations to --local_processing are very opaque and
    technical. What should one use if sitting on a multi-core box?
    Maybe --local_processing=multiprocessing_spawn and
    --global_processing=serial? Furthermore, the default reported
    (local_processing=ray) is not even explained.

    Why is --local_cpus defaulting to 6 on a 32 thread machine? And
    why does it say memory_availability=low as a default? I would
    expect the default to be "run only on local machine, using all
    available threads" - anything finetuned towards cluster systems,
    DLS or other environment should require special arguments (and not
    defaults). Or maybe provide settings files (similar to our macros)
    that could hold specific collections of settings that could be
    loaded via a command-line flag.

<- needs clearer readme and an option internally

Error is deeply unclear when wrong structure factors / wrongly formatted ones are given to program

  • when running with incorrect
    --structure_factors='("2FOFCWT","PH2FOFCWT")' we get a

    Looking for common structure factors in datasets...
    Found structure factors!

    which is totally unhelpful. Afterwards we get

    Filtering datasets with invalid structure factors...
    Done!

    and then all dataseta are being filtered. The argument parsing, error
    checking and messages to stdout are not very helpful: they don't
    tell me what is being done, what the result of a step is, what
    potential problems were encountered ... Clearly, a simple mistake
    like giving the incorrect column names is something that should be
    handled gracefully here ... ?

    It doesn't help that the various classes (like PanDDAFSModel etc)
    don't seem to provide a function for adequate pretty-printing via
    pprint ... or am I missing something?

<- should be straightforward to fix?

Make readme clearer on various options to run the program

fix documentation at README.md (--data_dirs) and explain that the

argument needs to end in a "/". Or better: make the program aware
that the --data_dirs argument could have no trailing "/" (and add
it automatically). E.g. via

<- can pull from Clemens


  • what is the set of commands to run in a normal way, i.e. on the
current machine (and not get messages like)

  E0228 14:03:18.298266714   54072 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
  E0228 14:03:18.323015215   54072 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
  E0228 14:03:18.343549771   54072 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
  • when running with incorrect

  • when running with incorrect
    --structure_factors='("2FOFCWT","PH2FOFCWT")' we get a

    Looking for common structure factors in datasets...
    Found structure factors!

    which is totally unhelpful. Afterwards we get
    Filtering datasets with invalid structure factors...
    Done!

    and then all dataseta are being filtered. The argument parsing, error
    checking and messages to stdout are not very helpful: they don't
    tell me what is being done, what the result of a step is, what
    potential problems were encountered ... Clearly, a simple mistake
    like giving the incorrect column names is something that should be
    handled gracefully here ... ?

<- needs fixing internally as well as fixing in readme


  • What does

    --structure_factors STRUCTURE_FACTORS
    A string which gives the structure factors to use. It
    should be of the form 'FWT,PHWT'.

    mean? --structure_factors='2FOFCWT,PH2FOFCWT' doesn't seem to
    work: I get at some point

    StructureFactors(f='2', phi='F')

    The README.md file also shows a different syntax.


  • How can I switch off that validator option? I don't want it to
    automatically test for Rfree values ...

    In general: I would like a PanDDA version that just does what it
    says on the tin: look for events within a given set of
    changed_state map/models knowing about a given set of ground_state
    map/models.

    On top of that basic/code mode, there can be a ground-state
    defining module, a clustering module and then some preparation
    modules (checking, job distribution setup etc).

    At the moment I feel overwhelmed by restrictions, checks and
    assumptions that prevent me from doing the actual analysis
    ...

<- needs implementation internally and updates to readme


  • The explanations to --local_processing are very opaque and
    technical. What should one use if sitting on a multi-core box?
    Maybe --local_processing=multiprocessing_spawn and
    --global_processing=serial? Furthermore, the default reported
    (local_processing=ray) is not even explained.

    Why is --local_cpus defaulting to 6 on a 32 thread machine? And
    why does it say memory_availability=low as a default? I would
    expect the default to be "run only on local machine, using all
    available threads" - anything finetuned towards cluster systems,
    DLS or other environment should require special arguments (and not
    defaults). Or maybe provide settings files (similar to our macros)
    that could hold specific collections of settings that could be
    loaded via a command-line flag.

<- once implemented needs detailing in readme


Dataset filtering due to structure factors results in unclear error

Clearer error when all datasets filtered due to structure factors and cannot get a reference structure:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/asharff/miniconda3a/envs/pandda2-centos7-develop/lib/python3.9/site-packages/pandda_gemmi/ │
│ analyse.py:475 in process_pandda │
│ │
│ 472 │ │ │
│ 473 │ │ # Select refernce │
│ 474 │ │ with STDOUTManager('Deciding on reference dataset...', f'\tDone!'): │
│ ❱ 475 │ │ │ reference: ReferenceInterface = GetReferenceDataset()( │
│ 476 │ │ │ │ datasets_wilson, │
│ 477 │ │ │ │ datasets_statistics, │
│ 478 │ │ │ ) │
│ │
│ /home/asharff/miniconda3a/envs/pandda2-centos7-develop/lib/python3.9/site-packages/pandda_gemmi/ │
│ dataset/dataset.py:747 in call
│ │
│ 744 │ datasets: DatasetsInterface, │
│ 745 │ dataset_statistics: DatasetsStatisticsInterface, │
│ 746 │ ) -> ReferenceInterface: │
│ ❱ 747 │ │ return get_reference_from_datasets(datasets, dataset_statistics) │
│ 748 │
│ 749 │
│ 750 @dataclasses.dataclass() │
│ │
│ /home/asharff/miniconda3a/envs/pandda2-centos7-develop/lib/python3.9/site-packages/pandda_gemmi/ │
│ dataset/dataset.py:737 in get_reference_from_datasets │
│ │
│ 734 │ │ # min_resolution_structure = datasets[min_resolution_dtag].structure │
│ 735 │ │ # min_resolution_reflections = datasets[min_resolution_dtag].reflections │
│ 736 │ │ │
│ ❱ 737 │ │ return Reference(min_resolution_dtag, │
│ 738 │ │ │ │ │ │ datasets[min_resolution_dtag] │
│ 739 │ │ │ │ │ │ ) │
│ 740 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnboundLocalError: local variable 'min_resolution_dtag' referenced before assignment
Saving PanDDA log to: results/BAZ2BA-21/pandda_log.json

Ensure there is a way to limit resolutions of data nalysed

  • How can one limit the resolution of the data analysed? The
    explanation given

    --high_res_upper_limit HIGH_RES_UPPER_LIMIT
    A float that gives the highest resolution datasets to
    conside for processing.
    --high_res_lower_limit HIGH_RES_LOWER_LIMIT
    A float that gives the lowest resolution datasets to
    consider processing

    is not very clear. How can a float define a "resolution datasets"?
    And: I can't see where that parameter would be used anywhere
    anyway ... confused ...

separate analysis and auto-building step

It would be great if analysis and auto-building steps could be ran separately. Also, it would be good if the csv files for pandda.inspect could be written out after the analysis step finished, even if auto-building with rhofit runs as part of the process. We had instances where during auto-building an error appeared which made the process stop and which essentially led to having no suitabe input for pandda.inspect, even though 2500 event maps have been written out.

Improve and test readme

Program usage has developed significantly since the last time this was worked on seriously: it needs updating and testing by an external user.

Clean up pandda console output

A lot of things render poorly including:

  • Chain statistics
  • Ligand files
  • Print output of times (most of which should probably go behind debug)
  • Lack of printing progress through datasets (i.e. : x/y)
  • Autobuild scores

Exception raised during analysis

When I try to run pandda analysis, after the while processing stops with:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│                                                                              │
│ /home/elmjag/pandda2/pandda_2_gemmi/analyse.py:609 in process_pandda         │
│                                                                              │
│   606 │   │   │   │   │   process_autobuilds = process_global                │
│   607 │   │   │   │                                                          │
│   608 │   │   │   │   time_autobuild_start = time.time()                     │
│ ❱ 609 │   │   │   │   autobuild_results_list: Dict[EventID, AutobuildResult] │
│   610 │   │   │   │   │   [                                                  │
│   611 │   │   │   │   │   │   Partial(                                       │
│   612 │   │   │   │   │   │   │   autobuild_func,                            │
│ /home/elmjag/pandda2/pandda_2_gemmi/pandda_gemmi/pandda_functions.py:40 in   │
│ process_local_serial                                                         │
│                                                                              │
│     37 def process_local_serial(funcs):                                      │
│     38 │   results = []                                                      │
│     39 │   for func in funcs:                                                │
│ ❱   40 │   │   results.append(func())                                        │
│     41 │                                                                     │
│     42 │   return results                                                    │
│     43                                                                       │
│                                                                              │
│ /home/elmjag/pandda2/pandda_2_gemmi/pandda_gemmi/common/partial_func.py:8 in │
│ __call__                                                                     │
│                                                                              │
│   5 │   │   self.kwargs = kwargs                                             │
│   6 │                                                                        │
│   7 │   def __call__(self, *args, **kwargs):                                 │
│ ❱ 8 │   │   return self.func(*args, *self.args, **kwargs, **self.kwargs)     │
╰──────────────────────────────────────────────────────────────────────────────╯

See attached files below for full log.
stdout.log
stderr.log

The analysis is launched with following arguments:

python analyse.py \
	--autobuild no \
	--cif_strategy default \
	--data_dirs /data/staff/biomax/elmir/pandda2/ \
	--mtz_regex=final_rfill.mtz \
	--pdb_regex=final.pdb \
	--out_dir out \
	--local_processing serial

Let me know if I should provide more details to diagnose this issue.

'tuple' object has no attribute 'items

Syntax:
python analyse.py --data_dirs all_data --out_dir all_data_output --pdb_regex="final.pdb" --mtz_regex="final.mtz" --autobuild=False --rank_method="size" --comparison_strategy="high_res_random"

Getting this error during the "Processing resolution shells" step.

"trace": "Traceback (most recent call last):\n  File \"/media/hauptman/Data/x-ray_data/Pandda/../../../../../../home/hauptman/Desktop/pandda2/pandda_2_gemmi/pandda_gemmi/analyse.py\", line 604, in process_pandda\n    shells: ShellsInterface = get_shells(\n  File \"/home/hauptman/Desktop/pandda2/pandda_2_gemmi/pandda_gemmi/pandda_functions.py\", line 1812, in get_shells\n    for dtag, comparison_dtags in comparators.items():\nAttributeError: 'tuple' object has no attribute 'items'\n",
"exception": "'tuple' object has no attribute 'items'"

Full json log:
{
"Start time": 1650371914.5108826,
"The arguments to the main function and their values": {
"data_dirs": "all_data",
"out_dir": "all_data_output",
"pdb_regex": "final.pdb",
"mtz_regex": "final.mtz",
"ligand_dir_regex": "compound",
"ligand_cif_regex": ".cif",
"ligand_pdb_regex": "
.pdb",
"ligand_smiles_regex": "*.smiles",
"statmaps": "False",
"low_memory": "False",
"ground_state_datasets": "None",
"exclude_from_z_map_analysis": "None",
"exclude_from_characterisation": "None",
"only_datasets": "None",
"ignore_datasets": "None",
"dynamic_res_limits": "True",
"high_res_upper_limit": "1.0",
"high_res_lower_limit": "4.0",
"high_res_increment": "0.05",
"max_shell_datasets": "60",
"min_characterisation_datasets": "25",
"structure_factors": "None",
"all_data_are_valid_values": "True",
"low_resolution_completeness": "4.0",
"sample_rate": "3.0",
"max_rmsd_to_reference": "1.5",
"max_rfree": "0.4",
"max_wilson_plot_z_score": "0.4",
"same_space_group_only": "False",
"similar_models_only": "False",
"resolution_factor": "0.25",
"grid_spacing": "0.5",
"padding": "3.0",
"density_scaling": "True",
"outer_mask": "8.0",
"inner_mask": "2.0",
"inner_mask_symmetry": "2.0",
"contour_level": "2.5",
"negative_values": "False",
"min_blob_volume": "10.0",
"min_blob_z_peak": "3.0",
"clustering_cutoff": "1.5",
"cluster_cutoff_distance_multiplier": "1.5",
"max_site_distance_cutoff": "1.732",
"min_bdc": "0.0",
"max_bdc": "0.95",
"increment": "0.05",
"output_multiplier": "2.0",
"comparison_strategy": "high_res_random",
"comparison_res_cutoff": "0.5",
"comparison_min_comparators": "30",
"comparison_max_comparators": "30",
"known_apos": "None",
"exclude_local": "5",
"cluster_selection": "close",
"local_processing": "ray",
"local_cpus": "6",
"global_processing": "serial",
"memory_availability": "low",
"job_params_file": "None",
"distributed_scheduler": "SGE",
"distributed_queue": "medium.q",
"distributed_project": "labxchem",
"distributed_num_workers": "12",
"distributed_cores_per_worker": "12",
"distributed_mem_per_core": "10",
"distributed_resource_spec": "m_mem_free=10G",
"distributed_tmp": "/tmp",
"distributed_job_extra": "["--exclusive", ]",
"distributed_walltime": "30:00:00",
"distributed_watcher": "False",
"distributed_slurm_partition": "False",
"autobuild": "False",
"autobuild_strategy": "rhofit",
"rhofit_coord": "False",
"cif_strategy": "elbow",
"rank_method": "size",
"debug": "False"
},
"FS model building time": 2.484349489212036,
"Reference Dtag": "XXX-AU10-CPS-6888-pos7",
"Time to perform b factor smoothing": 44.83301091194153,
"trace": "Traceback (most recent call last):\n File "/media/hauptman/Data/x-ray_data/Pandda/../../../../../../home/hauptman/Desktop/pandda2/pandda_2_gemmi/pandda_gemmi/analyse.py", line 604, in process_pandda\n shells: ShellsInterface = get_shells(\n File "/home/hauptman/Desktop/pandda2/pandda_2_gemmi/pandda_gemmi/pandda_functions.py", line 1812, in get_shells\n for dtag, comparison_dtags in comparators.items():\nAttributeError: 'tuple' object has no attribute 'items'\n",
"exception": "'tuple' object has no attribute 'items'"
}

Argparse fails to parse literals correctly

  • why do we need a --cif_strategy=grade if we have no
    --autobuild=True flag? Ah: this seems to be the default and we
    have to set --autobuild=False explicitly (update documentation).

    Hmmm ... this doesn't seem to have any effect: I always get a
    report that autobuild=True (even after modifying constant.pyu)?
    What am I doing wrong?

    If we change the default to False in constants.py and then set
    --autobuild=False it reports it as autobuild=True. If I set
    nothing it reports it correcty as autobuild=False.

    Ok: it seems a logic flaw in parsing booleans in argparse. These
    sould follow
    https://blog.actorsfit.com/a?ID=01750-2899f071-6f57-4ef7-aa9c-82624ceb73a9
    and use

    type=ast.literal_eval

    instead of

    type=bool

    (in pandda_gemmi/args/args.py)

<- Simple fix, might even be possible to pull from Clemen's

Move to symlinking data for PDBs, MTZs, cifs, ect

  • Why is PanDDA copying all those files over - even if they are
    fully accessible throughout the job? not only does that take time,
    it also occupies disk space, creates confusion (files are being
    renamed and timestamps are lost) and seems totally unnecessary. At
    least this should be done via symbolic links ... but ideally, the
    rogram should access files in their original location (so that any
    log/json file has correct provenance tracking info). Only if
    absolutely necessary because of local setup (cluster nodes,
    networked filesystems etc) should copying happen, i.e. upon
    command-line flag.

Look into reordering filters

  • The filtering seems to have a fixed order (the various dataset
    results have distinct names, not somethjing like going from
    datasets to datasets_new and then a copying over for the next
    filter). Is there a reason for that? It seems that the order

    Removing datasets with dissimilar models ...

    Removing datasets whose models have large gaps ...

    Removing datasets with dissimilar spacegroups to the reference ...

    is a bit odd: if one wants to reject incorrect SG datasets, why
    waste time on those to check for gaps etc?

Crashing when processing shells

Hey Conor, I am experiencing that pandda2 crashes when processing the shells. I tried downloading the FALZA dataset just to see if it was specific to my data, but it doesn't appear to be so.

The error message is:
"Time to process all shells": 51183.45810365677,
"trace": "Traceback (most recent call last):\n File "/data1/x-ray_data/pandda2/pandda_2_gemmi/pandda_gemmi/analyse.py", line 732, in process_pandda\n console.summarise_shells(shell_results, all_events, event_scores)\n File "/data1/x-ray_data/pandda2/pandda_2_gemmi/pandda_gemmi/pandda_logging/pandda_console.py", line 262, in summarise_shells\n event_score = dataset_event_scores[event_id]\nKeyError: EventID(dtag=Dtag(dtag='FALZA-x0052'), event_idx=EventIDX(event_idx=329))\n",
"exception": "EventID(dtag=Dtag(dtag='FALZA-x0052'), event_idx=EventIDX(event_idx=329))"
}

Attached is the full log.json file (added .log so I could upload it). Let me know if you need anything else
pandda_log.json.log
.

Make output readable

Programs main console output is unreadable in any mode - needs massive restructuring

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.