ispyb / ispyb-database-modeling Goto Github PK

Shell 100.00%

ispyb-database-modeling's Introduction

ISPyB Project

Installing
Database creation and update
Database schema

Installing

Clone the ISPyB repository:

git clone https://github.com/ispyb/ISPyB.git

ISPyB needs the third-party libraries provided in the dependencies directory. These don't exist in a public repository, so install them to the local Maven repository so that Maven can find them:
```
cd dependencies && mvn initialize
```

Build ISPyB using Maven:

mvn clean install

By default, ISPyB builds for the GENERIC site and the development environment. These can be changed with the ispyb.site and ispyb.env system properties. For example, to build for the ESRF site and the test environment:

mvn -Dispyb.site=ESRF -Dispyb.env=test clean install

If the build succeeds, a summary message will be printed like this:

[INFO] Reactor Summary:
[INFO]
[INFO] ispyb-parent ...................................... SUCCESS [0.251s]
[INFO] ispyb-ejb3 ........................................ SUCCESS [10.243s]
[INFO] ispyb-ws .......................................... SUCCESS [1.751s]
[INFO] ispyb-ui .......................................... SUCCESS [7.212s]
[INFO] ispyb-ear ......................................... SUCCESS [5.048s]
[INFO] ispyb-bcr ......................................... SUCCESS [2.217s]
[INFO] ispyb-bcr-ear ..................................... SUCCESS [1.806s]

Database creation and update

Run the following creation scripts from the ispyb-ejb module (note that this requires the pxadmin database user to exist and have full permissions):

ispyb-ejb/db/scripts/pyconfig.sql: This corresponds to the menu options and contains both structure and data.
ispyb-ejb/db/scripts/pydb.sql: This corresponds to the ISPyB metadata and contains only the database structure.
ispyb-ejb/db/scripts/schemastatus.sql: This corresponds to the SchemaStatus table and contains both structure and data. The entries indicate which update scripts have been run.
ispyb-ejb/db/scripts/ispybAutoprocAttachment.sql: This corresponds to the type and names of different autoPROC attachments.

The creation scripts are normally updated for each tag, but if you are using the master branch, you may have to run the update scripts in ispyb-ejb/db/scripts/ahead.

Check the entries in the SchemaStatus table to know which scripts to execute. The scripts already run for the current tag are in ispyb-ejb/db/scripts/passed.

Creating an update script

The first line must be:

insert into SchemaStatus (scriptName, schemaStatus) values ('2017_06_06_blabla.sql','ONGOING');

Then comes the actual updates of the script:

-- SQL statements here

And the last line must be:

update SchemaStatus set schemaStatus = 'DONE' where scriptName = '2017_06_06_blabla.sql';

The last line updates the SchemaStatus table to mark the script as having been run.

Database schema

A patch or commit that changes the database schema must be accompanied by a corresponding change to the schema file to keep it up to date. This file can be edited using MySQL Workbench (a free tool from MySQL).

ispyb-database-modeling's People

Contributors

Stargazers

Watchers

Forkers

olofsvensson rhfogh karllevik

ispyb-database-modeling's Issues

Cleaning up Screening Tables

The goal of this issue is to check if some columns or tables could be simplified by, for instance, removing unused or deprecated columns.

This is the current schema:

Extend fileType enum in table DataCollectionFileAttachment

In DataCollectionFileAttachment table we would like to extend the fileType enum, and I've also included a comment to better document what the different options are for:

ALTER TABLE DataCollectionFileAttachment MODIFY fileType enum('snapshot', 'log', 'xy', 'recip') 
COMMENT 'snapshot: image file, usually of the sample. 
log: a text file with logging info. 
xy: x and y data in text format. 
recip: a compressed csv file with reciprocal space coordinates.';

Can this be merged into the official schema?

Add a new field in BLSession table to handle nb of reimbursed dewars

At ESRF we are asked to track in ISPyB the number of reimbursed dewars authorized in the User Portal software.
Therefore we will need an extra field in BLSession table:
nbReimbursedDewars

This number will be defined in the User Portal according to the type of experiment, to the nb of reimbursed users and to a maximum possible. This number is linked to a session and can change between 2 sessions of the same proposal (currently between 0 and 4).

Automatic scoring of drop images

We have been looking at implementing automatic scoring of drop images, and have come up with the following three tables which I hereby propose as changes to the ISPyB collaboration schema.

As you can see, BLSampleImageAutoScoreClass - the "thing" being scored (crystal, precipitant etc), is a child of BLSampleImageAutoScoreSchema which is basically just the name of the schema and whether it's currently enabled.

Then the last table, BLSampleImage_has_AutoScoreClass, enables a many-to-many relationship between BLSampleImage (drop images) and BLSampleImageAutoScoreClass (thing being scored), and at the same time contains the actual score value - the probability column - which gives us the probability that the drop image contains that thing.

CREATE TABLE BLSampleImageAutoScoreSchema (
  blSampleImageAutoScoreSchemaId tinyint(3) unsigned auto_increment PRIMARY KEY,
  schemaName varchar(25) NOT NULL COMMENT 'Name of the schema e.g. Hampton, MARCO',
  enabled tinyint(1) DEFAULT 1 COMMENT 'Whether this schema is enabled (could be configurable in the UI)'
) COMMENT 'Scoring schema name and whether it is enabled';

CREATE TABLE BLSampleImageAutoScoreClass (
  blSampleImageAutoScoreClassId tinyint(3) unsigned auto_increment PRIMARY KEY,
  blSampleImageAutoScoreSchemaId tinyint(3) unsigned,
  scoreClass varchar(15) NOT NULL COMMENT 'Thing being scored e.g. crystal, precipitant',
  CONSTRAINT BLSampleImageAutoScoreClass_fk1 FOREIGN KEY (blSampleImageAutoScoreSchemaId) REFERENCES BLSampleImageAutoScoreSchema(blSampleImageAutoScoreSchemaId) ON DELETE NO ACTION ON UPDATE CASCADE
) COMMENT 'The automated scoring classes - the thing being scored';

CREATE TABLE BLSampleImage_has_AutoScoreClass (
  blSampleImageId int(11) unsigned NOT NULL,
  blSampleImageAutoScoreClassId tinyint(3) unsigned,
  probability float,
  PRIMARY KEY (blSampleImageId, blSampleImageAutoScoreClassId),
  CONSTRAINT BLSampleImage_has_AutoScoreClass_fk1 FOREIGN KEY (blSampleImageId) REFERENCES BLSampleImage(blSampleImageId) ON DELETE CASCADE ON UPDATE CASCADE,
  CONSTRAINT BLSampleImage_has_AutoScoreClass_fk2 FOREIGN KEY (blSampleImageAutoScoreClassId) REFERENCES BLSampleImageAutoScoreClass(blSampleImageAutoScoreClassId) ON DELETE CASCADE ON UPDATE CASCADE
) COMMENT 'Many-to-many relationship between drop images and thing being scored, as well as the actual probability (score) that the drop image contains that thing';

Puck database schema changes

As discussed at the ISPyB meeting at Soleil earlier this week, Diamond has made some database schema developments for supporting a "puck database" within ISPyB. We now also have production code for this in Synchweb.

However, we are happy to make changes to the schema if anyone finds a problem with it. I've pasted in the create script below, but sadly there doesn't seem to be an easy way to attach the diagram.

The main table is ContainerRegistry, and the other tables are referencing that. An entry in ContainerRegistry can be associated with multiple Proposals through ContainerRegistry_has_Proposal. ContainerHistory will show the history of a given Container (i.e. the beamlines it's been to). ContainerReport allows a Person to write one or more reports about a Container in the ContainerRegistry.

Statements to remove the tables and columns created

ALTER TABLE Container 
  DROP containerRegistryId,
  DROP FOREIGN KEY  Container_ibfk8;

ALTER TABLE ContainerHistory 
  DROP beamlineName;

DROP TABLE IF EXISTS ContainerReport;
DROP TABLE IF EXISTS ContainerRegistry_has_Proposal;
DROP TABLE IF EXISTS ContainerRegistry;

Create statements

CREATE TABLE ContainerRegistry (
  containerRegistryId int(11) unsigned AUTO_INCREMENT PRIMARY KEY,
  barcode varchar(20),
  comments varchar(255),
  recordTimestamp datetime DEFAULT current_timestamp
);

ALTER TABLE Container 
  ADD containerRegistryId int(11) unsigned NULL DEFAULT NULL,
  ADD CONSTRAINT Container_ibfk8 FOREIGN KEY (containerRegistryId) REFERENCES ContainerRegistry(containerRegistryId);

CREATE TABLE ContainerRegistry_has_Proposal (
  containerRegistryHasProposalId int(11) unsigned AUTO_INCREMENT PRIMARY KEY,
  containerRegistryId int(11) unsigned,
  proposalId int(10) unsigned,
  personId int(10) unsigned COMMENT 'Person registering the container',
  recordTimestamp datetime DEFAULT current_timestamp,
  UNIQUE KEY (containerRegistryId, proposalId),
  CONSTRAINT ContainerRegistry_has_Proposal_ibfk1 FOREIGN KEY (containerRegistryId) REFERENCES ContainerRegistry(containerRegistryId),
  CONSTRAINT ContainerRegistry_has_Proposal_ibfk2 FOREIGN KEY (proposalId) REFERENCES Proposal(proposalId),
  CONSTRAINT ContainerRegistry_has_Proposal_ibfk3 FOREIGN KEY (personId) REFERENCES Person(personId)
);

CREATE TABLE ContainerReport (
  containerReportId int(11) unsigned AUTO_INCREMENT PRIMARY KEY,
  containerRegistryId int(11) unsigned,
  personId int(10) unsigned COMMENT 'Person making report',
  report text,
  attachmentFilePath varchar(255),
  recordTimestamp datetime,
  CONSTRAINT ContainerReport_ibfk1 FOREIGN KEY (containerRegistryId) REFERENCES ContainerRegistry(containerRegistryId),
  CONSTRAINT ContainerReport_ibfk2 FOREIGN KEY (personId) REFERENCES Person(personId)
);

ALTER TABLE ContainerHistory 
  ADD beamlineName varchar(20);

Can this be merged into the official schema?

Adding fields for anisotropic diffraction data

Justification

The STARANISO program is providing a new approach to describing diffraction limits of reflection data, taking anisotropy into account. Apart from the general approach to anisotropy it also gives a simplified description of this anisotropy via an ellipsoid fitted to the anisotropic cut-off surface which in turn can be used to calculate well-known statistical data merging descriptors.

These data are not autoPROC-specific. Diffraction anisotropy is a general phenomenon, which by its nature makes traditional statistics like resolution and completeness difficult to apply consistently without modification, once diffraction anisotropy is present and accounted for. In order to consider these effects appropriately, and to make the data accessible to all programs that (will) wish to take them into account, the anisotropy-derived values for resolution and completeness should be stored in ISPyB as general data, and not quarantined to summary files or program-specific tables.

The proposed changes affect two tables:

AutoProcScalingStatistics

Field	Type	Null	Default
completenessSpherical	float	YES	NULL
completenessEllipsoidal	float	YES	NULL
anomalousCompletenessSpherical	float	YES	NULL
anomalousCompletenessEllipsoidal	float	YES	NULL

Comment: Completeness and anomalous completeness can be calculated in two different ways, either assuming isotropic data or taking into account anisotropy. Both approaches calculate the fraction of observed reflection within the 'measurable' volume. For spherical completeness this volume is assumed to be a sphere with a radius corresponding to the resolution of the data, whereas ellipsoidal completeness considers the ellipse defined by the diffraction limits. The new fields give both values, leaving the pre-existing fields ‘completeness’ and 'anomalousCompleteness’ to be filled with either value as considered appropriate, and to be used in existing applications. Ideally the overall ‘completeness’ fields would be removed and the various applications refactored to account for the new data available, but this does not seem realistic.

AutoProcScaling

Field	Type	Null	Default
resolutionEllipsoidAxis11	float	YES	NULL
resolutionEllipsoidAxis12	float	YES	NULL
resolutionEllipsoidAxis13	float	YES	NULL
resolutionEllipsoidAxis21	float	YES	NULL
resolutionEllipsoidAxis22	float	YES	NULL
resolutionEllipsoidAxis23	float	YES	NULL
resolutionEllipsoidAxis31	float	YES	NULL
resolutionEllipsoidAxis32	float	YES	NULL
resolutionEllipsoidAxis33	float	YES	NULL
resolutionEllipsoidValue1	float	YES	NULL
resolutionEllipsoidValue2	float	YES	NULL
resolutionEllipsoidValue3	float	YES	NULL

Comment: STARANISO fits an ellipsoid to the anisotropic cut-off surface, describing this in terms of three principal axes (vectors of unit length) and resolution limits (in Angstrom) along each axis. The proposed fields give the direction cosines of the three principal axes of the ellipsoid in the standard orthonormal Cartesian frame associated with the crystal frame (e.g. the first axis has a triplet of directional cosines resolutionEllipsoidAxis11, resolutionEllipsoidAxis12, resolutionEllipsoidAxis13 and a corresponding length (resolution value) of resolutionEllipsoidValue1).

BLSession.archived boolean

We'd like to have another column in the BLSession table: archived boolean DEFAULT False.

The purpose is to tell the application that the data is archived and no longer available on disk.

ALTER TABLE BLSession
    ADD archived boolean DEFAULT False 
        COMMENT 'The data for the session is archived and no longer available on disk';

sessionId and blSampleId columns on dataCollection table

Hi,

sessionId exists within dataCollectionGroup and it associates a data collection group to the experimental session or visit. It is not NULL so it is filled in for each data collection group.

By other hand, a data collection is always linked to a data collection group. I wonder if someone knows why sessionId is also existing in the dataCollection Table.

Likewise for blSampleId.

I think it is redundant and unnormalizes the data model. So, we propose to remove them.

Schema Fixing: Consolidate *program tables

I would like to suggest consolidating the program/programattachment tables. At the moment there is various duplication of these tables across the schema. I would suggest AutoProcProgram is used by any table that runs a "Process", and attachments are added to AutoProcProgramAttachment. This then makes the database look as follows:

We can deprecate phasingprogram/attachment, and make screening also link to this, and thus store attachments for any process in a central location = easy sql

Schema fixing: gridinfo should link to datacollection not datacollectiongroup

GridInfo currently links to DataCollectionGroup, and not DataCollection. This means that a datacollectiongroup cannot contain multiple grid scans. I realise at ESRF you group these things using workflow, but we should not make these sort of assumptions.

Add a new field in BLSample table : filePath

This is a request from CRIMS and ID30B, to be able to handle the plates repositioning.
It will be used for example to store an image of the drop.

Changes for automatic shipment creation

To aid our automatic shipping creation process i would like to propose the following changes:

CourierTermsAccepted

shippingid int (foreign key to shipping)

We should have had this in the beginning but at the time users accepted terms before the shipment was created. Terms are now accepted afterwards allowing us to use the shippingid

Shipping

deliveryAgent_flightcode_timestamp timestamp (date flight code created, if automatic)
deliveryAgent_label text (base64 encoded pdf of airway label)

readybytime datetime (time shipment will be ready)
closetime datetime (time after which shipment cannot be picked up)
physicallocation varchar(50) (where shipment can be picked up from: i.e. Stores)

deliveryAgent_pickupconfirmation_timestamp (date picked confirmed)
deliveryAgent_pickupconfirmation varchar(10) (confirmation number of requested pickup)
deliveryAgent_readybytime datetime (confirmed readyby time)
deliveryAgent_callintime datetime (confirmed courier call in time)

Dewar

weight number (dewar weight in kg)

Self explanatory

Laboratory

postcode varchar(15)

I would like to register postcode separately from the "Address" field so it can be automatically sent to the courier

@KarlLevik can weigh in with SQL alter statements

New enum options for DataCollectionGroup.experimentType to support serial crystallography

We (DLS) would like to add two new options to DCG experimentType in order to support serial crystallography: 'Serial Fixed' and 'Serial Jet'.

In the DLS fork of the database schema the following DDL will add the needed options:

ALTER TABLE DataCollectionGroup 
  MODIFY experimentType enum('SAD','SAD - Inverse Beam','OSC',
    'Collect - Multiwedge','MAD','Helical','Multi-positional','Mesh','Burn',
    'MAD - Inverse Beam','Characterization','Dehydration','tomo','experiment','EM','PDF',
    'PDF+Bragg','Bragg','single particle', 'Serial Fixed', 'Serial Jet'), 
ALGORITHM=INPLACE;

New column Detector.localName or .friendlyName

We have detectorSerialNumber, but that's not the name used by staff when they talk about the various detectors.

So it would be good if we could have a column to identify the detector using a more "friendly" name.

This is especially useful on beamlines with more than one detector.

New column ScreeningOutput.alignmentSuccess?

We were wondering if it would make sense to have a new column ScreeningOutput.alignmentSuccess? This would be similar to existing columns indexingSuccess and strategySuccess in the same table. Both these have datatype tinyint(1), so are basically booleans, and this is what we would like for alignmentSuccess as well.

This idea originated from @graeme-winter

What are your thoughts? @antolinos @stufisher @olofsvensson or others?

Add AxisStart to EnergyScan / XFESpectrum

We would like to record axis start position along with EnergyScans / MCAs.

Thus would like to add
AxisStart float to EnergyScan & XFEFluorescenceSpectrum

Schema fixing: We need a BLSession.rootPath column

I've recently been looking into "archiving" sessions: After moving the sessions' directory to a different location (different mount point), all the file paths stored in the ISPyB database need to be updated.

E.g. we would have a session in a directory /dls/i04/data/2018/mx12345-123/ and we move it to /arc/i04/data/2018/mx12345-123/.

I've written a stored procedure that executes these updates in the database, and it seems to do the job correctly for the two sessions we've tested so far. However, whenever we add more path columns to the database schema, I would have to update the stored procedure.

To me it would seem sensible if we had a root path for each BLSession, and all other paths were relative to this. (This should be backwards compatible since old BLSessions don't have a root path to which the other paths should be relative ... Does that make sense?)

At the moment we're duplicating a lot of data by storing the full path for everything inside a BLSession. It's wasteful and also isn't adhering to normalisation principles.

Add 'groupName' to the PhasingStep table and 'input' to PhasingStepAttachment

We would like to make to changes related to storing phasing results in ISPyB:

A new column 'groupName' (varchar(45)) in the PhasingStep table. The idea is to be able to link with group names entered in the Structure table.
A new column 'input' (boolean) in the PhasingProgramAttachment table. The idea is to be able to store the input PDBs used for phasing into the database / pyarch and to distinguish these PDBs from the result PDBs.

We are well aware that this proposition is in conflict with #32, however, we would like to do this change rapidly in order to get this functinality deployed before our long shutdown on December 10th. During our long shutdown we can then deal with #32.

Table for auto processing analysis

Diamond is now running xtraige on each autoprocessed data set, we would like to propose a table to store these results:

AutoProcProgramAnalysis
-------------------------------
autoprocprogramanalysisid primary key auto incremnet
autoprocprogramid int foreign key
severity int (0-2)
message varchar(200)
description text

Example:
Severity: 0 == good, 1 == alert, 2 == bad
Message

"The overall completeness in low-resolution shells is less than 90%"

Description:

The following table shows the completeness of the data to 5.0 A. Poor
low-resolution completeness often leads to map distortions and other
difficulties, and is typically caused by problems with the crystal orientation
during data collection, overexposure of frames, interference with the beamstop,
or omission of reflections by data-processing software.

| Resolution range | N(obs)/N(possible) | Completeness |

| 31.4594 - 10.6518 | [53/151] | 0.351 |
| 10.6518 - 8.5098 | [46/134] | 0.343 |
| 8.5098 - 7.4504 | [55/131] | 0.420 |
| 7.4504 - 6.7766 | [55/125] | 0.440 |
| 6.7766 - 6.2951 | [56/128] | 0.438 |
| 6.2951 - 5.9265 | [51/126] | 0.405 |
| 5.9265 - 5.6315 | [57/123] | 0.463 |
| 5.6315 - 5.3876 | [54/122] | 0.443 |
| 5.3876 - 5.1811 | [42/119] | 0.353 |
| 5.1811 - 5.0031 | [51/130] | 0.392 |

@KarlLevik can weigh in with sql statements again

Change of processingStatus column in table AutoProcProgram

We would like to distinguish when a process:

has started
has finished successfully
has failed

For doing so we propose to change processingStatus type from TINYINT (0,1) to enumeration: RUNNING, FAILED, SUCCESS

Add status for MR and SAD phasing

The AutoProcStatus table is used for tracking the status of the automatic data processing pipelines. We would now need a similar table / mechanism for tracking the status of MR and SAD phasing pipelines.

Add ISa to ISPyB strategy statistics

The following results is part of the XDS CORRECT.LP file:

 ******************************************************************************
    CORRECTION PARAMETERS FOR THE STANDARD ERROR OF REFLECTION INTENSITIES
 ******************************************************************************

 The variance v0(I) of the intensity I obtained from counting statistics is
 replaced by v(I)=a*(v0(I)+b*I^2). The model parameters a, b are chosen to
 minimize the discrepancies between v(I) and the variance estimated from
 sample statistics of symmetry related reflections. This model implicates
 an asymptotic limit ISa=1/SQRT(a*b) for the highest I/Sigma(I) that the
 experimental setup can produce (Diederichs (2010) Acta Cryst D66, 733-740).

     a        b          ISa
 7.512E-01  2.195E-03   24.62

The 'a', 'b' and 'ISa' values should be added as new columns for the ISPyB AutoProcScalingStatistics table. Although the 'ISa' value can easily be calculated given the 'a' and 'b' values we should store also the calculated 'ISa' value.

Other programs (e.g aimless) also calculates ISa values. These values should also be stored in the AutoProcScalingStatistics table.

A role column for ProposalHasPerson

At Diamond we need a role column for the ProposalHasPerson table, similar to how we have a role column in Session_has_Person.

ALTER TABLE ProposalHasPerson
  ADD role enum('Co-Investigator','Principal Investigator','Alternate Contact');

Does this look sensible? (NULL is implicitly allowed as an option in the enum.)

Tables for storing fluorescence maps

One of the Diamond beamlines is using Xspress3 to collect MCAs at all points in a grid scan. They want to be able to show the results from this in the web application. Below are the tables we think should be able to hold the information needed.

CREATE TABLE XRFFluorescenceMappingROI (
    xrfFluorescenceMappingROIId int(11) unsigned auto_increment PRIMARY KEY,
    startEnergy float NOT NULL,
    endEnergy float NOT NULL,
    element varchar(2),
    edge varchar(2) COMMENT 'In future may be changed to enum(K, L)',
    ADD r tinyint unsigned COMMENT 'R colour component', 
    ADD g tinyint unsigned COMMENT 'G colour component', 
    ADD b tinyint unsigned COMMENT 'B colour component'
);

CREATE TABLE XRFFluorescenceMapping (
    xrfFluorescenceMappingId int(11) unsigned auto_increment PRIMARY KEY,
    xrfFluorescenceMappingROIId int(11) unsigned NOT NULL,
    dataCollectionId int(11) unsigned NOT NULL,
    imageNumber int(10) unsigned NOT NULL,
    counts int(10) unsigned NOT NULL,
    CONSTRAINT XRFFluorescenceMapping_ibfk1 
        FOREIGN KEY (xrfFluorescenceMappingROIId) REFERENCES XRFFluorescenceMappingROI(xrfFluorescenceMappingROIId) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT XRFFluorescenceMapping_ibfk2
        FOREIGN KEY (dataCollectionId) REFERENCES DataCollection(dataCollectionId) ON DELETE CASCADE ON UPDATE CASCADE
);

Does this look reasonable? Can this be merged into the official database schema?

EM Data Model

Following up from antolinos/em-model#1, here is the latest EM Data Model that i have put together from Alex's input, DLS Scisoft & EM Staff, and EPN EM people:

Further changes for XPDF

As discussed at the ISPyB meeting at Soliel earlier this week, Diamond has made some recent database schema developments to support our XPDF (pair distribution function) beamline.

While these schema changes have been "live" at Diamond for a little while now, we are happy to make changes to this if anyone finds a problem with it.

A few words about the schema changes

In the tables and columns you'll see the words DataCollectionPlan and Component. These are what we would like to rename the current DiffractionPlan and Protein tables to, respectively, in the future. This is to make the schema less MX specific.

packingFraction was previously erroneously added to Crystal, so we've now moved it to BLSample.

The table DataCollectionPlan_has_Detector allows us to have multiple Detectors for each DC plan, and to use the same plan for multiple detectors.

The table BLSample_has_DataCollectionPlan allows us to have multiple DC plans for each BLSample and also to use the same plan on multiple samples.

The enums DiffractionPlan.experimentKind and DataCollectionGroup.experimentType have to be extended to allow for XPDF data collection types.

Each DiffractionPlan needs to have a name to make them more easily recognizable by humans.

DataCollection_has_ScanParametersModel allows us to say that a DC was done using a set of scan parameters.

DataCollectionPlan_has_Detector allows each DC to be associated with more than one Detector.

I would have attached a diagram to show the changes, but alas, it's not possible.

The SQL statements

ALTER TABLE DiffractionPlan drop FOREIGN KEY DataCollectionPlan_ibfk2;
ALTER TABLE DiffractionPlan drop dataCollectionPlanGroupId;
DROP TABLE DataCollectionPlanGroup;

ALTER TABLE Crystal DROP packingFraction;
ALTER TABLE Crystal ADD theoreticalDensity float;
ALTER TABLE BLSample ADD packingFraction float;
ALTER TABLE Protein CHANGE theoreticalDensity density float;

DROP TABLE IF EXISTS DiffractionPlan_has_Detector;
DROP TABLE IF EXISTS BLSample_has_DiffractionPlan;
DROP TABLE IF EXISTS DataCollectionPlan_has_Detector;
DROP TABLE IF EXISTS BLSample_has_DataCollectionPlan;

CREATE TABLE DataCollectionPlan_has_Detector (
	dataCollectionPlanId int(11) unsigned NOT NULL,
    detectorId int(11) NOT NULL,
    exposureTime double,
    distance double,
    orientation double,
    PRIMARY KEY (`dataCollectionPlanId`, `detectorId`),
	CONSTRAINT DataCollectionPlan_has_Detector_ibfk1 FOREIGN KEY (dataCollectionPlanId) REFERENCES DiffractionPlan (diffractionPlanId),
    CONSTRAINT DataCollectionPlan_has_Detector_ibfk2 FOREIGN KEY (detectorId) REFERENCES Detector (detectorId)
);

CREATE TABLE BLSample_has_DataCollectionPlan (
	blSampleId int(11) unsigned NOT NULL,
    dataCollectionPlanId int(11) unsigned NOT NULL,
    PRIMARY KEY (`blSampleId`, `dataCollectionPlanId`),    
	CONSTRAINT BLSample_has_DataCollectionPlan_ibfk1 FOREIGN KEY (blSampleId) REFERENCES BLSample (blSampleId),
    CONSTRAINT BLSample_has_DataCollectionPlan_ibfk2 FOREIGN KEY (dataCollectionPlanId) REFERENCES DiffractionPlan (diffractionPlanId)
);

ALTER TABLE DiffractionPlan
  MODIFY experimentKind enum('Default','MXPressE','MXPressO','MXPressE_SAD','MXScore','MXPressM','MAD','SAD','Fixed','Ligand binding','Refinement',
    'OSC','MAD - Inverse Beam','SAD - Inverse Beam','MESH','XFE', 'Bragg', 'PDF', 'PDF+Bragg');

DROP TABLE IF EXISTS Protein_has_Lattice;
DROP TABLE IF EXISTS ComponentLattice;

CREATE TABLE ComponentLattice (
    componentLatticeId int(11) unsigned auto_increment PRIMARY KEY,
    componentId int(10) unsigned,
    spaceGroup varchar(20),
    cell_a double,
    cell_b double,
    cell_c double,
    cell_alpha double,
    cell_beta double,
    cell_gamma double,
    CONSTRAINT ComponentLattice_ibfk1 FOREIGN KEY (componentId) REFERENCES Protein (proteinId)
);

ALTER TABLE DiffractionPlan 
  ADD `name` varchar(20) AFTER `diffractionPlanId`; 

ALTER TABLE DataCollectionGroup 
  MODIFY experimentType enum('SAD','SAD - Inverse Beam','OSC','Collect - Multiwedge','MAD','Helical','Multi-positional','Mesh','Burn','MAD - Inverse Beam','Characterization','Dehydration','tomo','experiment','EM','PDF', 'PDF+Bragg', 'Bragg');

CREATE TABLE DataCollection_has_ScanParametersModel (
    dataCollectionId int(11) unsigned NOT NULL,
    scanParametersModelId int(11) unsigned NOT NULL,
	PRIMARY KEY (dataCollectionId, scanParametersModelId),
	CONSTRAINT DataCollection_has_ScanParametersModel_ibfk1
      FOREIGN KEY (dataCollectionId) REFERENCES DataCollection(dataCollectionId) ON DELETE CASCADE ON UPDATE CASCADE,
    CONSTRAINT DataCollection_has_ScanParametersModel_ibfk2
      FOREIGN KEY (scanParametersModelId) REFERENCES ScanParametersModel(scanParametersModelId) ON DELETE CASCADE ON UPDATE CASCADE
);

ALTER TABLE DataCollectionPlan_has_Detector CHANGE `orientation` `roll` double;

ALTER TABLE ScanParametersModel CHANGE modelNumber sequenceNumber tinyint(3) unsigned;

CREATE UNIQUE INDEX Detector_ibuk1 ON Detector (detectorSerialNumber);

DROP TABLE IF EXISTS DataCollectionPlan_has_Detector;

CREATE TABLE DataCollectionPlan_has_Detector (
    dataCollectionPlanHasDetectorId int(11) unsigned auto_increment PRIMARY KEY,
	dataCollectionPlanId int(11) unsigned NOT NULL,
    detectorId int(11) NOT NULL,
    exposureTime double,
    distance double,
    roll double,
    UNIQUE KEY (`dataCollectionPlanId`, `detectorId`),
	CONSTRAINT DataCollectionPlan_has_Detector_ibfk1 FOREIGN KEY (dataCollectionPlanId) REFERENCES DiffractionPlan (diffractionPlanId),
    CONSTRAINT DataCollectionPlan_has_Detector_ibfk2 FOREIGN KEY (detectorId) REFERENCES Detector (detectorId)
);

Can these changes be merged into the official database schema?

BLSession: Unique constraint on (proposalId, visit_number)

We'd like to add the following UNIQUE constraint on the BLSession table:

ALTER TABLE BLSession
  ADD CONSTRAINT UNIQUE INDEX (proposalId, visit_number);

This way we can ensure we don't get any duplicated visits a.k.a. sessions.

Measure marked sample

I would like to re-propose a table to extend our LIMS system, allowing us to measure crystals in plate view. We had previously submitted this to the mailing list and had a response that it was not useful for ESRF. I would like to move forward with the UI for this. @KarlLevik does this match your scripts?

blsampleimagemeasurement
---------------------------------------
blsampleimagemeasurementid autoinc pk
blsampleimageid fk blsampleimageid
blsubsampleid fk subsampleid
bltimestamp timestamp
startPosX double,
startPosY double,
startPosZ double, – ?
endPosX double,
endPosY double,
endPosZ double, – ?

Schema fixing: modify DataCollection.runStatus to become an enum

Currently, DataCollection.runStatus is a varchar(45).
I think it would make sense if this was modified to become an enum.
Currently, we (DLS) populate the column with the below values:

NULL
DataCollection Successful
DataCollection Stopped

We're also discussing a fourth option:

DataCollection Unsuccessful

NULL means that the data collection is still running.
'DataCollection Successful' means that the data collection has completed successfully.
'DataCollection Stopped' means that the data collection was interrupted by the user.
'DataCollection Unsuccessful' means that the data collection has completed unsuccessfully, i.e. failed in some way, e.g. due to missing images.

New phasingDirectory column for PhasingProgramRun

We need a new column in PhasingProgramRun for storing the location of the phasing execution.

Add a new field storageLocation in the Container.

Allow the location of individual containers to be tracked on site.
Default value should be set to DewarStorageLocation value, but can be changed if needed.

Add new file type CIF for PhasingProgramAttachment

Max needs a new file type CIF for PhasingProgramAttachment.

Deleted column for Shipping, Dewar, Container and BLSample

We'd like the users to be able to tag a shipment, dewar, container or sample as 'deleted' without actually deleting the row from the database as there could be other things elsewhere in the database referring to that row. For example, there could have been data collected against a BLSample already.

The effect of tagging one of these as deleted should be that it is hidden from the UI, the BCM and anything else that might make use of it.

To be able to tag one of these as deleted, we think the best solution is to have a 'deleted' column in each of the tables:

ALTER TABLE Shipping 
  ADD deleted boolean DEFAULT False COMMENT 'Flag to indicate the user has discarded this Shipping';
  
ALTER TABLE Dewar 
  ADD deleted boolean DEFAULT False COMMENT 'Flag to indicate the user has discarded this Dewar';

ALTER TABLE Container
  ADD deleted boolean DEFAULT False COMMENT 'Flag to indicate the user has discarded this Container';

ALTER TABLE BLSample
  ADD deleted boolean DEFAULT False COMMENT 'Flag to indicate the user has discarded this BLSample';

AutoProcProgramAttachment: new column 'primary' + add enum option 'Debug'

We'd like to propose the following change to the AutoProcProgramAttachment table:

ALTER TABLE AutoProcProgramAttachment
    MODIFY `fileType` enum('Log','Result','Graph', 'Debug') DEFAULT NULL COMMENT 'Type of file Attachment',
    ADD `primary` tinyint(1) COMMENT 'Indicate whether the attachment is the primary one for the particular autoProcProgramId and fileType';

The idea is to have only one single 'primary' attachment for each autoProcProgramId and fileType.

The purpose of these changes is to make it possible to indicate to the user which attachments are the most important ones as some softwares now apparently produce up to 75 attachments per run.

New "Notifications" table

Today we use at the ESRF the DataCollection.comments and DataCollectionGroup.comments fields for adding notifications from the automatic pipelines (workflows and auto-processing). There are many disadvantages with this approach:

Automatically added notifications are mixed with user comments
There's no information about the level of severeness of the notification (error, warning, info etc)
There's no information about the origin of the notification

I therefore suggest that we add a new table called "Notifications" with the following columns:

notificationId (auto-incremented)
level (enumeration of 'importantInfo, 'info', 'warning', 'error' etc)
origin (varchar(100), for example "Dimple")
message (varchar(255))
sessionId
sampleId
dataCollectionGroupId
dataCollectionId
autoProcIntegrationId
workflowId
workflowStepId
energyScanId
phasingStepId
screeningId
robotActionId
xfeFluorescenceSpectrumId

The idea is that the corresponding web service will take the following input:

level
origin
message
One of the Id, for example autoProcIntegrationId if the origin is an auto-processing pipeline

The web service will then be responsible to fill all other "Id" columns, for example in the case of autoProcIntegrationId the sessionId, sampleId, dataCollectionGroupId and dataCollectionId will be automatically filled.

This new table will make it very easy to get a list of all notifications at different levels: session, sample, data collection group and individual tasks like auto-processing, and also for different levels of severeness.

Change the collation for proposal title and protein name to utf8mb4_unicode_ci

Recently we had several troubles while ingesting data from User Portal due to special characters.
This would be fixed if we change the collations for these columns:
Proposal.title
Protein.name

DataCollection column for identifying the internal data path in the HDF5 file

We (DLS) would like to have a new DataCollection column for indentifying the path inside the HDF5 file that points to the data related to the data collection. We need this because there can be data for more than one data collection in one single HDF5 file, or at least that is the case for our PDF beamline.

I don't know what a good name would be. hdf5Path? hdf5InternalPath? Something else?

Hereby inviting discussion - @stufisher @graeme-winter @olofsvensson @antolinos ...

creating a new protein from SMIS leads in multiple proteins in db

hello
it seems that something weird is going on since we just created safety sheets with samples starting W under MX415 and you can see that there are multiple proteins created in the database. does this happen because I created samples attached to this proteins? this should not happen since they have the same space group... and moreover the acronym is not present in all of them??
please have a look
Thanks

Tables for processing jobs

We would like to have some tables to help keep track of processing jobs. These can be triggered by users of the ISPyB web application, or from other sources, including automatically when a data collection is happening.

Here's what the tables for this might look like:

CREATE TABLE ProcessingJob (
  processingJobId int(11) unsigned AUTO_INCREMENT PRIMARY KEY,
  dataCollectionId int(11) unsigned,
  displayName varchar(80) COMMENT 'xia2, fast_dp, dimple, etc',
  comments varchar(255) COMMENT 'For users to annotate the job and see the motivation for the job' ,
  recordTimestamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'When job was submitted',
  recipe varchar(50) COMMENT 'What we want to run (xia, dimple, etc).',
  automatic boolean COMMENT 'Whether this processing job was triggered automatically or not',
  CONSTRAINT ProcessingJob_ibfk1 FOREIGN KEY (dataCollectionId) REFERENCES DataCollection(dataCollectionId)
) COMMENT 'From this we get both job times and lag times';

CREATE TABLE ProcessingJobParameter (
  processingJobParameterId int(11) unsigned AUTO_INCREMENT PRIMARY KEY,
  processingJobId int(11) unsigned,
  parameterKey varchar(80) COMMENT 'E.g. resolution, spacegroup, pipeline',
  parameterValue varchar(255),
  CONSTRAINT ProcessingJobParameter_ibfk1 FOREIGN KEY (processingJobId) REFERENCES ProcessingJob(processingJobId)
);

CREATE TABLE ProcessingJobImageSweep (
  processingJobImageSweepId int(11) unsigned AUTO_INCREMENT PRIMARY KEY,
  processingJobId int(11) unsigned,
  dataCollectionId int(11) unsigned,
  startImage mediumint unsigned,
  endImage mediumint unsigned,
  CONSTRAINT ProcessingJobImageSweep_ibfk1 FOREIGN KEY (processingJobId) REFERENCES ProcessingJob(processingJobId),
  CONSTRAINT ProcessingJobImageSweep_ibfk2 FOREIGN KEY (dataCollectionId) REFERENCES DataCollection(dataCollectionId)
) COMMENT 'This allows multiple sweeps per processing job for multi-xia2';

ALTER TABLE AutoProcProgram
   ADD processingJobId int(11) unsigned COMMENT 'Which processing job triggered this auto processing',
   ADD CONSTRAINT AutoProcProgram_FK2 FOREIGN KEY (processingJobId) REFERENCES ProcessingJob(processingJobId);

We had originally named these tables Reprocessing*, but after some thought concluded ProcessingJob* was better since these jobs could be re-processing as well as the initial processing. Also, the "Job" suffix makes it clearer what we're talking about.

New metrics for PhasingStatistics

We need the following new metrics for the PhasingStatistics metric table:

"Start R-work"
"Start R-free"
"Final R-work"
"Final R-free"

Pins with multiple samples

One of our beamlines is keen to start using multi-sample pins, i.e. pins with multiple samples on them. These samples can have different crystal and protein properties.

The question is then, how do we fit this into the existing database schema? We've discussed various options at Diamond, and have landed on the following proposal:

We add a new colunm to BLSample: multiSamplePosition smallint unsigned. In addition, we will populate the loopType varchar(45) column with a value to indicate it's a multi-sample pin. We then get data like the below in the BLSample table:

blSampleId	containerId	loopType	code	location	multiSamplePosition
1	1	multi-pin	aaa	1	1
2	1	multi-pin	bbb	1	2
3	1	multi-pin	ccc	1	3
4	1	multi-pin	ddd	1	4

Does this make sense? Is there a better solution for this?

Schema fixing: Consolidate DataCollection, EnergyScan and XFEFluorescenceSpectrum

Big Picture

As has been highlighted in the past, there are a lot of columns in EnergyScan and XFEFluorescenceSpectrum that also exist in DataCollection. Thinking about it, EnergyScan and XFEFluorescenceSpectrum really are types of DataCollection, so should therefore perhaps be merged into DataCollection.

Additionally, there are some columns in the DataCollection table that refer to "images". I think we should rename these so the table make more sense to disciplines that don't use images.

I hereby propose to do this work. Let's keep the name DataCollection and merge the columns from the two other tables into this. We will also need a DataCollection.experimentType enum to say whether the data is from an XRF spectrum, energy scan or other type of data collection as currently enumerated in the DataCollectionGroup.experimentType enum.

Columns moving to DataCollectionFileAttachment

We can use the DataCollectionFileAttachment table to store the following columns in EnergyScan:

scanFileFullPath
jpegChoochFileFullPath
filename
choochFileFullPath

... and also these columns in XFEFluorescenceSpectrum:

jpegScanFileFullPath
annotatedPymcaXfeSpectrum
fittedDataFileFullPath
scanFileFullPath

... and these columns in DataCollection:

xtalSnapshotFullPath1
xtalSnapshotFullPath2
xtalSnapshotFullPath3
xtalSnapshotFullPath4
processedDataFile
datFullPath
bestWilsonPlotPath

(This will probably require some extra values in the fileType enum in DataCollectionFileAttachment.)

Columns moving to DataCollection

We can use the following mapping:

DC	EnergyScan	Fluorescence
dataCollectionId (PK)	energyScanId (PK)	xfeFluorescenceSpectrumId (PK)
DCG.sessionId	sessionId	sessionId
blSampleId	blSampleId	blSampleId
blSubSampleId	blSubSampleId	blSubSampleId
detectorId	fluorescenceDetector
beamSizeAtSampleX	beamSizeHorizontal	beamSizeHorizontal
beamSizeAtSampleY	beamSizeVertical	beamSizeVertical
transmission	transmissionFactor	beamTransmission
comments	comments	comments
crystalClass	crystalClass	crystalClass
	edgeEnergy
	element
	startEnergy
	endEnergy
	peakEnergy
	inflectionEnergy
		energy
	synchrotronCurrent
startTime	startTime	startTime
endTime	endTime	endTime
exposureTime	exposureTime	exposureTime
fileTemplate	filename	filename
	inflectionFDoublePrime
	inflectionFPrime
	peakFDoublePrime
	peakFPrime
averageTemperature	temperature
wavelength		wavelength
totalAbsorbedDose or totalExposedDose? Or neither because EM / different units?	xrayDose
flux	flux	flux
flux_end	flux_end	flux_end
imageDirectory	workingDirectory	workingDirectory
axisStart, axisEnd		axisPosition

Renaming image columns in DataCollection

numberOfImages ➞ numberOfDataPoints
startImageNumber ➞ startDataPointNumber
imageDirectory ➞ dataDirectory
imagePrefix ➞ filePrefix
imageSuffix ➞ fileSuffix
imageContainerSubPath ➞ dataContainerSubPath

SQL (DML + DDL)

See the work-in-progress for the SQL statements needed for this issue here - please run them in the order given:

New processing tables for energy scans and MCAs.
Changes to the DataCollections tables.
The data migration script.
The statements to drop the old tables (clean-up)

Appendix: Wide tables with many NULLs

I know that having lots of columns that we know will be NULL for many of the rows feels instinctively wrong as we could have normalised and broken the table up into a parent table with multiple child tables. (E.g. in the case of DataCollection, maybe this could be done based on experimentType.)

So I want to assure you that at least from a database storage optimisation point of view, wide tables with lots of NULLs is fine. In MariaDB and MySQL, assuming we use the default storage engine, InnoDB, and either DYNAMIC (now default in MariaDB) or COMPACT row format, the following is true:

The variable-length part of the record header contains a bit vector for indicating NULL columns. If the number of columns in the index that can be NULL is N, the bit vector occupies CEILING(N/8) bytes. (For example, if there are anywhere from 9 to 16 columns that can be NULL, the bit vector uses two bytes.) Columns that are NULL do not occupy space other than the bit in this vector.

Source: https://dev.mysql.com/doc/refman/8.0/en/innodb-row-format.html#innodb-row-format-compact

So as long as the columns are NULL-able to start with, we're not wasting any space at all by populating them with NULL values. Whereas with a normalised approach we would need an INT for the primary key of each child table + another INT for the foreign key pointing to the parent table + indexes and other overhead that comes with extra tables.

I'm happy to provide a test case as evidence in support of this, if anyone is interested. E.g. we can create two DataCollection tables in a dev database: One where a lot of columns are populated with NULL values and another where the same columns do not exist. Then we can compare the difference in size of the tables on disk, and hopefully observe that the difference is very small.

Add new column SMILES to Structure table

We need a new column SMILES (varchar(400) like for BLSample) for the Structure table.

Schema fixing: Rename Protein, Crystal and DiffractionPlan

In order to make the database schema more universal and less MX specific, I'd like to propose the following set of changes:

Proposed solution v1.0

Rename these three tables:
Protein ➞ Component
Crystal ➞ BLSampleType
DiffractionPlan ➞ DataCollectionPlan

... and their primary key columns:
Protein.proteinId ➞ Component.componentId
Crystal.crystalId ➞ BLSampleType.blSampleTypeId
DiffractionPlan.diffractionPlanId ➞ DataCollectionPlan.dataCollectionPlanId

... and rename this column:
Protein.sequence ➞ Component.content

... as well as all foreign key columns referencing the renamed tables:
BLSample.diffractionPlanId ➞ BLSample.dataCollectionPlanId
Crystal.proteinId ➞ BLSampleType.componentId
Crystal.diffractionPlanId ➞ BLSampleType.dataCollectionPlanId

Changes in bold (not all columns shown):

$protein-crystal-diffractionplan$

Proposed solution v2.0

NOTE:

Consider merging of SampleMaterial with Macromolecule, and SampleInstance with Specimen.

Rename these tables:
Protein ➞ SampleMaterial
Crystal ➞ SampleProperties
BLSample ➞ SampleInstance
BLSampleGroup ➞ SampleInstanceGroup
ComponentType ➞ SampleMaterialType
ComponentSubType ➞ SampleMaterialSubType
Component_has_SubType ➞ SampleMaterial_has_SubType
BLSampleType_has_Component ➞ SampleProperties_has_Material
BLSampleGroup_has_BLSample ➞ SampleInstanceGroup_has_Instance
DiffractionPlan ➞ DataCollectionPlan

... and their primary key columns:
Protein.proteinId ➞ SampleMaterial.sampleMaterialId
Crystal.crystalId ➞ SampleProperties.samplePropertiesId
DiffractionPlan.diffractionPlanId ➞ DataCollectionPlan.dataCollectionPlanId

... and rename this column:
Protein.sequence ➞ SampleMaterial.content

... as well as renaming all foreign key columns referencing the renamed tables, but make SampleProperties an optional table, so we want a foreign key column sampleMaterialId in the SampleInstance table, and we want to remove the foreign key from SampleProperties pointing to SampleMaterial:

BLSample.diffractionPlanId ➞ SampleInstance.dataCollectionPlanId
Crystal.proteinId: remove
SampleProperties.sampleMaterialId: new column
Crystal.diffractionPlanId ➞ SampleProperties.dataCollectionPlanId

Changes in bold (not all columns shown):

Appendix: Examples of sample material types (non-exhaustive)

These values could be used to populate the SampleMaterialType table. (Perhaps another column could be added to indicate which "instrument type" the material type is valid for?)

proteins
RNA
DNA
viruses
small molecules
ribosomes?

proteins
viruses

X-ray Pair Distribution Function

"crystalline, semi-crystalline and amorphous solids and liquids"

Powder diffraction

Metal-organic-frameworks
Lithium-ion battery and Solid Oxide Fuel Cell materials
Alloys
Self-assembled nano-scale solids
High temperature superconductors
Bio-engineered materials and minerals

Small Angle Scattering (SAS)

Samples in solution
- Colloids
- Polymers
- Proteins
- Fibres
- Self-assembled systems (soft condensed matter)
- Coacervates / liposomes
Emulsions
- Aqueous-Oil
- Aqueous-Aqueous
- Oil-Oil
- Pickering
Liquid crystals
- Pure
- Mixed
- In solution
Biomaterials
- Larger scale
  - Bone
  - Tissue
  - Organs
- Smaller scale
  - Viruses
  - Protein-drug binding
Materials science
- Monocrystalline materials
- Polycrystalline materials
- Composite materials
  - i.e. fibres in polymers, weave and weft systems
- Powders
- Inorganic self-assembled systems
Surface / grazing incidence
- Thin films
  - Simple systems
    - Single layer
    - Multiple layer
  - Complex systems
  - Randomly or uniformly distributed particles
    - On their own
    - As part of a layered system
    - Intercalated into a layered system
  - Liquid crystals / self-assembled systems
  - Industrial / natural systems
    - Interacting / drying on a surface
Dynamic systems thereof (experiments)
- Transition from sample type a to b e.g.
  - Solid-solid transitions
    - Particles melting to form a thin film
  - Liquid-solid transitions
    - Monomer solution forming hard polymer
  - Solid-liquid transitions
    - Powder dispersion / dissolution studies
  - Colloid-solid transitions
    - Proteins self-assembling into fibres
  - Evaporation / drying
    - Inks on surfaces

Data collection limits

We have a requirement at Diamond to be able to specify that certain parameter values are restricted to certain ranges (so we need min and max values) for each beamline. These ranges can change over time. So far we have identified the following:

detector distance
beam size
wavelength
omega
kappa
phi
exposure time

For detector distance, we have min and max values in the Detector table. For the others, I'm not sure anything exists yet? Would the right place for these be the BeamLineSetup table?

Schema fixing: Optimise the structure of the ImageQualityIndicators table

At Diamond the ImageQualityIndicators table is the largest (in terms of records) in the whole database, approximately 114 million records.

At this size we need to make sure the table is optimally structured, that unused/unnecessary columns and indexes are removed etcetera.

In the first instance I'd like to propose the following change:

ALTER TABLE ImageQualityIndicators 
  MODIFY dataCollectionId int(11) unsigned NOT NULL FIRST,
  MODIFY imageNumber mediumint(8) unsigned NOT NULL AFTER dataCollectionId,
  DROP FOREIGN KEY _ImageQualityIndicators_ibfk3,
  DROP KEY ImageQualityIndicators_ibfk3,
  DROP PRIMARY KEY,
  DROP imageQualityIndicatorsId,
  ADD PRIMARY KEY (dataCollectionId, imageNumber);

As you can see, I'm removing the primary key (including the column), removing the explicit foreign key to DataCollection (but not the column itself), creating a new, compound primary key consisting of dataCollectionId and imageNumber, and (for cosmetic reasons) moving those two columns first in the table.

The advantages of this are:

To improve the performance of INSERTs, since we don't have an auto_incrementing primary key, we now have only one index to maintain (the primary key index), and we have no foreign key constraint that needs to be enforced
Smaller memory and disk space needed, again because of the smaller number of indexes, and one less column
Faster full database backup

Please note:

Your table may not look exactly like ours, so the query might need some tweaking
Assuming you're actually using this table and have a lot of records in it, it may not be a good idea to just run the query as-is as it will lock the table while the query is running, and this can take a very long time to complete, perhaps more than an hour. I'd suggest using tools such as pt-online-schema-change, which is what I use, or gh-ost.

I do realize there are other tables in the database that also need to be reviewed and probably modified, but we need to start somewhere, and this is as good a candidate as any.

Further changes that could be done would be to remove unused columns. My colleague Markus did a review and here's what he had to say:

We currently write meaningful information to:

datacollectionid
image_number

for dozor:

dozor_score

for in-house per-image-analysis:

spots_total
totalintegratedsignal
good_bragg_candidates
method1_res

and then we fill these values as well:

in_res_total # always set to same value as spots_total
method2_res # always same value as method1_res
programid # always 65228265
icerings # always 0
maxunitcell # always 0
pctsaturationtop50peaks # always 0
inresolutionovrlspots # always 0
binpopcutoffmethod2res # always 0

What do you think about the proposed changes? And are all the columns populated with meaningful data at other synchrotrons? Let's discuss ...

Add diffractionplanid to datacollection and containerqueuesample

I would like to add diffractionplanid to datacollection and containerqueuesample. This will allow us to know which diffraction plan has been queued and which diffractionplan a datacollection came from.

ALTER TABLE `DataCollection` 
	ADD `diffractionPlanId` INT UNSIGNED NULL DEFAULT NULL,
	ADD CONSTRAINT `DataCollection_ibfk9`
		FOREIGN KEY (`diffractionPlanId`)
			REFERENCES `DiffractionPlan`(`diffractionPlanId`)
				ON DELETE NO ACTION ON UPDATE NO ACTION;

ALTER TABLE `ContainerQueueSample` 
	ADD `diffractionPlanId` INT UNSIGNED NULL DEFAULT NULL,
	ADD CONSTRAINT `ContainerQueueSample_ibfk3`
		FOREIGN KEY (`diffractionPlanId`)
			REFERENCES `DiffractionPlan`(`diffractionPlanId`)
				ON DELETE NO ACTION ON UPDATE NO ACTION,
	ADD `blSampleId` INT(10) UNSIGNED NULL DEFAULT NULL,
	ADD CONSTRAINT `ContainerQueueSample_ibfk4`
		FOREIGN KEY (`blSampleId`)
			REFERENCES `BLSample`(`blSampleId`)
				ON DELETE NO ACTION ON UPDATE NO ACTION;

Recording the goniometer orientation of data collections

We'd like to record the goniometer orientation of data collections. I'd like to propose the following addition to the DataCollection table to cater for this:

ALTER TABLE DataCollection
  ADD goniometerOrientation enum('horizontal', 'vertical');

State of a proposal

I'd like to propose a state column for the Proposal table:

ALTER TABLE Proposal
    ADD state enum('Open', 'Closed', 'Cancelled') NULL DEFAULT 'Open';

This way the ISPyB applications can prevent users from e.g. uploading shipments to proposals that are not open.

New table BeamCalendar

We would like to have a table for the "beam calendar":

CREATE TABLE BeamCalendar (
    beamCalendarId int(10) unsigned auto_increment,
    run varchar(7) NOT NULL COMMENT 'e.g. "2016-04", same as the run column in v_run',
    beamStatus varchar(24) NOT NULL COMMENT 'e.g. "User Mode", "UM Special Beam", "Start up/Machine dev", "Shutdown", ...',
    startDate datetime NOT NULL,
    endDate dateTime NOT NULL,
    PRIMARY KEY (beamCalendarId)
);

I realize this probably isn't needed at synchrotrons that integrate the ISPyB web application with SMIS or similar, but we do need this at Diamond and elsewhere that don't do this.