royallgroup / tcc Goto Github PK

View Code? Open in Web Editor NEW

11.0 2.0 5.0 9.36 MB

The Topological Cluster Classification algorithm

Home Page: https://royallgroup.github.io/TCC/

License: GNU General Public License v3.0

C 81.10% C++ 0.75% CMake 0.27% Python 17.88%

tcc's Introduction

README

Latest version of the Topological Cluster Classification (TCC) code.

For documentation, open index.html in the docs folder or https://royallgroup.github.io/TCC/.

Citation

If this software is used in the prepration of published work please cite:
Malins A, Williams SR, Eggers J & Royall CP "Identification of Structure in Condensed Matter with the Topological Cluster Classification", J. Chem. Phys. (2013). 139 234506

Licenses

This software is distributed under the GNU General Public License v3. For more details see the LICENSE file.

This software makes use of libraries released under other licenses:

The iniparser library (https://github.com/ndevilla/iniparser/) which is distributed under the MIT license.
The Breathe extension package for Sphinx (https://github.com/michaeljones/breathe) which is distributed under the BSD license

tcc's People

Contributors

Stargazers

Watchers

Forkers

yueguangzoum alexmalins yangyushi kskips ursacavebear

tcc's Issues

Additional file format: DynamO config.*.xml.bz2

The DynamO package has been extensively used to generate hard sphere data. Among its output is "config.*.xml.bz2"

It would be nice to be able to read this into the tcc or perhaps a Python script is more appropriate?)

Ben has the file format....

Unit tests for detection of individual structures

Running the TCC on isolated cluster geometries should return exactly 1 of the structure, and possibly multiple subunits if it is a composite structure. Having unit tests would facilitate implementation of new structures through test-focused design.

11F performance improvement

When detecting 11F clusters, in the function get_bonded_6As, can loop over mmemsp4c[5a_common_particle] instead of all 6A's.

Standardisation of coordinate file format

It would be good to have a standard format for xyz type files.

The dinosaur approach of VMD has that the line after N can be anything. This is the situation as it stands, although an earlier version used PDB format.

Ovito uses an enriched xyz format, which is presumably compatible with the format above (but we could read in much that is in the *.in files etc, from the coordinate files.

Create key for cluster output

Create a key which describes the order order of particle IDs in cluster output.

E.g. for 11C the cluster is output in the order [s_com, s_i, s_j, r_ca, r_cb, d_i, d_i, d_j, d_j, unc_i, unc_j]

This should take into account the sorting that goes on for each cluster type after it is written to the hc array.

Need sample geometries for all TCC structures for unit testing

All clusters in xyz format, together with their topological descriptors are required for cluster unit testing following implementation of unit testing in #43 .

Decide which new clusters to add to the TCC

Hi Josh

Peter has been refactoring the TCC to be able to take new clusters.

Could we test this by adding one from the your HS-morphometric FEL minima.

And/or a cluster from the CuZr stuff - that is not already in the TCC.

Output clusters to an XYZ file

This is a common procedure that is done all the time for visualization purposes. At the moment it requires a script to match the RAW files and the coordinates. This would be especially useful for undergrads since they like to see the clusters they have produced.

Improvement to 12K method

To reduce the number of calls to the bond method in the first part of the 12K identification symmetry can be used.

It appears to be the case that the particles in the two rings are ordered sequentially anticlockwise (from the perspective of the center particle) with 0-3 in the first ring and 4-7 in the second ring. This is observed but not rigorously proved yet.

This means that only one pair of bonds (one particle bonded to two from the other ring) is needed to uniquely identify the configuration. This would decrease the calls to the bonds method from 24 per cluster to 3 per cluster.

Build test should fail when build fails

At the moment it is possible for the build to fail but the build test to pass.

This is probably because the return code from subprocess.run does not reflect the return code from cmake.

Python API to TCC

Many users already have implemented simple python interfaces to handle the input/output with a python frontend, so it would be good to have a canonical version. A simple implementation would create a temporary working directory, execute the TCC within there, extract the data and then delete the directory.

Consider removing net clusters

This bit of code is horribly implemented. This is something that can and is done as a post analysis method in python.

Removing (or fixing) net clusters will also allow testing do_clust on memset in setup.c to improve performance.

Documentation for python xyz reader

I've added an xyz reader for the python front end, but not documented this yet. Docstrings would be nice.

Raw files and cluster files into folder

Raw files and cluster files should be each be put into their own folder in the analysis directory to avoid cluttering up the working directory.

Build should include install option

@FTurci comment in #50 thread:

Other remark: "make" should be followed by "make install" and we should not have plenty of executables copied everywhere. "cmake" has also a prefix option. This was a source of confusion when we tried to identify the issue.

Alignment of column labels in pop per frame is off

Labels and mean values don't line up properly since removing the tau alpha parameter

Unclear output

If only a subset of particles are analysed the static clust output file reports those not analysed as having a population of zero. This could be misleading. The static clust file should show if a cluster was not analysed.

Clusters can be detected over PBCs in small boxes

Clusters can be detected twice over the PBCs if the box is less than twice the cutoff length in any dimension.

This is because cluster A can be connected to cluster B twice, once over the PBC. This is exposed when the mem array is used to loop over clusters.

Compilation fails on Mac OS

The latest development version does not compile on the Mac. The problem is that Mac OS is not contemplated in the make_directory function.

Here is the error during make

/Users/ft14968/Documents/GitHub/TCC/tcc/src/tools.c: In function 'make_directory':
/Users/ft14968/Documents/GitHub/TCC/tcc/src/tools.c:105:12: warning: implicit declaration of function '_mkdir' [-Wimplicit-function-declaration]
         if(_mkdir(name) != 0) {
            ^~~~~~
[ 97%] Building C object tcc/src/CMakeFiles/tcc.dir/voronoi_bonds.c.o
[100%] Linking C executable ../../../bin/tcc
Undefined symbols for architecture x86_64:
  "__mkdir", referenced from:
      _make_directory in tools.c.o
ld: symbol(s) not found for architecture x86_64
collect2: error: ld returned 1 exit status
make[2]: *** [../bin/tcc] Error 1
make[1]: *** [tcc/src/CMakeFiles/tcc.dir/all] Error 2
make: *** [all] Error 2

Unit tests require tcc executable in path

Unit tests fail on @merrygoat's windows machine because the unit test assumes the tcc executable is in the system path. Solution requires either an install feature (e.g. make install as suggested in #59), instructions for adding to the path or a workaround this requirement e.g. by directly locating the executable within the python front end.

Add cell list to simple bond calculation

To speed up determination of simple (non-voronoi) bond lengths. Add a cell list.

Segfaut on fixed bond cutoff

The attached xyz file causes a TCC crash with fixed pairwise cutoff of 1.4

tcc_crash.txt

Remove centers magic numbers

11A and 13A centers can be output to xyz files. The variables to do this are set using magic numbers which are unstable to changing the cluster lists.

Only the centers of a limited number of clusters are interesting - not all clusters even have centers.

There are multiple ways of fixing this:

Output centers of all clusters using the "S" of the 's' clust lists. This would produce a centers xyz file for each of the cluster types. While this would be messy (many empty files) it would also be the most general (all clusters with centers and all future clusters would be output).
Iterate over the cluster name list to identify the position of 11A and 13A to set the magic numbers at runtime. This reduces the files output and is most like the original behavior. The disadvantage is that outputting the centers of other cluster types would not be supported without further changes.
Remove all center xyz output and make sure that the centers are output as a separate species in the full cluster xyz files. A python script could be supplied to strip the centers from the xyz and make center xyzs if required. This is the cleanest method and reduces the number of output types, however for people processing cluster centers this increases the number of steps required and complexity to get the data.

Which is best depends on how much the xyz centers are used and whether the centers of any other clusters are interesting.

Add null characters to lists at the beginning of main

Instead of specifying the number of clusters, add a null character at the end of the cluster names and count the number. Also check that the rest of the lists have the same number of elements.

Static cluster file broken

The BCC_15 row has one too many columns, which causes attempts to read the table with pandas to fail.

Implement structure compositions in python for unit tests

Most unit tests for cluster detection currently fail because we do not have a breakdown of the expected structure compositions of most clusters.

tcc test fails with no build folder

If there is not already a build folder in the main folder then the tcc test fails. Test should check for and make a build folder.

Move memory allocation in preparation for openmp

Each frame loop should have its own memory allocation/deallocation.

Critical output should be minimized as much as possible and localized.

11A performance improvement

At the moment 11A's loop over all pairs of 6A to find linked spindles. Since a linked spindle should be stored in the mmem_sp4c array it should be possible to loop over all 6A_i, loop over the two spindles and then loop over the spindles in mmem. This is essentially the same as the detection used for the 9K.

13K clusters missed

Sometimes a 13K is detected but the independent particles not in 11F cannot be identified. In this case the cluster is not recorded. It is not currently known if this is an incorrect 13K identification or a correct 13K that is not reported correctly.

Linked to issue #2 which has a configuration which can produce these clusters.

Net cluster script input variables

At the moment the cluster priority list is set directly in the script. The priority list should either be read from the command line or a file.

BCC15 Clusters not working

The detection of BCC clusters was turned off in commit 464b28f due to a segfault.

This is due to an index out of range in clusters.c function Clusters_GetBCC_15, array "sj" is referenced by particle number even though it is a fixed length array. It is not clear what function this variable has.

Need to identify function of sj and either remove it or correct index.

Static cluster file annoying

The final two rows containing nrows and other information are very annoying, as one must know precisely how many structures one expects to read the central table correctly. These should be placed at the top, so the user has the option of ignoring them easily and reading the table becomes more flexible when e.g. new structures are added.

Create a generic XYZ output function

Combine the functions which output 11A centers and 13A centers into a generic function which can output any cluster type. This should be relatively easy given that all raw coordinates are stored in the s_clust name global arrays.

Too many files output

The TCC outputs raw cluster files for every cluster regardless of whether the cluster is selected for analysis. Only those clusters which are analysed should have an output file.

11C not correctly detected

The 11C detection algorithm does not check for bonds between the two pairs of non-common bonded particles in the 7A rings (labelled rd1, rd2, rd3 and rd4 in the methods paper). This allows detection of 11Cs with unbonded pairs. This is not a new issue, it has been present in the code for some years.

Enforcing this bond condition will decrease the number of detected 11C clusters.

Bond cutoff documentation

It should be made much more clear that the bond length cutoff still applies when the Voronoi construction is used to determine the bond network. One would naturally assume that this setting would have no effect if the Voronoi was used.

Perhaps fixed length cutoff should be turned off by default when the Voronoi bond detection is turned on?

Implement 7T clusters

A 7Z cluster is a 6Z cluster with an extra particle.

There are two types of 7Z cluster depending on where the particle is attached.

The symmetric type is identified by having an extra particle bonded to 2 ring particles and a non-common spindle - this equates to a bond to (hc6z[0]) and (hc6z[2]) and (hc6z[4] or hc6z[5])

The asymmetric type is identified by having an extra particle bonded to 2 ring particles and a common spindle - this equates to a bond to (hc6z[0] or hc6z[3]) and (hc6z[1] or hc6z[2]) and (hc6z[4] or hc6z[5])

A 7A is created when the extra particle is bonded to both common ring particles and both distinct spindle particles - this is not a valid 7Z

Constant density box only valid when N is constant

Specifying a density via the ini file requires a constant number of particles from frame to frame otherwise the density will change. Check for density mode = 0 if num particles varies during XYZ parse.

12E performance improvement

Since the new 5A must have the two spindle particles common with the uncommon spindles of the 6As in the parent can loop over mmemsp4c[5a_spindle_1] and mmemsp4c[5a_spindle_2] instead of over all 5A.

Write a proper XYZ parser

Should be able to read XYZ files with varying numbers of particles in each frame
Should be able to parse XYZ file to check it is valid
Should automatically read in number of particles and type

mmem arrays only use spindles

Although all of a cluster is stored in the mmem array, it seems only the spindles are queried for the 6A and 7A clusters. Removing non-spindle particles from the mmem array will improve performance by decreasing looping on mmem access, however before doing this it is essential to check that no other function uses the non-spindle particles as part of a cluster creation.

Test xyz file does not have all cluster types in it

If the test routine had at least one of each cluster it would not make the test comprehensive by any means but would ensure each cluster routine runs through at least once.

Improve integration tests

At the moment the integration tests just directly compare the output files with a sample output file. This is very vulnerable to any changes in the input file format, addition of new clusters and floating point differences on different platforms.

An interpreter should be created to read in each output file type and parse the results comparing to the known values.

Check validity of coordinates on XYZ parse, not frame read

At the moment the XYZ parser does not check the validity of the coordinates as it builds the list of frame offsets. If a long dataset has an invalid frame near the end this means the TCC could run for a long time before failing as it reads the bad frame. Checking the coordinates as the frame offsets are determined would detect this right at the start of the analysis.

This is less important now that output files are written on a per frame basis since most data will still be output on an unexpected exit, just not averages.

It would also be good to have some logic that does not check frames that will not be read. It is possible not to read all frames by using the sample_frequency parameter to not analyse every frame.

13B performance improvement

Can improve speed of 13B detection by looping over mmemsp5c[7A_i_spindle_1] and mmemsp5c[7A_i_spindle_2] instead of all sp5c when selecting 7A_j.

Decide on license for TCC

Ideally the TCC should have a software license before it is released to the public properly. This is not a high priority issue as the repository is private, but it is something that should be decided at some point. There are many options available which we should discuss, perhaps next time we meet with @chryswoods ?

@ursacavebear @merrygoat @FTurci

Add a minimum distance cut off

This would be useful for measuring systems such as ideal gases where we do not want to consider overlapping particles.

Distinguish between isomers of 6A

5A, 6A, 7A have been replaced by sp3c, sp4c and sp5c. While it is good to include these base structures as well, the 6A and 7A in previous versions contained useful point group number conversions to avoid counting them multiple times due to their symmetries. I suggest they be put back in, in addition to the base structures.

Should be able to analyse a subset of clusters

For some analyses only a subset of clusters need to be found. This would speed up analysis.

This would require some sort of interface to select the clusters which are desired and some internal logic to determine which prerequisite clusters need to be calculated to find the selected cluster.