wtsi-hgi / hgi-vault Goto Github PK

View Code? Open in Web Editor NEW

1.0 3.0 0.0 575 KB

Data retention policy tools

License: GNU General Public License v3.0

Shell 0.39% Python 98.87% Jinja 0.74%

data-retention policy

hgi-vault's Introduction

HGI Vault

Data retention policy tools.

Installation

You will need to create an exclusive Python 3.8 (or later) virtual environment. For example:

python -m venv .venv
source .venv/bin/activate

Then, to install HGI Vault:

pip install git+https://github.com/wtsi-hgi/hgi-vault.git

It is not recommended to install HGI Vault globally or in a shared virtual environment due to the risk of namespace collision.

Usage

See the documentation directory for full instructions. Specifically:

hgi-vault's People

Contributors

Stargazers

Watchers

hgi-vault's Issues

Minimal number of owners required for a vault

It might be useful to specify, as part of the data retention policy, the minimal number of owners (i.e., those who will get actionable e-mails) that a group must have before a vault can be created.

Avoid namespace collision for installation

The structure of the project with the current setup.cfg installs a bunch of packages named after our top-level directories (core, api, bin, etc.). They should all be grouped together under a single installed package (vault, say).

The "obvious" way to achieve this would be to add that new top-level directory and then change every import statement appropriately. I suspect, however, that there's probably some trick to this -- to avoid such a disruptive restructuring -- to which the (incredibly vague) setuptools documentation does not avail itself...

Standard base64 alphabet includes /

The standard base64 alphabet includes a / character, which happens to be the POSIX path separator. The Vault path encoding must therefore use an alternative alphabet to avoid the potential clash.

n.b., The "URL safe" alphabet, which replaces + with - and / with _, is also problematic because - is used as a delimiter

Improve the UX for the e-mail notification attachments

Currently, when notification e-mails are sent, they get several attachments of the form X.fofn.gz; where X is, say, deleted, staged, delete-24, etc. They are gzip'd text files, with one file per line (modulo files that contain newlines in their names). There have been several suggestions to improve the UX of this:

From @vviyer:

Name the files X.txt.gz, rather than .fofn.gz -- perhaps even avoid gzip'ing -- so they can be opened more easily from the mail client.

From @gn5:

Provide Excel output, rather than gzip'd text files. Either that, or csv/tsv files so they can be opened easily in Excel.

The gzip'ing is done because -- while they haven't been during testing -- these files could potentially be very large. I don't think we can get realistically away from that. I would shy away from native-Excel output as it would require non-trivial dependencies and increased complexity. Simply renaming them to X.tsv.gz would be the simplest solution:

While filenames containing newlines will be incorrect, a FOFN is just a single column TSV file.
Once ungzip'd, a TSV file should natively open in either a text editor or spreadsheet, depending on how ones desktop is set up.

Judge file age using the greater of atime, mtime & ctime

File age (and thus eligibility for deletion) is based on comparing the current time to the mtime of the file.

This was done because:

We use modification time, rather than change time, as it's a better indicator of usage. (Unfortunately, access time is not reliably available to us on all filesystems.)

However, there are scenarios where we can have unexpected, undesirable deletion of files without warning, due to the mtime of "new" files being too old:

unpacking archives or downloading from the web can leave mtimes unchanged
starting to use vault on an old project directory, where old files are constantly read, but not changed

Using the greater of atime, mtime & ctime will be no worse than the current implementation, but likely better in these scenarios.

Finer-grained identity management

Currently, users and groups are fetched from a specified DN tree and their attributes are mapped to their appropriate internal representations. This captures all entities in that subtree. It would be useful if LDAP entities could be restricted using a filter (e.g., users could be limited to non-service accounts, groups must have specific properties, etc.) specified in the configuration

Configuration parsing should be more strict

Noticed by @piyushahuja: Currently the configuration parsing is type checked, but in some cases, value-constraints would also be appropriate (and necessary, to prevent catastrophic misconfiguration). For example:

Time durations should always be non-negative
Network ports must be in the range of 0-65535
etc.

Improving Sandman Notifications (no deletion until user warned)

Sandman Notifications

This is a summary to the logic behind Sandman's notifications.

Files

For every file:

(note: there is other logic here, i.e. for detecting staged files, corruptions etc., but I'm only summarising what we care about)

Is it in Limbo?

If it is, we check if it has passed the limbo threshold. If it has, we delete it. There are no notifications here. And we're done.

Has it passed the deletion threshold?

If it has:

move the file to Limbo
delete the original
PERSIST: Status: Deleted, Notified: False

If it hasn't:

we look to see which warning thresholds the file has passed

for every passed warning threshold (based on the file age, doesn't touch DB):

persist: Status: Warned, Notified: False, Time: Warning Time

Emails

For every stakeholder in the system:

Get anything with Deleted status, NOT notified to that stakeholder
For each warning threshold:
- Get anything with Warning status of that time NOT notified to that stakeholder

Send the emails.

Mark everything as notified.

Persisting

We persist the group
We persist the file
We persist the state.
- This means adding the state to status table
- If it is a warning status, we add the time linked to that status to warnings table

To mark something as notified, we add the status and the stakeholder to notifications table.

Summary

It adds the required notifications to the DB marked as not notified. Then, it picks up all the non-notified stuff for that stakeholder, emails them, and then marks it notified.

Notes

The status is per file, so a Deletion status will replace a Warning status.
The times are per warning status, so a more recent one will replace an older one.
If the first time sandman sees a file is past it's deletion threshold, it'll delete the file without ANY warning. Just an email to say it got deleted.
Sandman will send a notification based on when it should have sent. i.e., if the "10 Days" notification was scheduled for Monday, and it didn't run until Thursday, it wouldn't say "6 Days", it would say "10 Days". But the deletion is still in 6 days.

Design

From the design doc:

These lists will need to be persisted to disk, rather than kept in memory, such that:

Importantly, the list of files that have been staged for archival will be used in the draining phase (the way in which they're stored must therefore take this into account).

When files have exceeded a warning checkpoint, they will only be included in that e-mail once (e.g., if the sweep is run daily and there are checkpoints at 72 and 24 hours, this will ensure the 72 hour warning won't also be sent during the 48 and 24 hour sweeps).

In the event of failure, actions that were performed before the failure can still be reported on retrospectively.

Note that the intended deletion time will also need to be stored, in case the mtime of a file changes. That is, if this happens, then it's possible that a previous warning will become re-eligible. For example:

A file's age exceeds the 72 hour warning checkpoint and an e-mail is sent.

The user updates the file, which changes its mtime, no longer making it eligible for automatic deletion.

Eventually, the 72 hour warning relative to the new mtime is again exceeded; another e-mail is expected, despite a 72 hour warning being previously sent.

Note that when files are completely unlinked, their inodes are recycled by the operating system. As such, they should not be used as unique keys in any persistence model, unless they are respectfully recycled.

There's been discussions about whether we needed state to be stored - simply, yes, because we need to know if:

a notification has already been sent or not for a time period
any failure between an action happening and the email being sent (i.e. a failure on deleting another file) won't affect the contents of the email, assuming there isn't a failure in persisting the state of a file when the file is originally actioned. In simple, it's writing the email as it goes, so if something happens, the work is already done and saved instead of being lost.

As this is relational data (files -> different statuses -> statuses notified for various stakeholders), this DB is a good approach, even if the groups and group owners didn't need to be stored in it.

Timeline

    W1       W1             W1 W2       W2          W2 DEL          DEL
    V        V              V  V        V            V  V            V
    |__________________________|________________________|________________>
 warning                     warning                  deletion
threshold                   threshold                threshold
    1                           2

This shows a timeline with two warning thresholds for a file and the deletion point. Above shows what will be sent if Sandman first sees that file at that point in the timeline. It'll send the most recent warning notification that is hasn't sent (stored as whether the stakeholder had been notified in the DB). If Sandman first sees a file after its deletion point, it'll just delete it without warning. (Note: it's just adding it to Limbo, it still gets its full time in Limbo).

This is Bad

We need to ensure at least one notification is sent, so when we come to check if the file can be deleted, we should query the database to see if there is a Warning state that has been notified to at least one stakeholder. If not, it will not move the file to limbo, and will add a Warning status with the time being the Sandman running interval (i.e. one day). This means, the user will get a one day warning without us having to modify the times of the file. The next time Sandman runs, it'll see that the file is past its threshold and that a notification has been sent, so it'll happily delete the file.

Code Snippets

bin/sandman/sweep.py:323

This can have an extra check in to see if there is a warning marked as notified: True in the DB.

log.info(
    f"Deleting: {file.path} has passed the soft-deletion threshold")
if self.Yes_I_Really_Mean_It_This_Time:
    # 0. Instantiate the persisted file model before it's
    #    deleted so we don't lose its stat information
    to_persist = file.to_persistence()

    # 1. Move file to Limbo and delete source
    limboed = vault.add(Branch.Limbo, file.path)
    touch(limboed.path)
    assert hardlinks(file.path) > 1
    try:
        file.delete()  # DELETION WARNING
        log.info(f"Soft-deleted {file.path}")

    except PermissionError:
        log.error(
            f"Could not soft-delete {file.path}: Permission denied")
        return

    log.info(f"{file.path} has been soft-deleted")

    # 2. Persist to database
    self._persistence.persist(
        to_persist, State.Deleted(notified=False))

bin/sandman/sweep.py:359

This is how we add warning notifications to be sent.

self._persistence.persist(
    to_persist, State.Warned(notified=False, tminus=tminus))

bin/sandman/sweep.py:131

Although we can filter out these notifications, currently it only does it by the configured notification periods, so we can't trivially add the notification period sandman run interval.

# Warned files that require notification
for tminus in config.deletion.warnings:
    to_warn = _files(State.Warned, tminus=tminus)

This being said, we could (assuming we don't want this interval to be a typical warning), add it as an extra configuration option, and have something like

for tminus in (*config.deletion.warnings, config.sandman_run_interval):
    ...

If the stakeholder doesn't have any notifcations for the run_interval time period, it'll just skip it, exactly the same as if the user doesn't have warnings for one of the normal warning time periods.

bin/sandman/sweep.py:108

This function is where the query is done: note the query is done with notified: False

def _files(
        state: T.Type[core.persistence.base.State], **kwargs) -> FileCollection.User:
    """
    Filtered file factory for the current stakeholder in
    this context management stack with the given state
    """
    state_args = {"notified": False, **kwargs}
    criteria = Filter(state=state(
        **state_args), stakeholder=stakeholder)
    return stack.enter_context(
        self._persistence.files(criteria))

Vault key name can exceed NAME_MAX

The vault keys are the base64 encoding of the annotated file, relative to the vault root. This can exceed 255 bytes, the default Linux NAME_MAX. (base64 represents 6 bits in every 8, so 255 bytes would give a maximum representable path length of 191 characters.)

(The default PATH_MAX of 4096 bytes is unlikely to be exceeded, but it should be noted.)

A new representation is needed that doesn't suffer from this problem.

File recovery UX enhancements

The initial --view option to the file recovery CLI is very basic: it just shows all the files that are recoverable from the current vault. The only convenience it affords is to normalise the files relative the the current working directory (this lets you do, say, vault recover my-file instead of vault recover path/that/i/cannot/remember/to/my-file).

The following enhancements to this view would be useful:

An option that only shows the recoverable files from the current working directory (rather than for the entire vault);
Have the output tab-delimited with the path in the first column (as currently), and the time until permanent deletion in the second column (in, say, hours to 1dp):
```
my-file        2.1 hr
another-file   1.0 hr
just-deleted   14.8 hr
```
This output can then be sorted, if needed, by piping it through sort (e.g. sort -t$'\t' -k2g,2).

(Note: This isn't the time until actual permanent deletion, but the "safe time" until permanent deletion can happen.)

Spontaneous file deletion

If a file under the Sandman's sweep pass has an age that exceeds the deletion threshold, but didn't exist in a previous sweep, then it will be deleted without notification. This could happen in a number of ways:

The file was copied/moved from elsewhere, with its timestamps preserved.
The file's timestamp was modified through some other process.
In the first run of Sandman over historical data (e.g., setting up a vault in an old directory).

(n.b., This edge case was distinguished during the design. It doesn't yet have a satisfactory solution and will hinder adoption against historical data, as significant preprocessing will be required to avoid unintended deletion.)

Only the file owner or root can set mtime

For soft-deletion, a file's mtime is updated to the current time to give it grace before being hard-deleted.

One cannot change a file's mtime with the utime system call unless they are the file owner or root. As a workaround, it can be updated to the current time if you have write permission to the file (e.g., append a byte then truncate).

Non-existent kept files cause view crashes

Using keep --view on a directory where files have been kept, can fail:

vault keep --view mine
2022-04-05T14:07:35Z+0100	CRITICAL	[Errno 2] No such file or directory: '/.../.git/objects/23/9de85972c940378e6d122e8be7731378fef4ad'
  File "/.../bin/vault", line 8, in <module>
    sys.exit(main())
  File "/.../vault/.venv/lib/python3.8/site-packages/bin/vault/__init__.py", line 209, in main
    view(Branch.Keep, _view_contexts[context], args.absolute)
  File "/.../vault/.venv/lib/python3.8/site-packages/bin/vault/__init__.py", line 72, in view
    elif view_mode == ViewContext.Mine and path.stat().st_uid != os.getuid():
  File ".../conda_envs/R.4/lib/python3.8/pathlib.py", line 1197, in stat
    return self._accessor.stat(self)

Likely due to the kept file having now been deleted.

Soft-deletion was attempted despite being in dry-run mode

During initial testing of the soft-deletion functionality, in dry-run mode, an attempt was made to soft-delete a file that was eligible. The process failed because of #11, but no attempt should have been made in dry-run mode in the first place.

Sea Trial Postmortem

Forensic analysis of the logs from the first trial run on valueless data leads to the following fixes and recommendations:

There is a bug when retrieving the list of files in the staging queue; it returns duplicate entries, whereas the list should be distinct. This looks to be an error in the query when ignoring the stakeholder (i.e., a file can have multiple stakeholders). This should be an easy schema fix.
The duplicate entries in the staging queue was causing a race condition between the downstream handler removing files and its pipe buffer filling. This is peculiar to the dummy .tar.gz archive handler, but it would be fixed by resolving the above.
If a file is persisted without a vault path ("key") then if/when it gets a key, its record will never be updated. A file needs a key by the time it reaches the staging queue. Thus the persistence method needs to check for this and update the record if necessary, rather than ignoring it because it thinks it is complete. This is a relatively easy code fix.
The bug discovered in 1. wouldn't be fixed by this, but it would nonetheless be a good idea to be more defensive about the list of files that are drained to the downstream handler: file existence should be checked first. This will avoid a similar problem should someone manually hack around in the vault. This is a relatively easy code enhancement.
Some files (4 out of 27) are missing their persistence log. Two of these files were @piyushahuja's testing files; the other two files are completely unaccounted for. The cause of this is unknown and I can't see how it could happen. My best conjecture would be that their persistence logs were created when Sandman was manually operated (2020-11-12 15:26) and the logs not saved to disk.

Raw analysis

2020-11-14  12:00  Same failure mode as previously
                   This continues from hereon in

2020-11-14  00:00  DRAIN (!!!):
                     File (1) and (2) from yesterday, as expected
                     Then "None" followed by type conversion failure

                     Suspected failing file inode: 342274076523393086
                       Determined to be: pa11/file-09
                         2020-11-13  ~16:11
                           Added to archive branch by pa11
                         In this run:
                           Found in archival branch; moved to staged
                           Staged status logged; persistence not logged (!!!)
                           Retrieved multiple times (!!!)

                     BUG (!!!):
                       pa11/file-09 has a NULL key (vault path) in the
                       database. A file should never have a NULL key by
                       the time it gets to the drain phase. When setting
                       the new status, if the file already exists in the
                       database, the record won't be updated (because
                       keys aren't used to determine equality)

                   ACTION (!!!):
                     In the logs, this isn't the only file that hasn't
                     had its persistence logged. It affects four files,
                     out of 27, with inodes:
                     A. 342274076523392813
                     B. 342274076523393070
                     C. 342274076523393080
                     D. 342274076523393086

                     File A:
                       No logs found (!!!)

                     File B:
                       Determined to be pa11/file-01 (see below)

                     File C:
                       No logs found (!!!)

                     File D:
                       (Triggered error; see above)

2020-11-13  12:00  Unexpected handler failure

                   DRAIN:
                     Same as 00:00 (i.e., the failure was retried)

                   HANDLER:
                     File (1) and (2) (from 00:00) now reported to not
                     exist; which is true because they got deleted by
                     the previous handler invocation, despite it failing

2020-11-13  00:00  Unexpected handler failure

                   DRAIN:
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/70/30-cGExMS9maWxlLTAy (1)
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/70/2e-cGExMS9maWxlLTAx (2)
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/70/2e-cGExMS9maWxlLTAx (2)
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/70/30-cGExMS9maWxlLTAy (1)
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/70/2e-cGExMS9maWxlLTAx (2)
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/70/30-cGExMS9maWxlLTAy (1)

                   BUG (!!!):
                     File (1) and (2) are duplicated in the drain list
                     This list MUST contain distinct files
                     Likely suspect: Query returns the same file for
                       multiple stakeholders, rather than grouping

                     File (1): pa11/file-02  inode: 342274076523393072
                       2020-11-12 ~17:18
                         Added to archive branch by pa11
                       In this run:
                         Found in archival branch; moved to staged
                         Persisted once
                         Retrieved multiple times (!!!)

                     File (2): pa11/file-01  inode: 342274076523393070
                       2020-11-12 ~15:26:
                         Added to archive branch by pa11
                         Moved to staged branch by pa11 (manual invocation of Sandman)
                       In this run:
                         No log of persistence??? (!!!)
                         Retrieved multiple times (!!!)

                     FOFNs with duplicate lines:
                       2020-11-11 16:16
                       2020-11-12 00:00
                       2020-11-13 00:00 (this run)
                       2020-11-13 12:00
                       (fine thereafter)

                   HANDLER:
                     FoFN matches drain output; contains duplicates
                     File (1) does not exist, reported twice => Fail
                     How can it not exist?
                     No output regarding File (2)
                     Despite failure, archive file exists with both
                     files correctly archived (and spurious hardlinks
                     for two instances of File (2) from the duplicated
                     drain output).

                     Theory:
                       --remove-files, when the same file is specified
                       multiple times, causes this failure

                     Method:
                       touch foo bar
                       tar czf test.tar.gz --remove-files foo bar bar foo bar foo

                     Result:
                       No error; the file must have genuinely not
                       existed when the tar was attempted. How can this
                       be; Sandman staged it in this very run?

                   ACTION:
                     Check staging code for errors
                     i.e., How did a file for staging go missing between
                     being persisted and staged at 00:00:05 and drained
                     at 00:00:06 (delta 1 second)?

                     Theory:
                       Race condition: pipe buffer of long paths into
                       tar vs. --remove-files

                     Method:
                       declare FILE1="$(pwd)/$(dd if=/dev/urandom | tr -cd 'a-zA-Z0-9' | head -c100)"
                       declare FILE2="$(pwd)/$(dd if=/dev/urandom | tr -cd 'a-zA-Z0-9' | head -c100)"
                       touch "$FILE1" "FILE2"
                       printf "%s\0" "$FILE1" "$FILE2" "$FILE2" "$FILE1" "$FILE2" "$FILE1" \
                       | xargs -0 tar cPzf test.tar.gz --remove-files

                     Result (!!!):
                       Error replicated! This was the problem. Note that
                       this problem would only manifest itself when the
                       same file is fed into the handler more than once;
                       if that bug is fixed, it won't be a problem any
                       more.

                   ACTION (!!!):
                     Check for existence of persisted files when pushing
                     downstream (i.e., be defensive against manual Vault
                     intervention)

2020-11-12  12:00  No error

2020-11-12  00:00  No error
                   FOFN contains duplicates (see above for discovery)

                   Duplicates confirmed in drain:
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/6f/27-YWQ3L2ZpbGUtMDM=
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/6f/27-YWQ3L2ZpbGUtMDM=
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/6f/27-YWQ3L2ZpbGUtMDM=

                     File: ad7/file-03  inode: 342274076523392807
                       2020-11-11  ~17:50
                         Added to archival branch by ad7
                       In this run:
                         Found in archival branch; moved to staged
                         Persisted once
                         Retrieved multiple times (!!!)

2020-11-11  16:18  No error

2020-11-11  16:16  No error
                   FOFN contains duplicates (see above for discovery)

                   Duplicates confirmed in drain:
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/6e/b3-Y2gxMi9maWxlLTA0
                     /lustre/scratch119/realdata/mdt3/projects/cramtastic/.vault/.staged/04/c0/00/75/8b/00/6e/b3-Y2gxMi9maWxlLTA0

                     File: ch12/file-04  inode: 342274076523392691
                       2020-11-11  ~16:14
                         Added to archival branch by ch12
                       In this run:
                         Found in archival branch; moved to staged
                         Persisted once
                         Retrieved multiple times (!!!)

Scheduled downtime

Sandman is designed to be run periodically, say as a cron job. There could be instances, however, when this is not appropriate:

During routine hardware maintenance/"at risk" periods
During bank holidays or other times of office closure

It would be useful to codify this into the configuration, so even if Sandman is triggered by the cron job, it won't do anything potentially destructive while no one is around, or hardware is down, etc.

Note that, as this would cause its run to skip at least one period, when Sandman runs fully again, the deletion thresholds for files may have passed without warnings being sent (i.e., files get deleted without notification). This is a special case of a known and documented (but not resolved) wider problem (see #7).

Bug on 'Mine' View Context when source file has been deleted

Example:

vault keep --view mine

lists a few files and then gives

pipelines/Pilot_UKB/qc/ELGH_nfCore/work/06/94b5f115c398030ea983c6dc433e17/minimal_dataset/ELGH_VAL11509205.doublet.h5ad
pipelines/Pilot_UKB/qc/ELGH_nfCore/work/06/94b5f115c398030ea983c6dc433e17/minimal_dataset/ELGH_VAL11509206.donor5.h5ad
2022-04-01T14:51:54Z+0100	CRITICAL	[Errno 2] No such file or directory: '/lustre/scratch123/hgi/mdt1/projects/ukbb_scrna/pipelines/hp3_dev/elgh_yascp/yascp/.git/objects/23/9de85972c940378e6d122e8be7731378fef4ad'
  File "/nfs/users/nfs_p/pa11/vault/.venv/bin/vault", line 8, in <module>
    sys.exit(main())
  File "/nfs/users/nfs_p/pa11/vault/.venv/lib/python3.8/site-packages/bin/vault/__init__.py", line 209, in main
    view(Branch.Keep, _view_contexts[context], args.absolute)
  File "/nfs/users/nfs_p/pa11/vault/.venv/lib/python3.8/site-packages/bin/vault/__init__.py", line 72, in view
    elif view_mode == ViewContext.Mine and path.stat().st_uid != os.getuid():
  File "/lustre/scratch118/humgen/resources/conda_envs/R.4/lib/python3.8/pathlib.py", line 1197, in stat
    return self._accessor.stat(self)

This bug happenes because this line

hgi-vault/bin/vault/__init__.py

Line 72 in bdb37eb

elif view_mode == ViewContext.Mine and path.stat().st_uid != os.getuid():

in the view function assumes that the source file is available to query its uid. But this assumption is problematic: the source file can be deleted after annotation, in which case the query would cause the software to crash.

Critical failures should raise an e-mail notification, if possible

When an error occurs that causes the software to crash (e.g., uncaught exceptions), then an e-mail notification should be sent so we don't have to actively monitor the logs. This would be particularly useful for Sandman runs. (At the moment, this functionality is tacked on to the cron job script's wrapper.)

Some amount of configuration is going to need to be read for this to work, which becomes a chicken vs. egg problem if that's the source of the error.

Notification e-mail text is not as clear as it could be

The text for the notification e-mail was designed before the advent of the soft-deletion function. With the soft-deletion function, it reads somewhat awkwardly and is not very clear about the files it has soft-deleted (as opposed to hard-deleted). This can be reworded, potentially even including the hard-deletion threshold, for clarity.

symlink warnings

Current behaviour:

if sandman comes across a symlink, it'll won't act on it. this is what we want so sandman doesn't go wild attacking other project areas
if you run the vault command with a symlink, it'll resolve it and annotate the original. this is what we want so user's aren't confused why the actual file just went missing
when running the vault command on a symlink, it'll tell you that it has acted on the original filepath

Proposed behaviour:

when running the vault command on a symlink, in addition to telling you the original filepath it has acted on, it should display a warning, for example

WARNING   this_symlink is a symlink. Acting on the original file: /lustre/scratchXXX/YYY/original_file

Sandman should be extremely fussy about its arguments

Currently, the command line parameters to Sandman are:

sandman [--dry-run] [--force-drain] [--stats FILE] PATH...

Specifically, the PATHs must be at least one path which is covered by a Vault. As a convenience, this list is normalised into unique Vault root directories (i.e., child directories, under a Vault, are raised up; duplicate Vaults are collapsed into one).

While convenient, this can lead to catastrophic operational errors. Instead:

Each PATH must be a Vault root directory, rather than an arbitrary child. (This would have prevented the incident in first weaponised trial by refusing to proceed.)
Any duplicated PATHs should be considered an input error. (e.g., While not necessarily wrong, this could be a sign of a script gone rogue and warrants double-checking.)

i.e., Safety is much more important than convenience.

It may be worthwhile to default to a dry run and instead have an option that "arms" Sandman. This would help prevent accidental usage at the command line, but Sandman will mostly be invoked by a script/scheduler.

Files owned by root can't be found in LDAP and stop Sandman

Files owned by root have an ID of 0 which sandman won't find in the LDAP records, so it'll stop. Although stopping immediately if something goes wrong is what we want, this case should be dealt with individually (i.e. skipping the files)