scilifelab / taca Goto Github PK

This project forked from guillermo-carrasco/taca

Tool for the Automation of Cleanup and Analyses: tools for projects and data management at NGI Stockholm

License: MIT License

Shell 0.02% Python 98.46% HTML 1.35% Dockerfile 0.16%

taca's Introduction

Tool for the Automation of Cleanup and Analyses

This package contains several tools for projects and data management in the National Genomics Infrastructure in Stockholm, Sweden.

Run tests in docker

git clone https://github.com/SciLifeLab/TACA.git
cd TACA
docker build -t taca_testing --target testing .
docker run -it taca_testing

Installation

Inside the repo, run pip install .

Development

Run pip install requirements-dev.txt to install packages used for development and pip install -e . to make the installation editable.

Automated linting

This repo is configured for automated linting. Linter parameters are defined in pyproject.toml.

As of now, we use:

ruff to perform automated formatting and a variety of lint checks.
- Run with ruff check . and ruff format .
mypy for static type checking and to prevent contradictory type annotation.
- Run with mypy **/*.py

pipreqs to check that the requirement files are up-to-date with the code.

This is run with a custom Bash script in GitHub Actions which will only compare the list of package names.

# Extract and sort package names
awk '{print $1}' $1 | sort -u > "$1".compare
awk -F'==' '{print $1}' $2 | sort -u > "$2".compare

# Compare package lists
if cmp -s "$1".compare "$2".compare
then
  echo "Requirements are the same"
  exit 0
else
  echo "Requirements are different"
  exit 1
fi

prettier to format common languages.
- Run with prettier .
editorconfig-checker to enforce .editorconfig rules for all files not covered by the tools above.
- Run with
```
editorconfig-checker $(git ls-files | grep -v '.py\|.md\|.json\|.yml\|.yaml\|.html')
```

GitHub Actions

Configured in .github/workflows/lint-code.yml. Will test all commits in pushes or pull requests, but not change code or prevent merges.

Pre-commit

Will prevent local commits that fail linting checks. Configured in .pre-commit-config.yml.

To set up pre-commit checking:

Run pip install pre-commit
Navigate to the repo root
Run pre-commit install

This can be disabled with pre-commit uninstall

VS Code automation

To enable automated linting in VS Code, go the the user settings.json and include the following lines:

"[python]": {
    "editor.defaultFormatter": "charliermarsh.ruff",
}

This will run the ruff-mediated linting with the same parameters as the GitHub Actions and pre-commit every time VS Code is used to format the code in the repository.

To run formatting on save, include the lines:

"[python]": {
    "editor.formatOnSave": true,
}

Git blame suppression

When a non-invasive tool is used to tidy up a lot of code, it is useful to supress the Git blame for that particular commit, so the original author can still be traced.

To do this, add the hash of the commit containing the changes to .git-blame-ignore-revs, headed by an explanatory comment.

Deliver command

There is also a plugin for the deliver command. To install this in the same development environment:

# Install taca delivery plugin for development
git clone https://github.com/<username>/TACA.git
cd ../taca-ngi-pipeline
python setup.py develop
pip install -r ./requirements-dev.txt

# add required config files and env for taca delivery plugin
echo "foo:bar" >> ~/.ngipipeline/ngi_config.yaml
echo "foo:bar" >> ~/.ngipipeline/ngi_config.yaml
mkdir ~/.taca && cp tests/data/taca_test_cfg.yaml ~/.taca/taca.yaml
export CHARON_BASE_URL="http://tracking.database.org"
export CHARON_API_TOKEN="charonapitokengoeshere"

# Check that tests pass:
cd tests && nosetests -v -s

For a more detailed documentation please go to the documentation page.

taca's People

Contributors

Stargazers

Watchers

Forkers

ewels b97pla galithil vezzi kate-v-stepanova sylvinite jfnavarro hammarn chuan-wang sofiahag aanil zhanglingfei alneberg ssjunnebo perlundmark franbonath kedhammar

taca's Issues

Optional parameter to limit the number of runs proceed simultaneously

As the way it is we can either archive all runs at a time or one at a time. Its good to have more sophisticated options.

-r run option for taca analysis demultiplex

If one wants to demultiplex a single run, should be possible

Move run_tracker functionalities to a new subcommand

In reference to this trello card.

The subcommand, something like:

$> taca analysis demultiplex [options]

Should act exactly as run_tracker.py is acting now.

archive funcitonality not looking at the days

Even though it is in the argument list

def archive_to_swestore(days, run=None)

It is not used in this method (it is in cleanup), so basically it will archive all the runs, regardless whatever you specify as old.

Infer samplesheet run run folder name

Currently there is a bug as I do not take into account the A or B in front of the actual flowcell name

Modify storage controller to use archive_dirs config option

Instead of hardcoding <data_dir>/nosync in the code.

This should help Clinical Genomics as it is site independent.

Load YAML configuration files

Instead of the weird format that Cement accepts by default. Take a look at this

Implement delivery routine

Delivery of analysis data, as outlined in NGI delivery policies document, should be implemented and managed with TACA.

Implement pm production storage-cleanup

run_tracker - Change imports to pm instead of ngi_pipeline

when possible

Implement pm production transfer

Make configuration global

Like the LOG?

Maybe in taca.__init__?

Dunno man...

Implement pm report report-to-gdocs

Possibility to prepend a string in log output for external commands

Right now all external commands' outputs are logged like <name_of_the_command>.oerr/out, i.e iput.out. Would be nice to be able to prepend a string to better identify the log, for example: 150210_D00456_0063_BC6L1KANXX_iput.out

Implement dry run

Something that is used a lot...

Check swestore instead of days old for cleanup processing runs

Instead of removing runs older than X days in nosync in the preprocessing servers, check that they have been archived to swestore.

Install iCommands in the processing servers
Init iCommands (credentials and so)
Implement code

-s --server-type flag to diferenciate between nas and processing server for cleanup

Remove contributors from README

What do you think? It is implicit in the commit history. Actually, it is availably in the "Contributors" tab on the repository so... one less thing to keep up to date.

Implement pm qc upload-qc

Implement pm deliver best-practice

Implement pm report closed-projects

Docs docs docs

Hmmm this is just a question: Do you think it is enough with the help of the package?

(master) ~/repos_and_code/TACA (master) ~> taca --help
Usage: taca [OPTIONS] COMMAND [ARGS]...

  Tool for the Automation of Storage and Analyses

Options:
  --version                   Show the version and exit.
  -c, --config-file FILENAME  Path to TACA configuration file
  --help                      Show this message and exit.

Commands:
  analysis  Analysis methods entry point
  storage   Storage management methods and utilities

etc. Or do you think we should add a page per subcommand in the documentation? Like one page for taca storage, one page per taca analysis, etc.

I don't want to over-document, thats the thing, but I don't want either that subcommands or options become forgotten. On the other hand... is a subcommand becomes forgotten is basically because it is not used, so it shouldn't be there....

what do you think? @senthil10 @vezzi @ewels @mariogiov

PM - Check if run exists in Swestore

Now it will crash if the run already exists in Swestore:

ERROR: putUtil: put error for /ssUppnexZone/proj/a2010002/141120_M01548_0038_000000000-AB8D9.tar.bz2, status = -312000 status = -312000 OVERWRITE_WITHOUT_FORCE_FLAG

pm is not logging to a file

Even though it is specified in the configuration file:

# This section overrides the default login parameters in Cement
log.logging:
    file: /home/hiseq.bioinfo/log/pm.log
    rotate: True

Implement pm report project-status

added thread options in run_tracker.yaml but looks like the command created does not care

title explains ....

Properly version TACA

-r option not working properly

(master)hiseq.bioinfo@seq-nas-3:/srv/illumina/hiseq_data/nosync$ taca storage archive-to-swestore -r 150113_D00456_0058_AC6KUBANXX.tar.bz2
Traceback (most recent call last):
  File "/home/hiseq.bioinfo/.anaconda/envs/master/bin/taca", line 5, in <module>
    pkg_resources.run_script('taca==1.0', 'taca')
  File "/home/hiseq.bioinfo/.anaconda/envs/master/lib/python2.7/site-packages/setuptools-3.6-py2.7.egg/pkg_resources.py", line 534, in run_script
  File "/home/hiseq.bioinfo/.anaconda/envs/master/lib/python2.7/site-packages/setuptools-3.6-py2.7.egg/pkg_resources.py", line 1434, in run_script
  File "/home/hiseq.bioinfo/.anaconda/envs/master/lib/python2.7/site-packages/taca-1.0-py2.7.egg/EGG-INFO/scripts/taca", line 38, in <module>
    app.run()
  File "/home/hiseq.bioinfo/.anaconda/envs/master/lib/python2.7/site-packages/cement/core/foundation.py", line 694, in run
    self.controller._dispatch()
  File "/home/hiseq.bioinfo/.anaconda/envs/master/lib/python2.7/site-packages/cement/core/controller.py", line 455, in _dispatch
    return func()
  File "/home/hiseq.bioinfo/.anaconda/envs/master/lib/python2.7/site-packages/cement/core/controller.py", line 461, in _dispatch
    return func()
  File "/home/hiseq.bioinfo/.anaconda/envs/master/lib/python2.7/site-packages/taca-1.0-py2.7.egg/taca/controllers/storage.py", line 56, in archive_to_swestore
    self._archive_run(self.pargs.run)
AttributeError: 'StorageController' object has no attribute 'pargs'

Implement pm production touch-finished

Transfer SampleSheet.csv to Nestor

Implement pm qc multiplex-qc

Multiprocessing Pool for archiving

In reference to #63

Check if the samplesheet is present in the run directory

And do that before copying it from MFS (if it exists there), to give it priority.

Describe TACA configuration file

Demultiplexing should be machine agnostic

Baically, taca analysis demultiplex -r <HiSeq run> should work as taca analysis demultiplex -r <MiSeq run> and taca analysis demultiplex -r <XTen run> without the user having to specify the run type.

`taca --help` shouldn't fail if there is not configuration file

The help should always be accessible

Samplesheets for HAS

That might not be true for the latest versions, but if you want to make the samplesheets HAS compatible, you need a key named "Workflow" under the [Header] key, and possibly a [Settings] key before [Data]

Remove base mask generation method

As with the new X-Ten it is not necessary

Implement pm deliver raw-data

Update documentation

And document all the configuration options in the config file!!!!!

Implement pm report survey

Detach iput command

This command takes ages for a HiSeq/XTen run, and it only uses one core, so I think that we could detach it and continue to tarball the next run. So basically at a given point we would have just one run being compressed (using several cores), but several being sent at the same time to swestore.

If we don't do like this, the risk of creating a queue of pm processes is high.