nationalgenomicsinfrastructure / ngi_pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mariogiov/ngi_pipeline

6.0 6.0 23.0 2.26 MB

Code driving the production pipeline at SciLifeLab

Python 99.63% Shell 0.27% Dockerfile 0.11%

ngi_pipeline's People

Contributors

Stargazers

Watchers

ngi_pipeline's Issues

Celery - Configurable

Take away from the code all the hardcoded configurations, use a celeryconfig.py file properly

Record lanes in seqrun database tracking, add to job exit code file name

Mario knows what I'm talking about

Change shelve database to SQLite

and maybe later to a real SQL e.g. PostgreSQL

Move job tracking functionality to the various engine_ngi submodules

Each engine will have a different way to track jobs; at least at the moment I think the best way to do this is within each individual engine_ngi submodule, although they can of course use common database functionality and so on.

Document distributed module

Remove Piper .done, .failed files before processing

Need to remove these files before copying over existing data if we're re-analyzing

Celery server & task definition

Two thing are mainly needed:

A server script that launches a celery worker to read a RabbitMq queue
Celery takts definition

I propose to structure the module like this, for generality:

.
├── dummy_config.yaml
├── ngi_pipeline
│   ├── bcbio_sll
│   │   └── __init__.py
│   ├── common
│   │   ├── __init__.py
│   │   ├── parsers.py
│   ├── distributed
│   │   ├── celery.py
│   │   └── __init__.py
│   ├── __init__.py
│   ├── log
│   │   ├── __init__.py
│   ├── piper_sll
│   │   ├── __init__.py
│   │   └── workflows.py
│   └── utils
│       ├── config.py
│       ├── __init__.py
├── README.md
├── requirements.txt
├── scripts
│   └── ngi_server.py
└── setup.py

Clear up completed milestones

add code to email pipeline operators

In the event that we need their attention.

common.setup_analysis_directory_structure crashes if no restrict_to_projects passed

TypeError                                 Traceback (most recent call last)
<ipython-input-8-a89a040ed764> in <module>() ----> 1 common.setup_analysis_directory_structure('/pica/v3/a2010002/archive/131030_SN7001362_0103_BC2PUYACXX', '/home/guilc/ngi_pipeline_config.py', [])

/pica/h1/guilc/repos/ngi_pipeline/ngi_pipeline/common/__init__.py in setup_analysis_directory_structure(fc_dir, config_file_path, projects_to_analyze, restrict_to_projects, restrict_to_samples)

      259         # If specific projects are specified, skip those that do not match
      260         project_name = project['project_name']
--> 261         if len(restrict_to_projects) > 0 and project_name not in restrict_to_projects:
     262             LOG.debug("Skipping project {}".format(project_name))
     263             continue

TypeError: object of type 'NoneType' has no len()

Using an empty list as a default parameter instead of None would solve the problem.

helper script: remove project from Charon

Improve documentation

@vezzi your slides/mockups about what an engine/trigger/helper, etc. is would be a great addition to the README.md, as well as information about installation and functionalities.

Better not to leave this one for the far future!

Uppsala projects can be added to Charon by parsing the filesystem

I was adding some Uppsala projects to Charon today so I could try to process them, and I was doing this just using the information from the filesystem (e.g. the names of directories and so on). It occurs to me that this may be a much easier way of doing things for Uppsala projects than to have them send us files to upload -- also it seems there would be less risk of introducing human error.

Does anyone have a reason why this isn't a good idea? The scripts are basically done, I would just need to fix them up a bit so they're less hacky.

Celery - Define task for running flowcell analysis and test

Yeah well, that

Improve logging

Logging quite badly done now, the script is using the minimal_logger, but at the same time the Celery worker spits to standard out everything that the task generate, so everything is logged twice.

Improve this so that everything is logged only once, and nicely!

Verify Charon "RUNNING" status

At the moment, if we lose a record in the local jobs database, the corresponding Charon object will not be updated. For instance, if we mark a seqrun as RUNNING and the record is lost from the local jobs database (or the database is lost entirely), it will remain RUNNING eternally.

sbatch scripts can't update Charon

because Charon is only accessible from milou-b.
This is a problem because we're trying to chain together the alignment and the variant calling which requires updating Charon with the alignment data. I suppose one solution would be to use the Tornado server as a go-between, so perhaps we should define that as a task.

Validate config file

Make sure e.g. reference files exist, we have permissions, etc. etc.

Write some tests!

:-)

Define Tornado tasks

Or handlers, so to speak.

Here would be nice to hear suggestions about: "API" structure, parameters, etc..

Add `-jobNative` arguments to Piper command line

As a string, e.g.

-jobNative arg1 arg2 arg3

project.base_path is analysis_ready

Working on symlink_convert_file_names i realised that the base_path field of project is the folder contaning all the analysis.

This is not optimal as there is some confusion between project name and the project where we run the analysis.....

CharonSession() should be a singleton or whatever

We don't need to open a new connection every time, we can just return the existing one if it's already been instantiated.

Implement tracking of Library Preparation information

This will also change the directory tree structure. Conversation began in #34

Output folders starting with number

pipeline_output folder contains the results of the pipeline.
I recently started to append in front of the folders a number in order to have the folders automatically sorted in a chronological order, it is a really small detail but often allows to have an idea of the recipe that has been followed in few seconds.

Something like this:

00_logs
01_preliminary_alignment_qc
02_raw_alignments
03_merged_aligments
04_processed_alignments
05_final_alignment_qc
06_variant_calls
misc

Set Piper --output-directory flag

This relates also to issue #34

Implement a way to start processes from the Project/Sample/Libprep/Seqrun directory structure instead of from a flowcell

Change relative imports -> absolute

Change tests setUp() and tearDown() to class methods

I didn't realize that the methods setUp and tearDown are run before and after every test, respectively -- I need to switch to using the @classmethodforms setUpClass and tearDownClass.

Add Data and Analysis subfolders to Project tree

Project A
- Data
  - Sample 1
    - Run 1
- Analysis
  - Working Directory
    - Workflow A (e.g. DNA Alignment, QC)
      - Sample 1
  - Results

Need some way to restart FAILED jobs

Running on 8 cores instead of 1 node (memory issue)

in ngi_pipeline/nestor_ngi_config.yaml change threads to 8 when running only 8 cores. Otherwise you will run out of memory.

piper:
   # Also can be set as an environmental variable $PIPER_QSCRIPTS_DIR
   path_to_piper_qscripts: /proj/a2014205/software/piper/qscripts
   load_modules:
       - java/sun_jdk1.7.0_25
       - R/2.15.0
   sample:
       required_autosomal_coverage: 28.4
   threads: 16

PIPER_GLOB_CONF namespace collision

The PIPER_GLOB_CONF environment variable is colliding between piper and ngi_pipeline. Switch it to PIPER_GLOB_CONF_XML in ngi_pipeline.

Pick up genotyping data from INBOX

To make it as simple as possible for the guys at the genotyping the ideas is that they deliver data to the IGN project just as if they were any other user project. This means dropping the vcf and idat files into the IGN project INBOX. I'm just dropping this here to make sure that there is information somewhere that a mechanism to pickup this data needs to be present somewhere in the code.

Fake issue

What happens if you close an issue that has the same issue number than one on your fork but they are indeed different issues? @pekrau

Logging of subprocess jobs to Logbook

I haven't yet figured out how to redirect stdout and stderr to Logbook, but there I'm hoping there is some way to do this.

Move workflow selection to ngi_pipeline.conductor submodule

In the future, a project may have multiple workflows which are handled by different analysis engines; thus the code will need to make its decisions on a workflow-by-workflow basis earlier in the code flow (in ngi_pipeline.conductor).

Server: Can't fin application in run_ngi_pipeline

In server/handlers.py the methos run_ngi_pipeline can't find self.application because it is not a class inheriting from tornado.web.RequestHandler

If Charon can't be reached the script just hangs for longer than I have patience to wait

which is not ideal

Scilifelab labels around

Cleanup remaining scilifelab labels and replace with ngi_pipeline (i.e the setup.py file)

helper script: reset project statuses

Daemonize engine jobs

I think it's worthwhile to daemonize the engine jobs, by which I mean decouple them from the main ngi_pipeline thread so that it doesn't have to stay running the entire time the analyses are ongoing.

Cons:

This will make analysis jobs slightly more complex to find and kill, but as this entire process will generally be non-interactive to begin with, I don't think this is a major problem.
I don't know how to do this yet so I'll have to learn.

Pros:

Analyses are more robust as they no longer depend on the parent Python thread remaining alive.
Individual analyses can be killed separately via the individual Python threads, which means it can still wrap up logging and job status updating, etc.
After I learn to do this, then I will know how to do it.

I think the pros have it but I welcome other ideas.

Cron job to update Charon and send relevant emails

I'm gonna set up a cron job to run the script that updates Charon with running/finished/failed job statuses and also sends notification emails to operators.

Decouple from scilifelab package

For example:

from scilifelab.utils.config import load_yaml_config_expand_vars

IMHO, it would be better to decouple this pipeline from scilifelab package, which is huge and needs a big refactor.

no data in scratch

When running :
python start_pipeline_from_project.py -f /proj/a2014205/nobackup/denis/topdir/DATA/J.Taipale_14_01

I get this :

Traceback (most recent call last):
  File "start_pipeline_from_project.py", line 48, in <module>
    exec_mode=args_dict["exec_mode"])
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/conductor/launchers.py", line 60, in launch_analysis_for_samples
    config=config, config_file_path=config_file_path)
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/utils/classes.py", line 27, in __call__
    return self.f(**kwargs)
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/conductor/launchers.py", line 192, in launch_analysis
    exec_mode=exec_mode)
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/utils/classes.py", line 27, in __call__
    return self.f(**kwargs)
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/engines/piper_ngi/__init__.py", line 170, in analyze_sample
    local_scratch_mode=(exec_mode == "sbatch"))
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/engines/piper_ngi/__init__.py", line 634, in build_setup_xml
    for fastq_file_name in os.listdir(sample_run_directory):
OSError: [Errno 2] No such file or directory: '$SNIC_TMP/DATA/J.Taipale_14_01/P1371_101/A/141030_D00415_0061_AC5DL5ANXX'

If I do :

>echo $SNIC_TMP
/scratch

I'm not sure, is this the part where de data is copied to the node ?

nationalgenomicsinfrastructure / ngi_pipeline Goto Github PK

ngi_pipeline's People

Contributors

Stargazers

Watchers

Forkers

ngi_pipeline's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs