GithubHelp home page GithubHelp logo

nationalgenomicsinfrastructure / ngi_pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mariogiov/ngi_pipeline

6.0 6.0 23.0 2.26 MB

Code driving the production pipeline at SciLifeLab

Python 99.63% Shell 0.27% Dockerfile 0.11%

ngi_pipeline's People

Contributors

aanil avatar alneberg avatar b97pla avatar chuan-wang avatar ewels avatar galithil avatar guillermo-carrasco avatar hammarn avatar kate-v-stepanova avatar mariogiov avatar matrulda avatar monikabrandt avatar parlundin avatar pekrau avatar remiolsen avatar robinandeer avatar senthil10 avatar sofiahag avatar ssjunnebo avatar sylvinite avatar vezzi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ngi_pipeline's Issues

Celery - Configurable

Take away from the code all the hardcoded configurations, use a celeryconfig.py file properly

Celery server & task definition

Two thing are mainly needed:

  • A server script that launches a celery worker to read a RabbitMq queue
  • Celery takts definition

I propose to structure the module like this, for generality:

.
├── dummy_config.yaml
├── ngi_pipeline
│   ├── bcbio_sll
│   │   └── __init__.py
│   ├── common
│   │   ├── __init__.py
│   │   ├── parsers.py
│   ├── distributed
│   │   ├── celery.py
│   │   └── __init__.py
│   ├── __init__.py
│   ├── log
│   │   ├── __init__.py
│   ├── piper_sll
│   │   ├── __init__.py
│   │   └── workflows.py
│   └── utils
│       ├── config.py
│       ├── __init__.py
├── README.md
├── requirements.txt
├── scripts
│   └── ngi_server.py
└── setup.py

common.setup_analysis_directory_structure crashes if no restrict_to_projects passed

TypeError                                 Traceback (most recent call last)
<ipython-input-8-a89a040ed764> in <module>() ----> 1 common.setup_analysis_directory_structure('/pica/v3/a2010002/archive/131030_SN7001362_0103_BC2PUYACXX', '/home/guilc/ngi_pipeline_config.py', [])

/pica/h1/guilc/repos/ngi_pipeline/ngi_pipeline/common/__init__.py in setup_analysis_directory_structure(fc_dir, config_file_path, projects_to_analyze, restrict_to_projects, restrict_to_samples)

      259         # If specific projects are specified, skip those that do not match
      260         project_name = project['project_name']
--> 261         if len(restrict_to_projects) > 0 and project_name not in restrict_to_projects:
     262             LOG.debug("Skipping project {}".format(project_name))
     263             continue

TypeError: object of type 'NoneType' has no len()

Using an empty list as a default parameter instead of None would solve the problem.

Improve documentation

@vezzi your slides/mockups about what an engine/trigger/helper, etc. is would be a great addition to the README.md, as well as information about installation and functionalities.

Better not to leave this one for the far future!

Uppsala projects can be added to Charon by parsing the filesystem

I was adding some Uppsala projects to Charon today so I could try to process them, and I was doing this just using the information from the filesystem (e.g. the names of directories and so on). It occurs to me that this may be a much easier way of doing things for Uppsala projects than to have them send us files to upload -- also it seems there would be less risk of introducing human error.

Does anyone have a reason why this isn't a good idea? The scripts are basically done, I would just need to fix them up a bit so they're less hacky.

Improve logging

Logging quite badly done now, the script is using the minimal_logger, but at the same time the Celery worker spits to standard out everything that the task generate, so everything is logged twice.

Improve this so that everything is logged only once, and nicely!

Verify Charon "RUNNING" status

At the moment, if we lose a record in the local jobs database, the corresponding Charon object will not be updated. For instance, if we mark a seqrun as RUNNING and the record is lost from the local jobs database (or the database is lost entirely), it will remain RUNNING eternally.

sbatch scripts can't update Charon

because Charon is only accessible from milou-b.
This is a problem because we're trying to chain together the alignment and the variant calling which requires updating Charon with the alignment data. I suppose one solution would be to use the Tornado server as a go-between, so perhaps we should define that as a task.

Define Tornado tasks

Or handlers, so to speak.

Here would be nice to hear suggestions about: "API" structure, parameters, etc..

project.base_path is analysis_ready

Working on symlink_convert_file_names i realised that the base_path field of project is the folder contaning all the analysis.

This is not optimal as there is some confusion between project name and the project where we run the analysis.....

Output folders starting with number

pipeline_output folder contains the results of the pipeline.
I recently started to append in front of the folders a number in order to have the folders automatically sorted in a chronological order, it is a really small detail but often allows to have an idea of the recipe that has been followed in few seconds.

Something like this:

00_logs
01_preliminary_alignment_qc
02_raw_alignments
03_merged_aligments
04_processed_alignments
05_final_alignment_qc
06_variant_calls
misc

Running on 8 cores instead of 1 node (memory issue)

in ngi_pipeline/nestor_ngi_config.yaml change threads to 8 when running only 8 cores. Otherwise you will run out of memory.

piper:
   # Also can be set as an environmental variable $PIPER_QSCRIPTS_DIR
   path_to_piper_qscripts: /proj/a2014205/software/piper/qscripts
   load_modules:
       - java/sun_jdk1.7.0_25
       - R/2.15.0
   sample:
       required_autosomal_coverage: 28.4
   threads: 16

Pick up genotyping data from INBOX

To make it as simple as possible for the guys at the genotyping the ideas is that they deliver data to the IGN project just as if they were any other user project. This means dropping the vcf and idat files into the IGN project INBOX. I'm just dropping this here to make sure that there is information somewhere that a mechanism to pickup this data needs to be present somewhere in the code.

Fake issue

What happens if you close an issue that has the same issue number than one on your fork but they are indeed different issues? @pekrau

Move workflow selection to ngi_pipeline.conductor submodule

In the future, a project may have multiple workflows which are handled by different analysis engines; thus the code will need to make its decisions on a workflow-by-workflow basis earlier in the code flow (in ngi_pipeline.conductor).

Daemonize engine jobs

I think it's worthwhile to daemonize the engine jobs, by which I mean decouple them from the main ngi_pipeline thread so that it doesn't have to stay running the entire time the analyses are ongoing.

Cons:

  • This will make analysis jobs slightly more complex to find and kill, but as this entire process will generally be non-interactive to begin with, I don't think this is a major problem.
  • I don't know how to do this yet so I'll have to learn.

Pros:

  • Analyses are more robust as they no longer depend on the parent Python thread remaining alive.
  • Individual analyses can be killed separately via the individual Python threads, which means it can still wrap up logging and job status updating, etc.
  • After I learn to do this, then I will know how to do it.

I think the pros have it but I welcome other ideas.

Decouple from scilifelab package

For example:

from scilifelab.utils.config import load_yaml_config_expand_vars

IMHO, it would be better to decouple this pipeline from scilifelab package, which is huge and needs a big refactor.

no data in scratch

When running :
python start_pipeline_from_project.py -f /proj/a2014205/nobackup/denis/topdir/DATA/J.Taipale_14_01

I get this :

Traceback (most recent call last):
  File "start_pipeline_from_project.py", line 48, in <module>
    exec_mode=args_dict["exec_mode"])
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/conductor/launchers.py", line 60, in launch_analysis_for_samples
    config=config, config_file_path=config_file_path)
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/utils/classes.py", line 27, in __call__
    return self.f(**kwargs)
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/conductor/launchers.py", line 192, in launch_analysis
    exec_mode=exec_mode)
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/utils/classes.py", line 27, in __call__
    return self.f(**kwargs)
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/engines/piper_ngi/__init__.py", line 170, in analyze_sample
    local_scratch_mode=(exec_mode == "sbatch"))
  File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/engines/piper_ngi/__init__.py", line 634, in build_setup_xml
    for fastq_file_name in os.listdir(sample_run_directory):
OSError: [Errno 2] No such file or directory: '$SNIC_TMP/DATA/J.Taipale_14_01/P1371_101/A/141030_D00415_0061_AC5DL5ANXX' 

If I do :

>echo $SNIC_TMP
/scratch

I'm not sure, is this the part where de data is copied to the node ?

Re-structure tests

Create a general tests directory and move all the tests there.

Could also study the possibility of mocking the DB to do CI in Travis, but let's put that last one in the wish list.

Move code out of __init__.py scripts

I've got whole gobs of code in the __init__.py files of the various modules/submodules which as I understand it is not The Right Way to do it. Separate things into different, named files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.