nationalgenomicsinfrastructure / ngi_pipeline Goto Github PK
View Code? Open in Web Editor NEWThis project forked from mariogiov/ngi_pipeline
Code driving the production pipeline at SciLifeLab
This project forked from mariogiov/ngi_pipeline
Code driving the production pipeline at SciLifeLab
Take away from the code all the hardcoded configurations, use a celeryconfig.py
file properly
Mario knows what I'm talking about
and maybe later to a real SQL e.g. PostgreSQL
Each engine will have a different way to track jobs; at least at the moment I think the best way to do this is within each individual engine_ngi submodule, although they can of course use common database functionality and so on.
Need to remove these files before copying over existing data if we're re-analyzing
Two thing are mainly needed:
I propose to structure the module like this, for generality:
.
├── dummy_config.yaml
├── ngi_pipeline
│ ├── bcbio_sll
│ │ └── __init__.py
│ ├── common
│ │ ├── __init__.py
│ │ ├── parsers.py
│ ├── distributed
│ │ ├── celery.py
│ │ └── __init__.py
│ ├── __init__.py
│ ├── log
│ │ ├── __init__.py
│ ├── piper_sll
│ │ ├── __init__.py
│ │ └── workflows.py
│ └── utils
│ ├── config.py
│ ├── __init__.py
├── README.md
├── requirements.txt
├── scripts
│ └── ngi_server.py
└── setup.py
In the event that we need their attention.
TypeError Traceback (most recent call last)
<ipython-input-8-a89a040ed764> in <module>() ----> 1 common.setup_analysis_directory_structure('/pica/v3/a2010002/archive/131030_SN7001362_0103_BC2PUYACXX', '/home/guilc/ngi_pipeline_config.py', [])
/pica/h1/guilc/repos/ngi_pipeline/ngi_pipeline/common/__init__.py in setup_analysis_directory_structure(fc_dir, config_file_path, projects_to_analyze, restrict_to_projects, restrict_to_samples)
259 # If specific projects are specified, skip those that do not match
260 project_name = project['project_name']
--> 261 if len(restrict_to_projects) > 0 and project_name not in restrict_to_projects:
262 LOG.debug("Skipping project {}".format(project_name))
263 continue
TypeError: object of type 'NoneType' has no len()
Using an empty list as a default parameter instead of None would solve the problem.
@vezzi your slides/mockups about what an engine/trigger/helper, etc. is would be a great addition to the README.md, as well as information about installation and functionalities.
Better not to leave this one for the far future!
I was adding some Uppsala projects to Charon today so I could try to process them, and I was doing this just using the information from the filesystem (e.g. the names of directories and so on). It occurs to me that this may be a much easier way of doing things for Uppsala projects than to have them send us files to upload -- also it seems there would be less risk of introducing human error.
Does anyone have a reason why this isn't a good idea? The scripts are basically done, I would just need to fix them up a bit so they're less hacky.
Yeah well, that
Logging quite badly done now, the script is using the minimal_logger, but at the same time the Celery worker spits to standard out everything that the task generate, so everything is logged twice.
Improve this so that everything is logged only once, and nicely!
At the moment, if we lose a record in the local jobs database, the corresponding Charon object will not be updated. For instance, if we mark a seqrun as RUNNING
and the record is lost from the local jobs database (or the database is lost entirely), it will remain RUNNING
eternally.
because Charon is only accessible from milou-b.
This is a problem because we're trying to chain together the alignment and the variant calling which requires updating Charon with the alignment data. I suppose one solution would be to use the Tornado server as a go-between, so perhaps we should define that as a task.
Make sure e.g. reference files exist, we have permissions, etc. etc.
:-)
Or handlers, so to speak.
Here would be nice to hear suggestions about: "API" structure, parameters, etc..
As a string, e.g.
-jobNative arg1 arg2 arg3
Working on symlink_convert_file_names i realised that the base_path field of project is the folder contaning all the analysis.
This is not optimal as there is some confusion between project name and the project where we run the analysis.....
We don't need to open a new connection every time, we can just return the existing one if it's already been instantiated.
This will also change the directory tree structure. Conversation began in #34
pipeline_output folder contains the results of the pipeline.
I recently started to append in front of the folders a number in order to have the folders automatically sorted in a chronological order, it is a really small detail but often allows to have an idea of the recipe that has been followed in few seconds.
Something like this:
00_logs
01_preliminary_alignment_qc
02_raw_alignments
03_merged_aligments
04_processed_alignments
05_final_alignment_qc
06_variant_calls
misc
This relates also to issue #34
I didn't realize that the methods setUp and tearDown are run before and after every test, respectively -- I need to switch to using the @classmethod
forms setUpClass and tearDownClass.
in ngi_pipeline/nestor_ngi_config.yaml change threads to 8 when running only 8 cores. Otherwise you will run out of memory.
piper:
# Also can be set as an environmental variable $PIPER_QSCRIPTS_DIR
path_to_piper_qscripts: /proj/a2014205/software/piper/qscripts
load_modules:
- java/sun_jdk1.7.0_25
- R/2.15.0
sample:
required_autosomal_coverage: 28.4
threads: 16
The PIPER_GLOB_CONF environment variable is colliding between piper and ngi_pipeline. Switch it to PIPER_GLOB_CONF_XML in ngi_pipeline.
To make it as simple as possible for the guys at the genotyping the ideas is that they deliver data to the IGN project just as if they were any other user project. This means dropping the vcf and idat files into the IGN project INBOX. I'm just dropping this here to make sure that there is information somewhere that a mechanism to pickup this data needs to be present somewhere in the code.
What happens if you close an issue that has the same issue number than one on your fork but they are indeed different issues? @pekrau
I haven't yet figured out how to redirect stdout and stderr to Logbook, but there I'm hoping there is some way to do this.
In the future, a project may have multiple workflows which are handled by different analysis engines; thus the code will need to make its decisions on a workflow-by-workflow basis earlier in the code flow (in ngi_pipeline.conductor).
In server/handlers.py
the methos run_ngi_pipeline
can't find self.application
because it is not a class inheriting from tornado.web.RequestHandler
which is not ideal
Cleanup remaining scilifelab labels and replace with ngi_pipeline (i.e the setup.py file)
I think it's worthwhile to daemonize the engine jobs, by which I mean decouple them from the main ngi_pipeline
thread so that it doesn't have to stay running the entire time the analyses are ongoing.
Cons:
Pros:
I think the pros have it but I welcome other ideas.
I'm gonna set up a cron job to run the script that updates Charon with running/finished/failed job statuses and also sends notification emails to operators.
For example:
from scilifelab.utils.config import load_yaml_config_expand_vars
IMHO, it would be better to decouple this pipeline from scilifelab package, which is huge and needs a big refactor.
When running :
python start_pipeline_from_project.py -f /proj/a2014205/nobackup/denis/topdir/DATA/J.Taipale_14_01
I get this :
Traceback (most recent call last):
File "start_pipeline_from_project.py", line 48, in <module>
exec_mode=args_dict["exec_mode"])
File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/conductor/launchers.py", line 60, in launch_analysis_for_samples
config=config, config_file_path=config_file_path)
File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/utils/classes.py", line 27, in __call__
return self.f(**kwargs)
File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/conductor/launchers.py", line 192, in launch_analysis
exec_mode=exec_mode)
File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/utils/classes.py", line 27, in __call__
return self.f(**kwargs)
File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/engines/piper_ngi/__init__.py", line 170, in analyze_sample
local_scratch_mode=(exec_mode == "sbatch"))
File "/home/denis/anaconda/envs/ngi/lib/python2.7/site-packages/ngi_pipeline-0.1.0-py2.7.egg/ngi_pipeline/engines/piper_ngi/__init__.py", line 634, in build_setup_xml
for fastq_file_name in os.listdir(sample_run_directory):
OSError: [Errno 2] No such file or directory: '$SNIC_TMP/DATA/J.Taipale_14_01/P1371_101/A/141030_D00415_0061_AC5DL5ANXX'
If I do :
>echo $SNIC_TMP
/scratch
I'm not sure, is this the part where de data is copied to the node ?
Create a general tests
directory and move all the tests there.
Could also study the possibility of mocking the DB to do CI in Travis, but let's put that last one in the wish list.
I've got whole gobs of code in the __init__.py
files of the various modules/submodules which as I understand it is not The Right Way to do it. Separate things into different, named files.
This is obviously a problem, investigate why is this happening and how to solve it.
When calling a real handler, i.e curl -X GET http://nestor1.uppmax.uu.se:6666/flowcell/130611_SN7001298_0148_AH0CCVADXX
a couple of wrong parameter like False
are passed to ngi_pipeline_start.py
and it crashes
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.