mdu-phl / bohra Goto Github PK
View Code? Open in Web Editor NEWA pipeline for bioinformatics analysis of bacterial genomes
License: GNU General Public License v3.0
A pipeline for bioinformatics analysis of bacterial genomes
License: GNU General Public License v3.0
handful of isolates that we map to a reference and extract reads aligning to a small 5kb region.
if report dir found should rename or remove and start again.
Swap in seqtk
in place of fq
Add in example reports for each type of pipeline
Modify the cextflow pipeline to allow for each process to have customisable resources
Add in a setup command for the setup of databases and to check that all files and deps are installed
Check all binary and database deps exist
And maybe their versions / names
When a report.toml is too large, the writing of the report.html from this file is waaaaay to slow - to the point of not being possible (tested on >1000 isolates).
It would be useful to include one or more ref genomes in preview mode.
Add some docstrings to the top of these two files (at least, I would like to see it in all of them). That will help in self-documenting later. Ideally, all functions and class definitions would have one too. They don't have to be long. I generally try to start every function/class definition by writing down the docstrings what the element will do, and then what parameters it will take and what output to expect. That helps in organising things, and you can quickly see if the function is trying to do too much.
You can follow the sphinx model and then add sphinx to the tasks.py
to automatically generate some docs too. https://pythonhosted.org/an_example_pypi_project/sphinx.html
push to pypi
Snakemake parameter
Bohra passed args.cpus to SnpDetection object self.cpus during the initial
Then, there is a func name: set_snakemake_jobs() to double check whether the self.cpus is over the limitation:
SnpDetection.py#L141
def set_snakemake_jobs(self):
'''
set the number of jobs to run in parallel based on the number of cpus from args
'''
if int(self.cpus) < int(psutil.cpu_count()):
self.jobs = self.cpus
else:
self.jobs = 1
However, in the final command line to run snakemake file, it does not use self.jobs.
if self.cluster:
cmd = f"{self.cluster_cmd()} -s {snake_name} -d {wd} {force} {singularity_string} --latency-wait 1200"
else:
cmd = f"snakemake {dry} -s {snake_name} {singularity_string} -j {self.cpus} -d {self.job_id} {force} --verbose 2>&1"
Create a little invoke
script to automatically version bump, generate the bundles, and push to PyPI. Example using bumpversion
(to automatically bump version) and twine
(to upload to PyPI) below.
You can add that to a file called tasks.py
and then just run inv
to push new versions to PyPI (after pip3 install invoke
). You could also make it a bash script, of course.
Read twine
README for some more background: https://github.com/pypa/twine
And, invoke
: http://www.pyinvoke.org/
'''
Automate deployment to PyPi
'''
import invoke
@invoke.task
def deploy_patch(ctx):
'''
Automate deployment
rm -rf build/* dist/*
bumpversion patch --verbose
python3 setup.py sdist bdist_wheel
twine upload dist/*
git push --tags
'''
ctx.run("rm -rf build/* dist/*")
ctx.run("bumpversion patch --verbose")
ctx.run("python3 setup.py sdist bdist_wheel")
ctx.run("twine check dist/*")
ctx.run("twine upload dist/*")
ctx.run("git push --tags")
In the summary table
Would be useful to report the minaln but would suggest responsibiility on user for them to think about what makes sense for them. Can instead state in github / show in case studies what standard/ recommended threshold for inclusion based on core min aln (if doing large large scale pop genomics). Add an example config to github for snippy
File "/home/linuxbrew/.linuxbrew/bin/bohra", line 11, in <module>
load_entry_point('bohra==1.0.3', 'console_scripts', 'bohra')()
File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 96, in main
parser.print_help(sys.stderr)
NameError: name 'parser' is not defined
Ideally want these:
Maybe these
Hi Kristy,
My command to run bohra
bohra run -c 8 -i ids.tab -j CRL_20210120_ -r GCF_001548355.1_JKo3.fna -p sa -mdu -ma 0 -mc 0
Josh
Add the keyword __version__ = "1.0.1"
to __init__.py
(note version number is a string). You can then import the bohra
into setup.py
and never have to modify the version in more than one place. Read the bumpversion
docs too.
This is good practice when deploying python packages.
You can add __author__
, __copyright__
and __license__
variables to __init__.py
too.
You either get None
or the full path to the executable. You could easily then allow the user to supply a path to the executable if it is not in $PATH
for some reason.
You will, of course, still need to use subprocess.run
to get the exact version of the tool.
You can then be a bit clever with regex to parse out versions and use the packaging package to compare:
import re
from packaging import version
version_pat = re.compile(r'\bv?(?P<major>[0-9]+)\.(?P<minor>[0-9]+)\.(?P<release>[0-9]+)(?:\.(?P<build>[0-9]+))?\b')
version = "snippy v3.2.1"
m = version_pat.search(version)
# you can access individual components
m.group("major") # "3"
# the whole matching string
m.group() # v3.2.1
# as a dictionary
m.groupdict() # {'major': '3', 'minor': '2', 'release': '1', 'build': None}
# or tuple
m.groups() # ('3', '2', '1', None)
# the packing package offers some comparison tools and it comes with
# setuptools, so no additional requirements needed
min_version = version.parse("v3.2.3")
version.parse(m.group()) >= min_version # False
min_version = version.parse("v3.2.0")
version.parse(m.group()) >= min_version # True
Ion Torrent are single end reads (1 fast file) and need different settings in BWA MEM normally.
Do you really support them?
make dependencies brew and conda installable
bohra run
[INFO:03/05/2020 05:13:47 PM] Starting bohra pipeline using /home/linuxbrew/.linuxbrew/bin/bohra run
[INFO:03/05/2020 05:13:47 PM] You are running bohra in preview mode.
Traceback (most recent call last):
File "/home/linuxbrew/.linuxbrew/bin/bohra", line 8, in <module>
sys.exit(main())
File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 136, in main
args.func(args)
File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 32, in run_pipeline
R = RunSnpDetection(args)
File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/SnpDetection.py", line 65, in __init__
self.log_messages('warning', 'Input file can not be empty, please set -i path_to_input to try again')
AttributeError: 'RunSnpDetection' object has no attribute 'log_messages'
It is a convention to have the tests
folder outside the module
folder. You can modify the tasks.py
to run tests before packaging, and stop if one or more tests
fail.
I find that if I run e.g.
bohra run --input_file isolates.tab --job_id PA_20200115_1 --reference ref.fa --mask phastaf/phage.bed -mdu -n
...and get the warning message:
[WARNING:01/15/2020 02:59:44 PM] This may be a re-run of an existing job. Please try again using rerun instead of run OR use -f to force an overwrite of the existing job.
[WARNING:01/15/2020 02:59:44 PM] Exiting....
...then running the same command with -f
added only repeats the same error message.
line 693
main()
should be short, and really just call on other functions.
The Unix standard for dry run is --dry-run
and the synonym -n
if needed.
If input files are not accessible:
prefill
path not accessible for assembly and speciation default to running assembly or speciation nulla2bohra Ensure that bohra can be rerun over an existing
nullarbor folder. Can also be used to update older
bohra directories. Must supply name of nullarbor
directory, and your isolates.tab file
can you use nullarbor/input.tab
so they only provide the folder?
it is a copy of their original file when they ran it last
Update README.md for installation options
Include pip, singularity, conda and brew?
line 404
Skip over isolates where no mlst is found gracefully. Thanks to @willpitchers for finding
--workdir WORKDIR, -w WORKDIR
Working directory, default is current directory
(default: /home/linuxbrew)
If this is meant to be fast scratch, please use os.tempdir
instead, which will use $TMPDIR
.
This will allow Kraken2 to play nicer with the Snkakemake scheduler.
File "/home/linuxbrew/.linuxbrew/bin/bohra", line 8, in <module>
sys.exit(main())
File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 136, in main
args.func(args)
File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/bohra.py", line 33, in run_pipeline
return(R.run_pipeline())
File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/SnpDetection.py", line 920, in run_pipeline
self.index_reference()
File "/home/linuxbrew/.linuxbrew/opt/python/lib/python3.7/site-packages/bohra/SnpDetection.py", line 612, in index_reference
if '.fa' not in self.ref:
TypeError: argument of type 'PosixPath' is not iterable```
The offending line should be:
if 'self.ref.match("*.fa*):
bohra/bohra/utils/iqtree_generator.sh
Line 4 in 33671b8
double-N please :)
Add a zoom functionality to the tree
A good convention is to put at the top of each *.py
file or module:
import logging
logging.basicConfig(format='[%(asctime)s] %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p', level=logging.INFO)
logger = logging.getLogger(__name__)
Then call logger
wherever needed (e.g., logger.info("Starting up")
)
Add in paths to singularity containers and recipes
Ideally ~SNPs and evol dist
add in a cluster.json and a flag for cluster with alternate snaemake command
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.