churchlab / millstone Goto Github PK

View Code? Open in Web Editor NEW

47.0 47.0 19.0 102.65 MB

Genome engineering and analysis software

Home Page: http://churchlab.github.io/millstone/

License: MIT License

Python 84.07% CoffeeScript 0.19% JavaScript 9.59% HTML 4.90% CSS 0.72% Shell 0.08% Jupyter Notebook 0.45%

millstone's People

Contributors

Stargazers

Watchers

Forkers

woodymit piandpower dayedepps glebkuznetsov rubenszimbres onisimchukv doppiomacchiatto thammuio knabil mlangone pankev-in ssitb fitranugraha sandeepsunkari cser2016 shunsunsun hjanime

millstone's Issues

Support multiple ALTs

Add an is_primary key to the VariantAlternate
When the user is generating a ReferenceGenome from a VariantSet that contains Variants with multiple ALTs, the software must require the user to choose an ALT

Merge Variants that affect the same codon

Each of the Variants might not be consequential by itself, or may not capture the full effect of having both.

However, this might caught by the flow where we generate new reference genomes and do iterative analysis.

Make JBrowse work in dev mode without nginx

Melted view fails for some genomes (I think ones with real sample data).

'None' key in every row of csv.DictReader

Every row seems to have an empty K/V pair {None:''}, not sure why. Here I remove it by hand: https://github.com/churchlab/genome-designer-v2/blob/master/genome_designer/scripts/import_util.py#L102

Concurrent alignment pipeline throws lock error when each alignment process tries to read reference genome

Structural variants

Identify a tool that works pretty well
Test it on test genomes with known deletions that we create
Integrate with our database representation

The temp filesystem is filling up with extra data during every test, even when the test db is torn down.

This can be remedied by overriding the delete() method for models that filesystem data so that the respective filesystem data is deleted.

Also, it might make sense to make a test filesystem location.

In melted view, alternates should show per sample, not all combined

Allow creating new Variant Set from master checkbox action menu

Get rid of Datatables default search bar

Giving this to you Dan since you probably know how to do this after mucking around with datatables more than I have.

When creating an alignment, the select-all checkbox doesn't work

Implement hard_delete() for models that store files on the filesystem.

Allow doing ancestral comparisons with variant sets

For example, we want to trace the lineage of a particular SNP.

Implement a way to create "designed" variants

This should probably be some sort of spreadsheet template/upload flow like the sample upload.

Figure out less confusing UI

Probably want to separate the view into 3 tabs:

Import / Align / Analyze

Starting to sketch this in the Google Powerpoint Mocks:
https://docs.google.com/presentation/d/1xIEn89cz6Pw1r-Ap8Cn5Olr9Xz_0_S1cwYoSAR12myw/edit

test_remove_variant_from_set fails sporadically

For some reason this test fails sometimes when running all the tests. It seems to always pass when run alone.

Perhaps there's a lapse in our understanding of how the Django test framework uses databases.

SnpEff field (INFO.EFF) is currently a long string, and needs to be split into multiple fields.

as one long string. We need to split it up into many fields, and add them all as separate info fields in the VCF, by adding new methods to vcf_parser.py.

We could just split them up in our data model only and not touch the snpeff VCF, but it might be annoying since other software might want a normalized VCF file.

Wherever variants show up, give them relevant JBrowse links

Make sure melted vs cast view works as expected

Also add more tests

Make the Data tab match the mock

Variant filtering tests broken when running them alone

As of 8f99e2f the tests are broken, even after Dan added missing files.

For some reason, the tests pass when run as part of the suite, but only running the test_variant_filter.py module.

I am debugging now.

Data import actions should be asynchronous

Finish implementing changing which columns show up in the Variants view.

Three levels of implementation:

Show some set of default columns.
Allow the user to specify which columns to show.
When a column key is in the search query, show that column.

This should be implemented by using our application-level (in python) database view objects.

RScript required by Picard

Changping found out that his test was failing due to Rscript missing.

Per-Alt INFO fields are always 'undefined'

Add putative variants for freebayes to look for when creating alignment

Freebayes can take in putative snps and look for them regardless of their actual presence/absence. This will be useful with designed variants. We want to be able to input an existing variant set when performing a new alignment with samples, turn that into a VCF, and send it to Freebayes with:

-@ --variant-input variant_set.vcf

Use mapping quality and non-unique reads to identify 'uncallable' or 'paralogous' regions of the genome, as a automatically generated region

Handle IN_SETS() and NOT_IN_SETS() for melted view.

Right now these filter operators are applied on the level of Variants without regard to the 'sample_variant_set_association' field of VariantToVariantSet.

Make deployable on AWS

Let's obliterate 2 birds with one stone here and get this to work straight on AWS.

Figure out how to handle the different reference genome IDs and name fields.

Each chromsome corresponds to a a BioPython sequence record, which has many 'descriptive' fields, and different programs use different ones.

This has been causing problems. For instance, snpEff uses the genbank LOCUS (not sure which SeqRecord field this corresponds to), while FASTA uses record.id (which comes from genbank ACCESSION), and internally before our django code used record.name.

For now, I'm just setting all these to be equal, but we need to come up with a unified scheme, including tests and assertions, for making these all agree so the pipeline (genbank -> FastQ -> Bam -> VCF -> SnpEff) doesn't report missing chromosomes when a genbank LOCUS, a genbank ACCESSION, and a FASTA record.id are all different, for instance.

Implement cast and melted view for Variants

Variant filtering by region and genes

SnpEff test leaving behind files

These files are created and not cleaned up:

genome_designer/snpEff_genes.txt
genome_designer/snpEff_summary.html

Implement url/javascript sync for variants view

Thus you can save a link to a particular view and navigate there without having to go through the whole series of button clicks to recreate a filter.

Running all nosetests takes a long time, I think it might be looping.

Any request that kicks off an async job should return an error if Celery isn't running

Move alignment stuff to the Align tab

Also, I realized that this tab would also control variant calling, so perhaps the tab should be called something other than 'Align'?

Update add/remove Variant to/from VariantSet to account for whether a sample is associated (i.e. cast vs melted view)

Dynamic generate_filter_key_map() generation.

Different VCFs will have different fields, and so we will need to dynamically update the allowed VARIANT_CALLER_COMMON_MAPs and VARIANT_EVIDENCE_MAPs accordingly. We will add these as JSONFields to ReferenceGenomes, and update them whenever a new vcf is added, either by the user or by the pipeline.

Implement Variant pagination

Callable regions

Partitioning the genome into regions that can be called and those that can't based on:

Raw notes

how the reads fall (read depth, uniqueness)
investigate whether GATK or something has a tool to do this
add feature annotations to JBrowse for such regions

Without the actual data

read len
distance between paired end reads
genome sequence

Empirical calling

Based on bam
Regions without reads (borders on structural variation (deletion) calling )

Figure out dt_bootstrap.js error with pagination

This happens only on the initial load of Variant data, but not on subsequent paginated server-side loads:

Uncaught TypeError: Cannot read property '_iDisplayStart' of null

Allow merging AlignmentGroups

An AlignmentGroup is defined as a set of related alignments that can all have Variants called together. And specifically, all samples are aligned to the same ReferenceGenome.

It's possible that we run some alignments at different times, but we may want to be able to add subsequent alignment runs to an existing AlignmentGroup so that Variants can be called all together.

When setting Dataset filesystem_location, create class method that ensures the location is relative to MEDIA_ROOT

Analyze / act on variants in regions of the genome

Regions could be duplications, etc. We may want to revert all SNPs in a region, or ignore SNPs in a region, etc.

Regions where SNPs cannot be accurately called due to paralogous sequence (non-unique mapping) or no/low read depth should also be marked.

Add support for not-equals operator (!=) in variant filtering

This is not trivial because while Django has operators for, e.g. <= is __lte, it does not have one for __ne. Instead clients should use Q objects.

Support multiple VariantCallerCommonData objects in a view

Right now, when requesting the melted view with test data that includes a Variant that was both imported and called, I'm getting the error:

Traceback (most recent call last):
File "/home/glebk/Projects/churchlab/genome-designer-v2/venv/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response
response = callback(request, _callback_args, *_callback_kwargs)
File "/home/glebk/Projects/churchlab/genome-designer-v2/venv/local/lib/python2.7/site-packages/django/contrib/auth/decorators.py", line 25, in _wrapped_view
return view_func(request, _args, *_kwargs)
File "/home/glebk/Projects/churchlab/genome-designer-v2/genome_designer/../genome_designer/main/xhr_handlers.py", line 93, in get_variant_list
combined_filter_string, is_melted),
File "/home/glebk/Projects/churchlab/genome-designer-v2/genome_designer/main/data_util.py", line 32, in lookup_variants
variant_id_to_metadata_dict))
File "/home/glebk/Projects/churchlab/genome-designer-v2/genome_designer/main/melt_util.py", line 29, in variant_as_melted_list
"objects not implemented yet.")
AssertionError: Support for multiple VariantCallerCommonData objects not implemented yet.