kipoi / kipoi-veff Goto Github PK
View Code? Open in Web Editor NEWVariant effect prediction plugin for Kipoi
Home Page: https://kipoi.org/veff-docs
License: MIT License
Variant effect prediction plugin for Kipoi
Home Page: https://kipoi.org/veff-docs
License: MIT License
The following command fails silently:
kipoi veff score_variants DeepBind/Homo_sapiens/TF/D00328.018_ChIP-seq_CTCF \
> --dataloader_args='{"fasta_file": "input/hg19.chr22.fa"}' \
> -i input/clinvar_20180429.chr22.pathogenic.vcf.gz \
> -s ref alt diff \
> -o /tmp/annotated.vcf
I executed it in the root of https://github.com/kipoi/examples where the file input/...
doesn't exist.
In all the .py files, we should replace all kipoi.postprocessing.variant_effects
to kipoi_veff
e.g. the goal is to not depend on any code from kipoi.postprocessing
Definition of a class that can deals with generation of mutated DNA sequences given an interval, a variant and an anchor point.
class VariantSeqExtractor:
def __init__(self, fasta_file):
…
def extract(self, interval, variant, anchor, keep_length=True):
“””
interval: define a new class Interval()
Variant: define a a new class Variant() in analogy to cyvcf2
keep_length: returned sequence will have the same length as the interval
...[AGAC|AGATG]C...
[ interval ]
| anchor point
GA -> G variant
If not keep_length:
return (“AGAC|AGATG”, “AGACAGTG”)
else:
return (“AGAC|AGATG”, “AGACAGTGC”)
Returns
A tuple of sequences with ref and alt
“””
pass
The command kipoi veff score_variants
fails silently if the inputs vcf -i
doesn't exist. I would expect it to throw an error saying the file doesn't exist.
A mixin to kipoi.data.*Dataset classes, which defines a function that returns model input for both alleles (and reverse-complementation).
The difference for the MutationDatasetMixin-methods is that the key inputs
will be replaced with the keys: inputs_ref
, inputs_alt
(optionally also: inputs_ref_rc
, inputs_alt_rc
). Which all contain the identical structure and their data corresponds to the reference and alterenative (optionally also in reverse-complement) of the model input data.
So additionally the current Dataloader output schema:
{
"inputs": <some_obj>,
"targets": <some_obj>,
"metadata": {...}
}
there will be a method returning dictionaries of:
{
"inputs_ref": <some_obj>,
"inputs_alt": <some_obj>,
"inputs_ref_rc": <some_obj>,
"inputs_alt_rc": <some_obj>,
"targets": <some_obj>,
"metadata": {...}
}
All relationships between inputs
and metadata
etc. have to hold identically for the newly defined inputs_*
keys.
Core functionality
Interaction with Kipoi
Hi,
I am trying to test the kipoi-veff cli taking by example the command in the Kipoi paper, but I get an error:
Traceback (most recent call last):
File "/Users/gisela/anaconda3/envs/kipoi-veff/bin/kipoi", line 10, in <module>
sys.exit(main())
File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi/__main__.py", line 104, in main
command_fn(args.command, sys.argv[2:])
File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_veff/__main__.py", line 11, in cli_main
kipoi_veff.cli.cli_main(command, raw_args)
File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_veff/cli.py", line 458, in cli_main
command_fn(args.command, raw_args[1:])
File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_veff/cli.py", line 180, in cli_score_variants
dataloader_arguments = parse_json_file_str_or_arglist(args.dataloader_args)
File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_utils/utils.py", line 282, in parse_json_file_str_or_arglist
raise RuntimeError("wrong usage, dataloader_args must be a list")
RuntimeError: wrong usage, dataloader_args must be a list
My command:
kipoi veff score_variants DeepSEA --dataloader_args '{"fasta_file": "~/.vep/homo_sapiens/97_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa"}' -i 'input.vcf' -o 'output.vcf'
I've also tested:
kipoi veff score_variants DeepSEA --dataloader_args='{"fasta_file": "~/.vep/homo_sapiens/97_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa"}' -i 'input.vcf' -o 'output.vcf'
to no avail.
Have score_variants
to print the output vcf to stdout instead of the file.
Start from this branch -https://github.com/kipoi/kipoi/tree/stdout- which removes the stdout of all the other commands (logging, tqdm etc).
Currently, logit
is confusing and is not consistent with: ref
, alt
-> diff
. We should call it logit_diff
Here is the main loop performing:
https://github.com/kipoi/kipoi-veff/blob/master/kipoi_veff/snv_predict.py#L620-L658
for i, batch in enumerate(tqdm(it)):
...
# Step 1. load the data
eval_kwargs = _generate_seq_sets(dataloader.output_schema, batch, vcf_fh, vcf_id_generator_fn,
seq_to_mut=seq_to_mut, seq_to_meta=seq_to_meta,
sample_counter=sample_counter, vcf_search_regions=vcf_search_regions,
generate_rc=model_info_extractor.use_seq_only_rc,
bed_id_conv_fh=bed_id_conv_fh)
# Step 2. data preparation for the model
if generated_seq_writer is not None:
for writer in generated_seq_writer:
writer(eval_kwargs)
# Assume that we don't actually want the predictions to be calculated...
continue
if evaluation_function_kwargs is not None:
assert isinstance(evaluation_function_kwargs, dict)
for k in evaluation_function_kwargs:
eval_kwargs[k] = evaluation_function_kwargs[k]
eval_kwargs["out_annotation_all_outputs"] = model_out_annotation
# Step 3. Make model prediction
res_here = evaluation_function(model, output_reshaper=out_reshaper, **eval_kwargs)
....
# Step 4. write the predictions
if sync_pred_writer is not None:
for writer in sync_pred_writer:
writer(res_here, eval_kwargs["vcf_records"], eval_kwargs["line_id"])
Follow the following notebook: https://github.com/kipoi/kipoi-veff/blob/write_buffer/notebooks/code-profiling.ipynb
Finish the code on the write buffer
PR by speeding up the writing to take minimal amount of time.
py.test is not executed in the repository root hence file paths cannot be found
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 795, in score_variants
return_predictions=return_predictions)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 620, in predict_snvs
for i, batch in enumerate(tqdm(it)):
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/tqdm/_tqdm.py", line 979, in __iter__
for obj in iterable:
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 175, in __next__
return self._process_next_batch(batch)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 195, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
pyfaidx.FetchError: Traceback (most recent call last):
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 639, in from_file
i = self.index[rname]
KeyError: 'chr1'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in <listcomp>
samples = collate_fn([dataset[i] for i in batch_indices])
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 350, in __getitem__
ret = self.seq_dl[idx]
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 238, in __getitem__
seq = self.fasta_extractors.extract(interval)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/extractors.py", line 50, in extract
seq = str(self.fasta.get_seq(interval.chrom, interval.start + 1, interval.stop, rc=rc).seq)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1032, in get_seq
seq = self.faidx.fetch(name, start, end)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 624, in fetch
seq = self.from_file(name, start, end)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 642, in from_file
"Please check your FASTA file.".format(rname))
pyfaidx.FetchError: Requested rname chr1 does not exist! Please check your FASTA file.
minimal.vcf
##fileformat=VCFv4.0
##fileDate=20181110
##source=UKBB/variants.tsv.bgz_V3
##reference=GRCh37
#CHROM POS ID REF ALT QUAL FILTER INFO
1 15791 1:15791_C_T C T . . .
1 69487 1:69487_G_A G A . . .
1 69569 1:69569_T_C T C . . .
1 139853 1:139853_C_T C T . . .
1 693731 1:693731_A_G A G . . .
Fasta file contained the correct chromosome names. Eg. >1
...
I am trying to install Kipoi-Veff in conjunction with DeepSea and seem to be having problems. I am using an AWS instance with the deep learning AMI. I simply activate the python3 + cuda 10.0 environment and then install kipoi-veff using conda. This seems to install kipoi and kipoi-veff but then when I then try to install DeepSea into the same environment
kipoi env install DeepSEA/predict --gpu
I am getting compatibility issues which eventually break things.
If I am trying to install everything from scratch, how would you recommend I install Kipoi, Kipoi-veff, Basenji, and DeepSea?
Requirements by the refactoring of model dataloaders imply that the input sequence length and shape cannot be derived from the Dataloader class, but must be derived from the model object.
This refactoring may also cause that the dataloader only returns a numpy array in ['inputs'], but the model defines a list or dictionary of length 1 as input.
In analogy to predict_on_batch()
kipoi_veff should enable a predict_mutations_on_batch()
function that will use the inputs_ref
, inputs_alt
(inputs_ref_rc
, inputs_alt_rc
) keys returned from the MutationDatasetMixin
object to produce 2 (or 4) model prediction outputs.
This may be designed as a mixin for kipoi.model.Model
classes:
class PredictMutationsMixin(object):
def predict_mutations_on_batch(self, x):
return [self.predict_on_batch(x[k]) for k in x if k.startswith("inputs_")]
We envision two different ways to implement variant effect prediction for indels:
I'm using the DeepSEA/variantEffects
model with MutationMap
to try and find the most impactful mutations for a set of sequences. However, I've noticed a discrepancy between the predictions I get for the wild-type sequences when using pipeline.predict
vs. query_bed
with the ref
score.
The pipeline.predict
scores are generally high probabilities for CTCF in cell type A549, which is what I'd expect given that my bed file consists of ChIP-seq peaks for that TF/cell line pair. The ref
scored predictions, on the other hand, tend to be really small. A few examples of the difference are:
Predictions from pipeline.predict: [0.96582216 0.03907571 0.09172686 0.19638198 0.13311383 0.64790106
0.67463433 0.94372396 0.64945625 0.98188872]
vs.
Predictions from `ref`-scored MutationMap.query_bed: [-9.167706593871117e-09, -3.85800376534462e-07, 2.0929292077198625e-08, 3.2794196158647537e-07, 3.2247044146060944e-08, 2.3366883397102356e-06, 0.0, 1.3096723705530167e-09, 1.1496013030409813e-09, 0.0]
I'm trying to understand whether this is expected, a result of something I'm doing wrong, or a bug. For context, my code is essentially the following (pared down to make it easier to see the essentials):
dl_kwargs = {'fasta_file': '../dat/hg19.fa'}
predict_dl_kwargs = {'fasta_file': '../dat/hg19.fa', 'intervals_file': random_seqs_fpath}
random_seqs_fpath = "../dat/ChIPseq.A549.CTCF.100.random.narrowPeak.gz"
deepsea = kipoi.get_model("DeepSEA/variantEffects", source="kipoi")
# Predictions from MutationMap
dl_kwargs = {'fasta_file': '../dat/hg19.fa'}
mm = MutationMap(deepsea, deepsea.default_dataloader, dataloader_args=dl_kwargs)
mmp = mm.query_bed(random_seqs_fpath, scores=['ref', 'diff'])
mutation_map_predictions = [
mmp.mutation_map[i]['seq']['ref']['A549_CTCF_None_720']['mutation_map']
for i in range(len(mmp.mutation_map)
]
# I also do some post-processing to just get one of the non-zero values from the 4 x 1000 mutation map.
# Predictions from pipeline.predict
pipeline_predictions = deepsea_predict.pipeline.predict(predict_dl_kwargs, batch_size=100)
Note that both of these use the exact same bed file and therefore should be looking at the same sequences.
Am I missing some key reason why I should expect these two prediction arrays to differ dramatically? I am using the default rc merge settings for both (and that wouldn't account for the order-of-magnitude differences anyway). The two best ideas I have for why I might be getting such different results are:
MutationMap.query_bed
is re-centering on each currently-being-tested variant.ref
doesn't mean what I think it does.03:30, 1.44it/s]�[A
304it [03:31, 1.44it/s]�[A
305it [03:31, 1.44it/s]�[A
306it [03:32, 1.44it/s]�[A
307it [03:33, 1.44it/s]�[
...
overview.md
to the part relevant for contributing the model and part relevant for using the modelsThe VCF reader class is designed to return variant objects compatible with VariantSeqExtractor.extract
. It can either be used as an iterator starting from the beginning of the VCF file or an iterator for variants in a genomic regions defined by overlap. This functionality is all built in cyvcf2
so VcfReader is essentially a wrapper that converts vcf records to instances of our own variant class.
Hey there,
Pretty new to Kipoi and was explore this particular plugin. Unfortunately, whenever I try to run the score_variants function, I get the following error. Not sure if this is something others have faced in the past, but would be grateful if you could suggest a prospective solution! Thank you!
Error:
NotImplementedError: "sortBed" does not appear to be installed or on the path, so this method is disabled. Please install a more recent version of BEDTools and re-import to use this method.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.