kipoi / kipoi-veff Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 5.0 42.19 MB

Variant effect prediction plugin for Kipoi

Home Page: https://kipoi.org/veff-docs

License: MIT License

Python 47.23% Jupyter Notebook 52.25% Makefile 0.19% Shell 0.33%

kipoi-veff's Issues

Kipoi-veff fails silently when the input files don't exist

The following command fails silently:

kipoi veff score_variants DeepBind/Homo_sapiens/TF/D00328.018_ChIP-seq_CTCF \
>    --dataloader_args='{"fasta_file": "input/hg19.chr22.fa"}' \
>    -i input/clinvar_20180429.chr22.pathogenic.vcf.gz \
>    -s ref alt diff \
>    -o /tmp/annotated.vcf

I executed it in the root of https://github.com/kipoi/examples where the file input/... doesn't exist.

Migrate all `kipoi.postprocessing.variant_effects` to `kipoi_veff`

In all the .py files, we should replace all kipoi.postprocessing.variant_effects to kipoi_veff

e.g. the goal is to not depend on any code from kipoi.postprocessing

indel support: dataloader-utility: VariantSeqExtractor

Definition of a class that can deals with generation of mutated DNA sequences given an interval, a variant and an anchor point.

class VariantSeqExtractor:
  def __init__(self, fasta_file):
      …

  def extract(self, interval, variant, anchor, keep_length=True):
     “””
       interval: define a new class Interval()
       Variant: define a a new class Variant() in analogy to cyvcf2
       keep_length: returned sequence will have the same length as the interval
             ...[AGAC|AGATG]C...
                 [     interval       ]
                            |  anchor point
                                GA -> G variant
                   If not keep_length:
                  return (“AGAC|AGATG”, “AGACAGTG”)
       else:
                return (“AGAC|AGATG”, “AGACAGTGC”)  

     Returns
        A tuple of sequences with ref and alt
     “””
     pass

No error message if the input file doesn't exist

The command kipoi veff score_variants fails silently if the inputs vcf -i doesn't exist. I would expect it to throw an error saying the file doesn't exist.

indel support: dataloader-utility: MutationDatasetMixin

A mixin to kipoi.data.*Dataset classes, which defines a function that returns model input for both alleles (and reverse-complementation).

The difference for the MutationDatasetMixin-methods is that the key inputs will be replaced with the keys: inputs_ref, inputs_alt (optionally also: inputs_ref_rc, inputs_alt_rc). Which all contain the identical structure and their data corresponds to the reference and alterenative (optionally also in reverse-complement) of the model input data.

So additionally the current Dataloader output schema:

{ 
   "inputs": <some_obj>, 
   "targets": <some_obj>, 
   "metadata": {...}
}

there will be a method returning dictionaries of:

{ 
    "inputs_ref": <some_obj>, 
    "inputs_alt": <some_obj>, 
    "inputs_ref_rc": <some_obj>, 
    "inputs_alt_rc": <some_obj>, 
    "targets": <some_obj>, 
    "metadata": {...}
}

All relationships between inputs and metadata etc. have to hold identically for the newly defined inputs_* keys.

Implementation timeline

Core functionality

Copy core variant effect prediction functionality from Kipoi
Resolve variant-effect related issues here (models with multidimensional outputs, output selection, etc.)
Implement different solutions of indel variant effect prediction

Interaction with Kipoi

CLI needs to be loadable from kipoi_veff package
yaml parsing classes need to be loadable from kipoi_veff package

dataloader_args

Hi,

I am trying to test the kipoi-veff cli taking by example the command in the Kipoi paper, but I get an error:

Traceback (most recent call last):
  File "/Users/gisela/anaconda3/envs/kipoi-veff/bin/kipoi", line 10, in <module>
    sys.exit(main())
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi/__main__.py", line 104, in main
    command_fn(args.command, sys.argv[2:])
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_veff/__main__.py", line 11, in cli_main
    kipoi_veff.cli.cli_main(command, raw_args)
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_veff/cli.py", line 458, in cli_main
    command_fn(args.command, raw_args[1:])
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_veff/cli.py", line 180, in cli_score_variants
    dataloader_arguments = parse_json_file_str_or_arglist(args.dataloader_args)
  File "/Users/gisela/anaconda3/envs/kipoi-veff/lib/python3.6/site-packages/kipoi_utils/utils.py", line 282, in parse_json_file_str_or_arglist
    raise RuntimeError("wrong usage, dataloader_args must be a list")
RuntimeError: wrong usage, dataloader_args must be a list

My command:

kipoi veff score_variants DeepSEA --dataloader_args '{"fasta_file": "~/.vep/homo_sapiens/97_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa"}' -i 'input.vcf' -o 'output.vcf'

I've also tested:

kipoi veff score_variants DeepSEA --dataloader_args='{"fasta_file": "~/.vep/homo_sapiens/97_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa"}' -i 'input.vcf' -o 'output.vcf'

to no avail.

allow `stdout` filename on score_variants

Have score_variants to print the output vcf to stdout instead of the file.

Start from this branch -https://github.com/kipoi/kipoi/tree/stdout- which removes the stdout of all the other commands (logging, tqdm etc).

Add logit_diff scoring function.

Currently, logit is confusing and is not consistent with: ref, alt -> diff. We should call it logit_diff

Speedup writing to file

- [x] buffer writes - #21 (e.g. don't write predictions to disk on every batch but only every now and then)

this should be implemented in #21

- [ ] use asynchronous writes

Here is the main loop performing:

data-loading
data preparation for the model
model prediction
Prediction writing to file

https://github.com/kipoi/kipoi-veff/blob/master/kipoi_veff/snv_predict.py#L620-L658

    for i, batch in enumerate(tqdm(it)):
        ...
        # Step 1. load the data
        eval_kwargs = _generate_seq_sets(dataloader.output_schema, batch, vcf_fh, vcf_id_generator_fn,
                                         seq_to_mut=seq_to_mut, seq_to_meta=seq_to_meta,
                                         sample_counter=sample_counter, vcf_search_regions=vcf_search_regions,
                                         generate_rc=model_info_extractor.use_seq_only_rc,
                                         bed_id_conv_fh=bed_id_conv_fh)

        # Step 2.  data preparation for the model
        if generated_seq_writer is not None:
            for writer in generated_seq_writer:
                writer(eval_kwargs)
            # Assume that we don't actually want the predictions to be calculated...
            continue

        if evaluation_function_kwargs is not None:
            assert isinstance(evaluation_function_kwargs, dict)
            for k in evaluation_function_kwargs:
                eval_kwargs[k] = evaluation_function_kwargs[k]

        eval_kwargs["out_annotation_all_outputs"] = model_out_annotation


        # Step 3. Make model prediction
        res_here = evaluation_function(model, output_reshaper=out_reshaper, **eval_kwargs)
     
        ....

        # Step 4. write the predictions
        if sync_pred_writer is not None:
            for writer in sync_pred_writer:
                writer(res_here, eval_kwargs["vcf_records"], eval_kwargs["line_id"])

this is the main loop performing model prediction

- [ ] setup some standardized benchmarks to test the overhead

Tasks

Follow the following notebook: https://github.com/kipoi/kipoi-veff/blob/write_buffer/notebooks/code-profiling.ipynb

Finish the code on the write buffer PR by speeding up the writing to take minimal amount of time.

restrict version to kipoi >=0.4

Make a duplicate of variants.vcf when testing

py.test is not executed in the repository root hence file paths cannot be found

Error when chromosome names don't start with `chr`

  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 795, in score_variants
    return_predictions=return_predictions)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 620, in predict_snvs
    for i, batch in enumerate(tqdm(it)):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/tqdm/_tqdm.py", line 979, in __iter__
    for obj in iterable:
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 175, in __next__
    return self._process_next_batch(batch)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 195, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
pyfaidx.FetchError: Traceback (most recent call last):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 639, in from_file
    i = self.index[rname]
KeyError: 'chr1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 350, in __getitem__
    ret = self.seq_dl[idx]
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 238, in __getitem__
    seq = self.fasta_extractors.extract(interval)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/extractors.py", line 50, in extract
    seq = str(self.fasta.get_seq(interval.chrom, interval.start + 1, interval.stop, rc=rc).seq)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1032, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 624, in fetch
    seq = self.from_file(name, start, end)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 642, in from_file
    "Please check your FASTA file.".format(rname))
pyfaidx.FetchError: Requested rname chr1 does not exist! Please check your FASTA file.

minimal.vcf

##fileformat=VCFv4.0
##fileDate=20181110
##source=UKBB/variants.tsv.bgz_V3
##reference=GRCh37
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1    15791   1:15791_C_T     C       T       .       .       .
1    69487   1:69487_G_A     G       A       .       .       .
1    69569   1:69569_T_C     T       C       .       .       .
1    139853  1:139853_C_T    C       T       .       .       .
1    693731  1:693731_A_G    A       G       .       .       .

Fasta file contained the correct chromosome names. Eg. >1...

compatability issues with installation of deepsea

I am trying to install Kipoi-Veff in conjunction with DeepSea and seem to be having problems. I am using an AWS instance with the deep learning AMI. I simply activate the python3 + cuda 10.0 environment and then install kipoi-veff using conda. This seems to install kipoi and kipoi-veff but then when I then try to install DeepSea into the same environment

kipoi env install DeepSEA/predict --gpu

I am getting compatibility issues which eventually break things.

If I am trying to install everything from scratch, how would you recommend I install Kipoi, Kipoi-veff, Basenji, and DeepSea?

update ModelInfoExtractor

Requirements by the refactoring of model dataloaders imply that the input sequence length and shape cannot be derived from the Dataloader class, but must be derived from the model object.

This refactoring may also cause that the dataloader only returns a numpy array in ['inputs'], but the model defines a list or dictionary of length 1 as input.

indel support: predict_mutations_on_batch()

In analogy to predict_on_batch() kipoi_veff should enable a predict_mutations_on_batch() function that will use the inputs_ref, inputs_alt (inputs_ref_rc, inputs_alt_rc) keys returned from the MutationDatasetMixinobject to produce 2 (or 4) model prediction outputs.

This may be designed as a mixin for kipoi.model.Model classes:

class PredictMutationsMixin(object):
    def predict_mutations_on_batch(self, x):
        return [self.predict_on_batch(x[k]) for k in x if k.startswith("inputs_")]

indel support overview

We envision two different ways to implement variant effect prediction for indels:

A dataloader-based approach that relies on the implementation of a dataloader method that returns full sets of model inputs for the reference and the alternative (+ r.c.) allele of a variant.
A wrapper-based approach similar to how SNVs are handled. Indels will be handled by generating a short reference sequences for the individual alleles, modifying all dataloader input files accordingly. This is envisioned to be performed using g2gtools (there have been efforts recently to enable it for python3)

Results differ unexpectedly between `pipeline.predict` and `query_bed` with `ref` score

I'm using the DeepSEA/variantEffects model with MutationMap to try and find the most impactful mutations for a set of sequences. However, I've noticed a discrepancy between the predictions I get for the wild-type sequences when using pipeline.predict vs. query_bed with the ref score.

The pipeline.predict scores are generally high probabilities for CTCF in cell type A549, which is what I'd expect given that my bed file consists of ChIP-seq peaks for that TF/cell line pair. The ref scored predictions, on the other hand, tend to be really small. A few examples of the difference are:

Predictions from pipeline.predict: [0.96582216 0.03907571 0.09172686 0.19638198 0.13311383 0.64790106
 0.67463433 0.94372396 0.64945625 0.98188872]

		vs.

Predictions from `ref`-scored MutationMap.query_bed: [-9.167706593871117e-09, -3.85800376534462e-07, 2.0929292077198625e-08, 3.2794196158647537e-07, 3.2247044146060944e-08, 2.3366883397102356e-06, 0.0, 1.3096723705530167e-09, 1.1496013030409813e-09, 0.0]

I'm trying to understand whether this is expected, a result of something I'm doing wrong, or a bug. For context, my code is essentially the following (pared down to make it easier to see the essentials):

dl_kwargs = {'fasta_file': '../dat/hg19.fa'}
predict_dl_kwargs = {'fasta_file': '../dat/hg19.fa', 'intervals_file': random_seqs_fpath}
random_seqs_fpath = "../dat/ChIPseq.A549.CTCF.100.random.narrowPeak.gz"
deepsea = kipoi.get_model("DeepSEA/variantEffects", source="kipoi")

# Predictions from MutationMap
dl_kwargs = {'fasta_file': '../dat/hg19.fa'}
mm = MutationMap(deepsea, deepsea.default_dataloader, dataloader_args=dl_kwargs)
mmp = mm.query_bed(random_seqs_fpath, scores=['ref', 'diff'])
mutation_map_predictions = [
    mmp.mutation_map[i]['seq']['ref']['A549_CTCF_None_720']['mutation_map']
    for i in range(len(mmp.mutation_map)
]
# I also do some post-processing to just get one of the non-zero values from the 4 x 1000 mutation map.

# Predictions from pipeline.predict
pipeline_predictions = deepsea_predict.pipeline.predict(predict_dl_kwargs, batch_size=100)

Note that both of these use the exact same bed file and therefore should be looking at the same sequences.

Am I missing some key reason why I should expect these two prediction arrays to differ dramatically? I am using the default rc merge settings for both (and that wouldn't account for the order-of-magnitude differences anyway). The two best ideas I have for why I might be getting such different results are:

Contrary to my understanding of the docs, MutationMap.query_bed is re-centering on each currently-being-tested variant.
ref doesn't mean what I think it does.

Docs improvement

don't include long tqdm printing to ipynb's

03:30,  1.44it/s]�[A
304it [03:31,  1.44it/s]�[A
305it [03:31,  1.44it/s]�[A
306it [03:32,  1.44it/s]�[A
307it [03:33,  1.44it/s]�[
...

split the individual long documents into multiple logical sections / units
- split overview.md to the part relevant for contributing the model and part relevant for using the models
rename the .md documents (e.g. overview etc)

indel support: dataloader-utility:VcfReader

The VCF reader class is designed to return variant objects compatible with VariantSeqExtractor.extract. It can either be used as an iterator starting from the beginning of the VCF file or an iterator for variants in a genomic regions defined by overlap. This functionality is all built in cyvcf2 so VcfReader is essentially a wrapper that converts vcf records to instances of our own variant class.

Use HDF5BatchWriter instead of deepdish when writing the results for hdf5 files

https://github.com/kipoi/kipoi-veff/blob/master/kipoi_veff/cli.py#L241

Getting a NotImplementedError: "sortBed" does not appear to be installed or on the path... error when running score_variants

Hey there,

Pretty new to Kipoi and was explore this particular plugin. Unfortunately, whenever I try to run the score_variants function, I get the following error. Not sure if this is something others have faced in the past, but would be grateful if you could suggest a prospective solution! Thank you!

Error:

NotImplementedError: "sortBed" does not appear to be installed or on the path, so this method is disabled. Please install a more recent version of BEDTools and re-import to use this method.

kipoi / kipoi-veff Goto Github PK

kipoi-veff's Issues

- [x] buffer writes - #21 (e.g. don't write predictions to disk on every batch but only every now and then)

- [ ] use asynchronous writes

- [ ] setup some standardized benchmarks to test the overhead

Tasks

Recommend Projects

Recommend Topics

Recommend Org

Jobs