GithubHelp home page GithubHelp logo

fraenkel-lab / garnet Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 2.0 9.51 MB

Home Page: https://fraenkel-lab.github.io/GarNet/

Python 12.50% HTML 75.56% Jupyter Notebook 11.94%
epigenetics transcriptome omics data-integration

garnet's People

Contributors

alexlenail avatar amandakedaigle avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

garnet's Issues

Idea for speed optimization.

Outlining one way to speed up vanilla Garnet. Won't implement now but probably worth looking into when we decide on a version of Garnet we like.


From our discussion last week, it appeared the only way to get accurate motif matches is to calculate background from user-specified open chromatin regions, then run motif searching, all on the fly. We concluded we may have to compromise and just generate a motif file within windows around every TSS. But the motif matching algorithm is different that what I thought - specifically here are the steps:

  1. Calculate score threshold from user-defined p-value and background.
  2. Scan sequences for any subsequences whose score exceed this threshold. Note that score is only determined by the subsequence and the PWM, not the background.

This means we can do a lot of the preprocessing work beforehand. We can generate a motif file with a low threshold, and everytime we run garnet, we only have to calculate the threshold we want, then we filter the motif file based on these thresholds. This would be really fast, though the motif file would be very large (probably 10s of GB).

check files exist

Easy one :)

In map_peaks (and similar), can we run a check that all input files exist before attempting to open the garnet_file? Since that step takes a while to run, I'd rather realize I have a typo before waiting for that.

invalid syntax

Working out of this folder: /nfs/latdata/iamjli/ALS/GarNet/

Did the following on node19:

git clone <garnet repository>
virtualenv venv
virtualenv -p /usr/bin/python3 venv
source venv/bin/activate
pip install -r requirements.txt

All requirements downloaded successfully with no error. I loaded all relevant data and from src/ ran

python testing.py

Got this error:

Traceback (most recent call last):
  File "testing.py", line 3, in <module>
    from garnet import *
  File "/nfs/latdata/iamjli/ALS/GarNet/src/garnet.py", line 303
    motifs_and_genes = [{**motif, **gene, **peak} for peak, genes, motifs in peaks_with_associated_genes_and_motifs for gene in genes for motif in motifs]
                          ^
SyntaxError: invalid syntax

Then tried creating an instance of python3 and importing garnet, but got the same error.

Include TF "targets" in output

The output of Garnet right now is a file with three columns: "Transcription Factor Slope P-Value". I'd like to have an option to include a fourth, "Targets", which would have a list of the genes used to predict this TF, i.e. what genes were near that TF's motifs and fit the linear regression?

My use of "fit" here is broad and we'll have to figure out what we want that to mean. Just any gene included in the regression step, or do we want to exclude outliers somehow?

Peak directionality

If group A has peak X and group B doesn't, this should be treated differently than if group B has peak X and group A doesn't.

TODO: update garnet construction function

construct garnet function cannot handle large motif files

Replace with

cat $reference \
| bedtools slop -b 10000 -g /nfs/genomes/human_gp_feb_09/hg19.chrom.sizes \
| sortBed \
| bedtools intersect -a - -b <( cat $motifs ) -wa -wb -sorted \
| awk 'BEGIN {FS="\t"; OFS="\t"} {print $7,$8,$9,$10,$11,$12,$4,$2,$3,int(($8+$9)/2-($2+$3)/2)}' \
> garnetDB.cisBP.hg19.normalized.10kb.tsv

Create 10kb windows, sort, intersect with motifs file, add distance column and reorder, output.

MotifMap has transfac IDs

@AmandaKedaigle As discussed, we want to make sure we have a coherent list of symbols in GarNet's default motif file. We're concerned that MotifMap seems to have both GeneSymbols of TFs and Transfac IDs. It's unclear to me whether Transfac IDs are a proper superset of TF gene symbols, subset, or whether they are just two partially overlapping sets. Either way, we should hopefully only use one of them, and make sure we can correlate them with the names we're passed in as part of the expression file we receive for TF_regression.

  • Figure out the relationship of Transfac ID and GeneSymbol.
  • Email MotifMap people asking for rationale of including both Transfac ID and GeneSymbol.
  • "Repair" MotifMap file so that the TF_names are from a single Namespace.

Let me know whether you agree with this and if it makes any sense =)

figure out merging logic

Can I use a dataframe? if not, should I use a dict?

Also in this is adding the new columns of 'intergenic' vs 'promoter' and 'distance'

Garnet file stuck loading

Tried loading /nfs/latdata/alex/garnet_data/hg19.garnet.pickle into map_peaks, and it's been running for a very long time (24h+). Not sure if the computer's even trying anymore :[

Submitted as a job: wqsub python map_peaks_job.py out of this directory: /nfs/latdata/iamjli/ALS/TF_prediction/src.

Also tried loading in a Python instance, loaded for 4+ hours before connection broke.

TF regression drop duplicates

I'm unclear why duplicates are being removed. It's possible that a gene has the same motif more than once nearby, and I believe these should be considered separately. Or at least keep the one that's closer to the TSS or something.

if 'geneName' in motifs_genes_and_expression_levels.columns:

Bedtools can perform map_peaks function in ~1 min

We should probably use bedtools for large genome manipulations since it's super fast. I'm pretty sure it doesn't even need to load the entire motif file in either, but I haven't explicitly tested memory requirements. There's even a python package.

Here's the workflow:

Bedtools intersect notes

Tutorial: bedtools Tutorial
Test directory: /nfs/latdata/iamjli/test_projects/bedtools/test
Visualize bed file in IGV

Command to find full overlap:
bedtools intersect -a test_motifs.bed -b test_exons.bed -wa -f 1

Sorting will speed things up: add -sorted flag

If not sorted: sort -k1,1 -k2,2n foo.bed > foo.sort.bed

Intersecting GarNet motifs

Motif file: /nfs/latdata/iamjli/test_projects/bedtools/garnet_data/old/garnetDB.tsv (3.3 gb)

First sort (7 min):
sort -k1,1 -k2,2n garnet_data/old/garnetDB.tsv > garnetDB.sort.bed

Sort without header:
(tail -n +2 garnet_data/old/garnetDB.tsv | sort -k1,1 -k2,2n) > garnetDB.sorted.bed

Sort ATAC-seq DOS bed file from /nfs/latdata/iamjli/ALS/ATAC-seq/iMNs_ALS_SMA_CTR_from_iMPs/diffBind_042617:
(tail -n +2 diffSites_ALS_CTR.stripped.txt | sort -k1,1 -k2,2n) > diffSites_ALS_CTR.sorted.bed

Find intersection (73 s):
bedtools intersect -a garnetDB.sorted.bed -b diffSites_ALS_CTR.sorted.bed -wa -f 1 > overlapping_motifs.tsv

With -sorted flag (77 s):
bedtools intersect -a garnetDB.sorted.bed -b diffSites_ALS_CTR.sorted.bed -wa -f 1 -sorted > overlapping_motifs.fast.tsv

Window optional param

@zfrenchee Could you add a few lines that lets this function take window size as an optional parameter? Not too familiar with the syntax myself. Thanks!

def construct_garnet_file(reference_file, motif_file_or_files, output_file, options=dict()):

Specifically, do 2kb window if window size is not specified.

TF Regression Error

While trying to run Garnet on the cluster, I got this error for the TF regression part:

Traceback (most recent call last):
File "run_new_garnet.py", line 14, in
df_reg = TF_regression(df, expression, options)
File "/home/nlpm/.local/lib/python3.6/site-packages/GarNet/garnet.py", line 337, in TF_regression
motifs_genes_and_expression_levels = motifs_and_genes_dataframe.merge(expression_dataframe, left_on='geneSymbol', right_on='name', how='inner')
AttributeError: 'str' object has no attribute 'merge'

Map Motifs to Gene Symbols

Currently, it seems all the motifs' names are not in a format that is compatible with other analyses (forest, in particular). To address this, it'd be great if someone could create a mapping file for each motif; or map the motifs' names before they are fed into Garnet.

Comparison: Garnet file generation

I'm creating this just to keep track of some differences between the IntervalTree and BedTools methods of generating Garnet files. This is based off this notebook.

  • In dataframe motifs_and_genes, why are some of the distances > 10kb?
  • motif_to_gene_distance may not be calculated correctly, as the TSS varies depending on which strands the gene and motifs are on.

`TF_regression` not working properly on server (or locally either)

Traceback (most recent call last):
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/indexes/base.py", line 2134, in get_loc
    return self._engine.get_loc(key)
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'expression'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 3774, in __call__
    sort_columns=sort_columns, **kwds)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 2643, in plot_frame
    **kwds)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 2470, in _plot
    plot_obj.generate()
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 1043, in generate
    self._make_plot()
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 1619, in _make_plot
    scatter = ax.scatter(data[x].values, data[y].values, c=c_values,
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/internals.py", line 3543, in get
    loc = self.items.get_loc(item)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'expression'```

What does motif score mean?

Is this the same as binding affinity? If so, shouldn't we filter out motif scores that are 0? About 1 out of 17.3 million motifs are 0.

@zfrenchee

AP-2rep motif

This motif is being included despite having scores of 0.

image

Compute additional fields

@AmandaKedaigle Right now, we're only passing through raw data from the original files, not computing new fields. Each new field should be about one line of code. Let's use this issue as an inventory of fields we want to add:

Map known genes to peaks:

currently: ["chrom", "peakStart", "peakEnd", "peakName", "peakScore", "geneName", "geneStart", "geneEnd"]

Want to add:

  • Dist (where this is bp distance between TSS and peak summit/halfway-point, Iโ€™d think)
  • MapType (this is promoter/upstream/downstream/intergenic)

map_known_genes_and_motifs_to_peaks:

currently: ["chrom", "motifStart", "motifEnd", "motifID", "motifName", "motifScore", "geneName", "geneSymbol", "geneStart", "geneEnd"]

map_motifs_to_peaks:

currently: ["chrom", "peakStart", "peakEnd", "peakName", "peakScore", "motifID", "motifName", "motifStart", "motifEnd", "motifScore"]

map_known_genes_to_motifs:

currently: ["chrom", "motifStart", "motifEnd", "motifID", "motifName", "motifScore", "geneName", "geneStart", "geneEnd"]

mRNA should inform TF activity

TFs are transcriptionally regulated themselves, so their mRNA levels should give information about TF activity. May involve some network modeling or something.

GarNet File has inconsistent naming convention for record:

@iamjli I was testing GarNet when I ran into a bug. I was following the notebook, and I created a GarNet file and then tried to map peaks using that GarNet file. I got this error:

***** WARNING: File  has inconsistent naming convention for record:
	motifChrom	motifStart	motifEnd	motifName	motifScore	motifStrand	geneName	tssStart	tssEnd	motif_gene_distance

***** WARNING: File  has inconsistent naming convention for record:
	motifChrom	motifStart	motifEnd	motifName	motifScore	motifStrand	geneName	tssStart	tssEnd	motif_gene_distance

and an empty dataframe. However, when I used garnetDB_cisBP.hg19.LOD.10kb.tsv instead of garnetDB_cisBP.hg19.LOD.10kb.chr1_SLIM.tsv the error disappeared.

This might point to an issue with construct_garnet_file. Let me know what you think.

Python version compatability

I noticed in setup.py that GarNet is compatible with Python 3.5 and 3.6. Are there Python 3 features being used that prevent it from being compatible with Python 2.7?

I'm also curious if this rewrite is closer to being compatible with both Linux and Windows. Some path handling code in the original Omics Integrator made it Linux-only.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.