fraenkel-lab / garnet Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://fraenkel-lab.github.io/GarNet/
Home Page: https://fraenkel-lab.github.io/GarNet/
Not urgent, but definitely nice to have. cisDB has these files I think?
Without loading the huge motifs intervaltrees into memory each time.
Outlining one way to speed up vanilla Garnet. Won't implement now but probably worth looking into when we decide on a version of Garnet we like.
From our discussion last week, it appeared the only way to get accurate motif matches is to calculate background from user-specified open chromatin regions, then run motif searching, all on the fly. We concluded we may have to compromise and just generate a motif file within windows around every TSS. But the motif matching algorithm is different that what I thought - specifically here are the steps:
This means we can do a lot of the preprocessing work beforehand. We can generate a motif file with a low threshold, and everytime we run garnet, we only have to calculate the threshold we want, then we filter the motif file based on these thresholds. This would be really fast, though the motif file would be very large (probably 10s of GB).
This does the main logic of steps 1, 2, and 3 from the old garnet, as one function step
Creating our own genome-wide scan for all motifs
(or using MotifMap?)
Easy one :)
In map_peaks (and similar), can we run a check that all input files exist before attempting to open the garnet_file? Since that step takes a while to run, I'd rather realize I have a typo before waiting for that.
Working out of this folder: /nfs/latdata/iamjli/ALS/GarNet/
Did the following on node19:
git clone <garnet repository>
virtualenv venv
virtualenv -p /usr/bin/python3 venv
source venv/bin/activate
pip install -r requirements.txt
All requirements downloaded successfully with no error. I loaded all relevant data and from src/
ran
python testing.py
Got this error:
Traceback (most recent call last):
File "testing.py", line 3, in <module>
from garnet import *
File "/nfs/latdata/iamjli/ALS/GarNet/src/garnet.py", line 303
motifs_and_genes = [{**motif, **gene, **peak} for peak, genes, motifs in peaks_with_associated_genes_and_motifs for gene in genes for motif in motifs]
^
SyntaxError: invalid syntax
Then tried creating an instance of python3 and importing garnet, but got the same error.
The output of Garnet right now is a file with three columns: "Transcription Factor Slope P-Value". I'd like to have an option to include a fourth, "Targets", which would have a list of the genes used to predict this TF, i.e. what genes were near that TF's motifs and fit the linear regression?
My use of "fit" here is broad and we'll have to figure out what we want that to mean. Just any gene included in the regression step, or do we want to exclude outliers somehow?
BP3 section here: https://github.com/arq5x/bedtools-protocols/blob/master/bedtools.md
If group A has peak X and group B doesn't, this should be treated differently than if group B has peak X and group A doesn't.
construct garnet function cannot handle large motif files
Replace with
cat $reference \
| bedtools slop -b 10000 -g /nfs/genomes/human_gp_feb_09/hg19.chrom.sizes \
| sortBed \
| bedtools intersect -a - -b <( cat $motifs ) -wa -wb -sorted \
| awk 'BEGIN {FS="\t"; OFS="\t"} {print $7,$8,$9,$10,$11,$12,$4,$2,$3,int(($8+$9)/2-($2+$3)/2)}' \
> garnetDB.cisBP.hg19.normalized.10kb.tsv
Create 10kb windows, sort, intersect with motifs file, add distance column and reorder, output.
@AmandaKedaigle As discussed, we want to make sure we have a coherent list of symbols in GarNet's default motif file. We're concerned that MotifMap seems to have both GeneSymbols of TFs and Transfac IDs. It's unclear to me whether Transfac IDs are a proper superset of TF gene symbols, subset, or whether they are just two partially overlapping sets. Either way, we should hopefully only use one of them, and make sure we can correlate them with the names we're passed in as part of the expression file we receive for TF_regression.
Let me know whether you agree with this and if it makes any sense =)
Can I use a dataframe? if not, should I use a dict?
Also in this is adding the new columns of 'intergenic' vs 'promoter' and 'distance'
GPS/GEM: https://groups.csail.mit.edu/cgs/onePageGPS/
MACS: there are just some headers our code might freak out about. MACS can output a bed file, but we'd like to support the tsv format MACS also occasionally outputs.
Tried loading /nfs/latdata/alex/garnet_data/hg19.garnet.pickle
into map_peaks
, and it's been running for a very long time (24h+). Not sure if the computer's even trying anymore :[
Submitted as a job: wqsub python map_peaks_job.py
out of this directory: /nfs/latdata/iamjli/ALS/TF_prediction/src
.
Also tried loading in a Python instance, loaded for 4+ hours before connection broke.
I'm unclear why duplicates are being removed. It's possible that a gene has the same motif more than once nearby, and I believe these should be considered separately. Or at least keep the one that's closer to the TSS or something.
Line 241 in 831a711
We should probably use bedtools for large genome manipulations since it's super fast. I'm pretty sure it doesn't even need to load the entire motif file in either, but I haven't explicitly tested memory requirements. There's even a python package.
Here's the workflow:
Tutorial: bedtools Tutorial
Test directory: /nfs/latdata/iamjli/test_projects/bedtools/test
Visualize bed file in IGV
Command to find full overlap:
bedtools intersect -a test_motifs.bed -b test_exons.bed -wa -f 1
Sorting will speed things up: add -sorted
flag
If not sorted: sort -k1,1 -k2,2n foo.bed > foo.sort.bed
Motif file: /nfs/latdata/iamjli/test_projects/bedtools/garnet_data/old/garnetDB.tsv
(3.3 gb)
First sort (7 min):
sort -k1,1 -k2,2n garnet_data/old/garnetDB.tsv > garnetDB.sort.bed
Sort without header:
(tail -n +2 garnet_data/old/garnetDB.tsv | sort -k1,1 -k2,2n) > garnetDB.sorted.bed
Sort ATAC-seq DOS bed file from /nfs/latdata/iamjli/ALS/ATAC-seq/iMNs_ALS_SMA_CTR_from_iMPs/diffBind_042617
:
(tail -n +2 diffSites_ALS_CTR.stripped.txt | sort -k1,1 -k2,2n) > diffSites_ALS_CTR.sorted.bed
Find intersection (73 s):
bedtools intersect -a garnetDB.sorted.bed -b diffSites_ALS_CTR.sorted.bed -wa -f 1 > overlapping_motifs.tsv
With -sorted
flag (77 s):
bedtools intersect -a garnetDB.sorted.bed -b diffSites_ALS_CTR.sorted.bed -wa -f 1 -sorted > overlapping_motifs.fast.tsv
It would be nice if we had a webpage explaining what GarNet does. The page that should do that is docs/src/index.rst
https://github.com/fraenkel-lab/GarNet2/blob/master/docs/source/index.rst
It's reStructured Text (sorry).
Besides testing it manually for correctness against the old version, it would be great to incorporate Travis or other tests in here. Let me know what you think!
@zfrenchee Could you add a few lines that lets this function take window size as an optional parameter? Not too familiar with the syntax myself. Thanks!
Line 238 in 8656334
Specifically, do 2kb window if window size is not specified.
Seems like the way is using statsmodels:
http://stackoverflow.com/questions/19991445/run-an-ols-regression-with-pandas-data-frame
@iamjli and @AmandaKedaigle report that loading MotifMap into RAM is just too expensive on a laptop. Could we use a database instead?
While trying to run Garnet on the cluster, I got this error for the TF regression part:
Traceback (most recent call last):
File "run_new_garnet.py", line 14, in
df_reg = TF_regression(df, expression, options)
File "/home/nlpm/.local/lib/python3.6/site-packages/GarNet/garnet.py", line 337, in TF_regression
motifs_genes_and_expression_levels = motifs_and_genes_dataframe.merge(expression_dataframe, left_on='geneSymbol', right_on='name', how='inner')
AttributeError: 'str' object has no attribute 'merge'
How do you know if the user is passing you a pickled file instead of a datafile?
Currently, it seems all the motifs' names are not in a format that is compatible with other analyses (forest, in particular). To address this, it'd be great if someone could create a mapping file for each motif; or map the motifs' names before they are fed into Garnet.
I'm creating this just to keep track of some differences between the IntervalTree and BedTools methods of generating Garnet files. This is based off this notebook.
motifs_and_genes
, why are some of the distances > 10kb?Traceback (most recent call last):
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/indexes/base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'expression'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 3774, in __call__
sort_columns=sort_columns, **kwds)
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 2643, in plot_frame
**kwds)
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 2470, in _plot
plot_obj.generate()
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 1043, in generate
self._make_plot()
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 1619, in _make_plot
scatter = ax.scatter(data[x].values, data[y].values, c=c_values,
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/internals.py", line 3543, in get
loc = self.items.get_loc(item)
File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/indexes/base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'expression'```
Is this the same as binding affinity? If so, shouldn't we filter out motif scores that are 0? About 1 out of 17.3 million motifs are 0.
@zfrenchee
@AmandaKedaigle Right now, we're only passing through raw data from the original files, not computing new fields. Each new field should be about one line of code. Let's use this issue as an inventory of fields we want to add:
currently: ["chrom", "peakStart", "peakEnd", "peakName", "peakScore", "geneName", "geneStart", "geneEnd"]
Want to add:
currently: ["chrom", "motifStart", "motifEnd", "motifID", "motifName", "motifScore", "geneName", "geneSymbol", "geneStart", "geneEnd"]
currently: ["chrom", "peakStart", "peakEnd", "peakName", "peakScore", "motifID", "motifName", "motifStart", "motifEnd", "motifScore"]
currently: ["chrom", "motifStart", "motifEnd", "motifID", "motifName", "motifScore", "geneName", "geneStart", "geneEnd"]
TFs are transcriptionally regulated themselves, so their mRNA levels should give information about TF activity. May involve some network modeling or something.
@iamjli I was testing GarNet when I ran into a bug. I was following the notebook, and I created a GarNet file and then tried to map peaks using that GarNet file. I got this error:
***** WARNING: File has inconsistent naming convention for record:
motifChrom motifStart motifEnd motifName motifScore motifStrand geneName tssStart tssEnd motif_gene_distance
***** WARNING: File has inconsistent naming convention for record:
motifChrom motifStart motifEnd motifName motifScore motifStrand geneName tssStart tssEnd motif_gene_distance
and an empty dataframe. However, when I used garnetDB_cisBP.hg19.LOD.10kb.tsv
instead of garnetDB_cisBP.hg19.LOD.10kb.chr1_SLIM.tsv
the error disappeared.
This might point to an issue with construct_garnet_file
. Let me know what you think.
I noticed in setup.py
that GarNet is compatible with Python 3.5 and 3.6. Are there Python 3 features being used that prevent it from being compatible with Python 2.7?
I'm also curious if this rewrite is closer to being compatible with both Linux and Windows. Some path handling code in the original Omics Integrator made it Linux-only.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.