fraenkel-lab / garnet Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 9.51 MB

Home Page: https://fraenkel-lab.github.io/GarNet/

Python 12.50% HTML 75.56% Jupyter Notebook 11.94%

epigenetics transcriptome omics data-integration

garnet's People

Contributors

Stargazers

Watchers

Forkers

maggishaggy kunju-pitt

garnet's Issues

mm9, mm10 garnet file construction notebooks

Not urgent, but definitely nice to have. cisDB has these files I think?

Finish writing the argument parser

description
help messages

It should be possible to analyze many peaks files at once.

Without loading the huge motifs intervaltrees into memory each time.

Idea for speed optimization.

Outlining one way to speed up vanilla Garnet. Won't implement now but probably worth looking into when we decide on a version of Garnet we like.

From our discussion last week, it appeared the only way to get accurate motif matches is to calculate background from user-specified open chromatin regions, then run motif searching, all on the fly. We concluded we may have to compromise and just generate a motif file within windows around every TSS. But the motif matching algorithm is different that what I thought - specifically here are the steps:

Calculate score threshold from user-defined p-value and background.
Scan sequences for any subsequences whose score exceed this threshold. Note that score is only determined by the subsequence and the PWM, not the background.

This means we can do a lot of the preprocessing work beforehand. We can generate a motif file with a low threshold, and everytime we run garnet, we only have to calculate the threshold we want, then we filter the motif file based on these thresholds. This would be really fast, though the motif file would be very large (probably 10s of GB).

finish writing map_known_genes_and_TF_binding_motifs_to_peaks and intersection_of_three_dicts_of_intervaltrees

This does the main logic of steps 1, 2, and 3 from the old garnet, as one function step

Decide on motif data to draw from

Creating our own genome-wide scan for all motifs
(or using MotifMap?)

GUI output

check files exist

Easy one :)

In map_peaks (and similar), can we run a check that all input files exist before attempting to open the garnet_file? Since that step takes a while to run, I'd rather realize I have a typo before waiting for that.

invalid syntax

Working out of this folder: /nfs/latdata/iamjli/ALS/GarNet/

Did the following on node19:

git clone <garnet repository>
virtualenv venv
virtualenv -p /usr/bin/python3 venv
source venv/bin/activate
pip install -r requirements.txt

All requirements downloaded successfully with no error. I loaded all relevant data and from src/ ran

python testing.py

Got this error:

Traceback (most recent call last):
  File "testing.py", line 3, in <module>
    from garnet import *
  File "/nfs/latdata/iamjli/ALS/GarNet/src/garnet.py", line 303
    motifs_and_genes = [{**motif, **gene, **peak} for peak, genes, motifs in peaks_with_associated_genes_and_motifs for gene in genes for motif in motifs]
                          ^
SyntaxError: invalid syntax

Then tried creating an instance of python3 and importing garnet, but got the same error.

Include TF "targets" in output

The output of Garnet right now is a file with three columns: "Transcription Factor Slope P-Value". I'd like to have an option to include a fourth, "Targets", which would have a list of the genes used to predict this TF, i.e. what genes were near that TF's motifs and fit the linear regression?

My use of "fit" here is broad and we'll have to figure out what we want that to mean. Just any gene included in the regression step, or do we want to exclude outliers somehow?

Figure out what to do with the kgXref_file

TODO: autofetch reference files

BP3 section here: https://github.com/arq5x/bedtools-protocols/blob/master/bedtools.md

Peak directionality

If group A has peak X and group B doesn't, this should be treated differently than if group B has peak X and group A doesn't.

TODO: update garnet construction function

construct garnet function cannot handle large motif files

Replace with

cat $reference \
| bedtools slop -b 10000 -g /nfs/genomes/human_gp_feb_09/hg19.chrom.sizes \
| sortBed \
| bedtools intersect -a - -b <( cat $motifs ) -wa -wb -sorted \
| awk 'BEGIN {FS="\t"; OFS="\t"} {print $7,$8,$9,$10,$11,$12,$4,$2,$3,int(($8+$9)/2-($2+$3)/2)}' \
> garnetDB.cisBP.hg19.normalized.10kb.tsv

Create 10kb windows, sort, intersect with motifs file, add distance column and reorder, output.

MotifMap has transfac IDs

@AmandaKedaigle As discussed, we want to make sure we have a coherent list of symbols in GarNet's default motif file. We're concerned that MotifMap seems to have both GeneSymbols of TFs and Transfac IDs. It's unclear to me whether Transfac IDs are a proper superset of TF gene symbols, subset, or whether they are just two partially overlapping sets. Either way, we should hopefully only use one of them, and make sure we can correlate them with the names we're passed in as part of the expression file we receive for TF_regression.

Figure out the relationship of Transfac ID and GeneSymbol.
Email MotifMap people asking for rationale of including both Transfac ID and GeneSymbol.
"Repair" MotifMap file so that the TF_names are from a single Namespace.

Let me know whether you agree with this and if it makes any sense =)

Filter out unreasonable regressions

like this:

figure out merging logic

Can I use a dataframe? if not, should I use a dict?

Also in this is adding the new columns of 'intergenic' vs 'promoter' and 'distance'

support MACS and GPS/GEM

GPS/GEM: https://groups.csail.mit.edu/cgs/onePageGPS/
MACS: there are just some headers our code might freak out about. MACS can output a bed file, but we'd like to support the tsv format MACS also occasionally outputs.

Get autodoc to work

Garnet file stuck loading

Tried loading /nfs/latdata/alex/garnet_data/hg19.garnet.pickle into map_peaks, and it's been running for a very long time (24h+). Not sure if the computer's even trying anymore :[

Submitted as a job: wqsub python map_peaks_job.py out of this directory: /nfs/latdata/iamjli/ALS/TF_prediction/src.

Also tried loading in a Python instance, loaded for 4+ hours before connection broke.

TF regression drop duplicates

I'm unclear why duplicates are being removed. It's possible that a gene has the same motif more than once nearby, and I believe these should be considered separately. Or at least keep the one that's closer to the TSS or something.

GarNet/GarNet/garnet.py

Line 241 in 831a711

if 'geneName' in motifs_genes_and_expression_levels.columns:

Bedtools can perform map_peaks function in ~1 min

We should probably use bedtools for large genome manipulations since it's super fast. I'm pretty sure it doesn't even need to load the entire motif file in either, but I haven't explicitly tested memory requirements. There's even a python package.

Here's the workflow:

Bedtools intersect notes

Tutorial: bedtools Tutorial
Test directory: /nfs/latdata/iamjli/test_projects/bedtools/test
Visualize bed file in IGV

Command to find full overlap:
bedtools intersect -a test_motifs.bed -b test_exons.bed -wa -f 1

Sorting will speed things up: add -sorted flag

If not sorted: sort -k1,1 -k2,2n foo.bed > foo.sort.bed

Intersecting GarNet motifs

Motif file: /nfs/latdata/iamjli/test_projects/bedtools/garnet_data/old/garnetDB.tsv (3.3 gb)

First sort (7 min):
sort -k1,1 -k2,2n garnet_data/old/garnetDB.tsv > garnetDB.sort.bed

Sort without header:
(tail -n +2 garnet_data/old/garnetDB.tsv | sort -k1,1 -k2,2n) > garnetDB.sorted.bed

Sort ATAC-seq DOS bed file from /nfs/latdata/iamjli/ALS/ATAC-seq/iMNs_ALS_SMA_CTR_from_iMPs/diffBind_042617:
(tail -n +2 diffSites_ALS_CTR.stripped.txt | sort -k1,1 -k2,2n) > diffSites_ALS_CTR.sorted.bed

Find intersection (73 s):
bedtools intersect -a garnetDB.sorted.bed -b diffSites_ALS_CTR.sorted.bed -wa -f 1 > overlapping_motifs.tsv

With -sorted flag (77 s):
bedtools intersect -a garnetDB.sorted.bed -b diffSites_ALS_CTR.sorted.bed -wa -f 1 -sorted > overlapping_motifs.fast.tsv

Write pretty docs.

It would be nice if we had a webpage explaining what GarNet does. The page that should do that is docs/src/index.rst

https://github.com/fraenkel-lab/GarNet2/blob/master/docs/source/index.rst

It's reStructured Text (sorry).

Make sure GarNet2 recovers functionality of GarNet 1

Besides testing it manually for correctness against the old version, it would be great to incorporate Travis or other tests in here. Let me know what you think!

Window optional param

@zfrenchee Could you add a few lines that lets this function take window size as an optional parameter? Not too familiar with the syntax myself. Thanks!

GarNet/GarNet/garnet.py

Line 238 in 8656334

 def construct_garnet_file(reference_file, motif_file_or_files, output_file, options=dict()): 

Specifically, do 2kb window if window size is not specified.

linear regression (the actual main function of garnet)

Seems like the way is using statsmodels:
http://stackoverflow.com/questions/19991445/run-an-ols-regression-with-pandas-data-frame

write output files

MotifMap takes too long to load into RAM -- could we use a database?

@iamjli and @AmandaKedaigle report that loading MotifMap into RAM is just too expensive on a laptop. Could we use a database instead?

TF Regression Error

While trying to run Garnet on the cluster, I got this error for the TF regression part:

Traceback (most recent call last):
File "run_new_garnet.py", line 14, in
df_reg = TF_regression(df, expression, options)
File "/home/nlpm/.local/lib/python3.6/site-packages/GarNet/garnet.py", line 337, in TF_regression
motifs_genes_and_expression_levels = motifs_and_genes_dataframe.merge(expression_dataframe, left_on='geneSymbol', right_on='name', how='inner')
AttributeError: 'str' object has no attribute 'merge'

figure out the pickling strategy

How do you know if the user is passing you a pickled file instead of a datafile?

Map Motifs to Gene Symbols

Currently, it seems all the motifs' names are not in a format that is compatible with other analyses (forest, in particular). To address this, it'd be great if someone could create a mapping file for each motif; or map the motifs' names before they are fed into Garnet.

Comparison: Garnet file generation

I'm creating this just to keep track of some differences between the IntervalTree and BedTools methods of generating Garnet files. This is based off this notebook.

In dataframe motifs_and_genes, why are some of the distances > 10kb?
motif_to_gene_distance may not be calculated correctly, as the TSS varies depending on which strands the gene and motifs are on.

Alternative IntervalTree implementations

https://gist.github.com/shoyer/c939325f509d7c027949 (keep an eye on pandas-dev/pandas#8707 which has now moved to pandas-dev/pandas#15309)
https://github.com/ekg/intervaltree
https://github.com/cpcloud/banyan

`TF_regression` not working properly on server (or locally either)

Traceback (most recent call last):
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/indexes/base.py", line 2134, in get_loc
    return self._engine.get_loc(key)
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'expression'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 3774, in __call__
    sort_columns=sort_columns, **kwds)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 2643, in plot_frame
    **kwds)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 2470, in _plot
    plot_obj.generate()
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 1043, in generate
    self._make_plot()
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/tools/plotting.py", line 1619, in _make_plot
    scatter = ax.scatter(data[x].values, data[y].values, c=c_values,
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/core/internals.py", line 3543, in get
    loc = self.items.get_loc(item)
  File "/nfs/latdata/iamjli/packages/GarNet2/venv/lib/python3.6/site-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'expression'```

What does motif score mean?

Is this the same as binding affinity? If so, shouldn't we filter out motif scores that are 0? About 1 out of 17.3 million motifs are 0.

@zfrenchee

AP-2rep motif

This motif is being included despite having scores of 0.

Compute additional fields

@AmandaKedaigle Right now, we're only passing through raw data from the original files, not computing new fields. Each new field should be about one line of code. Let's use this issue as an inventory of fields we want to add:

Map known genes to peaks:

currently: ["chrom", "peakStart", "peakEnd", "peakName", "peakScore", "geneName", "geneStart", "geneEnd"]

Want to add:

Dist (where this is bp distance between TSS and peak summit/halfway-point, I’d think)
MapType (this is promoter/upstream/downstream/intergenic)

map_known_genes_and_motifs_to_peaks:

currently: ["chrom", "motifStart", "motifEnd", "motifID", "motifName", "motifScore", "geneName", "geneSymbol", "geneStart", "geneEnd"]

map_motifs_to_peaks:

currently: ["chrom", "peakStart", "peakEnd", "peakName", "peakScore", "motifID", "motifName", "motifStart", "motifEnd", "motifScore"]

map_known_genes_to_motifs:

currently: ["chrom", "motifStart", "motifEnd", "motifID", "motifName", "motifScore", "geneName", "geneStart", "geneEnd"]

mRNA should inform TF activity

TFs are transcriptionally regulated themselves, so their mRNA levels should give information about TF activity. May involve some network modeling or something.

GarNet File has inconsistent naming convention for record:

@iamjli I was testing GarNet when I ran into a bug. I was following the notebook, and I created a GarNet file and then tried to map peaks using that GarNet file. I got this error:

***** WARNING: File  has inconsistent naming convention for record:
	motifChrom	motifStart	motifEnd	motifName	motifScore	motifStrand	geneName	tssStart	tssEnd	motif_gene_distance

***** WARNING: File  has inconsistent naming convention for record:
	motifChrom	motifStart	motifEnd	motifName	motifScore	motifStrand	geneName	tssStart	tssEnd	motif_gene_distance

and an empty dataframe. However, when I used garnetDB_cisBP.hg19.LOD.10kb.tsv instead of garnetDB_cisBP.hg19.LOD.10kb.chr1_SLIM.tsv the error disappeared.

This might point to an issue with construct_garnet_file. Let me know what you think.

Python version compatability

I noticed in setup.py that GarNet is compatible with Python 3.5 and 3.6. Are there Python 3 features being used that prevent it from being compatible with Python 2.7?

I'm also curious if this rewrite is closer to being compatible with both Linux and Windows. Some path handling code in the original Omics Integrator made it Linux-only.