snystrom / memes Goto Github PK

View Code? Open in Web Editor NEW

43.0 43.0 5.0 1.92 MB

An R interface to the MEME Suite

Home Page: https://snystrom.github.io/memes/

License: Other

R 100.00%

memes's People

Contributors

Stargazers

Watchers

Forkers

ggraham ttriche mniederhuber seb-mueller itsmestewart

memes's Issues

Release 1.1

Utils

MAST
- A note on use of MAST: https://groups.google.com/d/msg/meme-suite/ZDnELfCZ0Z8/aIjwreEbBQAJ
SpaMo (?)
Centrimo
MEME

Overhaul summary plot functions.

Give titles to plot_denovo_matches. Do this by abstracting the stack into a new function that allows titling, then use that inside plot_denovo_match.

Better stats than with grob table?

replace all references to old flyFactor to the new flyFactorSurvey_cleaned.meme file.

AME Support

runAme
importAme
plot heatmap results
~~utilities for resolving redundant entries? (ie get_best_tfid for finding tf w/ lowest p-value)~~
database as universalmotif list
change environment/option variable names for database to be inclusive to ame
modify get_sequence to provide method to add score after sequence position name (sometimes used by ame), take either a vector or a column name from input regions?
option to import sequences.tsv when method = "fisher"

Add `as_universalmotif` method

Perhaps consider also adding as_universalmotif method which runs update_motifs, then returns motif columns only as list.

Originally posted by @snystrom in #31 (comment)

tomtom warnings & tests

Loaded/excluded database entries (print to terminal)

test:

more than 1 db produces correct output (need to left_join on db + idx cols, or is idx sufficient?)

Rename meme output columns from id/alt to name/altname (consistent w/ universalmotif)

add cols to dreme output with the 'm01', 'dreme-1', 'seq' values in addition to generating the name columns.

TomTom / Ame / FIMO runs indefinitely if `meme_db` is empty string

# Runs correctly
options(meme_db = system.file("extdata/flyFactorSurvey_cleaned.meme", package = "dremeR"))
tt_out <- runTomTom(dreme_out)

vs.

# Runs forever
options(meme_db = system.file("extdata/flyFactorSurvey_cleaned.meme"))
tt_out <- runTomTom(dreme_out)

ame heatmap internal group comparison

add option to rescale heatmap values within a group? Ie use rank but in case regions are filtered, rescale to rank 1:n

user-facing check install function

So end-user can troubleshoot whether their install is detected (and can run? Check with -h or --version?).

importAME needs a way to guess 'method'

currently, importAME requires the user to say which method they ran. This isn't very helpful to beginners and since we know the different colname types for different methods, it would be easy enough to look for different ones and guess the type instead.

distance from summit cumdist

change default motifs slot naming behavior

Try: m01-IUPAC

This way rank and sequence are encoded into the list object. Might make things a bit easier to deal with later.

helper to convert ame sequences to GRanges

Need to think about this some more & get feedback. If not using shuffled input it becomes difficult to label the regions by whether they're input or control sequences.

Possible solution modify get_sequences to add an optional ID label which users must use to convert sequences easily??? Seems too complicated.

# Attempt at writing sequence converter for AME results
ame_analysis_seq <- peaks %>% 
  resize(200, "center") %>% 
  get_sequence(dm.genome) %>% 
  runAme(evalue_report_threshold = 30, sequences = TRUE)

ame_analysis_seq$sequences[[1]] %>% 
  tidyr::separate(seq_id, c("pos", 'type'), sep = "_") %>% 
  # what about partitioning or background/control?
  # when using paritioning or control fasta, there is no ID appended after sequence info,
  # so no easy way to label them.... need to think about this
  dplyr::mutate(type = dplyr::case_when(is.na(type) ~ "input",
                                        type == "shuf" ~ "shuffle")) %>% 
  {
    dat <- .
    ranges <- GRanges(.$pos)
    mcols(ranges) <- dat %>% 
      dplyr::select(-pos)
    return(ranges)
  }

universalmotif list input to runTomTom

method to check list is list of universalmotifs

S3 to deploy path vs list

also need to use tmpdir if using list input

Rename package from {dremeR} to {memes}

So shall it be.

Mask sequences?

utilities for masking sequences?

Use meme built-in dust?

Consider Containerizing meme versions

One solution to crossplatform support & instead of using environment variables to point to a local install which could cause issues with different meme versions, is to containerize the meme layer. Interface could be docker or singularity (likely prefer singularity for HPC support).

R Interface to containers:
https://cran.r-project.org/web/packages/babelwhale/index.html

If you do go this route be sure to include the MEME-suite Copyright Notice:
http://meme-suite.org/doc/copyright.html?man_type=web

Experimental: mutate_motifs / update_motifs

dplyr-like mutate_motifs for manipulating motif column of data.frame and assigning to value from data.frame.

update_motifs updates motifs to values from data.frame matching universalmotif slot names (id/alt are used for name/altname).

mutate_motifs currently doesn't support NSE, just vector args. Needs more user testing to see what people like.

modify input_method.R names

dreme_input -> sequence_input??
tomtom_input -> motif_input??
update generics

rename `motifs` column in results to `motif`

This was a dumb holdover from before, and it will cause confusion, because $motif only holds 1 motif per row.

places to change

Ensure good unit tests in place before making this change

where does nsites in `getProbabilityMatrix` come from?

It's been so long since I looked at this piece of code I need to sort this out.

runTomTom `motif` list column doesn't have names

This might have to do with inheritance of previous named state of this column.

But check when running from .meme path also. Could require an additional modification to the import step.

Vignette: large jobs on cluster w/ rslurm or futures?

Fix shuffle algorithm

Shuffle dinucleotides by default.

Write test for shuffle using random seed (expect match with 2 iterations @ same seed).

TOMTOM database as motif list (or vector of paths)

Should allow multiple database paths (since tomtom allows this), and using motif list as database.

Also implement for AME

user-facing import xml functions

Should expose some import functions to users so they can work with MEME-server data inside R as well.

Take stringset as input to runDreme

write a get_sequences function taking GRanges & genome as input

S3 method to deploy runDreme on stringset vs path.

use tmpdir if stringset input

Improve MEME Suite install process

Some ideas:

container (see #9)
conda environment? (basilisk?)
sneaky .config hook on pkg install? (see basilisk source code)

Function to compare all TOMTOM PWMs to discovered PWM

Also tool for selecting what appears to be the true "best match".

Good addition would be RNA count data so you can have plot like

Discovered PWM | PWM#1 | PWM#2 | ...
Spacer or stats | Expression barplot of all TFs

add runMeme input test to runTomTom unit tests

Refactor to use {{dotargs}}

dotargs will allow more flexibility to commandline interface.

rename functions

write_fasta_from_region

runFimoGenome

Release 1.0

Core Utilities

Helpers

Global environment variables/options
importXML (Dreme / TomTom)
importAME (for all results types)
get_sequence
user-facing check_meme_install #14
convert ame fasta_id column back to GRanges (hard problem, because AME appends information sometimes, but not always, hard to tell if region is the real region or the background region)

Experimental features
Unsure how to move forward with these ideas. Patch universalmotif? Force user to destroy object? Need external input & user testing before making a decision.

update tomtom best_match (drop best_* cols, grab top row from .$tomtom, update best_*)
mutate_motifs #31
update motifs (take data.frame cols, update .$motif entry to reflect new values) #31

Better Error Checking

Add dotargs::suggest_flag_name loop if program has nonzero exit status to check if arguments are wrong. Undo argsDict or other processing to flags??
runDreme
runTomTom
runAme
runFimo
runTomTom returns empty columns for matches (Partial implementation)

Input Types

File path
stringSet (and associated get_sequence functions)
Discovered motifs (universalmotif list)
run* output object (if applicable)
database with multiple entry types + vector input

Data Types

Change motifs to motif in output columns

Plots

Dreme/TomTom match plots
TomTom hits comparison
AME heatmap

Documentation

Simple README with core functionality
MEME installation instructions (also link MEME-suite.org) in README/vignette
dreme core
tomtom core
ame core
fimo core
Vignette repeating E93 paper (complex vignette)

Fixes

runTomTom currently requires id & altname, need to support id only. #37
warn_dreme call in motif_input fails (function not exist)
revise motif_input to allow meme results data.frame also (relax is_dreme_results check)
add OS check to run* functions.
add ggplot2 check to ame_plot_heatmap functions.
move ggseqlogo to suggests?

Testing

Fix unit tests to work on remote systems (copy test data to /tmp to avoid version issues)
check on macOS, windows will always fail (MEME won't install).
Fimo tests

Bioconductor Submission

cmdfun accepted to CRAN
Bioc submission guidlines

Fimo text output to temp file

--text can cause file to be very large and overrun systems with limited memory if trying to read the whole file into R. Solution is add a return_type argument with 'data' or 'path'. If path, return the file path which the user can choose to import.

To Add:

fimo return_type
importFimo

Update runFimo to unlist list input.

running FIMO on sequence list input doesn't really make sense, so can either not support or unlist & warn.

Vignette: Discover similar motifs, combine, plot pearson cor, search genome w/ FIMO

Use example from Megan?

tryCatch processx process

If error, return stdError/stdOut to user

separate TOMTOM from DremeStats?

Allow modular running of just DREME followed by TOMTOM. Currently the process is too intimately linked and the internal functions need some refactoring.

use importTomTomXML backend for both import & internal parsing

All other utils use the user-facing import function internally. This should also work with tomtom (and reduce overhead/possibility for bugs in 1 interface vs the other). I forget my original reasoning for this separation, but it's time to merge these if possible.

tomtom fails without altid

Issues
Main reason for this feature is to avoid rare scenario where two motifs may have identical names (like if the user joined two data.frames).

Possible Solutions

It may be that tomtom doesn't allow identical named motifs as input (haven't checked this). If this is the case, then I can drop requirement of altname and instead check & error if names are non-unique?
add altname columns to everything initialized with NA_character_ if altname column does not exist.

Set default database location, update variable name

Change "tomtom_db" to "meme_db" for env & options.
default = /meme/db/

see install docs

Add session_info at the end of vignettes

import MEME needs rework

currently importMeme has parse_sequences and combined_sequences flags. They currently assume that the input will be fasta headers of the genomic position. This eliminates using proteins with runMeme.

Refactor so parse_sequences controls whether to convert to GRanges->data.frame, otherwise use data.frame to start for everything.

Additional changes accompanying this fix:
Add flag for if dna / rna = T use parse_sequences otherwise if protein = T don't parse sequences.

add user-override sequence parse (parse_sequence = "auto") argument to enforce one way or the other (allow use with user-fasta without sequence headers)

Vignette idea: Motif DB query & use as input to runAme/runFimo, etc.

Also, filter expressed TFs from motifDB list, then use as tomtom/ame search, etc.

Use as_universalmotif_df on motifDb query to clean up motif entries before using as database. (Example: flyFactor Survey FBgn).

test tomtom database with no altname

Pretty sure the no altname issue carries over to the tomtom database entries as well. Need to fix this using any_of and dplyr::rename_all(recode, alt = "altname") where needed. Or initialize altname to NA_character_ as soon as possible.

add unit test for this

Cleanup Test data

Clean up unit tests

General

Set NOT_CRAN/meme_is_installed flags for remote systems

Tools

Tomtom output needs revision

Problem

everything works well when using dreme results as input, but things break down when using dreme.txt or universalmotif list

It is very useful to have a column of $motif and $best_match with $tomtom full results nested inside. This currently doesn't happen when not using dreme results.

Solution

Do not allow path input. Users should use motif list from import function.

issues with dreme.txt input: dreme.txt leaves out extra information such as p-value, pos, and neg counts, etc. so xml will allow consistent behavior. xml vs. txt. Build dreme-results-like data.frame from the query entries in tomtom.xml. If users want to use dreme.txt they can import it as a meme file? (this will error with read_meme currently: submitted PR to universalmotif to fix)

For universalmotif input, build dreme-results like dataframe from coercion with as.data.frame and append the motif column w/ universalmotif object. rename 'name' to 'id' to be consistent with meme-suite identifiers. (d60b8bc)
better handling of tomtom_results = NULL
double check columns returned in tomtom_results, ensure nothing is being left out!

Tests to write

check that downstream utilities work with all types of input (shouldn't need multiple dispatch)