cnluzon / wigglescout Goto Github PK
View Code? Open in Web Editor NEWExplore and visualize genomics bigWig data.
License: GNU General Public License v3.0
Explore and visualize genomics bigWig data.
License: GNU General Public License v3.0
This is to be done when repository is made public
Right now, the minimal flanking region to be used in both heatmaps and profiles is bin size. Setting upstream or downstream values to 0 will exit with an error.
This should be adapted to also accept empty flanking regions and return just the inside matrix. This only makes sense of course in stretch
mode.
Sometimes I get unexpected label order. I believe it's the factor sorting alphabetically. Labels are properly assigned, it's just the colors.
While adding new features I realised I completely messed up a figure by changing something in the aes
function call of ggplot2
. However, all the tests succeeded because I am not testing on the actual look of the plots. It would be great to test for this as well.
This would apply to:
ggplot
does not provide dendrogram functionality, so in order to do this I should use R base functionality or just draw the dendrogram somewhere else.I believe plot_bw_profile
should perform the normalization after aggregating values instead of before. Signal is otherwise too noisy.
This is a relevant parameter to show in the caption, which is not obvious from the plot itself.
Nice packages!
From the help message, I find the genome arguments only mm9 and hg38, How Can I define the other genomes, thanks.
There are a bunch of validation functions that could be factored out from bwtools.R. If refactor is done properly they can probably be reused in additional functionality.
Now that package internal functions start with a .
dot this should also apply to utils.R
functions.
Summarized heatmaps and profiles could include more information on the number of points included in the plot and so on.
Since the ggplot
implementation, plot_bw_summary_heatmap
reorders rows, because they are factor type. Factor should be reordered to keep the initial order.
I recently have spotted an error where if I try to do bw_loci
directly on a narrowPeak
file I get this issue:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
scan() expected 'an integer', got 'NNNNNN'
Where NNNN is a floating point that corresponds to the signalValue column on the narrowPeak
file. Apparently this comes from the .loci_to_granges
function where there is a call:
bed <- import(loci, format = "BED")
Apparently BiocIO::import
does not work with format = BED
and narrowPeak
although it parses correctly the file if this parameter is skipped.
With ggrastr
library it is possible to rasterize layers of a plot. This is useful for scatterplots with many points, when you want to save them in a vectorial format.
Running the full set is still not very slow, but I noticed this test slows noticeably (like seconds). I think this: future::plan(multisession, workers=2)
line is the problem.
Wherever there is a BED file accepted now, one can provide a GRanges. This is not reflected in the parameter names and function names. I think it should.
This is a major change in the API so I think changes like this should be gathered together for the formal release.
Accepting GRanges
objects wherever a BED file is accepted could be useful when processing BED files before plotting (i.e. expanding or shortening loci, filtering and so on).
Outlier removal - add a parameter to the functions that allows to exclude a percentile of the data.
This is a very specific and unusual case. If provided BED file has redundant loci, at some point in the merging the values are duplicated, creating spurious points. In the plot this does not matter but in the point-count it does.
This would be solved by deduplicating the BED file beforehand, which can be a bit costful but perhaps worth it.
Heatmap plots now are by default sorted by means. We need an order
parameter, so we can plot heatmaps side to side where loci correspond to the same line.
This happens in plot_bw_profile
and plot_bw_heatmap
.
Tiny issue, but needs to be fixed: make_norm_label
returns log2(RPGC/background
instead of log2(RPGC/background)
It seems that plot_bw_summary_heatmap
will aggregate without removing NA's resulting in a NA value over a distribution that contains NA's.
We had not run into this issue before, as our bigWig files are always complete.
In general it seems sensible to use the mean to remove the top percentile for multi dimensional plots, but in the case of scatterplot it creates an odd diagonal effect that is not nice.
So in this case I think we should separately remove top_percentile from x and y axis.
If one provides a norm_mode
parameter that is not implemented, .process_norm_func
returns NULL, and eventually you get an error because a function does not exist.
.process_norm_func
should further validate and return an aproppriate error in such case.
Caption lines in verbose plots are sometimes longer than the plot, making some of the values not visible.
Now BiocCheck is included in the CI workflow but it doesn't make the test fail.
I have noticed in an Input line on a ChromHMM plot that values were close to zero but not exactly zero for the input bigWig when I used remove_top = 0.001
and norm_mode = "log2fc"
I suspect this is because it is removing the elements from only the signal files but not the input ones.
Calculations run slow regardless on whether a bigWig file is an example one with very little values, becuause all the bins will still be calculated. This can be fixed by passing on the selection
parameter, which is already there.
This speeds up the generation of vignettes, otherwise they take some unnecesary extra time.
At the beginning I decided to pass norm_func
as a parameter to be able to either normalize signal / background
or log2(signal / background)
. This log2
function is what is passed to the function. This is confusing because it does not convey that signal is divided by background in any way, so users would often think that log2 means only that. It also does not generalize that much because there are only so many ways in which we would like to transform the data, and it allows to inject strange functions probably for no use.
So I propose changing this to normalization
and allow a set of string values: fc
, log2fc
and perhaps diff
or whatever we can think of.
This has gotten lost in some recent update
I think some of the core bwtools functions: multi_bw_ranges_
bw_ranges
have some room for optimization.
This can probably be done better after package is public.
Some of the values don't look good by default in RStudio on laptop screens, which I find inconvenient. Things that look good on display on a laptop will more likely look good on a larger screen than viceversa, so I think default should be shifted towards smaller displays.
It is not easy to predict when the legend is going to be on top of the lines, so it's better to make it transparent.
The usual color palette that is not part of other ggplot default palettes should be specified somewhere instead of hard-coded where needed.
This is a very specific example from a RNA_seq bigwig file from another reference.
The error I got:
Error in quantile.default(rowMeans(full), probs = c(1 - remove_top)) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Haven't been able to reproduce this easily but I will look into it.
Internal function granges_cbind
sorts GRanges
objects and merges them. This is OK because it is always called with objects of the same bins and ranges, but it could probably run faster using merge
function on dataframe and converting back to GRanges
.
The idea would be to add a parameter like subsample
that randomly takes a subset of subsample
bins or loci to perform the analysis. This is helpful for two things:
Probably also requires a seed parameter for testing.
This may have slipped after reimplementing with ggplot
. Specially relevant for ChromHMM plots, numbered regions should be sorted.
NA values are calculated within the filtering function, but they should be reported also if the data is unfiltered.
For traceability, package version should be displayed in the verbose version of plots.
aes_string
uses should be replaced by .data
, as aes_string
is deprecated and eventually will be removed.
ggplot2
imports should import only the specific functions to each given function.
At some point in the past genome info data was stored as a sysdata object within the package, in order to make it a bit more lightweight, since we only needed really seq lengths info.
At this point, I think this decision is not really solving dependencies and it makes the package less general, as it can only handle mm9/10 and hg38 genomes, which is unreasonably restrictive.
I get the same result for norm_mode == "log2fc"
as for norm_mode == "fc"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.