GithubHelp home page GithubHelp logo

open2c / bioframe Goto Github PK

View Code? Open in Web Editor NEW
170.0 170.0 28.0 3.67 MB

Genomic interval operations on Pandas DataFrames

License: MIT License

Python 100.00%
bioinformatics dataframes genomic-intervals genomic-ranges genomics ngs-analysis numpy pandas python spatial-join

bioframe's People

Contributors

aafkevandenberg avatar agalitsyna avatar dependabot[bot] avatar gamazeps avatar gfudenberg avatar gokceneraslan avatar golobor avatar gspracklin avatar harshit148 avatar itsameerkat avatar ivirshup avatar luisdiaz1997 avatar mimakaev avatar nileshpatra avatar nvictus avatar phlya avatar pre-commit-ci[bot] avatar sergpolly avatar smitkadvani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bioframe's Issues

bioframe.closest drops values if chromosome is missing in other

in bioframe.closest(a,b) the result is usually the same length than a. However, if some chromosome is missing in b but is in a, the result is smaller. This is dangerous!

image

Related but different issue: why don't we index the result by default with the index of a? This will prevent so so so many bugs! (I just had two in a row!)

Correct me if I'm wrong, but (one of) the most likely use case of closest would proceed as follows:

result = bioframe.closest(mydf, other_df)
do-something-to-result 
mydf["new_column"] = result["some_column"]  # option 1 or ... 
mydf["new_column"] = result["some_column"].values  # option 2 

Right now both options 1 and 2 would usually work, but sometimes silently fail for two reasons. One is missing chromosome (both fail). Second is if the result is indexed differently than mydf! (if mydf is a slice and is indexed with [1,2,4,6,9,14,15]). First will fail silently, but second will work unless there are missing chromosomes. And if it is a mostly-complete slice, when you display the .head(), you won't even notice the bug. I just experienced both versions...

In my experience, pandas preserves the original index and length with .join(), but not with .merge(). So .join-like operations should comply with .join: same length and original index, while .merge()-like operations should allow for different length and no original index.

UCSC fetching functions

Hi,

I'm getting errors connecting to UCSC for both fetching centromeres and gene density, after upgrading to the most recent version of bioframe today. Is there a new way to call these functions based on the recent updates?

print(bioframe.__version__)
0.1.0-dev

#from cooltools example https://cooltools.readthedocs.io/en/stable/examples/pileups-example.html
#chromosome arms!
#Also specifying just good chromosomes 4, 14, 17, 18, 20, and 21

cens = bioframe.fetch_centromeres('hg19')
cens.set_index('chrom', inplace=True)
cens = cens.mid

GOOD_CHROMS = ['chr4', 'chr14', 'chr17', 'chr18', 'chr20', 'chr21']

arms = [arm
        for chrom in GOOD_CHROMS
        for arm in ((chrom, 0, cens.get(chrom,0)),
                    (chrom, cens.get(chrom,0), hg19.get(chrom,0)))
]

armsdf = pd.DataFrame(arms, columns=['chrom','start', 'end'])
arms

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-6971b191691f> in <module>
      3 #Also specifying just good chromosomes 4, 14, 17, 18, 20, and 21
      4 
----> 5 cens = bioframe.fetch_centromeres('hg19')
      6 cens.set_index('chrom', inplace=True)
      7 cens = cens.mid

~/bin/miniconda3/envs/cooler-env/lib/python3.6/site-packages/bioframe/resources.py in fetch_centromeres(db, provider, merge, verbose)
     77     client = UCSCClient(db)
     78     fetchers = [
---> 79         ('centromeres', client.fetch_centromeres),
     80         ('cytoband', client.fetch_cytoband),
     81         ('cytoband', partial(client.fetch_cytoband, ideo=True)),

AttributeError: 'UCSCClient' object has no attribute 'fetch_centromeres'

Also an issue with gene density

#gene density (whole genome)
#using new way of cooltools compartment calling using gene density instead of GC% to phase A vs B (in cooltools examples)
# Download and compute gene count per genomic bin 
bins = cooler.binnify(hg19, binsize)
genecov = bioframe.tools.frac_gene_coverage(bins, 'hg19')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-f5a95c5a96d8> in <module>
      3 # Download and compute gene count per genomic bin
      4 bins = cooler.binnify(hg19, binsize)
----> 5 genecov = bioframe.tools.frac_gene_coverage(bins, 'hg19')

AttributeError: module 'bioframe' has no attribute 'tools'

Find a way to ensure alignment of two interval tables.

We need a function to synchronize the indices of two tables with almost identical intervals. This is typically needed to enable safe transferring of columns between these tables. Alternatively, we can have a function that transfers columns between two tables of almost identical intervals.

@nvictus , is this a good summary of your request?

is_contained limitations: cannot check if arms/regions are contained in the chromosomes

Currently is_contained allows for "containment" checking only between 2 dataframes (viewframes?) that both have "name"-column (or similar) and such "names" of the df are cataloged in view_df .

This might be limiting, as for example it prevents us from checking if a set of regions (e.g arms) is contained within chromosomes:

make_viewframe(arms_df, check_bounds=chromsizes_df)

fails with the "not cataloged" error

PyPi package is different from github repo

I was recently checking out the cooltools suite and found that some functions used parts of the bioframe package that are not part of the module that could be installed via pip (i.e. PyPi) version.

I double-checked the version numbers and I have the latest bioframe version.

In particular this affects the bioframe.bedslice function which is used in the cooltools cooltools.eigdecomp.cooler_cis_eigs function to get the part of the phasing track matching the current region under decomposition. There it is used like defined in the repo here and also works as expected when using the cooltools function with a signature of

bedslice(frame, region)

However, when trying to use the bedslice directly it is a completely different function with a signature of

bedslice(frame, chrom, start, end)

Is there any explanation for this?###

frac_gc in tools.py throws error

When I try to run frac_gc, I get the following error:

/home/sameer/miniconda3/lib/python3.6/site-packages/bioframe/tools.py in _each(chrom_group)
149 for _, bin in chrom_group.iterrows():
150 s = seq[bin.start:bin.end]
--> 151 g = s.count('G')
152 g += s.count('g')
153 c = s.count('C')

AttributeError: 'Sequence' object has no attribute 'count'

I think this is because the variable s is an OrderedDict() of <chrom_name> and associated FastaRecord. The fasta record needs to be a string variable before the count method can be used successfully.

0.3.0 roadmap

TODO:

API

  • decide if we want to expose more from bioframe.core.*

Cosmetic:

  • master branch name → main
  • old region.py split into core.stringops, core.construction
  • delete old util.py
  • delete dask.py
  • rename genomeops.py to utils.py
  • docstrings should all have 1 sentence then space line, then longer description
  • use arg for arguments in docstrings, to agree with https://numpydoc.readthedocs.io/en/latest/format.html#parameters. internal links can be added with func:myfunc()

Changes in the existing code code:

  • update trim() to rename limits → regions
  • update complement() to rename chromsizes → view, and update behavior to accept a dataframe input instead of just dict of chromsizes.
  • update suffixes= to default ("", "_") across ops

New code:

  • add function bioframe.to_ucsc_region_string(). (maybe in io?) #50
  • create a new module that defines standards on bioframes, verifies existing dataframes and converts different inputs into bioframes: core.py? bioframe.py? standards.py? utils.py?
  • add functions to perform various checks on bioframes: is_sorted, is_overlapping, etc... #19 Which module should they go to? definitions.py? The constraint to keep in mind is that some of these checks may require ops (i.e. tests for overlapping intervals in the set).
  • add a universal constructor to make regions dataframe from: dict {str:int} or dict {str:(int,int)} or pd.Series(ints, chroms), etc. This can then be used for limits. (see https://gist.github.com/gfudenberg/9898023bf9c9f3fc0791d086e6875179#file-test_verifiers-ipynb)
  • synchronize handling of pd.NA in crucial columns (chrom, start, end, on). This is currently handled on a function-by-function basis to avoid casting to float.
  • solution: write a function that nans a part of a table AND casts numpy numeric types into pandas types bioframe.core.construction.sanitize_bioframe
  • delete split(), and update make_chromarms to use subtract.

Test:

  • tests for split
  • tests for make_chromarms
  • ensure that arrayops works with NaNs. (double check arrayops behavior for floats)

Docs:

  • standards and definitions (https://docs.google.com/document/d/10rnnz3TGcaR591Y33k5vPurJimq7vY_P_CHxXl5s0dE/edit)
  • an ipynb with performance evaluation and comparisons
  • fix links in docs to point to correct branch for ipynbs
  • need to use a different word for ‘name’ of an interval, vs. the ‘parent_region_name’ of an interval. E.g. a set of CTCF peaks could have names CTCFpeak1,…, CTCFpeakN, but they could all be on chr1p.
    →→ defaults are 'view_region' and 'name'
  • API docs for the rest of the library:
    • io
    • genomeops
    • core (specs, construction, stringops, checks)
  • add _verify_columns and _verify_column_dtypes to specs documentation (https://stackoverflow.com/a/7740295)
  • add discussion of construction/bedframes/concepts to the guide/interval_tutorial.ipynb.
  • add text to the guide/interval_tutorial.ipynb
  • update links in readme
  • add remaining ops to the guide
  • "how do I" aka cookbook/recipes (ideally, an ipynb)

split drops intervals that do not contain a splitting point

first dataframe that overlaps an interval from the second dataframe.

@gfudenberg , I forgot if we discussed it, but is that really the desired behaviour? In the standard application, splitting of chromosomes into arms, this setting leads to an omission of chromosomes for which centromeres are not specified.
Overall, this seems counter-intuitive to me... I would suggest (a) adding an argument to retain non-split intervals and (b) setting it to True by default. What do you and others think?

2D interval operations

In light of Anton's PR, just want to start a discussion about future possible functions for 2D interval operations for a later release.

  1. I think we decided on Slack they should just internally use 1D functions along each dimension, and then combine results. So they are more of a "sugar" than core functionality, but considering our focus on Hi-C analysis this seems important enough to implement - comparing dot calls seems like a frequent task (e.g. merging different resolution annotations (might be used in the dotcaller?), or obviously finding differential dot calls).
  2. I think we need to basically implement all the same functions as we (will) have for 1D overlaps, but for 2D. Except I am not sure if there is any reason to have 2D complement, and it seems ill defined anyway.
  3. I think it would be useful to have 2D vs 1D overlaps too. This is even easier to achieve by directly using 1D functions, but I'd say again it's something quite frequently needed - e.g. to annotate dot calls with CTCF peaks (and their orientation), or other ChIP-seq/whatever-seq peaks.

Other thoughts?

Fractional expand

Would be nice to add fractional expand, to e.g. double the length of each interval, even if they have different lengths.

fetch_chromsizes(as_bed=True) doesn't reset index of returned DataFrame - should it?

bioframe.fetch_chromsizes pulls a filtered subset of chromosomes (ignoring contigs and such) in a "natural" order - but it does not reset a DataFrame index - so it is out of order and non-contiguous e.g. chrM is ~249 for hg19 - should we reset index before returning a dataframe ?

!!! true - only when using as_bed=True !!! - as_bed = False yields a chrom indexed DataFrame

would be as simple as adding .reset_index(drop=True) at the end of this line:

chromtable = chromtable[['name','start','length']].rename({'name':'chrom', 'length':'end'}, axis='columns')

bioframe.closest (and perhaps other functions) break when working with categorical chromosomes

cooler.bins() returns chrom column as categorical.

If one is then to filter chroms (e.g. bins = bins[bins["chrom"] != "chrM") then "chrM" actually is left in the categorical variable, and would be a part of the .groupby(), even though the corresponding group is length 0.

Maybe this https://github.com/mirnylab/bioframe/blob/develop/bioframe/ops.py#L833 and similar places need a check for whether the group is empty?

standardize functions to read various "bioframes" from file

related to "define bioframe standards" item of the current roadmap: #48

in cooltools CLI (and when working with real data in a notebook) we have some reoccurring "read-from-file" patterns:

  • read "regions" dataframe (in a cooler-compatible fashion) - e.g. in compute-expected , soon in call-compartments, in compute-saddles, in call-dots ... also in dump-cworld apparently ... and of course - pile-ups ! maybe coolpuppy ?
  • read "reference-track" (bedGraph) in a cooler-compatible fashion for compartment flipping/sorting and for saddle calculation,
  • maybe reading some plain BED/BEDPE files e.g. for pile-ups ?!
  • reading "expected" in a cooler and regions-compatible fashion (does "expected" belong to "bioframe" standards ? - distance summary for cis, or BEDPE-like summary for trans ) in call-dots, in compute-saddles - something else (?)

We clearly need those functions in cooltools, but the question is if they need to be defined in bioframe ?
... Probably not, as making "read-from-file" functions that check for compatibility with cooler should be outside the scope of "bioframe" ...
However, implementing functions like that would require "bioframe" defined standards and checks for sure, like:

  • is reference-track "tiling" and "sorted" ?
  • are "regions" non-overlapping ? named or require naming ?
  • what else ?

Last point - "read-from-file" functions like these can maybe reside in bioframe only if they check for cooler-compatibility indirectly - e.g. via "chrom-sizes" and/or "binsize" in case we want or check alignment to bins

merge "breaks combinatorially" when input DataFrame has duplicate indices

minimal example:

x = pd.DataFrame({"chrom":["chr1","chr2"],"start":[100,400],"end":[110,410]})
x.index = [0,0]
bioframe.merge(x, min_dist=5)

"combinatorially" broken output - all chroms are paired with all the intervals:

	chrom	start	end	n_intervals
0	chr1	100	110	1
1	chr1	400	410	1
2	chr2	100	110	1
3	chr2	400	410	1

should have been:

	chrom	start	end	n_intervals
0	chr1	100	110	1
1	chr2	400	410	1

more real-life use case scenario would be:

df = pd.concat([peaks1,peaks2])
bioframe.merge(df,min_dist=1000)

resetting index of the input DataFrame fixes that ...

bioframe select

is currently incompatible with passing cols that differ from the default chrom, start, end
should be an easy fix

fetch chromsizes

only able to fetch chromsizes for organisms with Arabic numbers (i.e. hg19) not Roman numeral (i.e. ce10)

Integration with cooltools

I'm getting errors when calling cooltools functions for calculating eigenvectors/eigenvalues. For example slice_bed does not exist or no tools module. Can the cooltools package be updated to the proper bioframe functions?

EncodeClient, UCSCClient enhancements

it would be great to enhance the functionality of bioframe.EncodeClient

  • auto download the metadata.tsv if its not there
  • auto build the needed directory structure if not there
  • verify that provided assembly matches encode's list

Pass explicit `header=False` to any Series.to_csv

Upcoming breaking change in pandas default:

FutureWarning: The signature of `Series.to_csv` was aligned to that of `DataFrame.to_csv`, and argument 'header' will change its default value from False to True: please pass an explicit value to suppress this warning.

Trouble with to_bigwig

I am getting an error today trying to convert bedgraph formated pandas dataframes to bigwig - this code has worked previously (about 1 month ago), but today is giving an error. I have updated cooltools and bioframe to try to fix the issue, but it does not seem to have helped.

Here is the code:

#save to file
for cond in conditions:
    bioframe.to_bigwig(insul[cond], chromsizes, 
                       f'data/{long_names[cond]}.{binsize//1000}kb.insul_score_{window_bp}.bw', 
                       f'log2_insulation_score_{window_bp}')

Here is the error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-5-d6f9cc84595f> in <module>
      5     bioframe.to_bigwig(insul[cond], chromsizes, 
      6                        f'data/{long_names[cond]}.{binsize//1000}kb.insul_score_{window_bp}.bw',
----> 7                        f'log2_insulation_score_{window_bp}')

~/bin/miniconda3/envs/cooler-env/lib/python3.6/site-packages/bioframe/io/formats.py in to_bigwig(df, chromsizes, outpath, value_field)
    433 
    434         run(['bedGraphToBigWig', f.name, cs.name, outpath],
--> 435             print_cmd=True)
    436 
    437 

~/bin/miniconda3/envs/cooler-env/lib/python3.6/site-packages/bioframe/io/process.py in run(cmd, input, raises, print_cmd, max_msg_len)
     46         if len(out) > max_msg_len:
     47             out = out[:max_msg_len] + b'... [truncated]'
---> 48         raise OSError("process failed: %d\n%s\n%s" % (p.returncode,  out.decode('utf-8'), err.decode('utf-8')))
     49 
     50     return out.decode('utf-8')

OSError: process failed: 255

Expecting 2 words line 1 of /tmp/tmpb8jcb5z7.chrom.sizes got 1

Here is the first few lines of the insulation file:

chrom	start	end	is_bad_bin	log2_insulation_score_480000	n_valid_pixels_480000
0	chr1	0	40000	True	NaN	0.0
1	chr1	40000	80000	True	NaN	0.0
2	chr1	80000	120000	True	NaN	0.0
3	chr1	120000	160000	True	NaN	0.0
4	chr1	160000	200000	True	NaN	0.0

Here is what I am using for chromsizes - it is from bioframe.fetch_chromsizes

chr1     249250621
chr2     243199373
chr3     198022430
chr4     191154276
chr5     180915260
chr6     171115067
chr7     159138663
chr8     146364022
chr9     141213431
chr10    135534747
chr11    135006516
chr12    133851895
chr13    115169878
chr14    107349540
chr15    102531392
chr16     90354753
chr17     81195210
chr18     78077248
chr19     59128983
chr20     63025520
chr21     48129895
chr22     51304566
chrX     155270560
chrY      59373566
chrM         16571
Name: length, dtype: int64

Creating a more extensive and generative test suite?

pyranges has an extensive test suite is based on generative testing with hypothesis.

It basically generates random genomic data and ensures that the pyranges solution is equal to the bedtools solution. Would you be interested in a similar test suite for bioframe?

I only run a hundred iterations on CI, but on my home server I often let it run maany examples for each test.

fetch_centromeres error

Hi,

Thanks for your developed tool. Recently, I met a problem about the fetch_centromeres. When I tried the command 'bioframe.fetch_centromeres('mm9')', it reported an error :
##########
Traceback (most recent call last):
File "", line 1, in
File "/home/dell/softwares/anaconda2/envs/py36/lib/python3.6/site-packages/bioframe/io/resources.py", line 122, in fetch_centromeres
raise ConnectionError("No internet connection!")
ConnectionError: No internet connection!
##########
Could you give me a hand?
Thanks

provide a unified method to fetch centromeres

Several cooltools need to know the locations of centromeres. Thus, we need a function that would automatically fetch them and bioframe seems like the best library for it.

The biggest obstacle is that there is no standard file in goldenPath whose purpose would be to store centromeres. Several files may or may have centromeres (cytoband, gap, agp), but there is never a dedicated file that would work for all assemblies. Moreover, for some assemblies, like sacCer3, we'd need to use other resources besides UCSC.

So, the solution that I propose is to query centromeres in three steps:
(1) include centromeres for the model organisms (humans, mice (acro), zebrafish, drosophila (none), c.elegance, baker's and fission yeast, arabidopsis) into bioframe itself.
(2) if (1) fails, fetch and parse cytoband file from UCSC
(3) if (2) fails, fetch and parse gap

Also, since centromeres are extended regions, store them as (chrom, start, end, mid).
ping @nvictus

io.to_bed()

it'd be super nice to have a function that saves a bedframe into a bed file and resorts columns, potentially drops those not compatible with the bed format and formats the comment line.

docstrings: assorted collection of typos and misleading statements

  1. clustering isn't only for overlapping intervals, right? I was wondering at first what's different between merge and cluster

    Cluster overlapping intervals.

  2. add somethink like "limit by chromosome or region", or maybe a quick inline example would suffice, - otherwise not 100% obvious from the docstring that it has to be like:`{chr1: (0,100000), chr2: (400,8000000),...}

    limits : {str: int} or {str: (int, int)}

  3. "indepdendently" :))

    List of column names to perform clustering on indepdendently, passed as an argument

has been addressed by #67

Operations on signal tracks

As a reminder, once Anton's PR is merged we should see how/whether to implement operations with signal tracks, such as average profiles, at least as an example notebook at first.

Citing bioframe?

How would I cite bioframe?

I am writing up my Ph.D. and discussing the merits/downsides of pyranges and bioframe.

I know you do not have a paper on it but it would be nice to have something to cite. Is it correct to say that bioframe was originally made to support cooler and other Hi-C software? Then I can cite cooler.

can we have to_ucsc of some sort somewhere ?

in an effort to unify handling of regions throughout cooltools/bioframe we should have something like to_ucsc formatter somewhere , shoudn't we ?

e.g. to avoid code like that: regions.apply(lambda x: "{}:{}-{}".format(*x), axis=1) ...

There is something in cooler i believe but should it be in bioframe instead ?

value types casting issues, pd.NA and pd.Categorical ....

a little snippet below is an example of my issues with bioframe.ops.overlap :

  1. couldn't cast pd.NA to float
  2. chrom,start,end became object and pd.Categorical somehow - which made groupby without observed=True impossible

I'm not sure what of that is expected behavior and what isn't ...
I'll fill in this example with actual data later

from bioframe import io
timfname = "filename"
bins = clr.bins().fetch(chrom).copy()
val_rt = io.read_table(timfname, schema=bioframe.schemas.BEDGRAPH_FIELDS, skiprows=1)
val_rt = val_rt.rename({0:"chrom",1:"start",2:"end",3:"rt"}, axis=1)

# overlap RT signal with our bins ...
binned_rt = bioframe.ops.overlap(bins, val_rt, suffixes=['', '_rt'])
# freaking pd.NA !? can't cast them to float ... why ? why are they there in the first place ?!
binned_rt["value_rt"] = binned_rt["value_rt"].apply(lambda x: np.nan if x is pd.NA else x)

binned_rt = binned_rt \
                .drop(labels=["chrom_rt","start_rt","end_rt"],axis=1) \
                .astype({"start":int,"end":int,"weight":float})
# finally :
binned_rt = binned_rt.groupby(["chrom","start","end"],observed=True,sort=False).mean()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.