jwfoley / genomerator Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 216 KB

Python generators for iterating over genome data

License: MIT License

Python 96.54% Jupyter Notebook 3.46%

genomerator's Introduction

genomerator

Python generators for iterating over genome data

Glossary

Reference: A single physical reference sequence for a genome, transcriptome, gene, etc. with locations described by integer coordinates (e.g. base positions). Each chromosome or contig is one reference. A genome may contain many references. For ease of data management, references are tracked by reference ID (index), a non-negative integer expressing each reference's rank in your ordering system. The first reference's index is 0, because these indexes are used to access the list of reference names and Python starts list numbering at 0. But the order is arbitrary; you can put chr10 before chr2 for alphabetical sorting or after it for numerical consistency, as long as you always keep the references in the same order.
Feature: Any sort of annotation with coordinates on a reference. Genes, exons, SNPs, enriched regions, etc. are all kinds of features. In this module, a feature can only be paired with one reference.
Left and right positions: The "left" position of a feature is the base position within it that has the lowest coordinate (the farthest left as you look at it on a genome browser), expressed as a positive integer, and "right" is the highest coordinate. Other software often refers to these as "start" and "stop"/"end" positions, but that is ambiguous depending on the orientation of the feature. Note: in this module, unlike some other software, both left and right positions are always 1-based, i.e. the first base of the reference is base #1. Thus if the feature is only one base long, like a SNP, its left and right positions are the same number. This means genome coordinates inside the module will match coordinates on the genome browser.
Strand orientation: A feature may be oriented in the same direction as the reference (forward strand, also called plus strand or Watson strand) or the opposite direction (reverse strand, minus strand, Crick strand). E.g. if a gene is on the forward strand, its transcription initiation site has a lower-numbered position than its transcription termination site. Some kinds of features do not have any orientation; for simplicity these are treated as forward-oriented.
Start and end positions: In this module, the "start" and "end" of a feature are designated relative to the orientation of the feature, not the orientation of the reference. So if the feature is forward-oriented, the start position is the left position and the end position is the right position. But if it is reverse-oriented, the start position is the right position and the end position is the left position. It is important to keep these terms distinct as each context is useful in different situations.

`GenomeFeature`

This is a general-purpose container for any kind of data with reference coordinates. It always contains several important attributes:

reference_id: The index number of the reference this feature is on (first reference is number 0).
left_pos, right_pos: The leftmost and rightmost position numbers of the feature on that reference (first base of the reference is always position 1 regardless of whether it's the left or right end of a feature).
is_reverse: A Boolean value describing the orientation of the feature relative to the reference. If True, it is reverse-oriented. If False, it is forward-oriented, or has no orientation.
data: This attribute can contain any kind of information regarding the feature: a count, a score, a name, even a large data structure containing an unlimited amount of information.

Constructor: `GenomeFeature(reference_id, left_pos, right_pos = None, is_reverse = False, data = None`

The reference_id and left_pos must always be specified. If no right_pos is specified, it is assumed to be the same as the left_pos for a single-base feature, e.g. a SNP.

Several alternative constructors exist for creating a GenomeFeature from certain kinds of data:

GenomeFeature.from_genomefeature(other, data = None): Create a new instance at the same location as the old one, but you can specify new data for it to contain (if you don't specify, it has the same data as the old instance).
GenomeFeature.from_alignedsegment(aligned_segment) and GenomeFeature.from_variantrecord(variant_record): These convert a pysam.AlignedSegment or pysam.VariantRecord, respectively, into a GenomeFeature. The original object is stored in the data attribute. E.g. you can still access the data of a converted AlignedSegment as my_feature.data.sequence.
GenomeFeature.from_bed(line, reference_id) and the .from_bedgraph, .from_gff methods with the same syntax: These create a new GenomeFeature from a single line of a UCSC BED, UCSC bedGraph, or GFF file, respectively. .from_bedgraph puts the score (as a float) in data. from_bed and from_gff have another argument, parse (default true): if enabled, all the data fields defined in the format are parsed by name into a dictionary stored in data, or if disabled to save computing time, all the delimiter-split fields (as a list of strings) are stored there instead.

Extracting positions

GenomeFeature.start_pos and GenomeFeature.end_pos each return an int of the feature's start or end position, respectively, according to its strand orientation.

GenomeFeature.left(), .right(), .start(), and .end() each return a new, complete GenomeFeature instance consisting only of the leftmost/rightmost/start/end position of the original instance, but containing the same data.

my_feature[index] returns a new instance with the same data at the specified position, where index is relative to the left position of the original instance. For example, if my_feature's left position is 2355 and right position is 2358, my_feature[0] returns a new instance with left 2355, right 2355; my_feature[2] is left 2357, right 2357. Reverse indexes and contiguous forward-oriented slices are also allowed: my_feature[-1] is left 2358, right 2358, and my_feature[2:] is left 2357, right 2358. GenomeFeature.get_pos(position) does the same thing but with a genome position rather than a relative offset index, e.g. my_feature[2] is the same as my_feature.get_pos(2357). But note that, consistent with Python syntax, the stop position of a slice is still one past the end of the desired section; so my_feature.get_pos(2355:2359) will return the entire feature, length 4.

len(my_feature) returns the number of positions covered by a feature, e.g. my_feature has a length of 4. A single-position feature, e.g. left 2355 and right 2355, has a length of 1.

Iterating over a GenomeFeature yields single-base features from left to right, e.g. for i in my_feature: gives you position 2355, then 2356, then 2357, then 2358.

Modifying positions

In addition to simply setting left_pos and right_pos, you can also perform operations to shift one or both of them a specific distance. GenomeFeature.shift_left(distance) shifts only the left position by the specified distance and .shift_right shifts only the right position. Note that both methods shift the feature to the right (adding distance to left_pos or right_pos), unless distance is negative, and then they shift to the left. You cannot shift them so far that they become backwards and right_pos is less than left_pos. GenomeFeature.shift(distance) shifts both coordinates by the same distance. No amount of shifting will change the reference_id; you can shift past the right end of the reference, with position numbers greater than its entire length, or past the left end, with negative positions, if you have some reason to do that.

There are also versions that shift relative to the orientation of the feature rather than the orientation of the reference. shift_start and shift_end shift the start and end positions, but note that they shift in the direction of the feature. For example, if the feature is forward-oriented, shift_start moves its left position to the right (like shift_left); but if it is reverse-oriented, shift_start moves its right position to the left. shift_forward shifts both coordinates by the same distance, still in the direction of the feature.

For convenience, the + and - operators essentially apply shift, but create a new instance, to keep syntax consistent. So GenomeFeature(reference_id = 5, left_pos = 23, right_pos = 28) + 2 gives you a new feature with left position 25, right 30.

GenomeFeature.switch_strand() simply changes is_reverse to the opposite of what it was before.

Comparing features

The <, >, and == operators compare GenomeFeature instances' positions by sorting order. First they compare by reference_id: all features on reference 2 are "less than" features on reference 3. If the features are on the same reference, they are compared only by left position: a feature with left position 23, right position 42 is less than a feature with left 25, right 38. Strand orientation is ignored by these operators. This is arbitrary and somewhat unintuitive, but it allows features to be put in conventional sorting order with standard Python syntax: the build-in sorted function will sort a list of features in the same order expected by samtools and virtually all other genomics software.

More intuitively, you may want to check whether one feature is entirely to the left of another feature, with no overlap. GenomeFeature.left_of(other) and GenomeFeature.right_of(other) do this. The former will return True if the given feature's right position is to the left of the other feature's left position (or this feature is on an earlier reference than the other).

GenomeFeature.same_as(other) compares whether two features have the same reference, left position, right position, and orientation. They may still contain different data.

GenomeFeature.intersects(other) tests whether two features overlap at any position. They must be on the same reference to overlap. If they do overlap, GenomeFeature.intersection(other) returns a new instance containing only the overlap region, and the same data as the feature it was called from (the data from other are discarded). Contrariwise, GenomeFeature.union(other) returns a new instances containing the longest possible region covered by either feature, as long as there is no gap between then.

Progress calculation

When you are iterating over a whole genome, you may wish to check how far you've gotten, e.g. to estimate remaining time until your script finishes running. GenomeFeature.progress(reference_lengths) calculates the proportion of the whole genome that lies before the given feature, if you give it a list of the lengths of all references.

Exporting

The GenomeFeature class contains a few methods to export an instance into a single line of a common genomics file format.

GenomeFeature.bed(reference_names, name = None, score = None, thick_range = None, rgb = None, block_count = None, block_sizes = None, block_starts = None, use_strand = True) exports into UCSC BED format. You must provide the list of reference names, in the correct order, so it can convert reference_id to a name. You can provide data for the optional fields if desired, otherwise they are filled in with . per the specification. If use_strand is disabled, the strand field will also be filled with . to indicate a nondirectional feature.

GenomeFeature.gff(reference_names, type, source = None, score = None, phase = None, attributes = None, use_strand = True) behaves similarly for GFF format, but one of the fields, type, is required by the format specification so this method requires it too. attributes should be a string containing the entire list of attributes for that field in the format.

GenomeFeature.bedgraph(reference_names) exports to the simpler UCSC bedGraph format, so it only needs the list of reference names. It assumes the feature's data attribute contains the numerical score for this genome region.

Specialized subclasses

`GenomeArray`

This subclass stores position-specific data for a genome region. That is, its data attribute is an iterable containing some kind of data value for every position in the range. Thus len(my_genomearray) is always the same as len(my_genomearray.data). Like collections.defaultdict, GenomeArray is instantiated with a special argument, default_factory, which is a function that creates the default data value for positions in the array. The default default_factory is float, which will create a data value of 0.0 for every position. You can change it to e.g. list or dict and then each position will have its own separate container for larger-scale data.

GenomeArray's methods for accessing individual positions within the region are similar to base GenomeFeature's, except they return instances containing position-specific data. For example, if my_genomearray spans from left position 13 to right position 20, my_genomearray.data has length 8, and my_genomearray[:3] returns a new instance containing only positions 13 through 15 and the data values for only those first 3 positions.

Likewise the methods for changing coordinates also change the data. shift_left, shift_right, shift_start, and shift_end either delete data from the corresponding end of the array (if the shift shortens it) or add new values from the default factory (if lengthening it). shift and shift_forward remove from one end and add to the other, so the same data values will still correspond to the same positions. If you don't want to change the data while moving the array, move(distance) is like shift(distance) but keeps the data intact.

If you want to add specific data while changing the coordinates, the methods are analogous to collections.deque: append(value), extend(values), and pop() operate on the right end and appendleft(value), extendleft(values), and popleft(values) operate on the left end. There is even rotate(steps = 1), which moves the array while simply rotating the contents of the data, so the values from the end that's being shortened are added to the end being lengthened. reverse() reverses the order of the data. intersection(other) returns only the data values from the intersection region.

Because the data value is an iterable, you can use count(value) directly on the array to count the number of positions whose data is value.

genomerator's People

Contributors

Stargazers

Watchers

Forkers

lewisamarshall

genomerator's Issues

maybe GenomeFeature should have a simple function to check intersection

simply test whether two features overlap, not return the overlap

can the filter function in FeatureStream be simplified?

Image Sequence may have a good analogy

should FeatureStream be a subclass of GenomeFeature?

It might be useful to compare a feature with a stream, or compare two streams to see if one of them is getting ahead of another.

no BedgraphStream yet

title

oriented shifting

Maybe GenomeFeature should have analogs to the shift_left and shift_right methods whose direction depends on the strand.

deep GFF/GTF parsing

It might be nice if GFF/GTF files could be parsed so that each set of children is grouped with its parent (maybe as a list inside self.data).

ProgressBar depends on click

The new ProgressBar class is a wrapper for click.progressbar, but this creates an undesirable dependency on click. Ideally click.progressbar would just work better so this extra wrapper isn't necessary, otherwise is there a workaround to let the user create a click.progressbar instance and pass it to this module elegantly?

GenomeBuffer may be ill-conceived

Might want a version for pushing into and a version for pulling out of? Think more about use cases

BED parser doesn't support user-defined fields

title

type hints

All functions should have type hints if possible. This would allow troubleshooting with mypy rather than assertions, which are proving computationally expensive.

Wiggler should have an option for fixedStep output

title

GenomeFeature.getitem

Maybe GenomeFeature should have a __getitem__ method that returns a new instance containing the ith position in a range (or a slice!). This would be overridden in e.g. GenomePositionDeque.

is OrderedDict the right container for FeatureStream._reference_lookup?

title

GenomePositionDeque should have slice functions

even though a normal deque doesn't have those

also, maybe a deque is the wrong name since it's more generic

parse read names

It might be useful to extract the metadata (e.g. run name, lane, tile coords) from read names in regular formats like Illumina's (and other platforms with lower priority). This would waste time on text-parsing if you're not using so it should be optional, and the question is whether it should be on by default.

merge sorted streams

There should be a simple generator that yields sorted features by merging the outputs of other generators that are already sorted, which is potentially much more efficient than naively sorting it all from scratch.

should base classes have alternative methods instead of subclasses?

E.g. FeatureStream has subclasses like SamStream, VariantStream; maybe there should just be one big class?

GenomeFeature subclasses for parsed data

Consider subclassing GenomeFeature to add attributes for the fields in parsed input data. Or make their data a dictionary of field names instead of a list of unparsed strings. But we probably don't want to completely parse every feature every time, for performance, so this shouldn't be default behavior.

check GenomeFeature.intersection for bugs

title

str

GenomeFeature could use a nicely formatted __str__() method because the output of __repr__() is barely human-readable, especially when data isn't empty. Perhaps a function to export to JSON might be nice too.

subclass namedtuple?

It may be possible to make GenomeFeature a subclass of collections.namedtuple. This might improve performance, at the cost of making the attributes immutable. Some specific problems I anticipate:

substantial refactoring required
the methods that change an instance in place would have to be removed, and only the methods that return a new instance would work (but maybe this category is more likely to be needed in most use cases, and if the constructor is much faster it could be a net gain)
self.data should be okay when it's a reference to a container, as the reference will be immutable but the container doesn't have to be; but if self.data is e.g. an int or float, it will be become immutable; this might be addressed by hiding a reference in e.g. self._data and then making self.data a property that dereferences it, but it would also need methods for modifying through the reference, and if it works for data why not use the same trick for coordinates?
some subclasses like GenomeArray might make more sense being mutable, as they'll be instantiated only once and then modified base-by-base to traverse the genome; perhaps it's possible to make them a mutable subclass of an immutable base, or perhaps there could be a third base class that only contains the methods but no data (so GenomeFeature and GenomeArray each inherit separately), or perhaps even those subclasses can work by returning new instances instead of mutating too

GenomeFeature comparisons could be faster

Many of the comparison methods in GenomeFeature (e.g. GenomeFeature.left_of, GenomeFeature.right_of) are implemented using other methods that create new instances (e.g. GenomeFeature.left) for coding convenience, but this may slow things down. It is possible to code these with only direct numerical comparisons on the coordinates of the existing instances, and that would be faster because it would avoid wasting time in GenomeFeature.__init__.

FastaStream could be designed to return overlapping chunks

title

self-reference in classes

e.g. GenomeFeature.left:

return GenomeFeature(reference_id = self.reference_id, left_pos = self.left_pos, is_reverse = self.is_reverse, data = self.data)

This should probably be self.__class__ instead of GenomeFeature but then it won't work in subclasses.

file format library

Should functions for reading and writing specific data formats be in their own library/libraries? E.g. BedStream but also GenomeFeature.bed.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.