GithubHelp home page GithubHelp logo

hsianktin / genomicannotations.jl Goto Github PK

View Code? Open in Web Editor NEW

This project forked from biojulia/genomicannotations.jl

0.0 0.0 0.0 4.51 MB

Fork to fix some bugs.

License: MIT License

Julia 100.00%

genomicannotations.jl's Introduction

GenomicAnnotations.jl

Description

GenomicAnnotations is a package for reading, modifying, and writing genomic annotations in the GenBank and GFF3 file formats.

Installation

julia>]
pkg> add GenomicAnnotations

Usage

GenBank and GFF3 files are read with readgbk(input) and readgff(input), which return vectors of Records. input can be an IOStream or a file path. GZipped data is unzipped automatically if a filename ending in ".gz" is passed as input. If we're only interested in the first chromosome in example.gbk we only need to store the first element.

chr = readgbk("test/example.gbk")[1]

Records have five fields, name, header, genes, genedata, and sequence. The name is read from the header, which is stored as a string. The annotation data is stored in genedata, but generally you should use genes to access that data. For example, it can be used to iterate over annotations, and to modify them.

for gene in chr.genes
    gene.locus_tag = "$(chr.name)_$(gene.locus_tag)"
end

chr.genes[2].locus_tag = "test123"

Accessing properties that haven't been stored will return missing. For this reason, it often makes more sense to use get() than to access the property directly.

# chr.genes[2].pseudo returns missing, so this will throw an error
if chr.genes[2].pseudo
    println("Gene 2 is a pseudogene")
end

# ... but this works:
if get(chr.genes[2], :pseudo, false)
    println("Gene 2 is a pseudogene")
end

The macro @genes can be used to filter through the annotations. The macro takes a Record or a Vector{Record}, followed by any number of expressions that will be evaluated for each gene. The keyword gene is used to refer to the individual Genes. @genes can also be used to modify annotations. Gene attributes can be referred to using Symbols.

@genes(chr, feature(gene) == "CDS")  # Returns all coding regions
@genes(chr, length(gene) > 300) # Returns all features longer than 300 nt
@genes(chr, iscomplement(gene)) # Returns all features on the complement strand
@genes(chr, ismissing(:product)) # Returns all features for which the attribute "product" has not been set

# Some short-hand forms are available to make life easier:
#     `CDS` expands to `feature(gene) == "CDS"`, and
#     `get(s::Symbol, default)` expands to `get(gene, s, default)`
# The following two are thus equivalent:
@genes(chr, feature(gene) == "CDS", occursin("glycoprotein", get(gene, :product, "")))
@genes(chr,                   CDS,  occursin("glycoprotein", get(      :product, "")))

# All arguments have to evaluate to `true` for a gene to be included, so the following expressions are equivalent:
@genes(chr, feature(gene) == "CDS", length(gene) > 300)
@genes(chr, (feature(gene) == "CDS") && (length(gene) > 300))

# `@genes` returns a `Vector{Gene}`. Attributes can be accessed with dot-syntax, and can be assigned to:
@genes(chr, :locus_tag == "tag03")[1].pseudo = true
@genes(chr, CDS, ismissing(:gene)).gene .= "unknown"

Gene sequences can be accessed with sequence(gene). For example, the following code will write the translated sequences of all protein-coding genes to a file:

using BioSequences
using FASTX
open(FASTA.Writer, "proteins.fasta") do w
    for gene in @genes(chr, CDS)
        aaseq = GenomicAnnotations.sequence(gene; translate = true)
        write(w, FASTA.Record(gene.locus_tag, get(:product, ""), aaseq))
    end
end

Genes can be added using addgene!, and sort! can be used to make sure that the resulting annotations are in the correct order for printing. delete! is used to remove genes.

newgene = addgene!(chr, "regulatory", 670:677)
newgene.locus_tag = "reg02"
sort!(chr.genes)

# Genes can be deleted. This works for all genes where `:pseudo` is `true`, and ignores genes where it is `false` or `missing`
delete!(@genes(chr, :pseudo))
# Delete all genes 60 nt or shorter
delete!(@genes(chr, length(gene) <= 60))

Individual genes, and Vector{Gene}s are printed in GBK format. To include the GBK header and the nucleotide sequence, printgbk(io, chr) can be used to write them to a file. printgff(io, chr) prints the annotations as GFF3, in which case the GenBank header is lost.

println(chr.genes[1])
println(@genes(chr, CDS))

open(GenBank.Writer, "updated.gbk") do w
    write(w, chr)
end

genomicannotations.jl's People

Contributors

kdyrhage avatar juliatagbot avatar hsianktin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.