GithubHelp home page GithubHelp logo

ensembllite's Introduction

PyPI version Downloads

Build Status coverall

PyPI - Python Version Using Black Formatting

CodeQL

cogent3 is a mature python library for analysis of genomic sequence data. We endeavour to provide a first-class experience within Jupyter notebooks, but the algorithms also support parallel execution on compute systems with 1000's of processors.

Who is it for?

Anyone who wants to analyse sequence divergence using robust statistical models

cogent3 is unique in providing numerous non-stationary Markov models for modelling sequence evolution, including codon models. cogent3 also includes an extensive collection of time-reversible models (again including novel codon models). We have done more than just invent these new methods, we have established the most robust algorithms for their implementation and their suitability for real data. Additionally, there are novel signal processing methods focussed on statistical estimation of integer period signals.

🎬 Demo non-reversible substitution model
cogent3-demo-composable.mp4

Anyone who wants to undertake exploratory genomic data analysis

Beyond our novel methods, cogent3 provides an extensive suite of capabilities for manipulating and analysing sequence data. You can manipulate sequences by their annotations, e.g.

🎬 Demo sequences with annotations
cogent3-demo-new-ann.mp4

Plus, you can read standard tabular and biological data formats, perform multiple sequence alignment using any cogent3 substitution models, phylogenetic reconstruction and tree manipulation, manipulation of tabular data, visualisation of phylogenies and much more.

Beginner friendly approach to genomic data analysis

Our cogent3.app module provides a very different approach to using the library capabilities. Expertise in structural programming concepts is not essential!

🎬 Demo friendly coding
cogent3-demo-composable.mp4

Installation?

$ pip install cogent3

Install extra -- adds visualisation support

The extra group includes python libraries required for visualisation, i.e. plotly, kaleido, psutil and pandas.

$ pip install "cogent3[extra]"

Install dev -- adds cogent3 development related libraries

The dev group includes python libraries required for development of cogent3.

$ pip install "cogent3[dev]"

Install the development version

$ pip install git+https://github.com/cogent3/cogent3.git@develop#egg=cogent3

Project Information

cogent3 is released under the BSD-3 license, documentation is at cogent3.org, while cogent3 code is on GitHub. If you would like to contribute (and we hope you do!), we have created a companion c3dev GitHub repo which provides details on how to contribute and some useful tools for doing so.

Project History

cogent3 is a descendant of PyCogent. While there is much in common with PyCogent, the amount of change has been substantial, motivating the name change to cogent3. This name has been chosen because cogent was always the import name (dating back to PyEvolve in 2004) and it's Python 3 only.

Given this history, we are grateful to the multitude of individuals who have made contributions over the years. Many of these contributors were also co-authors on the original PyEvolve and PyCogent publications. Individual contributions can be seen by using "view git blame" on individual lines of code on GitHub , through git log in the terminal, and more recently the changelog.

Compared to PyCogent version 1.9, there has been a massive amount of changes. These include integration of many of the new developments on algorithms and modelling published by the Huttley lab over the last decade. We have also modernised our dependencies. For example, we now use plotly for visualisation, tqdm for progress bar display, concurrent.futures and mpi4py.futures for parallel process execution, nox and pytest for unit testing.

Funding

Cogent3 has received funding support from the Australian National University and an Essential Open Source Software for Science Grant from the Chan Zuckerberg Initiative.

         

ensembllite's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

ensembllite's Issues

install fails on some systems

Running this command

elt install -d ~/someuser/whole_genome_mammal87/ensembl_download

Failed with the following exception

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
    local_install_genomes(
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 115, in local_install_genomes
    for _ in PAR.as_completed(
  File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 239, in as_completed
    yield from _as_completed_mproc(f, s, max_workers)
  File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 227, in _as_completed_mproc
    yield result.result()
          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I now think this is because the current implementation always uses concurrent futures, even if running on a single process. This can be fixed by providing a simple mechanism to use the builtin map() function in those cases.

download issue

The homology data are downloaded and installed, except for 'ovis arise' because it keeps raising this error when I try to download its homology data

ftplib.error_perm: 550 Failed to change directory.

Ensembl site-map object

The locations of data types can differ between ensembl and ensemblgenomes. For example the location of homology data, of whole genome alignments. For example

Taxa Site Path
Bacteria ftp.ensemblgenomes.org pub/release-57/pan_ensembl/tsv/ensembl-compara/homologies
Vertebrates ftp.ensembl.org pub/release-110/tsv/ensembl-compara/homologies

To simplify dealing with this, we need a federation object that provides a unified interface for code. For example, define a site-map class. For example, defining a class with attributes for each data type.

class EnsemblSiteMap:
    @property
    def homology_path(self) -> str:
        ...

    @property
    def alignment_path(self) -> str:
        ...

    @property
    def genome_path(self) -> str:
        ...

    @property
    def annotation_path(self) -> str:
        ...

Separate instances for each of the different Ensembl sites. Have an instance made on demand.

implement reinflating a stored alignment block

EnsemblLite saves an alignment maf formatted file in a sqlite db with the following rows

column type
source TEXT
block_id INT
species TEXT
coord_name TEXT
start INTEGER
end INTEGER
strand TEXT
gap_spans compressed_array
  • block_id corresponds to an alignment segment within the maf file
  • gap_spans is a 2D numpy array consisting of the location of a gap and it's length.

To get an alignment

We need:

  • all the records for a given block_id
  • the genome sequence corresponding to that coordinate

We use the functions within ensembl_lite.convert to create the individual Aligned instances and from that the Alignment instance.

ENH: ability to get summary statistics for Ensembl hosted genomes

Provide summary statistics to help a user to decide what species may be suitable for their research. Some of the data is in, for example, http://ftp.ensembl.org/pub/current/uniprot_report_EnsemblVertebrates.txt and other such top-level files.
Desirable data:

  • species tree
  • nucleotide counts (better than GC%)

when num CPU is 1, don't use multiprocess

this is causing problems in Docker containers with limited resources.

The solution is to write a function that, when num CPUs=1, returns a map(func, series) instead of the cogent3.util.parallel.as_completed(func, series)

download.cfg paths should be relative

they are currently absolute which means if the directory is removed, the connections are broken

for example, instead of

/Users/someone/working/apes_111/download_111

make it

apes_111/download_111

EnsemblLite roadmap sketch

Minimal usability

Note
This means the tool can be installed and used to download and install data from ensembl.org necessary to perform the selected set of queries defined below. These queries will produce standard formats. This variant will have a lot of errors!

  • terminal interface for controlling primary functions
  • given a config, can get sequences, gff3, and alignments from Ensembl.org
  • above is transformed to a "local" install via sqlite databases
  • dump-genes command generates a tsv formatted file for a species
  • export one-to-one homologs as fasta
  • display installed data in the terminal
  • dump alignment segments corresponding to a user provided input list of genomic regions / genes
  • mask segments of exported alignments for selected feature types (e.g. exons)

Track progress on the issues page

Robust usability

Note
Same feature set as above, but now the tool has a test coverage of ~90% and internals have been refactored for clarity. We will call for testers at this point. This variant will have some errors!

  • supported annotations extended to include repeats, CpG islands (see #65)
  • ~90% of all code lines are evaluated with tests
  • homolog querying expanded to different relationship types (e.g. ortholog one-to-many, paralogs)
  • scitrack logging will be added
  • minimal documentation

Track progress on the issues page

Selecting a slice of an alignment

To get an alignment, we need the block_id. And to get this we need a species identifier and genome coordinate. (Let's call this ref_coord.) We then use AlignDb.get_records_matching() does this.

With the block_id we can then get all the coordinates for other species in
the block. (NOTE: if we are going to support strict, i.e. only blocks
where all nominated species are present, we would filter at this level).

The gap_spans that are returned by block will exceed the coordinates for the selected region. These will need to be trimmed to just the region so only the desired sequences need be selected. So we need to implement a container class that handles gap positions, lets call this GapPositions. From the ref_coord, we need to GapPositions to convert the sequence indices into alignment indices

align_start = <ref gap pos>.from_seq_to_align_index(ref_start)
align_end = align_start + ref_end - ref_start

For all other records, we then use to
gives us the coordinates for that species.

other_start = <other species gap pos>.from_align_to_seq_index(align_start)
other_end = <other species gap pos>.from_align_to_seq_index(align_end)

These values can the be used to obtain the genome sequence followed by a subset of the annotation data and combined into an Alignment instance.

Restarting install triggers SQL error

While the solution is to use the -f (foce overwrite flag), the error message is not helpful.

File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
    local_install_genomes(
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 129, in local_install_genomes
    db.add_compressed_records(records=records)
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_genomedb.py", line 108, in add_compressed_records
    self.db.executemany(sql, [(n, s, len(s)) for n, s in records])
sqlite3.IntegrityError: UNIQUE constraint failed: genome.seqid

Exporting homolog groups issue

For export homolog genes:
elt homologs -i <path to the installation directory> -o OUTPATH -- limit 100

When setting the limit to 100 it raises the error:

IndexError: list index out of range

Full traceback:

message="Traceback (most recent call last):

  File

"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap

p/composable.py", line 366, in _call

    result = self.main(val, *args, **kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File

"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap

p/composable.py", line 429, in _main

    return self._user_func(**bound.arguments)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line

286, in get_selected_seqs

    return list(get_seqs_for_ids(config=config, species=species,

names=gene_ids))

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^

  File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line

247, in get_seqs_for_ids

    feature = list(genome.get_features(name=f"%{name}"))[0]

              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^

IndexError: list index out of range

")

Define a download cfg file

The ensembl ftp link provides access to many different formats. Among these are multiple sequence alignments.

The full description of the resources is in the README

The structure of the .cfg file needs to support selecting the types of data for each genome:

  • seq=fasta, we will grab the entire genome sequences (what sub directory? dna?)
  • features=gff3 (or json?)

For compara, we need the alignments for the selected species. This could be derived automatically by parsing the README's (e.g. this one, but that will be fragile). Better to require hard-coding of the alignment directory name to start with (e.g. *primates).

The homology information is here.

Usage on memory limited hardware

When trying to install vertebrate genomes, multiple users have run into what appear to be memory limits when running on RAM limited hardware (including my lab server which has 16GB RAM).

We need a more efficient data storage for sequences that will allow loading into memory only what we need.

Possible choices are HDF5 or Arne's custom backend.

Initial release on pypi

  • take a look at current readme
  • is workflow described?
  • basic usage examples?
  • setup token for shared pypi commit

including repeats in annotation data for querying

Minimal data for repeat information is:

  • genomic coordinates (coordinate name, start, end, strand)
  • repeat classification information, including repeat type, repeat class, score (?)

What is the minimum data needed to download that will provide this? (That is, what are the minimum files required.)

Conceptually separate the download from the install step. The former grabs data in a sufficiently complete form that the install step does not require internet access.

We then need to transform this to be compatible with the querying approach for gene features, perhaps a single table.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.