cogent3 / ensembllite Goto Github PK

This project forked from cogent3/ensembldb3

A new approach to obtaining local copies of ensembl data

Python 99.62% Jinja 0.38%

ensembllite's Introduction

cogent3 is a mature python library for analysis of genomic sequence data. We endeavour to provide a first-class experience within Jupyter notebooks, but the algorithms also support parallel execution on compute systems with 1000's of processors.

Who is it for?

Anyone who wants to analyse sequence divergence using robust statistical models

cogent3 is unique in providing numerous non-stationary Markov models for modelling sequence evolution, including codon models. cogent3 also includes an extensive collection of time-reversible models (again including novel codon models). We have done more than just invent these new methods, we have established the most robust algorithms for their implementation and their suitability for real data. Additionally, there are novel signal processing methods focussed on statistical estimation of integer period signals.

🎬 Demo non-reversible substitution model

cogent3-demo-composable.mp4

Anyone who wants to undertake exploratory genomic data analysis

Beyond our novel methods, cogent3 provides an extensive suite of capabilities for manipulating and analysing sequence data. You can manipulate sequences by their annotations, e.g.

🎬 Demo sequences with annotations

cogent3-demo-new-ann.mp4

Plus, you can read standard tabular and biological data formats, perform multiple sequence alignment using any cogent3 substitution models, phylogenetic reconstruction and tree manipulation, manipulation of tabular data, visualisation of phylogenies and much more.

Beginner friendly approach to genomic data analysis

Our cogent3.app module provides a very different approach to using the library capabilities. Expertise in structural programming concepts is not essential!

🎬 Demo friendly coding

cogent3-demo-composable.mp4

Installation?

$ pip install cogent3

Install `extra` -- adds visualisation support

The extra group includes python libraries required for visualisation, i.e. plotly, kaleido, psutil and pandas.

$ pip install "cogent3[extra]"

Install `dev` -- adds `cogent3` development related libraries

The dev group includes python libraries required for development of cogent3.

$ pip install "cogent3[dev]"

Install the development version

$ pip install git+https://github.com/cogent3/cogent3.git@develop#egg=cogent3

Project Information

cogent3 is released under the BSD-3 license, documentation is at cogent3.org, while cogent3 code is on GitHub. If you would like to contribute (and we hope you do!), we have created a companion c3dev GitHub repo which provides details on how to contribute and some useful tools for doing so.

Project History

cogent3 is a descendant of PyCogent. While there is much in common with PyCogent, the amount of change has been substantial, motivating the name change to cogent3. This name has been chosen because cogent was always the import name (dating back to PyEvolve in 2004) and it's Python 3 only.

Given this history, we are grateful to the multitude of individuals who have made contributions over the years. Many of these contributors were also co-authors on the original PyEvolve and PyCogent publications. Individual contributions can be seen by using "view git blame" on individual lines of code on GitHub , through git log in the terminal, and more recently the changelog.

Compared to PyCogent version 1.9, there has been a massive amount of changes. These include integration of many of the new developments on algorithms and modelling published by the Huttley lab over the last decade. We have also modernised our dependencies. For example, we now use plotly for visualisation, tqdm for progress bar display, concurrent.futures and mpi4py.futures for parallel process execution, nox and pytest for unit testing.

Funding

Cogent3 has received funding support from the Australian National University and an Essential Open Source Software for Science Grant from the Chan Zuckerberg Initiative.

ensembllite's People

Stargazers

Watchers

Forkers

gavinhuttley ebiarnie firimp kylesu12 yutong-shao sharonli126 yantonglu changsenjiang ruizhe-wang zongjing-han gaopeizhong weilinwu97 shunyiyang

ensembllite's Issues

DEV: decide on the project name

Needs to be discussed with Ensembl folk, cannot use their brand without consent!

install fails on some systems

Running this command

elt install -d ~/someuser/whole_genome_mammal87/ensembl_download

Failed with the following exception

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
    local_install_genomes(
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 115, in local_install_genomes
    for _ in PAR.as_completed(
  File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 239, in as_completed
    yield from _as_completed_mproc(f, s, max_workers)
  File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 227, in _as_completed_mproc
    yield result.result()
          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I now think this is because the current implementation always uses concurrent futures, even if running on a single process. This can be fixed by providing a simple mechanism to use the builtin map() function in those cases.

ENH: make download experience more robust

deal with flaky internet and / or sleeping laptop and reconnecting on another network

download issue

The homology data are downloaded and installed, except for 'ovis arise' because it keeps raising this error when I try to download its homology data

ftplib.error_perm: 550 Failed to change directory.

Ensembl site-map object

The locations of data types can differ between ensembl and ensemblgenomes. For example the location of homology data, of whole genome alignments. For example

Taxa	Site	Path
Bacteria	ftp.ensemblgenomes.org	pub/release-57/pan_ensembl/tsv/ensembl-compara/homologies
Vertebrates	ftp.ensembl.org	pub/release-110/tsv/ensembl-compara/homologies

To simplify dealing with this, we need a federation object that provides a unified interface for code. For example, define a site-map class. For example, defining a class with attributes for each data type.

class EnsemblSiteMap:
    @property
    def homology_path(self) -> str:
        ...

    @property
    def alignment_path(self) -> str:
        ...

    @property
    def genome_path(self) -> str:
        ...

    @property
    def annotation_path(self) -> str:
        ...

Separate instances for each of the different Ensembl sites. Have an instance made on demand.

implement reinflating a stored alignment block

EnsemblLite saves an alignment maf formatted file in a sqlite db with the following rows

column	type
source	TEXT
block_id	INT
species	TEXT
coord_name	TEXT
start	INTEGER
end	INTEGER
strand	TEXT
gap_spans	compressed_array

block_id corresponds to an alignment segment within the maf file
gap_spans is a 2D numpy array consisting of the location of a gap and it's length.

To get an alignment

We need:

all the records for a given block_id
the genome sequence corresponding to that coordinate

We use the functions within ensembl_lite.convert to create the individual Aligned instances and from that the Alignment instance.

create a docker container for installing EnsemblLite

based on the cogent3 workshop container, do direct git clone from the https link and developer install so there the code is accessible for playing with

ENH: ability to get summary statistics for Ensembl hosted genomes

Provide summary statistics to help a user to decide what species may be suitable for their research. Some of the data is in, for example, http://ftp.ensembl.org/pub/current/uniprot_report_EnsemblVertebrates.txt and other such top-level files.
Desirable data:

species tree
nucleotide counts (better than GC%)

DOC: create docs for users

what do you think is essential for first announced release

when num CPU is 1, don't use multiprocess

this is causing problems in Docker containers with limited resources.

The solution is to write a function that, when num CPUs=1, returns a map(func, series) instead of the cogent3.util.parallel.as_completed(func, series)

download.cfg paths should be relative

they are currently absolute which means if the directory is removed, the connections are broken

for example, instead of

/Users/someone/working/apes_111/download_111

make it

apes_111/download_111

EnsemblLite roadmap sketch

Minimal usability

Note
This means the tool can be installed and used to download and install data from ensembl.org necessary to perform the selected set of queries defined below. These queries will produce standard formats. This variant will have a lot of errors!

terminal interface for controlling primary functions
given a config, can get sequences, gff3, and alignments from Ensembl.org
above is transformed to a "local" install via sqlite databases
dump-genes command generates a tsv formatted file for a species
export one-to-one homologs as fasta
display installed data in the terminal
dump alignment segments corresponding to a user provided input list of genomic regions / genes
mask segments of exported alignments for selected feature types (e.g. exons)

Track progress on the issues page

Robust usability

Note
Same feature set as above, but now the tool has a test coverage of ~90% and internals have been refactored for clarity. We will call for testers at this point. This variant will have some errors!

supported annotations extended to include repeats, CpG islands (see #65)
~90% of all code lines are evaluated with tests
homolog querying expanded to different relationship types (e.g. ortholog one-to-many, paralogs)
scitrack logging will be added
minimal documentation

Track progress on the issues page

Selecting a slice of an alignment

To get an alignment, we need the block_id. And to get this we need a species identifier and genome coordinate. (Let's call this ref_coord.) We then use AlignDb.get_records_matching() does this.

With the block_id we can then get all the coordinates for other species in
the block. (NOTE: if we are going to support strict, i.e. only blocks
where all nominated species are present, we would filter at this level).

The gap_spans that are returned by block will exceed the coordinates for the selected region. These will need to be trimmed to just the region so only the desired sequences need be selected. So we need to implement a container class that handles gap positions, lets call this GapPositions. From the ref_coord, we need to GapPositions to convert the sequence indices into alignment indices

align_start = <ref gap pos>.from_seq_to_align_index(ref_start)
align_end = align_start + ref_end - ref_start

For all other records, we then use to
gives us the coordinates for that species.

other_start = <other species gap pos>.from_align_to_seq_index(align_start)
other_end = <other species gap pos>.from_align_to_seq_index(align_end)

These values can the be used to obtain the genome sequence followed by a subset of the annotation data and combined into an Alignment instance.

Restarting install triggers SQL error

While the solution is to use the -f (foce overwrite flag), the error message is not helpful.

File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
    local_install_genomes(
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 129, in local_install_genomes
    db.add_compressed_records(records=records)
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_genomedb.py", line 108, in add_compressed_records
    self.db.executemany(sql, [(n, s, len(s)) for n, s in records])
sqlite3.IntegrityError: UNIQUE constraint failed: genome.seqid

ENH: added completion checkpoint files

To the downloaded folders and to the installed folders. These can be inspected to eliminate unecessary computation.

investigate process for GitHub action based PyPI releases

we want to be able to have the flit build and publish steps controlled via GitHub actions. For a project, how does this work?

Exporting homolog groups issue

For export homolog genes:
elt homologs -i <path to the installation directory> -o OUTPATH -- limit 100

When setting the limit to 100 it raises the error:

IndexError: list index out of range

Full traceback:

message="Traceback (most recent call last):

  File

"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap

p/composable.py", line 366, in _call

    result = self.main(val, *args, **kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File

"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap

p/composable.py", line 429, in _main

    return self._user_func(**bound.arguments)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line

286, in get_selected_seqs

    return list(get_seqs_for_ids(config=config, species=species,

names=gene_ids))

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^

  File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line

247, in get_seqs_for_ids

    feature = list(genome.get_features(name=f"%{name}"))[0]

              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^

IndexError: list index out of range

")

ENH: cli for listing the available whole genome alignments

A for listing species present in whole genome alignments, for example

https://ftp.ensembl.org/pub/release-110/compara/species_trees/10_primates_EPO_default.nh

Define a download cfg file

The ensembl ftp link provides access to many different formats. Among these are multiple sequence alignments.

The full description of the resources is in the README

The structure of the .cfg file needs to support selecting the types of data for each genome:

seq=fasta, we will grab the entire genome sequences (what sub directory? dna?)
features=gff3 (or json?)

For compara, we need the alignments for the selected species. This could be derived automatically by parsing the README's (e.g. this one, but that will be fragile). Better to require hard-coding of the alignment directory name to start with (e.g. *primates).

The homology information is here.

Usage on memory limited hardware

When trying to install vertebrate genomes, multiple users have run into what appear to be memory limits when running on RAM limited hardware (including my lab server which has 16GB RAM).

We need a more efficient data storage for sequences that will allow loading into memory only what we need.

Possible choices are HDF5 or Arne's custom backend.

add option to just get homology info to compara section

in case multiple-sequence alignments not desired

Initial release on pypi

take a look at current readme
is workflow described?
basic usage examples?
setup token for shared pypi commit

including repeats in annotation data for querying

Minimal data for repeat information is:

genomic coordinates (coordinate name, start, end, strand)
repeat classification information, including repeat type, repeat class, score (?)

What is the minimum data needed to download that will provide this? (That is, what are the minimum files required.)

Conceptually separate the download from the install step. The former grabs data in a sufficiently complete form that the install step does not require internet access.

We then need to transform this to be compatible with the querying approach for gene features, perhaps a single table.