cogent3 / ensembllite Goto Github PK

This project forked from cogent3/ensembldb3

A new approach to obtaining local copies of ensembl data

Python 99.62% Jinja 0.38%

ensembllite's Issues

Minimal usability

Note
This means the tool can be installed and used to download and install data from ensembl.org necessary to perform the selected set of queries defined below. These queries will produce standard formats. This variant will have a lot of errors!

terminal interface for controlling primary functions
given a config, can get sequences, gff3, and alignments from Ensembl.org
above is transformed to a "local" install via sqlite databases
dump-genes command generates a tsv formatted file for a species
export one-to-one homologs as fasta
display installed data in the terminal
dump alignment segments corresponding to a user provided input list of genomic regions / genes
mask segments of exported alignments for selected feature types (e.g. exons)

Track progress on the issues page

Robust usability

Note
Same feature set as above, but now the tool has a test coverage of ~90% and internals have been refactored for clarity. We will call for testers at this point. This variant will have some errors!

supported annotations extended to include repeats, CpG islands (see #65)
~90% of all code lines are evaluated with tests
homolog querying expanded to different relationship types (e.g. ortholog one-to-many, paralogs)
scitrack logging will be added
minimal documentation

Track progress on the issues page

DOC: create docs for users

what do you think is essential for first announced release

investigate process for GitHub action based PyPI releases

we want to be able to have the flit build and publish steps controlled via GitHub actions. For a project, how does this work?

when num CPU is 1, don't use multiprocess

this is causing problems in Docker containers with limited resources.

The solution is to write a function that, when num CPUs=1, returns a map(func, series) instead of the cogent3.util.parallel.as_completed(func, series)

DEV: decide on the project name

Needs to be discussed with Ensembl folk, cannot use their brand without consent!

ENH: ability to get summary statistics for Ensembl hosted genomes

Provide summary statistics to help a user to decide what species may be suitable for their research. Some of the data is in, for example, http://ftp.ensembl.org/pub/current/uniprot_report_EnsemblVertebrates.txt and other such top-level files.
Desirable data:

species tree
nucleotide counts (better than GC%)

Initial release on pypi

take a look at current readme
is workflow described?
basic usage examples?
setup token for shared pypi commit

ENH: added completion checkpoint files

To the downloaded folders and to the installed folders. These can be inspected to eliminate unecessary computation.

ENH: cli for listing the available whole genome alignments

A for listing species present in whole genome alignments, for example

https://ftp.ensembl.org/pub/release-110/compara/species_trees/10_primates_EPO_default.nh

download issue

The homology data are downloaded and installed, except for 'ovis arise' because it keeps raising this error when I try to download its homology data

ftplib.error_perm: 550 Failed to change directory.

ENH: make download experience more robust

deal with flaky internet and / or sleeping laptop and reconnecting on another network

Usage on memory limited hardware

When trying to install vertebrate genomes, multiple users have run into what appear to be memory limits when running on RAM limited hardware (including my lab server which has 16GB RAM).

We need a more efficient data storage for sequences that will allow loading into memory only what we need.

Possible choices are HDF5 or Arne's custom backend.

install fails on some systems

Running this command

elt install -d ~/someuser/whole_genome_mammal87/ensembl_download

Failed with the following exception

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
    local_install_genomes(
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 115, in local_install_genomes
    for _ in PAR.as_completed(
  File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 239, in as_completed
    yield from _as_completed_mproc(f, s, max_workers)
  File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 227, in _as_completed_mproc
    yield result.result()
          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I now think this is because the current implementation always uses concurrent futures, even if running on a single process. This can be fixed by providing a simple mechanism to use the builtin map() function in those cases.

Selecting a slice of an alignment

To get an alignment, we need the block_id. And to get this we need a species identifier and genome coordinate. (Let's call this ref_coord.) We then use AlignDb.get_records_matching() does this.

With the block_id we can then get all the coordinates for other species in
the block. (NOTE: if we are going to support strict, i.e. only blocks
where all nominated species are present, we would filter at this level).

The gap_spans that are returned by block will exceed the coordinates for the selected region. These will need to be trimmed to just the region so only the desired sequences need be selected. So we need to implement a container class that handles gap positions, lets call this GapPositions. From the ref_coord, we need to GapPositions to convert the sequence indices into alignment indices

align_start = <ref gap pos>.from_seq_to_align_index(ref_start)
align_end = align_start + ref_end - ref_start

For all other records, we then use to
gives us the coordinates for that species.

other_start = <other species gap pos>.from_align_to_seq_index(align_start)
other_end = <other species gap pos>.from_align_to_seq_index(align_end)

These values can the be used to obtain the genome sequence followed by a subset of the annotation data and combined into an Alignment instance.

download.cfg paths should be relative

they are currently absolute which means if the directory is removed, the connections are broken

for example, instead of

/Users/someone/working/apes_111/download_111

make it

apes_111/download_111

including repeats in annotation data for querying

Minimal data for repeat information is:

genomic coordinates (coordinate name, start, end, strand)
repeat classification information, including repeat type, repeat class, score (?)

What is the minimum data needed to download that will provide this? (That is, what are the minimum files required.)

Conceptually separate the download from the install step. The former grabs data in a sufficiently complete form that the install step does not require internet access.

We then need to transform this to be compatible with the querying approach for gene features, perhaps a single table.

implement reinflating a stored alignment block

EnsemblLite saves an alignment maf formatted file in a sqlite db with the following rows

column	type
source	TEXT
block_id	INT
species	TEXT
coord_name	TEXT
start	INTEGER
end	INTEGER
strand	TEXT
gap_spans	compressed_array

block_id corresponds to an alignment segment within the maf file
gap_spans is a 2D numpy array consisting of the location of a gap and it's length.

To get an alignment

We need:

all the records for a given block_id
the genome sequence corresponding to that coordinate

We use the functions within ensembl_lite.convert to create the individual Aligned instances and from that the Alignment instance.

ENH: cli for downloading latest available species

Downloading the latest species / common name listings from https://ftp.ensembl.org/pub/release-110/species_EnsemblVertebrates.txt

NOTE: this file cannot be automatically read using load_table since data rows have an extra tab character at the end

Exporting homolog groups issue

For export homolog genes:
elt homologs -i <path to the installation directory> -o OUTPATH -- limit 100

When setting the limit to 100 it raises the error:

IndexError: list index out of range

Full traceback:

message="Traceback (most recent call last):

  File

"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap

p/composable.py", line 366, in _call

    result = self.main(val, *args, **kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File

"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap

p/composable.py", line 429, in _main

    return self._user_func(**bound.arguments)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line

286, in get_selected_seqs

    return list(get_seqs_for_ids(config=config, species=species,

names=gene_ids))

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^

  File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line

247, in get_seqs_for_ids

    feature = list(genome.get_features(name=f"%{name}"))[0]

              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^

IndexError: list index out of range

")

Restarting install triggers SQL error

While the solution is to use the -f (foce overwrite flag), the error message is not helpful.

File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
    local_install_genomes(
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 129, in local_install_genomes
    db.add_compressed_records(records=records)
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_genomedb.py", line 108, in add_compressed_records
    self.db.executemany(sql, [(n, s, len(s)) for n, s in records])
sqlite3.IntegrityError: UNIQUE constraint failed: genome.seqid

create a docker container for installing EnsemblLite

based on the cogent3 workshop container, do direct git clone from the https link and developer install so there the code is accessible for playing with

Define a download cfg file

The ensembl ftp link provides access to many different formats. Among these are multiple sequence alignments.

The full description of the resources is in the README

The structure of the .cfg file needs to support selecting the types of data for each genome:

seq=fasta, we will grab the entire genome sequences (what sub directory? dna?)
features=gff3 (or json?)

For compara, we need the alignments for the selected species. This could be derived automatically by parsing the README's (e.g. this one, but that will be fragile). Better to require hard-coding of the alignment directory name to start with (e.g. *primates).

The homology information is here.

ENH: cli interface for grabbing the species trees

These can be for the entire Ensembl collection or those present in whole genome alignments. For vertebrates the URL is
https://ftp.ensembl.org/pub/release-110/compara/species_trees/

add option to just get homology info to compara section

in case multiple-sequence alignments not desired

Ensembl site-map object

The locations of data types can differ between ensembl and ensemblgenomes. For example the location of homology data, of whole genome alignments. For example

Taxa	Site	Path
Bacteria	ftp.ensemblgenomes.org	pub/release-57/pan_ensembl/tsv/ensembl-compara/homologies
Vertebrates	ftp.ensembl.org	pub/release-110/tsv/ensembl-compara/homologies

To simplify dealing with this, we need a federation object that provides a unified interface for code. For example, define a site-map class. For example, defining a class with attributes for each data type.

class EnsemblSiteMap:
    @property
    def homology_path(self) -> str:
        ...

    @property
    def alignment_path(self) -> str:
        ...

    @property
    def genome_path(self) -> str:
        ...

    @property
    def annotation_path(self) -> str:
        ...

Separate instances for each of the different Ensembl sites. Have an instance made on demand.

cogent3 / ensembllite Goto Github PK

ensembllite's Issues

Minimal usability

Robust usability

To get an alignment

Recommend Projects

Recommend Topics

Recommend Org

Jobs