GithubHelp home page GithubHelp logo

cogent3 / ensembllite Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cogent3/ensembldb3

20.0 20.0 13.0 1.02 MB

A new approach to obtaining local copies of ensembl data

Python 99.62% Jinja 0.38%

ensembllite's Issues

EnsemblLite roadmap sketch

Minimal usability

Note
This means the tool can be installed and used to download and install data from ensembl.org necessary to perform the selected set of queries defined below. These queries will produce standard formats. This variant will have a lot of errors!

  • terminal interface for controlling primary functions
  • given a config, can get sequences, gff3, and alignments from Ensembl.org
  • above is transformed to a "local" install via sqlite databases
  • dump-genes command generates a tsv formatted file for a species
  • export one-to-one homologs as fasta
  • display installed data in the terminal
  • dump alignment segments corresponding to a user provided input list of genomic regions / genes
  • mask segments of exported alignments for selected feature types (e.g. exons)

Track progress on the issues page

Robust usability

Note
Same feature set as above, but now the tool has a test coverage of ~90% and internals have been refactored for clarity. We will call for testers at this point. This variant will have some errors!

  • supported annotations extended to include repeats, CpG islands (see #65)
  • ~90% of all code lines are evaluated with tests
  • homolog querying expanded to different relationship types (e.g. ortholog one-to-many, paralogs)
  • scitrack logging will be added
  • minimal documentation

Track progress on the issues page

when num CPU is 1, don't use multiprocess

this is causing problems in Docker containers with limited resources.

The solution is to write a function that, when num CPUs=1, returns a map(func, series) instead of the cogent3.util.parallel.as_completed(func, series)

ENH: ability to get summary statistics for Ensembl hosted genomes

Provide summary statistics to help a user to decide what species may be suitable for their research. Some of the data is in, for example, http://ftp.ensembl.org/pub/current/uniprot_report_EnsemblVertebrates.txt and other such top-level files.
Desirable data:

  • species tree
  • nucleotide counts (better than GC%)

Initial release on pypi

  • take a look at current readme
  • is workflow described?
  • basic usage examples?
  • setup token for shared pypi commit

download issue

The homology data are downloaded and installed, except for 'ovis arise' because it keeps raising this error when I try to download its homology data

ftplib.error_perm: 550 Failed to change directory.

Usage on memory limited hardware

When trying to install vertebrate genomes, multiple users have run into what appear to be memory limits when running on RAM limited hardware (including my lab server which has 16GB RAM).

We need a more efficient data storage for sequences that will allow loading into memory only what we need.

Possible choices are HDF5 or Arne's custom backend.

install fails on some systems

Running this command

elt install -d ~/someuser/whole_genome_mammal87/ensembl_download

Failed with the following exception

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
    local_install_genomes(
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 115, in local_install_genomes
    for _ in PAR.as_completed(
  File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 239, in as_completed
    yield from _as_completed_mproc(f, s, max_workers)
  File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 227, in _as_completed_mproc
    yield result.result()
          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I now think this is because the current implementation always uses concurrent futures, even if running on a single process. This can be fixed by providing a simple mechanism to use the builtin map() function in those cases.

Selecting a slice of an alignment

To get an alignment, we need the block_id. And to get this we need a species identifier and genome coordinate. (Let's call this ref_coord.) We then use AlignDb.get_records_matching() does this.

With the block_id we can then get all the coordinates for other species in
the block. (NOTE: if we are going to support strict, i.e. only blocks
where all nominated species are present, we would filter at this level).

The gap_spans that are returned by block will exceed the coordinates for the selected region. These will need to be trimmed to just the region so only the desired sequences need be selected. So we need to implement a container class that handles gap positions, lets call this GapPositions. From the ref_coord, we need to GapPositions to convert the sequence indices into alignment indices

align_start = <ref gap pos>.from_seq_to_align_index(ref_start)
align_end = align_start + ref_end - ref_start

For all other records, we then use to
gives us the coordinates for that species.

other_start = <other species gap pos>.from_align_to_seq_index(align_start)
other_end = <other species gap pos>.from_align_to_seq_index(align_end)

These values can the be used to obtain the genome sequence followed by a subset of the annotation data and combined into an Alignment instance.

download.cfg paths should be relative

they are currently absolute which means if the directory is removed, the connections are broken

for example, instead of

/Users/someone/working/apes_111/download_111

make it

apes_111/download_111

including repeats in annotation data for querying

Minimal data for repeat information is:

  • genomic coordinates (coordinate name, start, end, strand)
  • repeat classification information, including repeat type, repeat class, score (?)

What is the minimum data needed to download that will provide this? (That is, what are the minimum files required.)

Conceptually separate the download from the install step. The former grabs data in a sufficiently complete form that the install step does not require internet access.

We then need to transform this to be compatible with the querying approach for gene features, perhaps a single table.

implement reinflating a stored alignment block

EnsemblLite saves an alignment maf formatted file in a sqlite db with the following rows

column type
source TEXT
block_id INT
species TEXT
coord_name TEXT
start INTEGER
end INTEGER
strand TEXT
gap_spans compressed_array
  • block_id corresponds to an alignment segment within the maf file
  • gap_spans is a 2D numpy array consisting of the location of a gap and it's length.

To get an alignment

We need:

  • all the records for a given block_id
  • the genome sequence corresponding to that coordinate

We use the functions within ensembl_lite.convert to create the individual Aligned instances and from that the Alignment instance.

Exporting homolog groups issue

For export homolog genes:
elt homologs -i <path to the installation directory> -o OUTPATH -- limit 100

When setting the limit to 100 it raises the error:

IndexError: list index out of range

Full traceback:

message="Traceback (most recent call last):

  File

"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap

p/composable.py", line 366, in _call

    result = self.main(val, *args, **kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File

"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap

p/composable.py", line 429, in _main

    return self._user_func(**bound.arguments)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line

286, in get_selected_seqs

    return list(get_seqs_for_ids(config=config, species=species,

names=gene_ids))

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^

  File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line

247, in get_seqs_for_ids

    feature = list(genome.get_features(name=f"%{name}"))[0]

              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^

IndexError: list index out of range

")

Restarting install triggers SQL error

While the solution is to use the -f (foce overwrite flag), the error message is not helpful.

File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
    local_install_genomes(
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 129, in local_install_genomes
    db.add_compressed_records(records=records)
  File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_genomedb.py", line 108, in add_compressed_records
    self.db.executemany(sql, [(n, s, len(s)) for n, s in records])
sqlite3.IntegrityError: UNIQUE constraint failed: genome.seqid

Define a download cfg file

The ensembl ftp link provides access to many different formats. Among these are multiple sequence alignments.

The full description of the resources is in the README

The structure of the .cfg file needs to support selecting the types of data for each genome:

  • seq=fasta, we will grab the entire genome sequences (what sub directory? dna?)
  • features=gff3 (or json?)

For compara, we need the alignments for the selected species. This could be derived automatically by parsing the README's (e.g. this one, but that will be fragile). Better to require hard-coding of the alignment directory name to start with (e.g. *primates).

The homology information is here.

Ensembl site-map object

The locations of data types can differ between ensembl and ensemblgenomes. For example the location of homology data, of whole genome alignments. For example

Taxa Site Path
Bacteria ftp.ensemblgenomes.org pub/release-57/pan_ensembl/tsv/ensembl-compara/homologies
Vertebrates ftp.ensembl.org pub/release-110/tsv/ensembl-compara/homologies

To simplify dealing with this, we need a federation object that provides a unified interface for code. For example, define a site-map class. For example, defining a class with attributes for each data type.

class EnsemblSiteMap:
    @property
    def homology_path(self) -> str:
        ...

    @property
    def alignment_path(self) -> str:
        ...

    @property
    def genome_path(self) -> str:
        ...

    @property
    def annotation_path(self) -> str:
        ...

Separate instances for each of the different Ensembl sites. Have an instance made on demand.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.