cogent3 / ensembllite Goto Github PK
View Code? Open in Web Editor NEWThis project forked from cogent3/ensembldb3
A new approach to obtaining local copies of ensembl data
This project forked from cogent3/ensembldb3
A new approach to obtaining local copies of ensembl data
Note
This means the tool can be installed and used to download and install data from ensembl.org necessary to perform the selected set of queries defined below. These queries will produce standard formats. This variant will have a lot of errors!
Track progress on the issues page
Note
Same feature set as above, but now the tool has a test coverage of ~90% and internals have been refactored for clarity. We will call for testers at this point. This variant will have some errors!
what do you think is essential for first announced release
we want to be able to have the flit build and publish steps controlled via GitHub actions. For a project, how does this work?
this is causing problems in Docker containers with limited resources.
The solution is to write a function that, when num CPUs=1, returns a map(func, series)
instead of the cogent3.util.parallel.as_completed(func, series)
Needs to be discussed with Ensembl folk, cannot use their brand without consent!
Provide summary statistics to help a user to decide what species may be suitable for their research. Some of the data is in, for example, http://ftp.ensembl.org/pub/current/uniprot_report_EnsemblVertebrates.txt
and other such top-level files.
Desirable data:
To the downloaded folders and to the installed folders. These can be inspected to eliminate unecessary computation.
A for listing species present in whole genome alignments, for example
https://ftp.ensembl.org/pub/release-110/compara/species_trees/10_primates_EPO_default.nh
The homology data are downloaded and installed, except for 'ovis arise' because it keeps raising this error when I try to download its homology data
ftplib.error_perm: 550 Failed to change directory.
deal with flaky internet and / or sleeping laptop and reconnecting on another network
When trying to install vertebrate genomes, multiple users have run into what appear to be memory limits when running on RAM limited hardware (including my lab server which has 16GB RAM).
We need a more efficient data storage for sequences that will allow loading into memory only what we need.
Possible choices are HDF5 or Arne's custom backend.
Running this command
elt install -d ~/someuser/whole_genome_mammal87/ensembl_download
Failed with the following exception
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
local_install_genomes(
File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 115, in local_install_genomes
for _ in PAR.as_completed(
File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 239, in as_completed
yield from _as_completed_mproc(f, s, max_workers)
File "/home/someuser/env/lib/python3.11/site-packages/cogent3/util/parallel.py", line 227, in _as_completed_mproc
yield result.result()
^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I now think this is because the current implementation always uses concurrent futures, even if running on a single process. This can be fixed by providing a simple mechanism to use the builtin map()
function in those cases.
To get an alignment, we need the block_id. And to get this we need a species identifier and genome coordinate. (Let's call this ref_coord
.) We then use AlignDb.get_records_matching()
does this.
With the block_id
we can then get all the coordinates for other species in
the block. (NOTE: if we are going to support strict, i.e. only blocks
where all nominated species are present, we would filter at this level).
The gap_spans
that are returned by block will exceed the coordinates for the selected region. These will need to be trimmed to just the region so only the desired sequences need be selected. So we need to implement a container class that handles gap positions, lets call this GapPositions
. From the ref_coord
, we need to GapPositions
to convert the sequence indices into alignment indices
align_start = <ref gap pos>.from_seq_to_align_index(ref_start)
align_end = align_start + ref_end - ref_start
For all other records, we then use to
gives us the coordinates for that species.
other_start = <other species gap pos>.from_align_to_seq_index(align_start)
other_end = <other species gap pos>.from_align_to_seq_index(align_end)
These values can the be used to obtain the genome sequence followed by a subset of the annotation data and combined into an Alignment
instance.
they are currently absolute which means if the directory is removed, the connections are broken
for example, instead of
/Users/someone/working/apes_111/download_111
make it
apes_111/download_111
Minimal data for repeat information is:
What is the minimum data needed to download that will provide this? (That is, what are the minimum files required.)
Conceptually separate the download from the install step. The former grabs data in a sufficiently complete form that the install step does not require internet access.
We then need to transform this to be compatible with the querying approach for gene features, perhaps a single table.
EnsemblLite
saves an alignment maf
formatted file in a sqlite db with the following rows
column | type |
---|---|
source | TEXT |
block_id | INT |
species | TEXT |
coord_name | TEXT |
start | INTEGER |
end | INTEGER |
strand | TEXT |
gap_spans | compressed_array |
block_id
corresponds to an alignment segment within the maf
filegap_spans
is a 2D numpy array consisting of the location of a gap and it's length.We need:
block_id
We use the functions within ensembl_lite.convert
to create the individual Aligned
instances and from that the Alignment
instance.
Downloading the latest species / common name listings from https://ftp.ensembl.org/pub/release-110/species_EnsemblVertebrates.txt
NOTE: this file cannot be automatically read using load_table since data rows have an extra tab character at the end
For export homolog genes:
elt homologs -i <path to the installation directory> -o OUTPATH -- limit 100
When setting the limit to 100 it raises the error:
IndexError: list index out of range
Full traceback:
message="Traceback (most recent call last):
File
"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap
p/composable.py", line 366, in _call
result = self.main(val, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/gulugulu/opt/anaconda3/envs/c311/lib/python3.11/site-packages/cogent3/ap
p/composable.py", line 429, in _main
return self._user_func(**bound.arguments)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line
286, in get_selected_seqs
return list(get_seqs_for_ids(config=config, species=species,
names=gene_ids))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^
File "/Users/gulugulu/EnsemblLite_folk/src/ensembl_lite/_genomedb.py", line
247, in get_seqs_for_ids
feature = list(genome.get_features(name=f"%{name}"))[0]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
")
While the solution is to use the -f
(foce overwrite flag), the error message is not helpful.
File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/someuser/env/lib/python3.11/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/cli.py", line 277, in install
local_install_genomes(
File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_install.py", line 129, in local_install_genomes
db.add_compressed_records(records=records)
File "/home/someuser/env/lib/python3.11/site-packages/ensembl_lite/_genomedb.py", line 108, in add_compressed_records
self.db.executemany(sql, [(n, s, len(s)) for n, s in records])
sqlite3.IntegrityError: UNIQUE constraint failed: genome.seqid
based on the cogent3 workshop container, do direct git clone from the https link and developer install so there the code is accessible for playing with
The ensembl ftp link provides access to many different formats. Among these are multiple sequence alignments.
The full description of the resources is in the README
The structure of the .cfg file needs to support selecting the types of data for each genome:
dna
?)For compara, we need the alignments for the selected species. This could be derived automatically by parsing the README's (e.g. this one, but that will be fragile). Better to require hard-coding of the alignment directory name to start with (e.g. *primates
).
The homology information is here.
These can be for the entire Ensembl collection or those present in whole genome alignments. For vertebrates the URL is
https://ftp.ensembl.org/pub/release-110/compara/species_trees/
in case multiple-sequence alignments not desired
The locations of data types can differ between ensembl and ensemblgenomes. For example the location of homology data, of whole genome alignments. For example
Taxa | Site | Path |
---|---|---|
Bacteria | ftp.ensemblgenomes.org | pub/release-57/pan_ensembl/tsv/ensembl-compara/homologies |
Vertebrates | ftp.ensembl.org | pub/release-110/tsv/ensembl-compara/homologies |
To simplify dealing with this, we need a federation object that provides a unified interface for code. For example, define a site-map class. For example, defining a class with attributes for each data type.
class EnsemblSiteMap:
@property
def homology_path(self) -> str:
...
@property
def alignment_path(self) -> str:
...
@property
def genome_path(self) -> str:
...
@property
def annotation_path(self) -> str:
...
Separate instances for each of the different Ensembl sites. Have an instance made on demand.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.