genialis / resolwe-bio-py Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 24.0 1.89 MB

Resolwe Bioinformatics Python API

License: Apache License 2.0

Python 100.00%

resolwe-bio-py's People

Contributors

Stargazers

Watchers

resolwe-bio-py's Issues

Make help(res.<resource>.filter) point to the list of available fields

For example, if res is a Resolwe instance, currently, help(res.sample.filter) returns only:

filter(**filters) method of resdk.query.ResolweQuery instance
    Return clone of current query with added given filters.

It would be helpful if it showed the list of possible filtering fields for the sample resource that is given here.

Filter by contributor username, first name, and last name not possible

Currently, res.<resource>.filter(contributor=<id#>) is the only way to filter by contributor. Trying to use the user's username, first name, or last name instead of ID # fails to return any objects.

Table.meta not working with the latest upload-metadata processes

Since the Sample ID/slug/name naming remove the orange type mS#... in the new processes, Table.meta doesn't work anymore.
The problem lies from this line onward.

Example to reproduce error:

slug = <any_new_collection_with_new_metadata_process>

res = resdk.Resolwe(url="https://app.genialis.com/")
table = RNATables(res.collection.get(slug))
table.meta

Queries in new objects are not evaluated correctly

When new object is created, all queries that are included (i.e. collection.data, collection.samples,...) are constructed in a wrong way, because object's id is not yet known and filters are applied with id=None.

Queries shouldn't be initialized in __init__, but on the first call and error should be raised if id is not yet known.

Add ways to query available filter parameters for some fields

For example, returning the list of available categories in the process catalog, that can be used in the category field of upload_processes = res.process.filter(category=)

This concept would also be useful for:

The type field, listing all of the types of data that are available to the user. It can be difficult for a less-experienced user to know the specific string of types and subtypes (e.g. 'data:reads:fastq:paired') necessary to filter for certain data.
The date-related fields (created, modified, started, finished), where the query could return the date range of available data.
The status field (OK, ER, etc.).

Helper functions for mapping gene IDs

Details in spec.

Write utility functions in resdk to facilitate the interaction with knowledge base backend. Each function should do no more than a single request to the knowledge base backend, improve the backend if necessary.

A function that maps NCBI to MGI/Uniprot using a mapping table. Inputs should be: source_db, target_db and a list of feature_ids. Output is a list of mapped/converted feature_ids
A function that for a list of given features returns a dictionary of values from the gene info table. Inputs: source, a list of features and an optional parameter to limit which fields are returned (see gene info table/model for a list of valid options). Output is a dict {‘feature_id’: {‘source’: ‘NCBI, ‘organism’: ‘Homo sapiens’, … }}. Key ‘feature_id’ will always be unique since all the returned elements have the same source.

Running a process with unspecified `output` section results in KeyError

Decision support systems (DSS) often have no output fields but spawn new processes instead. While such DSS can be registered via ./manage.py register, they cannot be run via res.run command where res is a Resolwe instance. To run it via resdk, the DSS needs to include the output section and at least one output field, which is undesirable.

Run integration tests for resdk on Travis

Async download not working in Jupyter Notebook

Trying to download expression data with the new async functionality in a Jupyter Notebook results in the following error:

RuntimeError: This event loop is already running

From doing a bit of research it seems that the problem is that notebooks also use the async loop and the functionality can't be nested. The current workaround is to use the package nest_asyncio and add the following before downloading expressions:

import nest_asyncio
nest_asyncio.apply()

When time allowed we should probably look more into it and see if there is a permanent fix for this.

Permission denied error for logfile

If the SDK is installed using "sudo pip install resdk", the user gets msg "permission denied" for the logfile when loading the sdk using "import resdk".

This is not an issue if using python virtualenv.

Error reporting

Catch fail at process register stage (e.g. process type = basic:file) - this fails but should be reported
with the error msg “failed to register process: bad sytnax" instead of just "error 400"

Mutable object as default argument

resolwe-bio-py/resdk/shortcuts/collection.py

Line 40 in f8b00a9

def _create_relation(self, relation_type, samples, positions=[], label=None):

I suppose it doesn't have an effect, but mutable objects shouldn't be default arguments. http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments

Error in Collection.remove_data("slug")

Removing data from collections by slug does not work. It does work if data object is specified by its ID.

Error:

ResloweServerError: ValueError at /api/collection/2/remove_data
invalid literal for int() with base 10: 'hg19'

Fails for:

In [10]: c.remove_data('data-11')
---------------------------------------------------------------------------
ResloweServerError                        Traceback (most recent call last)
<ipython-input-10-798cccd2ab41> in <module>()
----> 1 c.remove_data('data-11')

/Users/janez/resolwe-bio-py/resdk/resources/collection.pyc in remove_data(self, *data)
     57         """Remove ``data`` objects from the collection."""
     58         data = [get_data_id(d) for d in data]
---> 59         self.api(self.id).remove_data.post({'ids': data})
     60         self._clear_data_cache()
     61

/Users/janez/resolwe-bio-py/resdk/exceptions.pyc in wrapper(*args, **kwargs)
     33             return func(*args, **kwargs)
     34         except SlumberHttpBaseException as exception:
---> 35             raise ResloweServerError(exception.content)  # pylint: disable=no-member
     36
     37     return wrapper

ResloweServerError: <h1>Server Error (500)</h1>

Download dir fields

One currently can not download fields of type DIR using the RESDK. Example is a "genome" object:

{
    u'fasta': {u'file': u'hg19.fasta.gz', u'size': 937047570},
    u'index_bt': {u'dir': u'bowtie_index', u'size': 3074126272},
    u'index_bt2': {u'dir': u'bowtie2_index', u'size': 7219579166},
    u'index_bwa': {u'dir': u'BWA_index', u'size': 5417472211},
    u'index_hisat2': {u'dir': u'hisat2_index', u'size': 4374656092},
    u'index_subread': {u'dir': u'subread_index', u'size': 5872181163}
}

Raise validation error when register fails for run(process_slug, input={..}, 'src':'process.yml')

Collision on slug change request

Slug change request

obj.slug = 'taken-slug'
obj.save()

where 'taken-slug' is a slug of another existing object, results in changing the slug of obj to 'taken-slug-N' where N is a number without any notification.

This is dangerous for two reasons:
(1) a change to taken-slug-N was not requested and should therefore fail. Consider adding force or similar boolean argument to the save() function to control this behavior.
(2) someone may rely on changing the slug to 'taken-slug' without an error and later re-accessing this object by get('taken-slug') which would return the wrong object.

Filter by process_name returns everything

Reproduce:
res.data.filter(process_name='Bowtie 1.0.0', limit=3) returns the same objects as res.data.filter(process_name='foo', limit=3).

CollectionTables returns expressions and metadata with different order of samples

Indices of ct.exp and ct.meta, where ct is a resdk.collection_tables.CollectionTables object, are in general not the same. The problem occurs when expressions and the associated response, which is typically stored in the metadata, are used as input in a sklearn estimator. Unfortunately, the fit method ignores the index of pandas.DataFrame and pandas.Series objects and matches expression profiles with response data by integer index and not by sample name. Thus, if the order of samples in ct.exp and ct.meta is different, reshuffled (wrong) responses may be assigned to the samples.

@robertcv proposed a solution that computes the intersection of expression and metadata indices and returns both pandas.DataFrames with the same order of rows, removing samples with no expression data objects.

The current workaround is to sort expressions and metadata by their sample index in each script after obtaining them from the server.

CollectionTables doesn't re-download expressions on sample name change

When using CollectionTables.exp to download the expression matrix you get rows indexed by sample name. The problem is that we currently don't take sample name change into account when checking for expression versions. The result is that if you change sample names you get stuck with the older names as the change doesn't trigger a re-download.

Current hotfix: call CollectionTables.clear_cache() to clear the cache and force a re-download.
Long-term fix suggestion: add a list of sample names to the version hash.

Some ReSDK methods not showing up in IPython

Python version: 3.6.7
IPython version: 7.2.0

Autocomplete (tab) in IPython doesn't show any of the permissions class methods.

Other classes seem to show the methods normally (but I haven't tested all):

This is how to recreate it:

In [19]: import resdk                                                                                                                                                                                              

In [20]: res = resdk.Resolwe(<username> , <password>, <url>)                                                                                                                                                                                                                                                                                                                                           

In [22]: s = res.sample.get(<sample_id>)                                                                                                                                                                                   

In [24]: s.permissions.

press Tab

Documentation website title

We should fix/remove the — Resolwe SDK for Python 0.0.0 documentation part.

CollectionTables unable to fatch metadata from large collection

During downloading of metadata from a large collection (it has 1220 samples) you get the following exception:

/home/robert/git/resolwe-bio-py/venv/bin/python /home/robert/git/resolwe-bio-py/foo.py
Traceback (most recent call last):
  File "/home/robert/git/resolwe-bio-py/src/resdk/exceptions.py", line 32, in wrapper
    return func(*args, **kwargs)
  File "/home/robert/git/resolwe-bio-py/venv/lib/python3.8/site-packages/slumber/__init__.py", line 155, in get
    resp = self._request("GET", params=kwargs)
  File "/home/robert/git/resolwe-bio-py/venv/lib/python3.8/site-packages/slumber/__init__.py", line 101, in _request
    raise exception_class("Client Error %s: %s" % (resp.status_code, url), response=resp, content=resp.content)
slumber.exceptions.HttpClientError: Client Error 414: https://app.genialis.com/api/sample

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/robert/git/resolwe-bio-py/foo.py", line 6, in <module>
    meta = col_tab.meta
  File "/home/robert/git/resolwe-bio-py/src/resdk/collection_tables.py", line 176, in meta
    return self._load_fetch(META)
  File "/home/robert/git/resolwe-bio-py/src/resdk/collection_tables.py", line 354, in _load_fetch
    data = self._download_metadata()
  File "/home/robert/git/resolwe-bio-py/src/resdk/collection_tables.py", line 488, in _download_metadata
    meta = pd.DataFrame(None, index=[s.name for s in self._samples])
  File "/home/robert/git/resolwe-bio-py/src/resdk/collection_tables.py", line 251, in _samples
    return list(
  File "/home/robert/git/resolwe-bio-py/src/resdk/query.py", line 145, in __len__
    return self.count()
  File "/home/robert/git/resolwe-bio-py/src/resdk/query.py", line 222, in count
    count_query._fetch()
  File "/home/robert/git/resolwe-bio-py/src/resdk/query.py", line 196, in _fetch
    items = self.api.get(**filters)
  File "/home/robert/git/resolwe-bio-py/src/resdk/exceptions.py", line 34, in wrapper
    raise ResolweServerError(exception.content)
resdk.exceptions.ResolweServerError: b'<html>\r\n<head><title>414 Request-URI Too Large</title></head>\r\n<body>\r\n<center><h1>414 Request-URI Too Large</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

The root of the problem is in _samples where we filter samples by ids which results in a large id__in filter/argument and consequently, a large URL send to the server.

unable to call RNATables "the old fashion way"

Tutorials call the RNATables using

import resdk
app = resdk.Resolwe(url='https://app.genialis.com/')
app.login()
collection = app.collection.get("sum149-fresh-for-rename")
sum149 = resdk.tables.RNATables(collection)

However in my 14.0.0 version of resdk this doesn't appear to work.

In [6]: sum149 = resdk.tables.RNATables(collection)                                                                                                      
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-5bec9560155e> in <module>
      1 collection = app.collection.get("sum149-fresh-for-rename")
----> 2 sum149 = resdk.tables.RNATables(collection)

AttributeError: module 'resdk' has no attribute 'tables'

If I import RNATables using the importing mechanism, I am able to use the functionality.

In [8]: from resdk.tables import RNATables                                                                                                              
In [9]: sum149 = resdk.tables.RNATables(collection)                                                                                                      
In [10]:

Is this intended and we need to fix the documentation or did something sneaky creep into the code?

CollectionTables not properly parsing missing string values

Downloading metadata with CollectionTables results in missing values being cast into a string object "nan" instead of the numpy/pandas object nan which may cause unexpected problems with further processing. This happens only for basic:string: fields. Additionally for basic:integer: and basic:decimal: fields we get pandas <NA> and numpy nan respectively which may also not be ideal and we should unify the not a number object we use.

Feature query fails for many genes

import resdk
res = resdk.Resolwe(url='https://qa.genialis.com')

res.feature.filter(source="NCBI", query=range(300)) # works
res.feature.filter(source="NCBI", query=range(400)) # fail

genialis / resolwe-bio-py Goto Github PK

resolwe-bio-py's People

Contributors

Stargazers

Watchers

Forkers

resolwe-bio-py's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs