genialis / resolwe-bio-py Goto Github PK
View Code? Open in Web Editor NEWResolwe Bioinformatics Python API
License: Apache License 2.0
Resolwe Bioinformatics Python API
License: Apache License 2.0
For example, if res is a Resolwe instance, currently, help(res.sample.filter)
returns only:
filter(**filters) method of resdk.query.ResolweQuery instance
Return clone of current query with added given filters.
It would be helpful if it showed the list of possible filtering fields for the sample resource that is given here.
Currently, res.<resource>.filter(contributor=<id#>)
is the only way to filter by contributor. Trying to use the user's username, first name, or last name instead of ID # fails to return any objects.
Since the Sample ID/slug/name naming remove the orange type mS#...
in the new processes, Table.meta
doesn't work anymore.
The problem lies from this line onward.
Example to reproduce error:
slug = <any_new_collection_with_new_metadata_process>
res = resdk.Resolwe(url="https://app.genialis.com/")
table = RNATables(res.collection.get(slug))
table.meta
When new object is created, all queries that are included (i.e. collection.data
, collection.samples
,...) are constructed in a wrong way, because object's id is not yet known and filters are applied with id=None
.
Queries shouldn't be initialized in __init__
, but on the first call and error should be raised if id
is not yet known.
For example, returning the list of available categories in the process catalog, that can be used in the category field of upload_processes = res.process.filter(category=)
This concept would also be useful for:
Details in spec.
Write utility functions in resdk to facilitate the interaction with knowledge base backend. Each function should do no more than a single request to the knowledge base backend, improve the backend if necessary.
Decision support systems (DSS) often have no output fields but spawn new processes instead. While such DSS can be registered via ./manage.py register
, they cannot be run via res.run
command where res
is a Resolwe instance. To run it via resdk, the DSS needs to include the output
section and at least one output field, which is undesirable.
Trying to download expression data with the new async functionality in a Jupyter Notebook results in the following error:
RuntimeError: This event loop is already running
From doing a bit of research it seems that the problem is that notebooks also use the async loop and the functionality can't be nested. The current workaround is to use the package nest_asyncio
and add the following before downloading expressions:
import nest_asyncio
nest_asyncio.apply()
When time allowed we should probably look more into it and see if there is a permanent fix for this.
If the SDK is installed using "sudo pip install resdk", the user gets msg "permission denied" for the logfile when loading the sdk using "import resdk".
This is not an issue if using python virtualenv.
Catch fail at process register stage (e.g. process type = basic:file) - this fails but should be reported
with the error msg “failed to register process: bad sytnax" instead of just "error 400"
I suppose it doesn't have an effect, but mutable objects shouldn't be default arguments. http://docs.python-guide.org/en/latest/writing/gotchas/#mutable-default-arguments
Removing data from collections by slug does not work. It does work if data object is specified by its ID.
Error:
ResloweServerError: ValueError at /api/collection/2/remove_data
invalid literal for int() with base 10: 'hg19'
Fails for:
In [10]: c.remove_data('data-11')
---------------------------------------------------------------------------
ResloweServerError Traceback (most recent call last)
<ipython-input-10-798cccd2ab41> in <module>()
----> 1 c.remove_data('data-11')
/Users/janez/resolwe-bio-py/resdk/resources/collection.pyc in remove_data(self, *data)
57 """Remove ``data`` objects from the collection."""
58 data = [get_data_id(d) for d in data]
---> 59 self.api(self.id).remove_data.post({'ids': data})
60 self._clear_data_cache()
61
/Users/janez/resolwe-bio-py/resdk/exceptions.pyc in wrapper(*args, **kwargs)
33 return func(*args, **kwargs)
34 except SlumberHttpBaseException as exception:
---> 35 raise ResloweServerError(exception.content) # pylint: disable=no-member
36
37 return wrapper
ResloweServerError: <h1>Server Error (500)</h1>
One currently can not download fields of type DIR using the RESDK. Example is a "genome" object:
{
u'fasta': {u'file': u'hg19.fasta.gz', u'size': 937047570},
u'index_bt': {u'dir': u'bowtie_index', u'size': 3074126272},
u'index_bt2': {u'dir': u'bowtie2_index', u'size': 7219579166},
u'index_bwa': {u'dir': u'BWA_index', u'size': 5417472211},
u'index_hisat2': {u'dir': u'hisat2_index', u'size': 4374656092},
u'index_subread': {u'dir': u'subread_index', u'size': 5872181163}
}
Slug change request
obj.slug = 'taken-slug'
obj.save()
where 'taken-slug'
is a slug of another existing object, results in changing the slug of obj
to 'taken-slug-N'
where N
is a number without any notification.
This is dangerous for two reasons:
(1) a change to taken-slug-N
was not requested and should therefore fail. Consider adding force
or similar boolean argument to the save()
function to control this behavior.
(2) someone may rely on changing the slug to 'taken-slug'
without an error and later re-accessing this object by get('taken-slug')
which would return the wrong object.
Reproduce:
res.data.filter(process_name='Bowtie 1.0.0', limit=3)
returns the same objects as res.data.filter(process_name='foo', limit=3)
.
Indices of ct.exp
and ct.meta
, where ct
is a resdk.collection_tables.CollectionTables
object, are in general not the same. The problem occurs when expressions and the associated response, which is typically stored in the metadata, are used as input in a sklearn
estimator. Unfortunately, the fit
method ignores the index of pandas.DataFrame
and pandas.Series
objects and matches expression profiles with response data by integer index and not by sample name. Thus, if the order of samples in ct.exp
and ct.meta
is different, reshuffled (wrong) responses may be assigned to the samples.
@robertcv proposed a solution that computes the intersection of expression and metadata indices and returns both pandas.DataFrame
s with the same order of rows, removing samples with no expression data objects.
The current workaround is to sort expressions and metadata by their sample index in each script after obtaining them from the server.
When using CollectionTables.exp
to download the expression matrix you get rows indexed by sample name. The problem is that we currently don't take sample name change into account when checking for expression versions. The result is that if you change sample names you get stuck with the older names as the change doesn't trigger a re-download.
Current hotfix: call CollectionTables.clear_cache()
to clear the cache and force a re-download.
Long-term fix suggestion: add a list of sample names to the version hash.
Python version: 3.6.7
IPython version: 7.2.0
Autocomplete (tab) in IPython doesn't show any of the permissions class methods.
Other classes seem to show the methods normally (but I haven't tested all):
This is how to recreate it:
In [19]: import resdk
In [20]: res = resdk.Resolwe(<username> , <password>, <url>)
In [22]: s = res.sample.get(<sample_id>)
In [24]: s.permissions.
press Tab
During downloading of metadata from a large collection (it has 1220 samples) you get the following exception:
/home/robert/git/resolwe-bio-py/venv/bin/python /home/robert/git/resolwe-bio-py/foo.py
Traceback (most recent call last):
File "/home/robert/git/resolwe-bio-py/src/resdk/exceptions.py", line 32, in wrapper
return func(*args, **kwargs)
File "/home/robert/git/resolwe-bio-py/venv/lib/python3.8/site-packages/slumber/__init__.py", line 155, in get
resp = self._request("GET", params=kwargs)
File "/home/robert/git/resolwe-bio-py/venv/lib/python3.8/site-packages/slumber/__init__.py", line 101, in _request
raise exception_class("Client Error %s: %s" % (resp.status_code, url), response=resp, content=resp.content)
slumber.exceptions.HttpClientError: Client Error 414: https://app.genialis.com/api/sample
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/robert/git/resolwe-bio-py/foo.py", line 6, in <module>
meta = col_tab.meta
File "/home/robert/git/resolwe-bio-py/src/resdk/collection_tables.py", line 176, in meta
return self._load_fetch(META)
File "/home/robert/git/resolwe-bio-py/src/resdk/collection_tables.py", line 354, in _load_fetch
data = self._download_metadata()
File "/home/robert/git/resolwe-bio-py/src/resdk/collection_tables.py", line 488, in _download_metadata
meta = pd.DataFrame(None, index=[s.name for s in self._samples])
File "/home/robert/git/resolwe-bio-py/src/resdk/collection_tables.py", line 251, in _samples
return list(
File "/home/robert/git/resolwe-bio-py/src/resdk/query.py", line 145, in __len__
return self.count()
File "/home/robert/git/resolwe-bio-py/src/resdk/query.py", line 222, in count
count_query._fetch()
File "/home/robert/git/resolwe-bio-py/src/resdk/query.py", line 196, in _fetch
items = self.api.get(**filters)
File "/home/robert/git/resolwe-bio-py/src/resdk/exceptions.py", line 34, in wrapper
raise ResolweServerError(exception.content)
resdk.exceptions.ResolweServerError: b'<html>\r\n<head><title>414 Request-URI Too Large</title></head>\r\n<body>\r\n<center><h1>414 Request-URI Too Large</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'
The root of the problem is in _samples
where we filter samples by ids which results in a large id__in
filter/argument and consequently, a large URL send to the server.
Tutorials call the RNATables using
import resdk
app = resdk.Resolwe(url='https://app.genialis.com/')
app.login()
collection = app.collection.get("sum149-fresh-for-rename")
sum149 = resdk.tables.RNATables(collection)
However in my 14.0.0 version of resdk this doesn't appear to work.
In [6]: sum149 = resdk.tables.RNATables(collection)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-5bec9560155e> in <module>
1 collection = app.collection.get("sum149-fresh-for-rename")
----> 2 sum149 = resdk.tables.RNATables(collection)
AttributeError: module 'resdk' has no attribute 'tables'
If I import RNATables
using the importing mechanism, I am able to use the functionality.
In [8]: from resdk.tables import RNATables
In [9]: sum149 = resdk.tables.RNATables(collection)
In [10]:
Is this intended and we need to fix the documentation or did something sneaky creep into the code?
Downloading metadata with CollectionTables results in missing values being cast into a string object "nan"
instead of the numpy/pandas object nan
which may cause unexpected problems with further processing. This happens only for basic:string:
fields. Additionally for basic:integer:
and basic:decimal:
fields we get pandas <NA>
and numpy nan
respectively which may also not be ideal and we should unify the not a number
object we use.
import resdk
res = resdk.Resolwe(url='https://qa.genialis.com')
res.feature.filter(source="NCBI", query=range(300)) # works
res.feature.filter(source="NCBI", query=range(400)) # fail
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.