Comments (11)
Hi,
You can't make a REST API call to get just some subset of columns but this never was an issue because all data produced by our API is cached (both on the server and client side) and you can filter out columns once you have them all.
Basically this is a constraint of the REST protocol not the client. In REST you can't ask for a subset of fields. On the other hand you can do it using grapQL (http://graphql.org/learn/) and this is what we are planning to support at some point in future.
from chembl_webresource_client.
Hum, but I don't understand why do you choose to return by default 30 columns in the new_client.activity ?
I presume that by default returned table of the REST API is a result of a sql view or selection query ?
In my opinion by adding targets and compounds information to activity table you allow users to increase filtering options but the drawback is that you created duplicates in 2/3 of the returned columns and the total time required to download data in the new_client version is multiplied by more than 100 !
from chembl_webresource_client.
Not really, the fields that are available were carefully chosen in a way that satisfies the vast majority of users. I'm happy to see timings that would prove that the new client is more than 100x slower, to my best knowledge that's not the case. We are using the REST API to build complex applications (https://chembl-glados.herokuapp.com/) that uses all the fields to provide advanced filtering.
from chembl_webresource_client.
I apologize I have a little underestimated the time factor is actually about 50-60 times slower with the new client
old client
%time old_df = pd.DataFrame.from_dict(targets.bioactivities('CHEMBL1862'))
CPU times: user 872 ms, sys: 156 ms, total: 1.03 s
Wall time: 8.05 s
new client
bioacts = new_client.activity.filter(target_chembl_id="CHEMBL1862")
%time new_df=pd.DataFrame.from_dict(list(bioacts))
CPU times: user 18.9 s, sys: 1.57 s, total: 20.5 s
Wall time: 7min 10s
And the results are not exactely the same ...
print(old_df.shape)
(11095, 16)
print(new_df.shape)
(11058, 31)
Any idea ?
from chembl_webresource_client.
Thanks for checking this. So yes, I agree that the new web services are slower than the old ones. But this doesn't have anything to do with the number of columns or bandwidth.
First of all, what you see is mostly related to the fact, that new API has a pagination. When you use the client you can't see this (this is why the client is so nice) but behind the scenes the new client fetches data in chunks. The default chunk size is 20 results but this can be increased up to 1000. Here is how you do it, just put this code at the very top of your script:
from chembl_webresource_client.settings import Settings
Settings.Instance().MAX_LIMIT = 1000
This has to be used BEFORE you import any other client related stuff.
Please try that and you should see a significant improvement.
Still, there are some more performance issues especially related to 'activities' endpoint as it offers the largest amount of data (we have 14M activities). A new API release that will happen in about 2 weeks from now should fix this.
So I will keep this issue opened and when the new release is out I will ask you to rerun your tests again. If the results are fine I'll close the issue.
from chembl_webresource_client.
Thanks, I will try with Settings.Instance().MAX_LIMIT = 1000 but before I have to clear the cache...
With old client I just have to remove the .chembl_ws_client__0.8.50.sqlite but how to do this with the new client ?
Otherwise , did you have an explication about the rows size difference of result between old and new client ?
I will sed you an email about this ...
from chembl_webresource_client.
OK, so there are two separate things:
-
If you want to disable caching temporarily, just for your script, this can be controlled using settings as well, just append this line:
Settings.Instance().CACHING = False
just after the:
Settings.Instance().MAX_LIMIT = 1000
-
If you want to delete the cache file, the new client changed the location of the cache file from the current directory (polluting the current directory doesn't make sense and the cache file is only local to this current directory) to the hidden file in the home directory. To see it just invoke:
ls ~/.chembl_ws_client*
Of course, you can change this default location using settings, just do
Settings.Instance().CACHE_NAME = '/some/new/location.sqlite'
from chembl_webresource_client.
Ok, thanks
after removing all cached files and set Settings.Instance().MAX_LIMIT = 1000 for new client the time difference between old and new client is around a factor 25. This is better but could be best.
Note that bandwidth dramatically impact the total download time by a factor 4 between my home network and my institute network. So in my opinion the observed difference could be due to size of dataset, and also the number of returned columns... Whatever the improvement there will always be a factor 2 between the old client and the new one because I saw that the size of the cache is twice as important in the new client, which is understandable because there are twice as much of columns.
I suggest to look for doing the join between Activities Compound and Target table only on client after that download of data was done instead of sending thousand of duplicates rows (id,smiles and proteins descriptions) as it is currently the case in new_client.activity
Anyway I will wait for the new version and I will test to see the difference. I hope I have made an interesting contribution to the project.
Have a good day
from chembl_webresource_client.
Yes, thank was helpful, thank you.
I hope we've managed to solve at least some of your problems.
from chembl_webresource_client.
Hi @fabricecarles. Can you please verify the API speed now? Can you see any improvements?
from chembl_webresource_client.
Hi,
Indeed, using client version 0.9.13 the results seem to be better
from chembl_webresource_client.settings import Settings
Settings.Instance().MAX_LIMIT = 1000
Settings.Instance().CACHING = False
### old client
%time old_df = pd.DataFrame.from_dict(targets.bioactivities('CHEMBL1862'))
CPU times: user 443 ms, sys: 111 ms, total: 554 ms
Wall time: 1.59 s
### new client
bioacts = new_client.activity.filter(target_chembl_id="CHEMBL1862")
%time new_df=pd.DataFrame.from_dict(list(bioacts))
CPU times: user 1.16 s, sys: 217 ms, total: 1.38 s
Wall time: 18.8 s
Now the difference between old and new client is around a factor 10, thank you for this improvement.
Fabrice
from chembl_webresource_client.
Related Issues (20)
- .filter for different names in "document_journal" column HOT 1
- Query hangs after reaching specific entry HOT 1
- Error for url https://www.ebi.ac.uk/chembl/api/data/activity.json, server response: <!doctype html> HOT 6
- Mechanism results do not match what is on the website HOT 3
- Isomeric SMILES string
- Problem with unichem.structure HOT 4
- Http Application error HOT 5
- Querying by inchi
- new_client is failing to import from chembl_webresource_client HOT 1
- Assay offsets and limit HOT 2
- How do I retrieve the Unichem Cross references as a dictionary/list for a CHEMBL compound?
- Fail to import new_client due to API problem HOT 5
- Fixed
- Space between words of the query
- HttpApplicationError HOT 8
- confidence score
- Error for url
- ChEMBL webresourse client issue
- status 500 with chembl_webresource_client.new_client
- Problem of parallelized big data mining via batch chunks
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chembl_webresource_client.