GithubHelp home page GithubHelp logo

Comments (11)

mnowotka avatar mnowotka commented on June 5, 2024

Hi,

You can't make a REST API call to get just some subset of columns but this never was an issue because all data produced by our API is cached (both on the server and client side) and you can filter out columns once you have them all.

Basically this is a constraint of the REST protocol not the client. In REST you can't ask for a subset of fields. On the other hand you can do it using grapQL (http://graphql.org/learn/) and this is what we are planning to support at some point in future.

from chembl_webresource_client.

fabricecarles avatar fabricecarles commented on June 5, 2024

Hum, but I don't understand why do you choose to return by default 30 columns in the new_client.activity ?
I presume that by default returned table of the REST API is a result of a sql view or selection query ?
In my opinion by adding targets and compounds information to activity table you allow users to increase filtering options but the drawback is that you created duplicates in 2/3 of the returned columns and the total time required to download data in the new_client version is multiplied by more than 100 !

from chembl_webresource_client.

mnowotka avatar mnowotka commented on June 5, 2024

Not really, the fields that are available were carefully chosen in a way that satisfies the vast majority of users. I'm happy to see timings that would prove that the new client is more than 100x slower, to my best knowledge that's not the case. We are using the REST API to build complex applications (https://chembl-glados.herokuapp.com/) that uses all the fields to provide advanced filtering.

from chembl_webresource_client.

fabricecarles avatar fabricecarles commented on June 5, 2024

I apologize I have a little underestimated the time factor is actually about 50-60 times slower with the new client

old client

%time old_df = pd.DataFrame.from_dict(targets.bioactivities('CHEMBL1862'))
CPU times: user 872 ms, sys: 156 ms, total: 1.03 s
Wall time: 8.05 s

new client

bioacts = new_client.activity.filter(target_chembl_id="CHEMBL1862")
%time new_df=pd.DataFrame.from_dict(list(bioacts))
CPU times: user 18.9 s, sys: 1.57 s, total: 20.5 s
Wall time: 7min 10s

And the results are not exactely the same ...

print(old_df.shape)
(11095, 16)
print(new_df.shape)
(11058, 31)

Any idea ?

from chembl_webresource_client.

mnowotka avatar mnowotka commented on June 5, 2024

Thanks for checking this. So yes, I agree that the new web services are slower than the old ones. But this doesn't have anything to do with the number of columns or bandwidth.

First of all, what you see is mostly related to the fact, that new API has a pagination. When you use the client you can't see this (this is why the client is so nice) but behind the scenes the new client fetches data in chunks. The default chunk size is 20 results but this can be increased up to 1000. Here is how you do it, just put this code at the very top of your script:

from chembl_webresource_client.settings import Settings
Settings.Instance().MAX_LIMIT = 1000

This has to be used BEFORE you import any other client related stuff.
Please try that and you should see a significant improvement.

Still, there are some more performance issues especially related to 'activities' endpoint as it offers the largest amount of data (we have 14M activities). A new API release that will happen in about 2 weeks from now should fix this.

So I will keep this issue opened and when the new release is out I will ask you to rerun your tests again. If the results are fine I'll close the issue.

from chembl_webresource_client.

fabricecarles avatar fabricecarles commented on June 5, 2024

Thanks, I will try with Settings.Instance().MAX_LIMIT = 1000 but before I have to clear the cache...
With old client I just have to remove the .chembl_ws_client__0.8.50.sqlite but how to do this with the new client ?
Otherwise , did you have an explication about the rows size difference of result between old and new client ?
I will sed you an email about this ...

from chembl_webresource_client.

mnowotka avatar mnowotka commented on June 5, 2024

OK, so there are two separate things:

  1. If you want to disable caching temporarily, just for your script, this can be controlled using settings as well, just append this line:

     Settings.Instance().CACHING = False
    

    just after the:

     Settings.Instance().MAX_LIMIT = 1000
    
  2. If you want to delete the cache file, the new client changed the location of the cache file from the current directory (polluting the current directory doesn't make sense and the cache file is only local to this current directory) to the hidden file in the home directory. To see it just invoke:

     ls ~/.chembl_ws_client*
    

    Of course, you can change this default location using settings, just do

     Settings.Instance().CACHE_NAME = '/some/new/location.sqlite'
    

from chembl_webresource_client.

fabricecarles avatar fabricecarles commented on June 5, 2024

Ok, thanks
after removing all cached files and set Settings.Instance().MAX_LIMIT = 1000 for new client the time difference between old and new client is around a factor 25. This is better but could be best.
Note that bandwidth dramatically impact the total download time by a factor 4 between my home network and my institute network. So in my opinion the observed difference could be due to size of dataset, and also the number of returned columns... Whatever the improvement there will always be a factor 2 between the old client and the new one because I saw that the size of the cache is twice as important in the new client, which is understandable because there are twice as much of columns.
I suggest to look for doing the join between Activities Compound and Target table only on client after that download of data was done instead of sending thousand of duplicates rows (id,smiles and proteins descriptions) as it is currently the case in new_client.activity
Anyway I will wait for the new version and I will test to see the difference. I hope I have made an interesting contribution to the project.
Have a good day

from chembl_webresource_client.

mnowotka avatar mnowotka commented on June 5, 2024

Yes, thank was helpful, thank you.
I hope we've managed to solve at least some of your problems.

from chembl_webresource_client.

mnowotka avatar mnowotka commented on June 5, 2024

Hi @fabricecarles. Can you please verify the API speed now? Can you see any improvements?

from chembl_webresource_client.

fabricecarles avatar fabricecarles commented on June 5, 2024

Hi,
Indeed, using client version 0.9.13 the results seem to be better

from chembl_webresource_client.settings import Settings
Settings.Instance().MAX_LIMIT = 1000
Settings.Instance().CACHING = False
### old client
%time old_df = pd.DataFrame.from_dict(targets.bioactivities('CHEMBL1862'))
CPU times: user 443 ms, sys: 111 ms, total: 554 ms
Wall time: 1.59 s
### new client
bioacts = new_client.activity.filter(target_chembl_id="CHEMBL1862")
%time new_df=pd.DataFrame.from_dict(list(bioacts))
CPU times: user 1.16 s, sys: 217 ms, total: 1.38 s
Wall time: 18.8 s

Now the difference between old and new client is around a factor 10, thank you for this improvement.
Fabrice

from chembl_webresource_client.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.