GithubHelp home page GithubHelp logo

Comments (14)

mnowotka avatar mnowotka commented on May 25, 2024

Interesting! Thanks for reporting this, I'll have a look.

from chembl_webresource_client.

mnowotka avatar mnowotka commented on May 25, 2024

@Swarchal - can you please paste the list of your smiles if possible? If not, at least one particular that fails? (The list would be better, I could add it to my acceptance tests)

from chembl_webresource_client.

mnowotka avatar mnowotka commented on May 25, 2024

Ah if case this 'doesn't seem to fail at a particular smile string, and as it caches, if I re-run it does make progress' I'm not so sure if there is anything that can be done here. If you provide a large enough SMILES string with small enough threshold that yields thousands of results then time taking to collect them will exceed the Apache timeout and you will get 502. Next time result will be taken from cache so there is a chance you will get the correct results. I may implement this asynchronously as ChEMBL grows but this is not a trivial change. Increasing gateway timeout may solve a problem in most cases but not all of them. Faster catridge and sharding also may help but as I said this won't be an immediate fix.

I suggest you can either hammer the API for as long as you will get correct results or download smiles and use chemfp while I come with some better solution on the API side of things. Still a representative set of SMILES would be helpful.

from chembl_webresource_client.

mnowotka avatar mnowotka commented on May 25, 2024

Also an information that it used to be faster would be helpful in which case I can raise the issue with our DBA team.

from chembl_webresource_client.

Swarchal avatar Swarchal commented on May 25, 2024

Wow, quick response.

I've ran the same list of smiles before without issues, but that was with a higher similarity threshold (85).

Here's a superset of the smile strings, the ~1,000 I'm using in the code are within there -- hush-hush data and all that.

from chembl_webresource_client.

mnowotka avatar mnowotka commented on May 25, 2024

Perfect, I'll have a look. General note is that as the threshold goes lower, exponentially more similar compounds are found.
bench

from chembl_webresource_client.

Swarchal avatar Swarchal commented on May 25, 2024

It runs without issue if I increase similarity from 70 => 75.

from chembl_webresource_client.

mnowotka avatar mnowotka commented on May 25, 2024

Good to know, I also checked and the cartidge is in heavy use at the moment as we are pregenerating substructure search cache for the bugfix release on Monday. So please rerun your stuff nex week but I'll try as well and probably during the release tune the timeout so your compounds will (mostly) pass next time.

from chembl_webresource_client.

mnowotka avatar mnowotka commented on May 25, 2024

This should be much faster now and no 502 erorrs anymore. @Swarchal, can you please check?

from chembl_webresource_client.

Swarchal avatar Swarchal commented on May 25, 2024

Just tried again with the master branch, seem to be getting the same error, but it ran much longer before returning an exception.

Traceback (most recent call last):
  File "test_chembl_fix.py", line 35, in <module>
    if len(res) == 0:
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/query_set.py", line 98, in __len__
    return len(self.query)
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 150, in __len__
    self.get_page()
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 383, in get_page
    handle_http_error(res)
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/http_errors.py", line 113, in handle_http_error
    raise exception_class(request.url, request.text)
chembl_webresource_client.http_errors.HttpBadGateway: Error for url https://www.ebi.ac.uk/chembl/api/data/similarity.json, server response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/chembl/api/data/similarity.json">POST&nbsp;/chembl/api/data/similarity.json</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
<hr>
<address>Apache/2.2.15 (Red Hat) Server at www.ebi.ac.uk Port 80</address>
</body></html>

from chembl_webresource_client.

mnowotka avatar mnowotka commented on May 25, 2024

OK, just to clarify: no changes have been made to the client. On the server side I:

  • increased the proxy timeout to 300s.
  • changed gunicorn worker class from sync to genevnt so long running task won't block other requests.
  • tuned the performance using yandex.tank and as a result increased the number of workers on a single machine from 8 to 24.
  • configured workers to restart every 1k requests to prevent memory leaks and fall of performance over time.

One thing I don't understand is why the client ignores TOTAL_RETRIES setting which defaults to 3. I'll check this but this still won't solve the problem of similarity running slow, I need to profile SQL statements.

from chembl_webresource_client.

mnowotka avatar mnowotka commented on May 25, 2024

OK, I've spent some time on this and I belive this is fixed now. Please do the following:

  • Upgrade the client to the latest version (0.9.30)
  • This version introduces "only" operator. "only" specifies which fields should be retrived. This is important in case of the "similarity" andpoint because it shows a lot of information about molecules, which is expensive due to many joins. But in your case (which is actally pretty common) you just want to see which molecules are hit (actually you only want to know the number or if the number is zero). So you can now instruct the API to return only molecule identifiers and entirely skip joins:
from chembl_webresource_client.new_client import new_client
similarity_query = new_client.similarity
dark_smiles = []
with open('12K_smile_strings.smi') as f:
    content = f.readlines()

for idx, line in enumerate(content):
    smile = line.strip()
    res = similarity_query.filter(smiles=smile, similarity=70).only(['molecule_chembl_id'])
    print("{0} {1} {2}".format(idx, smile, len(res)))
    if len(res) == 0:
        dark_smiles.append(smile)

If you also want to know the similarity score, replace only(['molecule_chembl_id']) with only(['molecule_chembl_id', 'similarity']).

PLEASE NOTE: I run your entire 12k example and I didn't get any proxy timeout in the process. It still took several hours to complete. Now smiles from this file are in API cache so it will work much faster (several minutes). If you provide new smiles not know to the API yet it will bahave slower but still much faster than the last time and you should see any proxy timeouts anymore.

@Swarchal - can you please confirm if this solves your problem?

from chembl_webresource_client.

Swarchal avatar Swarchal commented on May 25, 2024

Just tried the script above and it ran without error. Thanks for your work on this, it's a great tool!

from chembl_webresource_client.

mnowotka avatar mnowotka commented on May 25, 2024

Perfect! I'm closing this but feel free to reopen in case of any more proxy timeouts.

from chembl_webresource_client.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.