I keep running into 502 errors when searching for similar molecules based on smile str

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

This should be much faster now and no 502 erorrs anymore. <a class="user-mention notra

new_client.similarity.filter returns 502 errors with low similarity threshold about chembl_webresource_client HOT 14 CLOSED

Swarchal commented on May 25, 2024

new_client.similarity.filter returns 502 errors with low similarity threshold

from chembl_webresource_client.

Comments (14)

mnowotka commented on May 25, 2024

Interesting! Thanks for reporting this, I'll have a look.

from chembl_webresource_client.

mnowotka commented on May 25, 2024

@Swarchal - can you please paste the list of your smiles if possible? If not, at least one particular that fails? (The list would be better, I could add it to my acceptance tests)

from chembl_webresource_client.

mnowotka commented on May 25, 2024

Ah if case this 'doesn't seem to fail at a particular smile string, and as it caches, if I re-run it does make progress' I'm not so sure if there is anything that can be done here. If you provide a large enough SMILES string with small enough threshold that yields thousands of results then time taking to collect them will exceed the Apache timeout and you will get 502. Next time result will be taken from cache so there is a chance you will get the correct results. I may implement this asynchronously as ChEMBL grows but this is not a trivial change. Increasing gateway timeout may solve a problem in most cases but not all of them. Faster catridge and sharding also may help but as I said this won't be an immediate fix.

I suggest you can either hammer the API for as long as you will get correct results or download smiles and use chemfp while I come with some better solution on the API side of things. Still a representative set of SMILES would be helpful.

from chembl_webresource_client.

mnowotka commented on May 25, 2024

Also an information that it used to be faster would be helpful in which case I can raise the issue with our DBA team.

from chembl_webresource_client.

Swarchal commented on May 25, 2024

Wow, quick response.

I've ran the same list of smiles before without issues, but that was with a higher similarity threshold (85).

Here's a superset of the smile strings, the ~1,000 I'm using in the code are within there -- hush-hush data and all that.

from chembl_webresource_client.

mnowotka commented on May 25, 2024

Perfect, I'll have a look. General note is that as the threshold goes lower, exponentially more similar compounds are found.

from chembl_webresource_client.

Swarchal commented on May 25, 2024

It runs without issue if I increase similarity from 70 => 75.

from chembl_webresource_client.

mnowotka commented on May 25, 2024

Good to know, I also checked and the cartidge is in heavy use at the moment as we are pregenerating substructure search cache for the bugfix release on Monday. So please rerun your stuff nex week but I'll try as well and probably during the release tune the timeout so your compounds will (mostly) pass next time.

from chembl_webresource_client.

mnowotka commented on May 25, 2024

This should be much faster now and no 502 erorrs anymore. @Swarchal, can you please check?

from chembl_webresource_client.

Swarchal commented on May 25, 2024

Just tried again with the master branch, seem to be getting the same error, but it ran much longer before returning an exception.

Traceback (most recent call last):
  File "test_chembl_fix.py", line 35, in <module>
    if len(res) == 0:
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/query_set.py", line 98, in __len__
    return len(self.query)
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 150, in __len__
    self.get_page()
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 383, in get_page
    handle_http_error(res)
  File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/http_errors.py", line 113, in handle_http_error
    raise exception_class(request.url, request.text)
chembl_webresource_client.http_errors.HttpBadGateway: Error for url https://www.ebi.ac.uk/chembl/api/data/similarity.json, server response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/chembl/api/data/similarity.json">POST&nbsp;/chembl/api/data/similarity.json</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
<hr>
<address>Apache/2.2.15 (Red Hat) Server at www.ebi.ac.uk Port 80</address>
</body></html>

from chembl_webresource_client.

mnowotka commented on May 25, 2024

OK, just to clarify: no changes have been made to the client. On the server side I:

increased the proxy timeout to 300s.
changed gunicorn worker class from sync to genevnt so long running task won't block other requests.
tuned the performance using yandex.tank and as a result increased the number of workers on a single machine from 8 to 24.
configured workers to restart every 1k requests to prevent memory leaks and fall of performance over time.

One thing I don't understand is why the client ignores TOTAL_RETRIES setting which defaults to 3. I'll check this but this still won't solve the problem of similarity running slow, I need to profile SQL statements.

from chembl_webresource_client.

mnowotka commented on May 25, 2024

OK, I've spent some time on this and I belive this is fixed now. Please do the following:

Upgrade the client to the latest version (0.9.30)
This version introduces "only" operator. "only" specifies which fields should be retrived. This is important in case of the "similarity" andpoint because it shows a lot of information about molecules, which is expensive due to many joins. But in your case (which is actally pretty common) you just want to see which molecules are hit (actually you only want to know the number or if the number is zero). So you can now instruct the API to return only molecule identifiers and entirely skip joins:

from chembl_webresource_client.new_client import new_client
similarity_query = new_client.similarity
dark_smiles = []
with open('12K_smile_strings.smi') as f:
    content = f.readlines()

for idx, line in enumerate(content):
    smile = line.strip()
    res = similarity_query.filter(smiles=smile, similarity=70).only(['molecule_chembl_id'])
    print("{0} {1} {2}".format(idx, smile, len(res)))
    if len(res) == 0:
        dark_smiles.append(smile)

If you also want to know the similarity score, replace only(['molecule_chembl_id']) with only(['molecule_chembl_id', 'similarity']).

PLEASE NOTE: I run your entire 12k example and I didn't get any proxy timeout in the process. It still took several hours to complete. Now smiles from this file are in API cache so it will work much faster (several minutes). If you provide new smiles not know to the API yet it will bahave slower but still much faster than the last time and you should see any proxy timeouts anymore.

@Swarchal - can you please confirm if this solves your problem?

from chembl_webresource_client.

Swarchal commented on May 25, 2024

Just tried the script above and it ran without error. Thanks for your work on this, it's a great tool!

from chembl_webresource_client.

mnowotka commented on May 25, 2024

Perfect! I'm closing this but feel free to reopen in case of any more proxy timeouts.

from chembl_webresource_client.

new_client.similarity.filter returns 502 errors with low similarity threshold about chembl_webresource_client HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs