Comments (14)
Interesting! Thanks for reporting this, I'll have a look.
from chembl_webresource_client.
@Swarchal - can you please paste the list of your smiles if possible? If not, at least one particular that fails? (The list would be better, I could add it to my acceptance tests)
from chembl_webresource_client.
Ah if case this 'doesn't seem to fail at a particular smile string, and as it caches, if I re-run it does make progress' I'm not so sure if there is anything that can be done here. If you provide a large enough SMILES string with small enough threshold that yields thousands of results then time taking to collect them will exceed the Apache timeout and you will get 502. Next time result will be taken from cache so there is a chance you will get the correct results. I may implement this asynchronously as ChEMBL grows but this is not a trivial change. Increasing gateway timeout may solve a problem in most cases but not all of them. Faster catridge and sharding also may help but as I said this won't be an immediate fix.
I suggest you can either hammer the API for as long as you will get correct results or download smiles and use chemfp while I come with some better solution on the API side of things. Still a representative set of SMILES would be helpful.
from chembl_webresource_client.
Also an information that it used to be faster would be helpful in which case I can raise the issue with our DBA team.
from chembl_webresource_client.
Wow, quick response.
I've ran the same list of smiles before without issues, but that was with a higher similarity threshold (85).
Here's a superset of the smile strings, the ~1,000 I'm using in the code are within there -- hush-hush data and all that.
from chembl_webresource_client.
Perfect, I'll have a look. General note is that as the threshold goes lower, exponentially more similar compounds are found.
from chembl_webresource_client.
It runs without issue if I increase similarity from 70 => 75.
from chembl_webresource_client.
Good to know, I also checked and the cartidge is in heavy use at the moment as we are pregenerating substructure search cache for the bugfix release on Monday. So please rerun your stuff nex week but I'll try as well and probably during the release tune the timeout so your compounds will (mostly) pass next time.
from chembl_webresource_client.
This should be much faster now and no 502 erorrs anymore. @Swarchal, can you please check?
from chembl_webresource_client.
Just tried again with the master branch, seem to be getting the same error, but it ran much longer before returning an exception.
Traceback (most recent call last):
File "test_chembl_fix.py", line 35, in <module>
if len(res) == 0:
File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/query_set.py", line 98, in __len__
return len(self.query)
File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 150, in __len__
self.get_page()
File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/url_query.py", line 383, in get_page
handle_http_error(res)
File "/home/scott/.local/lib/python3.6/site-packages/chembl_webresource_client-0.9.25-py3.6.egg/chembl_webresource_client/http_errors.py", line 113, in handle_http_error
raise exception_class(request.url, request.text)
chembl_webresource_client.http_errors.HttpBadGateway: Error for url https://www.ebi.ac.uk/chembl/api/data/similarity.json, server response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/chembl/api/data/similarity.json">POST /chembl/api/data/similarity.json</a></em>.<p>
Reason: <strong>Error reading from remote server</strong></p></p>
<hr>
<address>Apache/2.2.15 (Red Hat) Server at www.ebi.ac.uk Port 80</address>
</body></html>
from chembl_webresource_client.
OK, just to clarify: no changes have been made to the client. On the server side I:
- increased the proxy timeout to 300s.
- changed gunicorn worker class from
sync
togenevnt
so long running task won't block other requests. - tuned the performance using
yandex.tank
and as a result increased the number of workers on a single machine from 8 to 24. - configured workers to restart every 1k requests to prevent memory leaks and fall of performance over time.
One thing I don't understand is why the client ignores TOTAL_RETRIES
setting which defaults to 3. I'll check this but this still won't solve the problem of similarity running slow, I need to profile SQL statements.
from chembl_webresource_client.
OK, I've spent some time on this and I belive this is fixed now. Please do the following:
- Upgrade the client to the latest version (0.9.30)
- This version introduces "only" operator. "only" specifies which fields should be retrived. This is important in case of the "similarity" andpoint because it shows a lot of information about molecules, which is expensive due to many joins. But in your case (which is actally pretty common) you just want to see which molecules are hit (actually you only want to know the number or if the number is zero). So you can now instruct the API to return only molecule identifiers and entirely skip joins:
from chembl_webresource_client.new_client import new_client
similarity_query = new_client.similarity
dark_smiles = []
with open('12K_smile_strings.smi') as f:
content = f.readlines()
for idx, line in enumerate(content):
smile = line.strip()
res = similarity_query.filter(smiles=smile, similarity=70).only(['molecule_chembl_id'])
print("{0} {1} {2}".format(idx, smile, len(res)))
if len(res) == 0:
dark_smiles.append(smile)
If you also want to know the similarity score, replace only(['molecule_chembl_id'])
with only(['molecule_chembl_id', 'similarity'])
.
PLEASE NOTE: I run your entire 12k example and I didn't get any proxy timeout in the process. It still took several hours to complete. Now smiles from this file are in API cache so it will work much faster (several minutes). If you provide new smiles not know to the API yet it will bahave slower but still much faster than the last time and you should see any proxy timeouts anymore.
@Swarchal - can you please confirm if this solves your problem?
from chembl_webresource_client.
Just tried the script above and it ran without error. Thanks for your work on this, it's a great tool!
from chembl_webresource_client.
Perfect! I'm closing this but feel free to reopen in case of any more proxy timeouts.
from chembl_webresource_client.
Related Issues (20)
- .filter for different names in "document_journal" column HOT 1
- Query hangs after reaching specific entry HOT 1
- Error for url https://www.ebi.ac.uk/chembl/api/data/activity.json, server response: <!doctype html> HOT 6
- Mechanism results do not match what is on the website HOT 3
- Isomeric SMILES string
- Problem with unichem.structure HOT 4
- Http Application error HOT 5
- Querying by inchi
- new_client is failing to import from chembl_webresource_client HOT 1
- Assay offsets and limit HOT 2
- How do I retrieve the Unichem Cross references as a dictionary/list for a CHEMBL compound?
- Fail to import new_client due to API problem HOT 5
- Fixed
- Space between words of the query
- HttpApplicationError HOT 8
- confidence score
- Error for url
- ChEMBL webresourse client issue
- status 500 with chembl_webresource_client.new_client
- Problem of parallelized big data mining via batch chunks
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chembl_webresource_client.