psf / cachecontrol Goto Github PK
View Code? Open in Web Editor NEWThe httplib2 caching algorithms packaged up for use with requests.
License: Other
The httplib2 caching algorithms packaged up for use with requests.
License: Other
CacheControl totally fails on this URL:
'http://www.meristation.com/v3/podcasts_rss.php'
Due to the max-age header being '600s' which is obviously wrong, but I think CacheControl should catch the ValueError (trying to cast the string to an int) and handle it gracefully.
Exception:
ValueError
invalid literal for int() with base 10: '600s'
Location: cachecontrol/controller.py, cache_response:241
I've been thinking about the design of BaseHeuristic
and, while the new design seems good as a base class (for the uses I've thought of so far), I think that, for common-case uses, it's going to result in a lot of people who don't have the time to become HTTP 1.1 experts reinventing boilerplate and introducing subtle bugs in their header parsing and resetting code.
There should probably be a subclass with a name like SimpleBaseHeuristic
where, instead of update_headers(self, response)
and warning(self, response)
, you override set_expiry(self, response, current_expiry)
and return
the new expiry value.
(That is, a subclass which provides a single, shared implementation of going from multiple subtly different HTTP headers to one "this is how long it'll last" number and then back again and also handles the warning 110/113 selection automatically )
If you don't beat me to it, I'll try to write it as soon as I can figure out the least hacky way to call CacheController.parse_cache_control
from a BaseHeuristic
subclass so I can access the value of max-age
without duplicating the Cache-Control
parsing code.
req = requests.get('http://example.org', params={'test':'ok', 'api_key': 's3cr3t', 'bqj':'None'})
print(req.url)
'http://example.org/?bqj=None&api_key=s3cr3t&test=ok'
...
# then in another python interpreter instance, the same code might give:
print(req.url) # http://example.org/?bqj=None&api_key=s3cr3t&test=ok
'http://example.org/?test=ok&api_key=s3cr3t&bqj=None'
I noticed, for the same set of http data, that order in the query part (from parse_uri) might change from a run to another.
Since the key for the cache is computed from uri we might end up with tow different keys for the same http resource.
This can be an issue for persistent http cache.
I'm not a (python|http) expert, I don't if this is actually an issue or a misuse of requests/caching.
Am I wrong, missing something?
Cheers
Is it possible to access the original response/status code of cached responses? This would be useful for measuring cache hits/misses.
Here's my CCSSE:
import github3
import cachecontrol
g = github3.GitHub()
cachecontrol.CacheControl(g._session)
print(g.rate_limit()['resources']['core']) # initial rate limit
print(g.rate_limit()['resources']['core']) # rate_limit - should not count
repository = g.repository('sigmavirus24', 'github3.py')
print(g.rate_limit()['resources']['core']) # get repo data - should count
repository = g.repository('sigmavirus24', 'github3.py')
print(g.rate_limit()['resources']['core']) # get repo again - should be served from cache
In the output, I'm seeing that the rate limit is being ticked off for each g.repository
call.
{u'reset': 1438863563, u'limit': 60, u'remaining': 53}
{u'reset': 1438863563, u'limit': 60, u'remaining': 53}
{u'reset': 1438863563, u'limit': 60, u'remaining': 52}
{u'reset': 1438863563, u'limit': 60, u'remaining': 51}
With logging cranked up to the max and tons of logging added (see #93 for the exact code, 50a76aa to be specific), I'm seeing the long trace below.
Conspicuously, there's no Updating cache with response from "http://..."
, which is what I added at https://github.com/toolforger/cachecontrol/blob/50a76aa0f022a47d34c65c13c4c813ecb1f2c086/cachecontrol/controller.py#L228, so I guess indicating cachecontrol.controller.CacheController.cache_response
is never called.
Since that call is stashed away in a functools.partial
, I have no idea where and when that call should have happened, so I have come to a dead end.
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
INFO:github3:Missed the cache building the url
DEBUG:github3:GET https://api.github.com/rate_limit with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): api.github.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
DEBUG:github3:GET https://api.github.com/rate_limit with {}
{u'reset': 1438867921, u'limit': 60, u'remaining': 60}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
{u'reset': 1438867922, u'limit': 60, u'remaining': 60}
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'repos', 'sigmavirus24', 'github3.py')
INFO:github3:Missed the cache building the url
DEBUG:github3:GET https://api.github.com/repos/sigmavirus24/github3.py with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/repos/sigmavirus24/github3.py" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /repos/sigmavirus24/github3.py HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
DEBUG:github3:GET https://api.github.com/rate_limit with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'repos', 'sigmavirus24', 'github3.py')
DEBUG:github3:GET https://api.github.com/repos/sigmavirus24/github3.py with {}
{u'reset': 1438867922, u'limit': 60, u'remaining': 59}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/repos/sigmavirus24/github3.py" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /repos/sigmavirus24/github3.py HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
DEBUG:github3:GET https://api.github.com/rate_limit with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
{u'reset': 1438867922, u'limit': 60, u'remaining': 58}
I'm unable to install via pip:
$ pip -vvv install cachecontrol
Collecting cachecontrol
Getting page https://pypi.python.org/simple/cachecontrol/
Starting new HTTPS connection (1): pypi.python.org
"GET /simple/cachecontrol/ HTTP/1.1" 200 119
1 location(s) to search for versions of cachecontrol:
* https://pypi.python.org/simple/cachecontrol/
Getting page https://pypi.python.org/simple/cachecontrol/
Analyzing links from page https://pypi.python.org/simple/cachecontrol/
Could not find a version that satisfies the requirement cachecontrol (from versions: )
Cleaning up...
No matching distribution found for cachecontrol
Strangely the link to https://pypi.python.org/pypi/CacheControl show a current version. Could you try re-releasing to fix?
In requests issue #2409 there is a discussion of removing the specialized history cache that maintains permanent redirects. A solution would be to recommend CacheControl, but unfortunately, as CacheControl assumes requests will handle 301s, it isn't supported.
CacheControl should handle 301s by always caching the response unless other caching headers are included that change the behavior. Similarly, any cache busting headers should be respected as usual.
This is very usefull when dealing whit paginated api. ( exemple : github api).
when a query have a lot of response, there is pagination whith header links to follow.
when a query is on several pages a 304 not modified, mean not modified for all of them, so no need to make the conditionnal query for all the next pages.
I can't find a way to go read directly in the cache without trying a query first
Hi, Is there any way that I can add an object to the cache, for example I have the headers, content in short complete response. Is there any way that I can add this to the cache and retrieve later using the url?
for example:
To put in the cache I have,
headers, response, status codes
To retrieve from the cahe I have,
url, request version etc
any code example would be much appreciated.
When I copy-pasted the example code from http://cachecontrol.readthedocs.org/en/latest/custom_heuristics.html#caching-heuristics into my program to serve as a starting point, my Flake8+PyLint integration warned me of several typos:
cachecontrol.heuristic
doesn't exist. You missed an s
at the end of that.headers
without defining it and then don't return
it.calendar
before using it.(Which really makes me glad I already had plans to integration-test my Requests+CacheControl stack using HTTPretty to ensure it's actually behaving as intended. Would you like some of that test code once I've gotten around to writing it?)
I'm using the CacheControlAdapter with a FileCache and I'm getting a FileNotFound error issuing a PUT request. Bug or user error? Stepping through the debugger. It doesn't look like the PUT request is ever cached to be deleted so the file shouldn't exist, or am I missing something?
> /Users/user/.py34/lib/python3.4/site-packages/cachecontrol/caches/file_cache.py(107)delete()
106 if not self.forever:
--> 107 os.remove(name)
108
ipdb> u
> /Users/user/.py34/lib/python3.4/site-packages/cachecontrol/adapter.py(108)build_response()
107 cache_url = self.controller.cache_url(request.url)
--> 108 self.cache.delete(cache_url)
109
ipdb> u
> /Users/user/.py34/lib/python3.4/site-packages/requests-2.7.0-py3.4.egg/requests/adapters.py(437)send()
435 raise
436
--> 437 return self.build_response(request, resp)
With Python 2.6.4 and the following code snippet:
import requests
from cachecontrol import CacheControl
sess = requests.session()
cached_sess = CacheControl(sess)
response = cached_sess.get('http://example.com')
response = cached_sess.get('http://example.com')
The following error occurs:
$ python test_cache.py
Traceback (most recent call last):
File "test_cache.py", line 10, in <module>
response = cached_sess.get('http://example.com')
File "/home/yen/pyenv/lib/python2.6/site-packages/requests/sessions.py", line 480, in get
return self.request('GET', url, **kwargs)
File "/home/yen/pyenv/lib/python2.6/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/home/yen/pyenv/lib/python2.6/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/adapter.py", line 36, in send
cached_response = self.controller.cached_request(request)
File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/controller.py", line 102, in cached_request
resp = self.serializer.loads(request, self.cache.get(cache_url))
File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/serialize.py", line 108, in loads
return getattr(self, "_loads_v{0}".format(ver))(request, data)
File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/serialize.py", line 184, in _loads_v2
return self.prepare_response(request, cached)
File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/serialize.py", line 145, in prepare_response
**cached["response"]
TypeError: __init__() keywords must be strings
Related: pypa/pip#3074
I tried to test for existence of cache directory, similar to included test test_storage_filecache.py but it does not get created. forever=True flag does not help, changing directory .web_cache to something else neither.
import os.path
import logging
logging.basicConfig(level=logging.DEBUG)
import requests
from cachecontrol import CacheControl
from cachecontrol.caches import FileCache
webcache_dir = ".web_cache"
cache = FileCache(webcache_dir)
sess = CacheControl(requests.Session(), cache=cache)
response = sess.get("http://google.com")
print()
print(cache)
print("%s exists?" % webcache_dir, os.path.exists(webcache_dir))
Attached log:
INFO:urllib3.connectionpool:Starting new HTTP connection (1): google.com
DEBUG:urllib3.connectionpool:Setting read timeout to None
DEBUG:urllib3.connectionpool:"GET / HTTP/1.1" 302 258
INFO:urllib3.connectionpool:Starting new HTTP connection (1): www.google.cz
DEBUG:urllib3.connectionpool:Setting read timeout to None
DEBUG:urllib3.connectionpool:"GET /?gfe_rd=cr&ei=DnKeVs2tOOWI8QfDyYbwDw HTTP/1.1" 200 7699
<cachecontrol.caches.file_cache.FileCache object at 0x7f72120f4b00>
.web_cache exists? False
In #8 it was pointed out that when query string arguments are ordered differently it presents the opportunity for a cache miss, even thought the request is logically the same.
While one option would be to ensure the params are ordered, that seems heavy handed in that it could be confusing.
For the time being the docs should be updated with a section about cache usages where this issue can be brought up. It can be called "best practices", "faq" or "common pitfalls".
I am planning on packaging CacheControl for Debian and Ubuntu. It's now a dependency for pip and I intend to update pip. However, CacheControl's license is vague.
The only reference to a license at all that I can find is in the setup.py where it says "MIT", but that's actually not clear enough. Do you specifically mean this OSI license:
http://opensource.org/licenses/mit-license.php
?
Could you include an explicit LICENSE.txt file or similar in a future release?
I'm using CacheControl as a wrapper around session with a FileCache. It works great, but, given a URL, I'd like to be able to get the path to the underlying file (if it has been cached). This would greatly speed up my use of it, because without it I need to read the contents of the file through the session, then write them out to the file system again for use in a command external to my program.
I've been looking through the code but I can't find an obvious way to do that. I can directly access the FileCache, but I'm having trouble converting the URL to a cache key. Is this use case supported, or can it be?
There's an ImportError
when trying to import cachecontrol.caches
. The fix is simple: first try to import cPickle
, and on ImportError
fallback to importing from pickle
(same as FileCache
).
I have a "cache for the greater of X and the site-specified cache duration" wrapper that I'd like to convert into a CacheControl heuristic but the documentation doesn't explain how the heuristics system actually interacts with Time Priority caching.
As such, I'm forced to dig around in the code to figure out which headers I have to override and in what ways to ensure the desired behaviour. (And while it is very nice, clean code, that's still a documentation bug)
Some caches use non trivial resources such as files, db connection, etc... Right now, CacheControlAdapter does not offer a way to close them.
The redis-based cache is buggy for this reason, for example.
I write various scripts which automate tasks in my daily life and, both as a courtesy to site owners and to make it harder for paranoid ones to be jerks, I always incorporate a minimum cache duration. (For most uses, something like "1 hour for 2xx/3xx, 5 minutes for 4xx/5xx", In the past, this was implemented as a separate wrapper layer around httplib2)
However, sometimes, I do need to force the cache to be ignored. Ideally, I'd like to have both of the following options available to me, but the docs say nothing about either one:
When the underlying urllib3.HTTPResponse
is chunked, I don't get any caching.
I believe the issue is coming from the fact that requests
gets the data out using urllib3.HTTPResponse.stream()
, which, if the response is chunked, calls urllib3.HTTPResponse.read_chunked
. This avoids directly hitting the read()
method on the file object (it calls _safe_read
instead) that cachecontrol overrides in the CallbackFileWrapper
.
I'm using python 35 and requests 2.8.1.
Redis has an excellent support for expiration: when adding a key, you can specify in how many seconds it will expire and be automatically deleted. It would be great if RedisCache
could use that when there's a max-age
or expires
header in the response.
All the decisions that lead to CacheControl entering something into the cache or not, retrieving something from the cache or not, should be logged at DEBUG level.
Reason: HTTP caching is far more complex and influenced by far more factors than most developers that will use CacheControl are aware. If developers are searching for the cause of unexpected behaviour (a.k.a. a bug), knowledge of the decisionmaking will help them determine whether they should look in their own logic, the server's logic, or CacheControl's logic.
Hello,
it will be nice if a connection to SQLAlchemy (engine) could be pass to cachecontrol (maybe using cache argument).
see http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html
It will provide more DB support (MySQL, PostgreSQL, Oracle, Microsoft SQL Server...)
Pandas is doing something similar:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.sql.read_sql.html
APSScheduler also
https://apscheduler.readthedocs.org/en/latest/
cache
argument could be
a SQLAlchemyCache
object
a SQLAlchemy engine
a database URI like dialect+driver://username:password@host:port/database
a table URI like dialect+driver://username:password@host:port/database::tablename
from cachecontrol.caches import SQLAlchemyCache
Kind regards
Hello,
I try this code with my internet connection enabled
import requests
from cachecontrol import CacheControl
from cachecontrol.caches import FileCache
req_session = requests.session()
cache = FileCache('web_cache', forever=True)
session = CacheControl(req_session, cache=cache)
response = session.get('http://www.google.com')
print(response.status_code)
I disabled my internet connection and run again this code.
It raised ConnectionError: ('Connection aborted.', gaierror(8, 'nodename nor servname provided, or not known'))
That's probably a misunderstanding from my side. But I thought that if I store in a file both request and response I could get it when my connection was disabled.
I also don't understand why this forever
flag exists. In my understanding we should pass a custom caching strategies (aka caching heuristics) to CacheControl
class Forever(BaseHeuristic):
pass
and use it like
req_session = requests.session()
cache = FileCache('web_cache')
session = CacheControl(req_session, cache=cache, heuristic=Forever())
response = session.get('http://www.google.com')
print(response.status_code)
Any idea ? but that's like I said probably a misunderstanding from my side.
Kind regards
Hello,
see #47
Not sure that cache heuristics name is great... but what I'm saying is to create your own heuristics derived from BaseHeuristic and to put it in CacheControl source so it will be much easier to use (for beginners like me).
class ExpireAfter(BaseHeuristic):
pass
This heuristic will never look at headers... just UTC datetime datetime.utcnow()
to decide if data need to be downloaded again.
It could be used like this:
from requests import Session
from cachecontrol import CacheControl, ExpireAfter
expire_after = 60 * 5 # cache_expiration (seconds) 0: no cache - None: no cache expiration
#ideally expire_after could also be a datetime.timedelta
sess = CacheControl(Session(), heuristic=ExpireAfter(expire_after))
r = sess.get('http://google.com')
Unfortunately I don't feel confortable with CacheControl to provide you code to do this.
Kind regards
PS: I'm coming here because of https://github.com/kennethreitz/requests/issues/2378
Hi again, I found another similar case like #84 but not exactly the same place.
URL that is failing: http://wizard2.sbs.co.kr/w3/podcast/V0000372136.xml
Specifically, the decoding chokes on the unicode chars in this header:
... 'p3p': "CP='\xb0\xa3\xb7\xab\xb9\xe6\xc4\xa7\xb1\xe2\xc8\xa3'" ...
Stacktrace:
File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/requests/sessions.py", line 476, in get
return self.request('GET', url, **kwargs)
File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/opbeat/instrumentation/packages/base.py", line 63, in __call__
args, kwargs)
File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/opbeat/instrumentation/packages/base.py", line 214, in call_if_samp
ling
return wrapped(*args, **kwargs)
File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/requests/sessions.py", line 464, in request
resp = self.send(prep, **send_kwargs)
File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/adapter.py", line 36, in send
cached_response = self.controller.cached_request(request)
File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/controller.py", line 102, in cached_request
resp = self.serializer.loads(request, self.cache.get(cache_url))
File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/serialize.py", line 114, in loads
return getattr(self, "_loads_v{0}".format(ver))(request, data)
File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/serialize.py", line 180, in _loads_v2
for k, v in cached["response"]["headers"].items()
File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/serialize.py", line 180, in <genexpr>
for k, v in cached["response"]["headers"].items()
File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/serialize.py", line 30, in _b64_decode_str
return _b64_decode_bytes(s).decode("utf8")
File "/home/ubuntu/.virtualenvs/webserver/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 4: invalid start byte
After redis gets sorted out, memcache please? :)
I'm encountering this exception when fetching some URLs:
UnicodeDecodeError
'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
The function that is raising is this:
def _b64_encode_str(s):
return _b64_encode_bytes(s.encode("utf8"))
And some example data that some HTTP servers seem to be sending is for example: '\u201cmax-age=31536000\u2033'
It would be great if cache-control could handle these edge cases without dying.
Thanks for a great package!
I'm not seeing any logging statements.
Issue I'm having right now is that cachecontrol isn't caching responses I'd have expected to get cached, but I have no idea why, not even after single-stepping through it: I'm seeing all the turns where it decides it's not going to cache it right now, and I don't know which of these decisions is right (or maybe all of them are right and the website just isn't sending the right headers).
The 'lockfile' package required for file caching is deprecated.
https://pypi.python.org/pypi/fasteners is the recommended alternative.
FileCache stopped working for me with Python 3 and PyPy, and I think I tracked down the cause to a recent change to urllib3's is_fp_closed
utility.
From https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/filewrapper.py#L32
# Is this the best way to figure out if the file has been completely
# consumed?
if is_fp_closed(self.__fp):
self.__callback(self.__buf.getvalue())
In my Python 3 and PyPy environments, is_fp_closed was never returning True. Reverting the changes in urllib3/urllib3#435 fixed it.
I tried cloning urllib3 and running the tox tests to dig in further but couldn't get the tests to run, and thought my best next step would be reporting here.
It may be that cachecontrol's code is just fine and the issue is in urllib3, but I figured I'd confirm here first. Does that look like the problem?
Thanks in advance for taking a look!
See comment on 5e8e8a6#commitcomment-12816743
IIUC, currently cache stores are allowed to grow without bound, is that right? If so, is there any interest in adding support for setting a max size on the cache store? If you're using a FileCache, for instance, and have a limited amount of disk space (e.g. you're on the AWS free tier), this would be useful.
For the simplest possible first pass at this, once the max size is reached, the cache store could simply decline to cache anything further, ideally notifying the user somehow (maybe if you use this setting you can pass in a logger). A less naive implementation could intelligently select items to evict from the cache to make room for additional ones, e.g. on the basis of hit rate. (Item with lowest hit rate gets evicted first; ties could be broken with item size and/or least recency of access. Actually, should look at how browsers do it.)
Thanks for your consideration, and for all the great work on cachecontrol!
This is my code
wikisess = requests.Session()
wikises = CacheControl(wikisess)
search = wikises.get("http://wiki.roblox.com/api.php?format=json&action=query&list=search&srwhat=title&srsearch=%s"%query)
If I were to do the same url again, even with a different query variable it says this
File "c:\Anaconda\lib\site-packages\requests\sessions.py", line 477, in get
return self.request('GET', url, *_kwargs)
File "c:\Anaconda\lib\site-packages\requests\sessions.py", line 465, in reques
t
resp = self.send(prep, *_send_kwargs)
File "c:\Anaconda\lib\site-packages\requests\sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "c:\Anaconda\lib\site-packages\cachecontrol\adapter.py", line 46, in send
resp = super(CacheControlAdapter, self).send(request, **kw)
File "c:\Anaconda\lib\site-packages\requests\adapters.py", line 415, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ResponseNotReady())
This is an issue with filewrapper.py
urllib3/response.py has:
if self.chunked:
for line in self.read_chunked(amt, decode_content=decode_content):
yield line
else:
while not is_fp_closed(self._fp):
data = self.read(amt=amt, decode_content=decode_content)
...
This causes chunked requests to bypass the read
implementation in CallbackFileWrapper and the callback never gets called. The chunked code ends up calling self._fp._safe_read
.
I was able to verify this by setting
chunked_transfer_encoding off;
in an nginx config and seeing that responses are then cached.
I think this might be related to #95. fwiw tracking this down only took me 20 minutes.
Hi,
Using FileCache I had a problem with the length of b64-encoded keys.
I'm playing with a webservice which lead me to request quite long URLs.
It is not absurd to reach 200 characters long URL which in turn produces an even longer b64-encoded string.
For instance the largest URL found in examples of this "Roseta Stone" webservice [0] from EchoNest is already 145 characters long, which makes 196 once b64-encoded.
It leaves a 54 characters headroom before the 255 characters limit of most file-systems.
Of course I hit that limit saving the cache in a XDG sub-folder of my home directory (on linux/ext4) :D
As I said these kind of long URLs could be quite common while playing with such webservices, I don't think this is an unusual use case.
Then why not using md5 or sha2 hash functions, they both provide fixed length keys and minimize the problem.
I've started using md5 since there are no security issues with collision attack IMHO (but I'm far from a security expert) :
def encode(self, x):
return md5(x.encode()).hexdigest()
I used md5 to ensure to alway have it available on python distribution (same with sha1, sha224, sha256, sha384,sha512 cf. [1]).
Cheers
[0] http://developer.echonest.com/docs/v4/index.html#project-rosetta-stone
[1] http://docs.python.org/library/hashlib.html
According to the spec, you're supposed to return warning 113 if your cache has intentionally served up something more than 24 hours stale. However, you neglected to expose the response's information to warning
so I can check how old the response is.
As is, I'm probably going to wind up overriding BaseHeuristic.apply
instead, which grates on me since the docs say nothing about its API stability.
Otherwise cacheable chunked responses (indicated by a response header Transfer-Encoding: chunked
) are not cached by CacheControl when requests>=2.6.1
is used.
It seems this issue appeared with the inclusion of urllib3/urllib3@f21c2a2 in the requests
codebase via kennethreitz/requests@5fcd843.
Example script demonstrating the issue:
In [1]: import requests
In [2]: from requests.packages import urllib3
In [3]: from cachecontrol import CacheControl
In [4]: urllib3.disable_warnings() # https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning
In [5]: sess = requests.session()
In [6]: cached_sess = CacheControl(sess)
In [7]: # non-chunked response
In [8]: response = cached_sess.get('http://httpbin.org/cache/60')
In [9]: response.headers.get('transfer-encoding') is None
Out[9]: True
In [10]: response.from_cache
Out[10]: False
In [11]: response = cached_sess.get('http://httpbin.org/cache/60')
In [12]: response.from_cache
Out[12]: True
In [13]: # chunked response
In [14]: response = cached_sess.get('https://tor.eff.org/')
In [15]: response.headers.get('transfer-encoding') == 'chunked'
Out[15]: True
In [16]: response.from_cache
Out[16]: False
In [17]: response = cached_sess.get('https://tor.eff.org/')
In [18]: response.from_cache
Out[18]: False
In [19]:
Hello,
it will be nice to provide a sample caching heuristics for "expire_after
"
(like requests_cache
).
It should ignores all cache headers, it should just caches the data for the time you specify.
Kind regards
https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/cache.py#L3 says
The cache object API for implementing caches. The default is just a dictionary, which in turns means it is not threadsafe for writing.
But http://cachecontrol.readthedocs.org/en/latest/storage.html#dictcache says "It is a simple threadsafe dictionary", and the set
and delete
implementations do in fact lock: https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/cache.py#L30
Currently (I think) the caching code assumes that if we call read()
on the file and it's closed that we've successfully downloaded the file. This caused issues for someone in pip (see https://bitbucket.org/zzzeek/sqlalchemy/issue/3447/unable-to-install-sqlalchemy). I'm not sure what the best way of handling this would be.
The latest on pypi 0.10.4.
CacheController.cached_request
clears out cache entries if they were stale, even if staleness happened just because the cache entry was older than what the request required.
A later request may tolerate a higher age, so the cache entry could still be useful.
In the following scenario, this will cause a refetch that should have been served from the cache:
In the following scenario, this will fail to return a cached entry where it could have:
Admittedly, neither scenario is very likely, but given that caching is such a foundational part that people expect to "just work", I think that even this stone shouldn't be left unturned.
Hello,
it will be nice if CacheControl could provide SQLite support.
Kind regards
https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/controller.py#L120 ff. seems to be ignoring cache entries that do not have a date
header.
I think https://tools.ietf.org/html/rfc7232#section-2.4 allows sending etag
without date
; in that case, the cache content would never be used.
Also, I'm wondering why it's deleting a cache entry here; if a cache entry does not have the headers required to make it useful, wouldn't it be better to never enter it into the cache in the first place?
(I may be misunderstanding things, grossly; I'm currently looking only at the code that retrieves data from the cache, in cached_response
.)
The docs should mention something about the potential pitfalls of using non-timezone aware dates. This is something that both servers and clients should consider, which is relevant for CacheControl when used with clients written by the same organization as the service.
Hi,
thanks for this lib but I think LastModified
heuristic is not working. here is the sample:
In [1]: import requests
In [2]: from cachecontrol import CacheControlAdapter
In [3]: from cachecontrol.heuristics import LastModified
In [4]: adapter = CacheControlAdapter(heuristic=LastModified())
In [5]: sess = requests.Session()
In [6]: sess.mount('http://', adapter)
In [7]: sess.mount('https://', adapter)
In [8]: r = sess.get("https://app.roihunter.com/data/example-feed-roi-hunter.xml")
In [9]: r.from_cache
Out[9]: False
In [10]: r = sess.get("https://app.roihunter.com/data/example-feed-roi-hunter.xml")
In [11]: r.from_cache
Out[11]: False
Expected behavior is that the second call of r.from_cache
(Out[11]
) returns True
.
Can you explain me the behavior or fix it please? Thanks
I'm able to retrieve the path where contents will be written by a FileCache object, however as the function's docstring indicates there's no guarantee that this file will exist. What I'm finding is that for most URLs I try, the contents are written immediately as I perform the request, see the two examples below.
When I do the following, the path doesn't exist:
In [54]: from cachecontrol.caches import FileCache
In [55]: from cachecontrol.caches.file_cache import url_to_file_path
In [56]: from contextlib import contextmanager
In [57]: from tempfile import gettempdir
In [58]: file_cache = FileCache(gettempdir())
In [60]: from cachecontrol import CacheControl
In [61]: sess = CacheControl(requests.Session(), cache=file_cache)
In [62]: url = ('http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmax=1'
....: '00&retmode=text&tool=skbio&db=nucleotide&id=459567&rettype=fas'
....: 'ta&retstart=0&[email protected]')
In [63]: req = sess.get(url)
In [64]: cached_fp = url_to_file_path(url, file_cache)
In [65]: from os.path import exists
In [67]: _ = req.content
# the file doesn't seem to exist
In [68]: exists(cached_fp)
Out[68]: False
However, if I try this with another URL that has the same contents, the file exists.
In [69]: other_url = 'https://gist.githubusercontent.com/ElDeveloper/e0144eaf196f3a641409/raw/f14a6ff47b880537da3067b322526a91124ff742/-'
In [70]: req = sess.get(other_url)
In [71]: cached_fp = url_to_file_path(other_url, file_cache)
# the file exists YAY :D
In [72]: exists(cached_fp)
Out[72]: True
I don't know if this is something specific to the server I'm trying to retrieve the data from or if this is an expected behavior of FileCache. Any help is greatly appreciated.
According to the RFC, "By default, a response is cacheable if the requirements of the request method, request header fields, and the response status indicate that it is cacheable."
For this reason, and also because it's the behavior of other clients such as browsers, I would expect a simple 200 response to a simple GET request to be cached by CacheControl if there are no headers limiting the caching. CacheControl should cache these idefinitely. There should not need to be any flag such as in #18 to invoke this behavior.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.