psf / cachecontrol Goto Github PK

View Code? Open in Web Editor NEW

465.0 465.0 122.0 543 KB

The httplib2 caching algorithms packaged up for use with requests.

License: Other

Makefile 0.87% Python 99.13%

cachecontrol's People

Contributors

Stargazers

Watchers

Forkers

degenhard tow nanonyme versae davidfraser mxjeff ericmoritz echel0n cournape dstufft alex ercpe sinemetu1 kevinburke bukzor andresvidal parkouss mblakele josephw msabramo bdeeney linearregression toolforger sigmavirus24 stickystyle rmcgibbo sgillies scollinson stephanerb edwardbetts eads tdna sybrenstuvel cydrobolt hartym maggie-m davinirjr mrb0 ridereport birne94 kyrias elnuno grokcode kirberich streetteam vdaskalov jayvdb jowrjowr tastuteche andcycle ajethorn sk1p orangedog alastairmccormack morristech pradyunsg ssbarnea thejokr user135711 carlio steadbytes sniegu justrich chrahunt mattbradley vuiis embray bigrlab groodt pdecat ishrock waggledans flameeyes dbaxa pombredanne dvershinin nholmes3 eamanu mwatts15 nageshlop santosh653 limkokholefork hexagonrecursion bob-schumaker stefanor dwillie ddc67cd alvistack jongaither eduardogama python-repository-hub clintonroy icodein tgolsson jelly arpitjain799 pganssle roger-luo frostming woodruffw-forks

cachecontrol's Issues

CacheControl fails on invalid (non-integer) max-age header

CacheControl totally fails on this URL:

'http://www.meristation.com/v3/podcasts_rss.php'

Due to the max-age header being '600s' which is obviously wrong, but I think CacheControl should catch the ValueError (trying to cast the string to an int) and handle it gracefully.

Exception:

ValueError
invalid literal for int() with base 10: '600s'
Location: cachecontrol/controller.py, cache_response:241

BaseHeuristic is too low-level to be the only API offered

I've been thinking about the design of BaseHeuristic and, while the new design seems good as a base class (for the uses I've thought of so far), I think that, for common-case uses, it's going to result in a lot of people who don't have the time to become HTTP 1.1 experts reinventing boilerplate and introducing subtle bugs in their header parsing and resetting code.

There should probably be a subclass with a name like SimpleBaseHeuristic where, instead of update_headers(self, response) and warning(self, response), you override set_expiry(self, response, current_expiry) and return the new expiry value.

(That is, a subclass which provides a single, shared implementation of going from multiple subtly different HTTP headers to one "this is how long it'll last" number and then back again and also handles the warning 110/113 selection automatically )

If you don't beat me to it, I'll try to write it as soon as I can figure out the least hacky way to call CacheController.parse_cache_control from a BaseHeuristic subclass so I can access the value of max-age without duplicating the Cache-Control parsing code.

params order in query

req = requests.get('http://example.org', params={'test':'ok', 'api_key': 's3cr3t', 'bqj':'None'})
print(req.url)
'http://example.org/?bqj=None&api_key=s3cr3t&test=ok'
...
# then in another python interpreter instance, the same code might give:
print(req.url)  # http://example.org/?bqj=None&api_key=s3cr3t&test=ok
'http://example.org/?test=ok&api_key=s3cr3t&bqj=None'

I noticed, for the same set of http data, that order in the query part (from parse_uri) might change from a run to another.

Since the key for the cache is computed from uri we might end up with tow different keys for the same http resource.

This can be an issue for persistent http cache.

I'm not a (python|http) expert, I don't if this is actually an issue or a misuse of requests/caching.

Am I wrong, missing something?

Cheers

Accessing the original response and status code

Is it possible to access the original response/status code of cached responses? This would be useful for measuring cache hits/misses.

Not getting the caching I expected

Here's my CCSSE:

import github3
import cachecontrol
g = github3.GitHub()
cachecontrol.CacheControl(g._session)
print(g.rate_limit()['resources']['core']) # initial rate limit
print(g.rate_limit()['resources']['core']) # rate_limit - should not count
repository = g.repository('sigmavirus24', 'github3.py')
print(g.rate_limit()['resources']['core']) # get repo data - should count
repository = g.repository('sigmavirus24', 'github3.py')
print(g.rate_limit()['resources']['core']) # get repo again - should be served from cache

In the output, I'm seeing that the rate limit is being ticked off for each g.repository call.

{u'reset': 1438863563, u'limit': 60, u'remaining': 53}
{u'reset': 1438863563, u'limit': 60, u'remaining': 53}
{u'reset': 1438863563, u'limit': 60, u'remaining': 52}
{u'reset': 1438863563, u'limit': 60, u'remaining': 51}

With logging cranked up to the max and tons of logging added (see #93 for the exact code, 50a76aa to be specific), I'm seeing the long trace below.
Conspicuously, there's no Updating cache with response from "http://...", which is what I added at https://github.com/toolforger/cachecontrol/blob/50a76aa0f022a47d34c65c13c4c813ecb1f2c086/cachecontrol/controller.py#L228, so I guess indicating cachecontrol.controller.CacheController.cache_response is never called.
Since that call is stashed away in a functools.partial, I have no idea where and when that call should have happened, so I have come to a dead end.

INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
INFO:github3:Missed the cache building the url
DEBUG:github3:GET https://api.github.com/rate_limit with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): api.github.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
DEBUG:github3:GET https://api.github.com/rate_limit with {}
{u'reset': 1438867921, u'limit': 60, u'remaining': 60}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
{u'reset': 1438867922, u'limit': 60, u'remaining': 60}
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'repos', 'sigmavirus24', 'github3.py')
INFO:github3:Missed the cache building the url
DEBUG:github3:GET https://api.github.com/repos/sigmavirus24/github3.py with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/repos/sigmavirus24/github3.py" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /repos/sigmavirus24/github3.py HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
DEBUG:github3:GET https://api.github.com/rate_limit with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'repos', 'sigmavirus24', 'github3.py')
DEBUG:github3:GET https://api.github.com/repos/sigmavirus24/github3.py with {}
{u'reset': 1438867922, u'limit': 60, u'remaining': 59}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/repos/sigmavirus24/github3.py" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /repos/sigmavirus24/github3.py HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
INFO:github3:Building a url from ('https://api.github.com', 'rate_limit')
DEBUG:github3:GET https://api.github.com/rate_limit with {}
DEBUG:cachecontrol.controller:Looking up "https://api.github.com/rate_limit" in the cache
DEBUG:cachecontrol.controller:No cache entry available
DEBUG:requests.packages.urllib3.connectionpool:"GET /rate_limit HTTP/1.1" 200 None
INFO:github3:Attempting to get JSON information from a Response with status code 200 expecting 200
INFO:github3:JSON was returned
{u'reset': 1438867922, u'limit': 60, u'remaining': 58}

cachecontrol not available on pypi

I'm unable to install via pip:

$ pip -vvv install cachecontrol
Collecting cachecontrol
  Getting page https://pypi.python.org/simple/cachecontrol/
  Starting new HTTPS connection (1): pypi.python.org
  "GET /simple/cachecontrol/ HTTP/1.1" 200 119
  1 location(s) to search for versions of cachecontrol:
  * https://pypi.python.org/simple/cachecontrol/
  Getting page https://pypi.python.org/simple/cachecontrol/
  Analyzing links from page https://pypi.python.org/simple/cachecontrol/
  Could not find a version that satisfies the requirement cachecontrol (from versions: )
Cleaning up...
No matching distribution found for cachecontrol

Strangely the link to https://pypi.python.org/pypi/CacheControl show a current version. Could you try re-releasing to fix?

Support caching 301 Moved Permanently

In requests issue #2409 there is a discussion of removing the specialized history cache that maintains permanent redirects. A solution would be to recommend CacheControl, but unfortunately, as CacheControl assumes requests will handle 301s, it isn't supported.

CacheControl should handle 301s by always caching the response unless other caching headers are included that change the behavior. Similarly, any cache busting headers should be respected as usual.

Documentation doesn't explain how to force read from cache even if expired

This is very usefull when dealing whit paginated api. ( exemple : github api).

when a query have a lot of response, there is pagination whith header links to follow.
when a query is on several pages a 304 not modified, mean not modified for all of them, so no need to make the conditionnal query for all the next pages.
I can't find a way to go read directly in the cache without trying a query first

adding extra objects

Hi, Is there any way that I can add an object to the cache, for example I have the headers, content in short complete response. Is there any way that I can add this to the cache and retrieve later using the url?
for example:
To put in the cache I have,
headers, response, status codes

To retrieve from the cahe I have,
url, request version etc

any code example would be much appreciated.

"Caching Heuristics" example code has two typos

When I copy-pasted the example code from http://cachecontrol.readthedocs.org/en/latest/custom_heuristics.html#caching-heuristics into my program to serve as a starting point, my Flake8+PyLint integration warned me of several typos:

cachecontrol.heuristic doesn't exist. You missed an s at the end of that.
You use headers without defining it and then don't return it.
You didn't import calendar before using it.

(Which really makes me glad I already had plans to integration-test my Requests+CacheControl stack using HTTPretty to ensure it's actually behaving as intended. Would you like some of that test code once I've gotten around to writing it?)

FileNotFound error with PUT and FileCache

I'm using the CacheControlAdapter with a FileCache and I'm getting a FileNotFound error issuing a PUT request. Bug or user error? Stepping through the debugger. It doesn't look like the PUT request is ever cached to be deleted so the file shouldn't exist, or am I missing something?

> /Users/user/.py34/lib/python3.4/site-packages/cachecontrol/caches/file_cache.py(107)delete()
    106         if not self.forever:
--> 107             os.remove(name)
    108

ipdb> u
> /Users/user/.py34/lib/python3.4/site-packages/cachecontrol/adapter.py(108)build_response()
    107             cache_url = self.controller.cache_url(request.url)
--> 108             self.cache.delete(cache_url)
    109

ipdb> u
> /Users/user/.py34/lib/python3.4/site-packages/requests-2.7.0-py3.4.egg/requests/adapters.py(437)send()
    435                 raise
    436
--> 437         return self.build_response(request, resp)

Not working with Python < 2.6.5

With Python 2.6.4 and the following code snippet:

import requests
from cachecontrol import CacheControl

sess = requests.session()
cached_sess = CacheControl(sess)

response = cached_sess.get('http://example.com')
response = cached_sess.get('http://example.com')

The following error occurs:

$ python test_cache.py
Traceback (most recent call last):
  File "test_cache.py", line 10, in <module>
    response = cached_sess.get('http://example.com')
  File "/home/yen/pyenv/lib/python2.6/site-packages/requests/sessions.py", line 480, in get
    return self.request('GET', url, **kwargs)
  File "/home/yen/pyenv/lib/python2.6/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/yen/pyenv/lib/python2.6/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/adapter.py", line 36, in send
    cached_response = self.controller.cached_request(request)
  File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/controller.py", line 102, in cached_request
    resp = self.serializer.loads(request, self.cache.get(cache_url))
  File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/serialize.py", line 108, in loads
    return getattr(self, "_loads_v{0}".format(ver))(request, data)
  File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/serialize.py", line 184, in _loads_v2
    return self.prepare_response(request, cached)
  File "/home/yen/pyenv/lib/python2.6/site-packages/cachecontrol/serialize.py", line 145, in prepare_response
    **cached["response"]
TypeError: __init__() keywords must be strings

Related: pypa/pip#3074

FileCache does not work

I tried to test for existence of cache directory, similar to included test test_storage_filecache.py but it does not get created. forever=True flag does not help, changing directory .web_cache to something else neither.

import os.path
import logging
logging.basicConfig(level=logging.DEBUG)

import requests
from cachecontrol import CacheControl
from cachecontrol.caches import FileCache

webcache_dir = ".web_cache"
cache = FileCache(webcache_dir)
sess = CacheControl(requests.Session(), cache=cache)
response = sess.get("http://google.com")

print()
print(cache)
print("%s exists?" % webcache_dir, os.path.exists(webcache_dir))

Attached log:

INFO:urllib3.connectionpool:Starting new HTTP connection (1): google.com
DEBUG:urllib3.connectionpool:Setting read timeout to None
DEBUG:urllib3.connectionpool:"GET / HTTP/1.1" 302 258
INFO:urllib3.connectionpool:Starting new HTTP connection (1): www.google.cz
DEBUG:urllib3.connectionpool:Setting read timeout to None
DEBUG:urllib3.connectionpool:"GET /?gfe_rd=cr&ei=DnKeVs2tOOWI8QfDyYbwDw HTTP/1.1" 200 7699

<cachecontrol.caches.file_cache.FileCache object at 0x7f72120f4b00>
.web_cache exists? False

Update documentation with a notice regarding GET param ordering and caching

In #8 it was pointed out that when query string arguments are ordered differently it presents the opportunity for a cache miss, even thought the request is logically the same.

While one option would be to ensure the params are ordered, that seems heavy handed in that it could be confusing.

For the time being the docs should be updated with a section about cache usages where this issue can be brought up. It can be called "best practices", "faq" or "common pitfalls".

Please clarify license

I am planning on packaging CacheControl for Debian and Ubuntu. It's now a dependency for pip and I intend to update pip. However, CacheControl's license is vague.

The only reference to a license at all that I can find is in the setup.py where it says "MIT", but that's actually not clear enough. Do you specifically mean this OSI license:

http://opensource.org/licenses/mit-license.php

Could you include an explicit LICENSE.txt file or similar in a future release?

Is there a good way to get the path of the underlying cached file for a given URL?

I'm using CacheControl as a wrapper around session with a FileCache. It works great, but, given a URL, I'd like to be able to get the path to the underlying file (if it has been cached). This would greatly speed up my use of it, because without it I need to read the contents of the file through the session, then write them out to the file system again for use in a command external to my program.

I've been looking through the code but I can't find an obvious way to do that. I can directly access the FileCache, but I'm having trouble converting the URL to a cache key. Is this use case supported, or can it be?

RedisCache doesn't work with Python 3

There's an ImportError when trying to import cachecontrol.caches. The fix is simple: first try to import cPickle, and on ImportError fallback to importing from pickle (same as FileCache).

Documentation doesn't explain algorithm used to determine whether to cache based on time

I have a "cache for the greater of X and the site-specified cache duration" wrapper that I'd like to convert into a CacheControl heuristic but the documentation doesn't explain how the heuristics system actually interacts with Time Priority caching.

As such, I'm forced to dig around in the code to figure out which headers I have to override and in what ways to ensure the desired behaviour. (And while it is very nice, clean code, that's still a documentation bug)

There should be a way for CacheControlAdapter to forward close to cache classes

Some caches use non trivial resources such as files, db connection, etc... Right now, CacheControlAdapter does not offer a way to close them.

The redis-based cache is buggy for this reason, for example.

Documentation doesn't explain how to force-invalidate a cached value

I write various scripts which automate tasks in my daily life and, both as a courtesy to site owners and to make it harder for paranoid ones to be jerks, I always incorporate a minimum cache duration. (For most uses, something like "1 hour for 2xx/3xx, 5 minutes for 4xx/5xx", In the past, this was implemented as a separate wrapper layer around httplib2)

However, sometimes, I do need to force the cache to be ignored. Ideally, I'd like to have both of the following options available to me, but the docs say nothing about either one:

When using Time Priority caching, I'd like a way to invalidate the time and just use the ETag on a per-request basis.
For either caching strategy, I'd like a way to just invalidate and refresh the cache entry, period.

Chunked responses are not cached

When the underlying urllib3.HTTPResponse is chunked, I don't get any caching.

I believe the issue is coming from the fact that requests gets the data out using urllib3.HTTPResponse.stream(), which, if the response is chunked, calls urllib3.HTTPResponse.read_chunked. This avoids directly hitting the read() method on the file object (it calls _safe_read instead) that cachecontrol overrides in the CallbackFileWrapper.

I'm using python 35 and requests 2.8.1.

RedisCache should use built-in expiration

Redis has an excellent support for expiration: when adding a key, you can specify in how many seconds it will expire and be automatically deleted. It would be great if RedisCache could use that when there's a max-age or expires header in the response.

CacheControl should log its decisionmaking

All the decisions that lead to CacheControl entering something into the cache or not, retrieving something from the cache or not, should be logged at DEBUG level.
Reason: HTTP caching is far more complex and influenced by far more factors than most developers that will use CacheControl are aware. If developers are searching for the cause of unexpected behaviour (a.k.a. a bug), knowledge of the decisionmaking will help them determine whether they should look in their own logic, the server's logic, or CacheControl's logic.

Support for many databases as cache data store / SQLAlchemy engine

Hello,

it will be nice if a connection to SQLAlchemy (engine) could be pass to cachecontrol (maybe using cache argument).
see http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html

It will provide more DB support (MySQL, PostgreSQL, Oracle, Microsoft SQL Server...)

Pandas is doing something similar:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.sql.read_sql.html

APSScheduler also
https://apscheduler.readthedocs.org/en/latest/

cache argument could be

a SQLAlchemyCache object
a SQLAlchemy engine
a database URI like dialect+driver://username:password@host:port/database
a table URI like dialect+driver://username:password@host:port/database::tablename

from cachecontrol.caches import SQLAlchemyCache

Kind regards

No cache when no internet connection - even with forever set to True

Hello,

I try this code with my internet connection enabled

import requests
from cachecontrol import CacheControl
from cachecontrol.caches import FileCache

req_session = requests.session()
cache = FileCache('web_cache', forever=True)
session = CacheControl(req_session, cache=cache)
response = session.get('http://www.google.com')
print(response.status_code)

I disabled my internet connection and run again this code.

It raised ConnectionError: ('Connection aborted.', gaierror(8, 'nodename nor servname provided, or not known'))

That's probably a misunderstanding from my side. But I thought that if I store in a file both request and response I could get it when my connection was disabled.

I also don't understand why this forever flag exists. In my understanding we should pass a custom caching strategies (aka caching heuristics) to CacheControl

class Forever(BaseHeuristic):
    pass

and use it like

req_session = requests.session()
cache = FileCache('web_cache')
session = CacheControl(req_session, cache=cache, heuristic=Forever())
response = session.get('http://www.google.com')
print(response.status_code)

Any idea ? but that's like I said probably a misunderstanding from my side.

Kind regards

ExpireAfter should be a cache heuristic that CacheControl provides

Hello,

see #47

Not sure that cache heuristics name is great... but what I'm saying is to create your own heuristics derived from BaseHeuristic and to put it in CacheControl source so it will be much easier to use (for beginners like me).

class ExpireAfter(BaseHeuristic):
    pass

This heuristic will never look at headers... just UTC datetime datetime.utcnow() to decide if data need to be downloaded again.

It could be used like this:

from requests import Session
from cachecontrol import CacheControl, ExpireAfter

expire_after = 60 * 5 # cache_expiration (seconds) 0: no cache - None: no cache expiration
#ideally expire_after could also be a datetime.timedelta
sess = CacheControl(Session(), heuristic=ExpireAfter(expire_after))
r = sess.get('http://google.com')

Unfortunately I don't feel confortable with CacheControl to provide you code to do this.

Kind regards

PS: I'm coming here because of https://github.com/kennethreitz/requests/issues/2378

UnicodeDecodeError on some utf8 content in headers in cachecontrol

Hi again, I found another similar case like #84 but not exactly the same place.

URL that is failing: http://wizard2.sbs.co.kr/w3/podcast/V0000372136.xml

Specifically, the decoding chokes on the unicode chars in this header:

... 'p3p': "CP='\xb0\xa3\xb7\xab\xb9\xe6\xc4\xa7\xb1\xe2\xc8\xa3'" ...

Stacktrace:

  File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/requests/sessions.py", line 476, in get
    return self.request('GET', url, **kwargs)
  File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/opbeat/instrumentation/packages/base.py", line 63, in __call__
    args, kwargs)
  File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/opbeat/instrumentation/packages/base.py", line 214, in call_if_samp
ling
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/requests/sessions.py", line 464, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ubuntu/.virtualenvs/webserver/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/adapter.py", line 36, in send
    cached_response = self.controller.cached_request(request)
  File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/controller.py", line 102, in cached_request
    resp = self.serializer.loads(request, self.cache.get(cache_url))
  File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/serialize.py", line 114, in loads
    return getattr(self, "_loads_v{0}".format(ver))(request, data)
  File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/serialize.py", line 180, in _loads_v2
    for k, v in cached["response"]["headers"].items()
  File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/serialize.py", line 180, in <genexpr>
    for k, v in cached["response"]["headers"].items()
  File "/home/ubuntu/.virtualenvs/webserver/src/cachecontrol-master/cachecontrol/serialize.py", line 30, in _b64_decode_str
    return _b64_decode_bytes(s).decode("utf8")
  File "/home/ubuntu/.virtualenvs/webserver/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 4: invalid start byte

Memcache as a storage backend

After redis gets sorted out, memcache please? :)

UnicodeDecodeError raised on some cache max-age headers

I'm encountering this exception when fetching some URLs:

UnicodeDecodeError
 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

The function that is raising is this:

def _b64_encode_str(s):
    return _b64_encode_bytes(s.encode("utf8"))

And some example data that some HTTP servers seem to be sending is for example: '\u201cmax-age=31536000\u2033'

It would be great if cache-control could handle these edge cases without dying.

Thanks for a great package!

No logging

I'm not seeing any logging statements.
Issue I'm having right now is that cachecontrol isn't caching responses I'd have expected to get cached, but I have no idea why, not even after single-stepping through it: I'm seeing all the turns where it decides it's not going to cache it right now, and I don't know which of these decisions is right (or maybe all of them are right and the website just isn't sending the right headers).

'lockfile' is deprecated.

The 'lockfile' package required for file caching is deprecated.

https://pypi.python.org/pypi/fasteners is the recommended alternative.

recent change to urllib3's is_fp_closed broke cachecontrol for Python 3 and PyPy

FileCache stopped working for me with Python 3 and PyPy, and I think I tracked down the cause to a recent change to urllib3's is_fp_closed utility.

From https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/filewrapper.py#L32

        # Is this the best way to figure out if the file has been completely
        #   consumed?
        if is_fp_closed(self.__fp):
            self.__callback(self.__buf.getvalue())

In my Python 3 and PyPy environments, is_fp_closed was never returning True. Reverting the changes in urllib3/urllib3#435 fixed it.

I tried cloning urllib3 and running the tox tests to dig in further but couldn't get the tests to run, and thought my best next step would be reporting here.

It may be that cachecontrol's code is just fine and the issue is in urllib3, but I figured I'd confirm here first. Does that look like the problem?

Thanks in advance for taking a look!

FileCache warning for a missing filelock shouldn't trigger when using rediscache

See comment on 5e8e8a6#commitcomment-12816743

support setting max size on cache store

IIUC, currently cache stores are allowed to grow without bound, is that right? If so, is there any interest in adding support for setting a max size on the cache store? If you're using a FileCache, for instance, and have a limited amount of disk space (e.g. you're on the AWS free tier), this would be useful.

For the simplest possible first pass at this, once the max size is reached, the cache store could simply decline to cache anything further, ideally notifying the user somehow (maybe if you use this setting you can pass in a logger). A less naive implementation could intelligently select items to evict from the cache to make room for additional ones, e.g. on the basis of hit rate. (Item with lowest hit rate gets evicted first; ties could be broken with item size and/or least recency of access. Actually, should look at how browsers do it.)

Thanks for your consideration, and for all the great work on cachecontrol!

Errors if I use the url twice

This is my code

wikisess = requests.Session()
wikises = CacheControl(wikisess)
search = wikises.get("http://wiki.roblox.com/api.php?format=json&action=query&list=search&srwhat=title&srsearch=%s"%query)

If I were to do the same url again, even with a different query variable it says this

File "c:\Anaconda\lib\site-packages\requests\sessions.py", line 477, in get
return self.request('GET', url, *_kwargs)
File "c:\Anaconda\lib\site-packages\requests\sessions.py", line 465, in reques
t
resp = self.send(prep, *_send_kwargs)
File "c:\Anaconda\lib\site-packages\requests\sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "c:\Anaconda\lib\site-packages\cachecontrol\adapter.py", line 46, in send

resp = super(CacheControlAdapter, self).send(request, **kw)

File "c:\Anaconda\lib\site-packages\requests\adapters.py", line 415, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ResponseNotReady())

Chunked responses are not cached

This is an issue with filewrapper.py

urllib3/response.py has:

    if self.chunked:
        for line in self.read_chunked(amt, decode_content=decode_content):
            yield line
    else:
        while not is_fp_closed(self._fp):
            data = self.read(amt=amt, decode_content=decode_content)
...

This causes chunked requests to bypass the read implementation in CallbackFileWrapper and the callback never gets called. The chunked code ends up calling self._fp._safe_read.

I was able to verify this by setting

chunked_transfer_encoding       off;

in an nginx config and seeing that responses are then cached.

I think this might be related to #95. fwiw tracking this down only took me 20 minutes.

FileCache: hash table keys too long?

Hi,

Using FileCache I had a problem with the length of b64-encoded keys.

I'm playing with a webservice which lead me to request quite long URLs.
It is not absurd to reach 200 characters long URL which in turn produces an even longer b64-encoded string.

For instance the largest URL found in examples of this "Roseta Stone" webservice [0] from EchoNest is already 145 characters long, which makes 196 once b64-encoded.
It leaves a 54 characters headroom before the 255 characters limit of most file-systems.

Of course I hit that limit saving the cache in a XDG sub-folder of my home directory (on linux/ext4) :D

As I said these kind of long URLs could be quite common while playing with such webservices, I don't think this is an unusual use case.

Then why not using md5 or sha2 hash functions, they both provide fixed length keys and minimize the problem.
I've started using md5 since there are no security issues with collision attack IMHO (but I'm far from a security expert) :

   def encode(self, x):
        return md5(x.encode()).hexdigest()

I used md5 to ensure to alway have it available on python distribution (same with sha1, sha224, sha256, sha384,sha512 cf. [1]).

Cheers

[0] http://developer.echonest.com/docs/v4/index.html#project-rosetta-stone
[1] http://docs.python.org/library/hashlib.html

BaseHeuristic.warning is insufficiently flexible to implement HTTP caching spec properly

According to the spec, you're supposed to return warning 113 if your cache has intentionally served up something more than 24 hours stale. However, you neglected to expose the response's information to warning so I can check how old the response is.

As is, I'm probably going to wind up overriding BaseHeuristic.apply instead, which grates on me since the docs say nothing about its API stability.

Chunked responses are not cached

Otherwise cacheable chunked responses (indicated by a response header Transfer-Encoding: chunked) are not cached by CacheControl when requests>=2.6.1 is used.

It seems this issue appeared with the inclusion of urllib3/urllib3@f21c2a2 in the requests codebase via kennethreitz/requests@5fcd843.

Example script demonstrating the issue:

In [1]: import requests
In [2]: from requests.packages import urllib3
In [3]: from cachecontrol import CacheControl
In [4]: urllib3.disable_warnings()  # https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning
In [5]: sess = requests.session()
In [6]: cached_sess = CacheControl(sess)
In [7]: # non-chunked response
In [8]: response = cached_sess.get('http://httpbin.org/cache/60')
In [9]: response.headers.get('transfer-encoding') is None
Out[9]: True
In [10]: response.from_cache
Out[10]: False
In [11]: response = cached_sess.get('http://httpbin.org/cache/60')
In [12]: response.from_cache
Out[12]: True
In [13]: # chunked response
In [14]: response = cached_sess.get('https://tor.eff.org/')
In [15]: response.headers.get('transfer-encoding') == 'chunked'
Out[15]: True
In [16]: response.from_cache
Out[16]: False
In [17]: response = cached_sess.get('https://tor.eff.org/')
In [18]: response.from_cache
Out[18]: False
In [19]:

Sample caching heuristics for "expire_after"

Hello,

it will be nice to provide a sample caching heuristics for "expire_after"
(like requests_cache).
It should ignores all cache headers, it should just caches the data for the time you specify.

Kind regards

inaccurate cachecontrol.cache docstring?

https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/cache.py#L3 says

The cache object API for implementing caches. The default is just a dictionary, which in turns means it is not threadsafe for writing.

But http://cachecontrol.readthedocs.org/en/latest/storage.html#dictcache says "It is a simple threadsafe dictionary", and the set and delete implementations do in fact lock: https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/cache.py#L30

Detect Failed Downloads

Currently (I think) the caching code assumes that if we call read() on the file and it's closed that we've successfully downloaded the file. This caused issues for someone in pip (see https://bitbucket.org/zzzeek/sqlalchemy/issue/3447/unable-to-install-sqlalchemy). I'm not sure what the best way of handling this would be.

Can we update the version on pypi ?

The latest on pypi 0.10.4.

Cache entries may be cleared prematurely

CacheController.cached_request clears out cache entries if they were stale, even if staleness happened just because the cache entry was older than what the request required.
A later request may tolerate a higher age, so the cache entry could still be useful.

In the following scenario, this will cause a refetch that should have been served from the cache:

Request 1 specifies a tight max-age, causing the cache to drop the entry and try fetching a new result
For some reason, that result is not entered into the cache (temporary failure, cache control headers, whatever)
Request 2 needs to specify a relaxed max age, so the original cache entry would have been returned if it were still in the cache

In the following scenario, this will fail to return a cached entry where it could have:

First three steps as above
Request 2, when fetching from the server, hits some temporary problem.

Admittedly, neither scenario is very likely, but given that caching is such a foundational part that people expect to "just work", I think that even this stone shouldn't be left unturned.

SQLite storage backends

Hello,

it will be nice if CacheControl could provide SQLite support.

Kind regards

Cache content ignored if no date present

https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/controller.py#L120 ff. seems to be ignoring cache entries that do not have a date header.

I think https://tools.ietf.org/html/rfc7232#section-2.4 allows sending etag without date; in that case, the cache content would never be used.

Also, I'm wondering why it's deleting a cache entry here; if a cache entry does not have the headers required to make it useful, wouldn't it be better to never enter it into the cache in the first place?
(I may be misunderstanding things, grossly; I'm currently looking only at the code that retrieves data from the cache, in cached_response.)

Update documentation with a discussion of timezones and expiration

The docs should mention something about the potential pitfalls of using non-timezone aware dates. This is something that both servers and clients should consider, which is relevant for CacheControl when used with clients written by the same organization as the service.

Last Modified heuristic is not working

Hi,
thanks for this lib but I think LastModified heuristic is not working. here is the sample:

In [1]: import requests
In [2]: from cachecontrol import CacheControlAdapter
In [3]: from cachecontrol.heuristics import LastModified
In [4]: adapter = CacheControlAdapter(heuristic=LastModified())
In [5]: sess = requests.Session()
In [6]: sess.mount('http://', adapter)
In [7]: sess.mount('https://', adapter)
In [8]: r = sess.get("https://app.roihunter.com/data/example-feed-roi-hunter.xml")
In [9]: r.from_cache
Out[9]: False
In [10]: r = sess.get("https://app.roihunter.com/data/example-feed-roi-hunter.xml")
In [11]: r.from_cache
Out[11]: False

Expected behavior is that the second call of r.from_cache (Out[11]) returns True.

Can you explain me the behavior or fix it please? Thanks

When are contents written to the path returned by url_to_file_path [question]

I'm able to retrieve the path where contents will be written by a FileCache object, however as the function's docstring indicates there's no guarantee that this file will exist. What I'm finding is that for most URLs I try, the contents are written immediately as I perform the request, see the two examples below.

When I do the following, the path doesn't exist:

In [54]: from cachecontrol.caches import FileCache

In [55]: from cachecontrol.caches.file_cache import url_to_file_path

In [56]: from contextlib import contextmanager

In [57]: from tempfile import gettempdir

In [58]: file_cache = FileCache(gettempdir())

In [60]: from cachecontrol import CacheControl

In [61]: sess = CacheControl(requests.Session(), cache=file_cache)

In [62]: url = ('http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmax=1'
   ....:        '00&retmode=text&tool=skbio&db=nucleotide&id=459567&rettype=fas'
   ....:        'ta&retstart=0&[email protected]')

In [63]: req = sess.get(url)

In [64]: cached_fp = url_to_file_path(url, file_cache)

In [65]: from os.path import exists

In [67]: _  = req.content

# the file doesn't seem to exist
In [68]: exists(cached_fp)
Out[68]: False

However, if I try this with another URL that has the same contents, the file exists.

In [69]: other_url = 'https://gist.githubusercontent.com/ElDeveloper/e0144eaf196f3a641409/raw/f14a6ff47b880537da3067b322526a91124ff742/-'

In [70]: req = sess.get(other_url)

In [71]: cached_fp = url_to_file_path(other_url, file_cache)

# the file exists YAY :D
In [72]: exists(cached_fp)
Out[72]: True

I don't know if this is something specific to the server I'm trying to retrieve the data from or if this is an expected behavior of FileCache. Any help is greatly appreciated.

CacheControl should cache any cacheable response

According to the RFC, "By default, a response is cacheable if the requirements of the request method, request header fields, and the response status indicate that it is cacheable."

For this reason, and also because it's the behavior of other clients such as browsers, I would expect a simple 200 response to a simple GET request to be cached by CacheControl if there are no headers limiting the caching. CacheControl should cache these idefinitely. There should not need to be any flag such as in #18 to invoke this behavior.

psf / cachecontrol Goto Github PK

cachecontrol's People

Contributors

Stargazers

Watchers

Forkers

cachecontrol's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs