Comments (8)
That looks like a good start. Remember that the directory path should be unique so /dev/shm/name is probably desired.
from python-diskcache.
I’m curious how much faster that’ll be. If you have a benchmark, please share the results.
from python-diskcache.
Here is a simple bench script that uses locust
. You'll (may) need multiple processes because the whole process is probably blocked by sqlite lock (gevent).
import time
import diskcache
from locust import User, task
class MyClient(diskcache.Deque):
@classmethod
def fromcache(cls, cache, iterable=(), maxlen=None, request_event=None):
self = super().fromcache(cache)
self._request_event = request_event
return self
def __getattribute__(self, item: str):
if item not in ("append",):
return diskcache.Deque.__getattribute__(self, item)
func = diskcache.Deque.__getattribute__(self, item)
def wrapper(*args, **kwargs):
request_meta = {
"request_type": "diskcache",
"name": func.__name__,
"start_time": time.time(),
"response_length": 0,
# calculating this for an xmlrpc.client response would be too hard
"response": None,
"context": {}, # see HttpUser if you actually want to implement contexts
"exception": None,
}
start_perf_counter = time.perf_counter()
try:
request_meta["response"] = func(*args, **kwargs)
except Exception as e:
request_meta["exception"] = e
response_time = (time.perf_counter() - start_perf_counter) * 1000
request_meta["response_time"] = response_time
# This is what makes the request actually get logged in Locust
self._request_event.events.request.fire(**request_meta)
return request_meta["response"]
return wrapper
class BaseActor(User):
"""
A minimal Locust user class that provides an XmlRpcClient to its subclasses
"""
host = ""
abstract = True # don't instantiate this as an actual user when running Locust
client: MyClient
def __init__(self, environment):
super().__init__(environment)
self.environment = environment
self.cache = diskcache.Cache(
# CONFIG 1
# directory="/dev/shm/my_index.db",
# sqlite_journal_mode="OFF",
# statistics=0,
# sqlite_synchronous=0,
#
# CONFIG 2
directory="/tmp/my_index.db",
statistics=0,
sqlite_synchronous=0,
sqlite_journal_mode="wal",
#
# shared config
#
sqlite_cache_size=0,
sqlite_mmap_size=0,
# size_limit=10 * (1024**3),
)
self.client = MyClient.fromcache(self.cache, request_event=environment)
class SingleInsert(BaseActor):
@task
def only_insert(self):
self.client.append("s")
from python-diskcache.
What are the results?
from python-diskcache.
In my laptop it was 1500/s on disk and 5500/s in memory (Deque.append()). There were many bottlenecks like too many transactions, querying for rows before appending, unused indexes (for deque), triggers (cache count), etc etc.
With some relaxed settings (like not checking for full on every insert), sharding/fanout, batching, & a faster laptop/server should be able to run at least 10x faster.
from python-diskcache.
Assuming I use it, would you be open down the line to accepting PRs making some things optional & performance fixes?
Some examples:
python-diskcache/diskcache/core.py
Line 885 in 323787f
python-diskcache/diskcache/core.py
Line 533 in 323787f
- Make triggers keeping "total row count" optional
python-diskcache/diskcache/core.py
Line 544 in 323787f
select count(*)
underneath). - Delete can just delete instead of selecting and not raise an error (optional)
python-diskcache/diskcache/core.py
Lines 1349 to 1365 in 323787f
- Don't need unique-index on Deque
python-diskcache/diskcache/core.py
Lines 527 to 530 in 323787f
- Use
WITHOUT ROW ID
for normal Cache/Indexpython-diskcache/diskcache/core.py
Line 513 in 323787f
- Use AUTOINCREMENT in
Deque.append()
instead ofpython-diskcache/diskcache/core.py
Lines 1446 to 1480 in 323787f
- There are such cases in most methods
from python-diskcache.
Sure, I’m open to improvements. But, I have some comments:
- How do you get the filenames and do the delete at the same time?
- You mean like excluding NULLs? That seems reasonable.
- Maybe. len() should really be fast in my mind but I suppose if it’s optional that would be okay. Maybe a count() method would better indicate the slower path.
- Again, how to cleanup the file then?
- How would you implement that? Deque is layered on top of Cache without specializations today. Specializing Cache could be tricky.
- Probably not, the rowid is really helpful in debugging.
- Probably not.
I like the partial index ideas best and the specializations for Deque least.
I would propose that you make all the changes you would like in a separate project (like fastdeque or diskdeque or whatever) and then benchmark the deque implementations against each other. Depending on how big the improvements are and how extensive the changes, maybe we could merge it back. Or, I could remove my Deque implementation and recommend yours.
Part of my hesitation is the deque scenario itself which is kind of moving away from the primary cache/index scenario.
from python-diskcache.
How do you get the filenames and do the delete at the same time?
I see you're storing some values as separate files. It won't work in that case. It should work if using incremental BLOB for large files.
I wouldn't cache small-inline-values & large-values that end up as separate files in the same cache instance though.
You mean like excluding NULLs? That seems reasonable.
Yes, less index to maintain.
Maybe. len() should really be fast in my mind but I suppose if it’s optional that would be okay. Maybe a count() method would better indicate the slower path.
Optional. I wouldn't want triggers in the hotpath when I will rarely (without locks) need the total row count.
Again, how to cleanup the file then?
See BLOB API like in 1. But still, there should be specialization for not using files. (actually, the reverse is true, caching big values into separate files is the specialization)
How would you implement that? Deque is layered on top of Cache without specializations today. Specializing Cache could be tricky.
There's no reason to store kv-cache & deque in the same table in sqlite, since they don't share anything from a quick look that I took.
Probably not, the rowid is really helpful in debugging.
It's still an extra index to maintain on every insert/delete.
Probably not.
I didn't understand the reasoning here.
I would propose that you make all the changes you would like in a separate project (like fastdeque or diskdeque or whatever) and then benchmark the deque implementations against each other.
It's ~easy to benchmark, just comment triggers & indexes & transactions. I did some and got the 5.5K -> 17K for in-memory Deque.append() as example (using code above).
Depending on how big the improvements are and how extensive the changes, maybe we could merge it back. Or, I could remove my Deque implementation and recommend yours.
Part of my hesitation is the deque scenario itself which is kind of moving away from the primary cache/index scenario.
I picked it as a hard case, only blocking write operations. The same (triggers,indexes,transactions,rowid) apply to the normal cache.
My original question was about config, which was answered.
from python-diskcache.
Related Issues (20)
- Native async implementation HOT 1
- Controlling the number of files generated by Cache HOT 3
- in Cache._cull(), cull_limit appears to be prescriptive vs. a limit & in Cache.cull(), 10 files are deleted at a time HOT 2
- Enable custom type serialization in `JSONDisk` HOT 2
- diskkache.Cache.get ignoring read=True parameter HOT 2
- Deque `peekleft` blocks infinitely after corruption(?) HOT 18
- No real-time synchronization between writing data and reading data in different processes. HOT 1
- RFE: is it possible to start making github releases?🤔 HOT 2
- Deque with JSONDisk throws TypeError: a bytes-like object is required, not 'int' HOT 1
- [Feature Request] Allow iterkeys method to yield value or tag along with key?
- High memory usage with multiple threads HOT 3
- [Bug] Fetching an item while writing the item into Cache HOT 4
- "unable to open database file" with Python 3.11 and PySide6 threads HOT 9
- Mark the distributed package as typed HOT 1
- Long-lived cache management HOT 2
- PR for fork deadlock HOT 1
- Can this awasome add the range selection Feature? HOT 2
- PicklingError: logger cannot be pickled HOT 6
- Use local OS' path separator HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from python-diskcache.