Hi, With missing in-memory multiprocess cache in Python, the closest

Here is a simple bench that uses locust . You'l

Q: Config for in-memory shared cache about python-diskcache HOT 8 OPEN

ddorian commented on June 30, 2024

Q: Config for in-memory shared cache

from python-diskcache.

Comments (8)

grantjenks commented on June 30, 2024

That looks like a good start. Remember that the directory path should be unique so /dev/shm/name is probably desired.

from python-diskcache.

grantjenks commented on June 30, 2024

I’m curious how much faster that’ll be. If you have a benchmark, please share the results.

from python-diskcache.

ddorian commented on June 30, 2024

Here is a simple bench script that uses locust. You'll (may) need multiple processes because the whole process is probably blocked by sqlite lock (gevent).

import time

import diskcache
from locust import User, task


class MyClient(diskcache.Deque):
    @classmethod
    def fromcache(cls, cache, iterable=(), maxlen=None, request_event=None):
        self = super().fromcache(cache)
        self._request_event = request_event
        return self

    def __getattribute__(self, item: str):
        if item not in ("append",):
            return diskcache.Deque.__getattribute__(self, item)

        func = diskcache.Deque.__getattribute__(self, item)

        def wrapper(*args, **kwargs):
            request_meta = {
                "request_type": "diskcache",
                "name": func.__name__,
                "start_time": time.time(),
                "response_length": 0,
                # calculating this for an xmlrpc.client response would be too hard
                "response": None,
                "context": {},  # see HttpUser if you actually want to implement contexts
                "exception": None,
            }
            start_perf_counter = time.perf_counter()
            try:
                request_meta["response"] = func(*args, **kwargs)
            except Exception as e:
                request_meta["exception"] = e
            response_time = (time.perf_counter() - start_perf_counter) * 1000
            request_meta["response_time"] = response_time
            # This is what makes the request actually get logged in Locust
            self._request_event.events.request.fire(**request_meta)
            return request_meta["response"]

        return wrapper


class BaseActor(User):
    """
    A minimal Locust user class that provides an XmlRpcClient to its subclasses
    """

    host = ""
    abstract = True  # don't instantiate this as an actual user when running Locust
    client: MyClient

    def __init__(self, environment):
        super().__init__(environment)
        self.environment = environment
        self.cache = diskcache.Cache(
            # CONFIG 1
            # directory="/dev/shm/my_index.db",
            # sqlite_journal_mode="OFF",
            # statistics=0,
            # sqlite_synchronous=0,
            #
            # CONFIG 2
            directory="/tmp/my_index.db",
            statistics=0,
            sqlite_synchronous=0,
            sqlite_journal_mode="wal",
            #
            # shared config
            #
            sqlite_cache_size=0,
            sqlite_mmap_size=0,
            # size_limit=10 * (1024**3),
        )

        self.client = MyClient.fromcache(self.cache, request_event=environment)


class SingleInsert(BaseActor):
    @task
    def only_insert(self):
        self.client.append("s")

from python-diskcache.

grantjenks commented on June 30, 2024

What are the results?

from python-diskcache.

ddorian commented on June 30, 2024

In my laptop it was 1500/s on disk and 5500/s in memory (Deque.append()). There were many bottlenecks like too many transactions, querying for rows before appending, unused indexes (for deque), triggers (cache count), etc etc.

With some relaxed settings (like not checking for full on every insert), sharding/fanout, batching, & a faster laptop/server should be able to run at least 10x faster.

from python-diskcache.

ddorian commented on June 30, 2024

Assuming I use it, would you be open down the line to accepting PRs making some things optional & performance fixes?

Some examples:

python-diskcache/diskcache/core.py

Line 885 in 323787f

select_expired_template = (

can be just 1 CTE query instead of doing 2 queries from python
python-diskcache/diskcache/core.py

Line 533 in 323787f

'CREATE INDEX IF NOT EXISTS Cache_expire_time ON'

should probably be a partial index
Make triggers keeping "total row count" optional

python-diskcache/diskcache/core.py

Line 544 in 323787f

sql(

(the db would do select count(*) underneath).

Delete can just delete instead of selecting and not raise an error (optional)

python-diskcache/diskcache/core.py

Lines 1349 to 1365 in 323787f

 
 with self._transact(retry) as (sql, cleanup): 

 rows = sql( 

 'SELECT rowid, filename FROM Cache' 

 ' WHERE key = ? AND raw = ?' 

 ' AND (expire_time IS NULL OR expire_time > ?)', 

 (db_key, raw, time.time()), 

 ).fetchall() 

 if not rows: 

 raise KeyError(key) 

 ((rowid, filename),) = rows 

 sql('DELETE FROM Cache WHERE rowid = ?', (rowid,)) 

 cleanup(filename) 

 return True

Don't need unique-index on Deque

python-diskcache/diskcache/core.py

Lines 527 to 530 in 323787f

sql(

'CREATE UNIQUE INDEX IF NOT EXISTS Cache_key_raw ON'

' Cache(key, raw)'

)
Use WITHOUT ROW ID for normal Cache/Index

python-diskcache/diskcache/core.py

Line 513 in 323787f

' rowid INTEGER PRIMARY KEY,'

Use AUTOINCREMENT in Deque.append() instead of

python-diskcache/diskcache/core.py

Lines 1446 to 1480 in 323787f

 now = time.time() 

 raw = True 

 expire_time = None if expire is None else now + expire 

 size, mode, filename, db_value = self._disk.store(value, read) 

 columns = (expire_time, tag, size, mode, filename, db_value) 

 order = {'back': 'DESC', 'front': 'ASC'} 

 select = ( 

 'SELECT key FROM Cache' 

 ' WHERE ? < key AND key < ? AND raw = ?' 

 ' ORDER BY key %s LIMIT 1' 

 ) % order[side] 

 with self._transact(retry, filename) as (sql, cleanup): 

 rows = sql(select, (min_key, max_key, raw)).fetchall() 

 if rows: 

 ((key,),) = rows 

 if prefix is not None: 

 num = int(key[(key.rfind('-') + 1) :]) 

 else: 

 num = key 

 if side == 'back': 

 num += 1 

 else: 

 assert side == 'front' 

 num -= 1 

 else: 

 num = 500000000000000 

 if prefix is not None: 

 db_key = '{0}-{1:015d}'.format(prefix, num) 

 else: 

 db_key = num

There are such cases in most methods

from python-diskcache.

grantjenks commented on June 30, 2024

Sure, I’m open to improvements. But, I have some comments:

How do you get the filenames and do the delete at the same time?
You mean like excluding NULLs? That seems reasonable.
Maybe. len() should really be fast in my mind but I suppose if it’s optional that would be okay. Maybe a count() method would better indicate the slower path.
Again, how to cleanup the file then?
How would you implement that? Deque is layered on top of Cache without specializations today. Specializing Cache could be tricky.
Probably not, the rowid is really helpful in debugging.
Probably not.

I like the partial index ideas best and the specializations for Deque least.

I would propose that you make all the changes you would like in a separate project (like fastdeque or diskdeque or whatever) and then benchmark the deque implementations against each other. Depending on how big the improvements are and how extensive the changes, maybe we could merge it back. Or, I could remove my Deque implementation and recommend yours.

Part of my hesitation is the deque scenario itself which is kind of moving away from the primary cache/index scenario.

from python-diskcache.

ddorian commented on June 30, 2024

How do you get the filenames and do the delete at the same time?

I see you're storing some values as separate files. It won't work in that case. It should work if using incremental BLOB for large files.

I wouldn't cache small-inline-values & large-values that end up as separate files in the same cache instance though.

You mean like excluding NULLs? That seems reasonable.

Yes, less index to maintain.

Maybe. len() should really be fast in my mind but I suppose if it’s optional that would be okay. Maybe a count() method would better indicate the slower path.

Optional. I wouldn't want triggers in the hotpath when I will rarely (without locks) need the total row count.

Again, how to cleanup the file then?

See BLOB API like in 1. But still, there should be specialization for not using files. (actually, the reverse is true, caching big values into separate files is the specialization)

How would you implement that? Deque is layered on top of Cache without specializations today. Specializing Cache could be tricky.

There's no reason to store kv-cache & deque in the same table in sqlite, since they don't share anything from a quick look that I took.

Probably not, the rowid is really helpful in debugging.

It's still an extra index to maintain on every insert/delete.

Probably not.

I didn't understand the reasoning here.

I would propose that you make all the changes you would like in a separate project (like fastdeque or diskdeque or whatever) and then benchmark the deque implementations against each other.

It's ~easy to benchmark, just comment triggers & indexes & transactions. I did some and got the 5.5K -> 17K for in-memory Deque.append() as example (using code above).

Depending on how big the improvements are and how extensive the changes, maybe we could merge it back. Or, I could remove my Deque implementation and recommend yours.

Part of my hesitation is the deque scenario itself which is kind of moving away from the primary cache/index scenario.

I picked it as a hard case, only blocking write operations. The same (triggers,indexes,transactions,rowid) apply to the normal cache.

My original question was about config, which was answered.

from python-diskcache.

Q: Config for in-memory shared cache about python-diskcache HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs


	with self._transact(retry) as (sql, cleanup):
	rows = sql(
	'SELECT rowid, filename FROM Cache'
	' WHERE key = ? AND raw = ?'
	' AND (expire_time IS NULL OR expire_time > ?)',
	(db_key, raw, time.time()),
	).fetchall()

	if not rows:
	raise KeyError(key)

	((rowid, filename),) = rows
	sql('DELETE FROM Cache WHERE rowid = ?', (rowid,))
	cleanup(filename)

	return True

	sql(
	'CREATE UNIQUE INDEX IF NOT EXISTS Cache_key_raw ON'
	' Cache(key, raw)'
	)

	now = time.time()
	raw = True
	expire_time = None if expire is None else now + expire
	size, mode, filename, db_value = self._disk.store(value, read)
	columns = (expire_time, tag, size, mode, filename, db_value)
	order = {'back': 'DESC', 'front': 'ASC'}
	select = (
	'SELECT key FROM Cache'
	' WHERE ? < key AND key < ? AND raw = ?'
	' ORDER BY key %s LIMIT 1'
	) % order[side]

	with self._transact(retry, filename) as (sql, cleanup):
	rows = sql(select, (min_key, max_key, raw)).fetchall()

	if rows:
	((key,),) = rows

	if prefix is not None:
	num = int(key[(key.rfind('-') + 1) :])
	else:
	num = key

	if side == 'back':
	num += 1
	else:
	assert side == 'front'
	num -= 1
	else:
	num = 500000000000000

	if prefix is not None:
	db_key = '{0}-{1:015d}'.format(prefix, num)
	else:
	db_key = num