GithubHelp home page GithubHelp logo

Comments (8)

grantjenks avatar grantjenks commented on June 30, 2024

That looks like a good start. Remember that the directory path should be unique so /dev/shm/name is probably desired.

from python-diskcache.

grantjenks avatar grantjenks commented on June 30, 2024

I’m curious how much faster that’ll be. If you have a benchmark, please share the results.

from python-diskcache.

ddorian avatar ddorian commented on June 30, 2024

Here is a simple bench script that uses locust. You'll (may) need multiple processes because the whole process is probably blocked by sqlite lock (gevent).

import time

import diskcache
from locust import User, task


class MyClient(diskcache.Deque):
    @classmethod
    def fromcache(cls, cache, iterable=(), maxlen=None, request_event=None):
        self = super().fromcache(cache)
        self._request_event = request_event
        return self

    def __getattribute__(self, item: str):
        if item not in ("append",):
            return diskcache.Deque.__getattribute__(self, item)

        func = diskcache.Deque.__getattribute__(self, item)

        def wrapper(*args, **kwargs):
            request_meta = {
                "request_type": "diskcache",
                "name": func.__name__,
                "start_time": time.time(),
                "response_length": 0,
                # calculating this for an xmlrpc.client response would be too hard
                "response": None,
                "context": {},  # see HttpUser if you actually want to implement contexts
                "exception": None,
            }
            start_perf_counter = time.perf_counter()
            try:
                request_meta["response"] = func(*args, **kwargs)
            except Exception as e:
                request_meta["exception"] = e
            response_time = (time.perf_counter() - start_perf_counter) * 1000
            request_meta["response_time"] = response_time
            # This is what makes the request actually get logged in Locust
            self._request_event.events.request.fire(**request_meta)
            return request_meta["response"]

        return wrapper


class BaseActor(User):
    """
    A minimal Locust user class that provides an XmlRpcClient to its subclasses
    """

    host = ""
    abstract = True  # don't instantiate this as an actual user when running Locust
    client: MyClient

    def __init__(self, environment):
        super().__init__(environment)
        self.environment = environment
        self.cache = diskcache.Cache(
            # CONFIG 1
            # directory="/dev/shm/my_index.db",
            # sqlite_journal_mode="OFF",
            # statistics=0,
            # sqlite_synchronous=0,
            #
            # CONFIG 2
            directory="/tmp/my_index.db",
            statistics=0,
            sqlite_synchronous=0,
            sqlite_journal_mode="wal",
            #
            # shared config
            #
            sqlite_cache_size=0,
            sqlite_mmap_size=0,
            # size_limit=10 * (1024**3),
        )

        self.client = MyClient.fromcache(self.cache, request_event=environment)


class SingleInsert(BaseActor):
    @task
    def only_insert(self):
        self.client.append("s")

from python-diskcache.

grantjenks avatar grantjenks commented on June 30, 2024

What are the results?

from python-diskcache.

ddorian avatar ddorian commented on June 30, 2024

In my laptop it was 1500/s on disk and 5500/s in memory (Deque.append()). There were many bottlenecks like too many transactions, querying for rows before appending, unused indexes (for deque), triggers (cache count), etc etc.

With some relaxed settings (like not checking for full on every insert), sharding/fanout, batching, & a faster laptop/server should be able to run at least 10x faster.

from python-diskcache.

ddorian avatar ddorian commented on June 30, 2024

Assuming I use it, would you be open down the line to accepting PRs making some things optional & performance fixes?

Some examples:

  1. select_expired_template = (
    can be just 1 CTE query instead of doing 2 queries from python
  2. 'CREATE INDEX IF NOT EXISTS Cache_expire_time ON'
    should probably be a partial index
  3. Make triggers keeping "total row count" optional (the db would do select count(*) underneath).
  4. Delete can just delete instead of selecting and not raise an error (optional)
    with self._transact(retry) as (sql, cleanup):
    rows = sql(
    'SELECT rowid, filename FROM Cache'
    ' WHERE key = ? AND raw = ?'
    ' AND (expire_time IS NULL OR expire_time > ?)',
    (db_key, raw, time.time()),
    ).fetchall()
    if not rows:
    raise KeyError(key)
    ((rowid, filename),) = rows
    sql('DELETE FROM Cache WHERE rowid = ?', (rowid,))
    cleanup(filename)
    return True
  5. Don't need unique-index on Deque
    sql(
    'CREATE UNIQUE INDEX IF NOT EXISTS Cache_key_raw ON'
    ' Cache(key, raw)'
    )
  6. Use WITHOUT ROW ID for normal Cache/Index
    ' rowid INTEGER PRIMARY KEY,'
  7. Use AUTOINCREMENT in Deque.append() instead of
    now = time.time()
    raw = True
    expire_time = None if expire is None else now + expire
    size, mode, filename, db_value = self._disk.store(value, read)
    columns = (expire_time, tag, size, mode, filename, db_value)
    order = {'back': 'DESC', 'front': 'ASC'}
    select = (
    'SELECT key FROM Cache'
    ' WHERE ? < key AND key < ? AND raw = ?'
    ' ORDER BY key %s LIMIT 1'
    ) % order[side]
    with self._transact(retry, filename) as (sql, cleanup):
    rows = sql(select, (min_key, max_key, raw)).fetchall()
    if rows:
    ((key,),) = rows
    if prefix is not None:
    num = int(key[(key.rfind('-') + 1) :])
    else:
    num = key
    if side == 'back':
    num += 1
    else:
    assert side == 'front'
    num -= 1
    else:
    num = 500000000000000
    if prefix is not None:
    db_key = '{0}-{1:015d}'.format(prefix, num)
    else:
    db_key = num
  8. There are such cases in most methods

from python-diskcache.

grantjenks avatar grantjenks commented on June 30, 2024

Sure, I’m open to improvements. But, I have some comments:

  1. How do you get the filenames and do the delete at the same time?
  2. You mean like excluding NULLs? That seems reasonable.
  3. Maybe. len() should really be fast in my mind but I suppose if it’s optional that would be okay. Maybe a count() method would better indicate the slower path.
  4. Again, how to cleanup the file then?
  5. How would you implement that? Deque is layered on top of Cache without specializations today. Specializing Cache could be tricky.
  6. Probably not, the rowid is really helpful in debugging.
  7. Probably not.

I like the partial index ideas best and the specializations for Deque least.

I would propose that you make all the changes you would like in a separate project (like fastdeque or diskdeque or whatever) and then benchmark the deque implementations against each other. Depending on how big the improvements are and how extensive the changes, maybe we could merge it back. Or, I could remove my Deque implementation and recommend yours.

Part of my hesitation is the deque scenario itself which is kind of moving away from the primary cache/index scenario.

from python-diskcache.

ddorian avatar ddorian commented on June 30, 2024

How do you get the filenames and do the delete at the same time?

I see you're storing some values as separate files. It won't work in that case. It should work if using incremental BLOB for large files.

I wouldn't cache small-inline-values & large-values that end up as separate files in the same cache instance though.

You mean like excluding NULLs? That seems reasonable.

Yes, less index to maintain.

Maybe. len() should really be fast in my mind but I suppose if it’s optional that would be okay. Maybe a count() method would better indicate the slower path.

Optional. I wouldn't want triggers in the hotpath when I will rarely (without locks) need the total row count.

Again, how to cleanup the file then?

See BLOB API like in 1. But still, there should be specialization for not using files. (actually, the reverse is true, caching big values into separate files is the specialization)

How would you implement that? Deque is layered on top of Cache without specializations today. Specializing Cache could be tricky.

There's no reason to store kv-cache & deque in the same table in sqlite, since they don't share anything from a quick look that I took.

Probably not, the rowid is really helpful in debugging.

It's still an extra index to maintain on every insert/delete.

Probably not.

I didn't understand the reasoning here.

I would propose that you make all the changes you would like in a separate project (like fastdeque or diskdeque or whatever) and then benchmark the deque implementations against each other.

It's ~easy to benchmark, just comment triggers & indexes & transactions. I did some and got the 5.5K -> 17K for in-memory Deque.append() as example (using code above).

Depending on how big the improvements are and how extensive the changes, maybe we could merge it back. Or, I could remove my Deque implementation and recommend yours.

Part of my hesitation is the deque scenario itself which is kind of moving away from the primary cache/index scenario.

I picked it as a hard case, only blocking write operations. The same (triggers,indexes,transactions,rowid) apply to the normal cache.


My original question was about config, which was answered.

from python-diskcache.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.