Right now, we calculate post hotness on the fly which makes cursoring and performance

<div class="highlight highlight-source-sql notranslate position-relative overflow-auto" dir="auto" d

Completed by <a class="issue-link js-issue-link" data-error-text="Failed to load title

Background Post Hotness generation about bsky-furry-feed HOT 5 CLOSED

strideynet commented on June 19, 2024

Background Post Hotness generation

from bsky-furry-feed.

Comments (5)

strideynet commented on June 19, 2024

SELECT
    cp.*
FROM
    candidate_posts cp
        INNER JOIN candidate_actors ca ON cp.actor_did = ca.did
        INNER JOIN post_hotness ph
                   ON ph.post_uri = cp.uri AND ph.algo = @algo AND
                      ph.generated_at = @generated_at
WHERE
      cp.is_hidden = false
  AND cp.deleted_at IS NULL
  AND ca.status = 'approved'
  AND (@require_tags::TEXT[] = '{}' OR @require_tags::TEXT[] <@ cp.tags)
  AND (@exclude_tags::TEXT[] = '{}' OR NOT (@exclude_tags::TEXT[] && cp.tags))
  AND (ph.hotness < @hotness_cursor)
ORDER BY
    ph.hotness DESC
LIMIT @_limit;

Following discussion with Tolf - we like the idea of having a background generation process that spits scores out to a table that can be joined in.

from bsky-furry-feed.

itstolf commented on June 19, 2024

still writing up some design notes for this, but what do you think about different tables for different algorithms rather than just putting them all in the same table? i think my feeling is that because each algorithm is semantically distinct, it might not make a lot of sense to put them in the same table and have the "hotness" value have a very different meaning per algorithm, but i don't have a super strong opinion either way!

from bsky-furry-feed.

strideynet commented on June 19, 2024

still writing up some design notes for this, but what do you think about different tables for different algorithms rather than just putting them all in the same table? i think my feeling is that because each algorithm is semantically distinct, it might not make a lot of sense to put them in the same table and have the "hotness" value have a very different meaning per algorithm, but i don't have a super strong opinion either way!

I see the argument from a semantic side, as the hotness across different algos won't be comparable, but I do think that splitting the tables for it will be more pain than it's worth. It'll reduce our ability to introduce new algos dynamically in future and general housekeeping tasks will be more complex (e.g the background task that cleans out old post hotness scores).

I'm also unsure how well sqlc and other parts of our toolchain will play with this.

If we wanted to track what went into the hotness score for debugging purposes, we could probably just a JSONB field for this (especially as I doubt we'll ever search by it and it'd mostly be for debugging)

from bsky-furry-feed.

itstolf commented on June 19, 2024

leaving this here for now until it finds a better home:

schema

CREATE TABLE post_hotness (
    uri TEXT PRIMARY KEY,
    alg TEXT NOT NULL,
    score REAL NOT NULL,
    generated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX post_hotness_score_idx ON post_hotness (alg, score);

formula

timebase = 2 # hours
gravity = 1.85
score = likes / (t + timebase) ** gravity

materializing query (every 5 minutes)

BEGIN;

DELETE FROM post_hotness
WHERE generated_at < NOW() - INTERVAL '30 minutes';

INSERT INTO post_hotness (uri, alg, score)
SELECT
    cp.uri,
    'classic',
    (SELECT COUNT(*) FROM candidate_likes cl WHERE cl.subject_uri = cp.uri AND cl.deleted_at IS NULL) /
        (EXTRACT(EPOCH FROM NOW() - cp.created_at) / (60 * 60) + 2) ^
        1.85
FROM candidate_posts cp
WHERE
    cp.deleted_at IS NULL AND
    cp.created_at >= NOW() - INTERVAL '48 hours';  -- only compute score over last 48 hours

COMMIT;

selection query

SELECT
    cp.*
FROM
    candidate_posts cp
INNER JOIN candidate_actors ca ON cp.actor_did = ca.did
INNER JOIN post_hotness ph
            ON ph.post_uri = cp.uri AND ph.alg = @alg AND
                ph.generated_at = @generated_at
WHERE
      cp.is_hidden = false
  AND ca.status = 'approved'
  AND (COALESCE($1::TEXT[], '{}') = '{}' OR $1::TEXT[] && cp.hashtags)
  AND ($2::BOOLEAN IS NULL OR COALESCE(cp.has_media, false) = $2)
  AND ($3::BOOLEAN IS NULL OR (ARRAY['nsfw', 'mursuit', 'murrsuit'] && cp.hashtags) = $3)
  AND (cp.indexed_at < $4)
  AND cp.deleted_at IS NULL
  AND (ph.hotness < @hotness_cursor)
ORDER BY
    ph.hotness DESC
LIMIT @_limit;

from bsky-furry-feed.

strideynet commented on June 19, 2024

Completed by #127

from bsky-furry-feed.

Background Post Hotness generation about bsky-furry-feed HOT 5 CLOSED

Comments (5)

schema

formula

materializing query (every 5 minutes)

selection query

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs