GithubHelp home page GithubHelp logo

mchangrh / sb-mirror Goto Github PK

View Code? Open in Web Editor NEW
164.0 14.0 22.0 123 KB

Docker containers to mirror the SponsorBlock database + API

License: Other

Shell 95.59% Dockerfile 4.41%
docker sponsorblock sqlite rsync

sb-mirror's Introduction

SponsorBlock Mirror

Docker containers to mirror the SponsorBlock database + API

SponsorBlock data and databases are under CC BY-NC-SA 4.0 from https://sponsor.ajay.app.


sb-mirror License: MIT License: GPL v3

Usage

This copies the latest SponsorBlock database to the ./sb-mirror local directory

docker run --rm -it -v "${PWD}/sb-mirror:/mirror" mchangrh/sb-mirror:alpine

docker-compose

sb-mirror:
  image: mchangrh/sb-mirror
  container_name: sb-mirror
  volumes:
    - ./mirror:/mirror
    - ./export:/export
  ports:
    - 873:873
  environment:
  # - MIRROR=TRUE # enable cronjob
  # - MIRROR_URL=mirror.sb.mchang.xyz # override to set upstream mirror 
  # - SQLITE=FALSE # generate .db in /export  

Mirroring

If you would like to set up an active mirror, allow 873/tcp through your firewalls for rsyncd and uncomment lines in docker-compose

If you would like to set up a full API mirror, see containers


Contributions & Pull request are always welcome & appreciated

non-exhaustive list of packages & respective licences here

archive.sb.mchang.xyz

  • 24hr delay
  • historical archive
  • rsync + http(s)

mirror.sb.mchang.xyz

  • 5 minute delay
  • rsync + http(s)

sponsorblock.kavin.rocks

  • 5 minute delay
  • rsync

Special thanks to Ajay, SponsorBlock, SponsorBlockServer and SponsorBlockSite contributors, SponsorBlock VIPs and the community for their contributions.

Don't be shy! Join us on Discord or Matrix

sb-mirror's People

Contributors

ajayyy avatar firemasterk avatar mchangrh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sb-mirror's Issues

rsync not starting

Upon starting the docker container I get:

sb-mirror_1  | @ERROR: chdir failed
sb-mirror_1  | rsync error: error starting client-server protocol (code 5) at main.c(1859) [Receiver=3.2.7]

I have the required ports open, but I have no idea what's causing this.

SQLite3 Support

Instead of spinning up a postgres mirror, add a config for running off of a generated .db file

ERROR: rejecting unrequested file-list name: categoryVotes.csv

Uses SponsorBlock data from https://sponsor.ajay.app/
Downloading from mirror: sponsorblock.kavin.rocks
receiving incremental file list
ERROR: rejecting unrequested file-list name: categoryVotes.csv
rsync error: protocol incompatibility (code 2) at flist.c(998) [Receiver=3.2.4]
Starting rsync daemon
Downloading from mirror: sponsorblock.kavin.rocks
receiving incremental file list
ERROR: rejecting unrequested file-list name: categoryVotes.csv
rsync error: protocol incompatibility (code 2) at flist.c(998) [Receiver=3.2.4]

Can anybody help me?

Can't get this working

I'm yet to get this thing working. I'm following the docker-compose file to deploy via Ansible. If I was to translate it to a docker-compose file, it would look like:

version: '3'
services:
  postgres:
    ports:
      - '5432:5432'
    environment:
      - POSTGRES_USER=sponsorblock
      - POSTGRES_PASSWORD=abc123
      - POSTGRES_DB=sponsorblock
    volumes:
      - ./mirror:/mirror
      - ./postgres:/var/lib/postgresql/data:rw
    image: postgres:alpine

  sponsorblock-mirror:
    image: ghcr.io/mchangrh/sb-mirror:latest
    environment:
      - MIRROR="TRUE"
      - SQLITE="FALSE"
      - MIRROR_URL="sponsor.ajay.app"
    volumes:
      - ./mirror:/mirror
      - ./export:/export

  sponsorblock-server:
    ports: 
     - "8080:8080"
    volumes:
      - ./export/SponsorTimesDB.db:/app/database/SponsorTimesDB.db
      - ./postgres-config.json:/app/config.json
    image: ghcr.io/mchangrh/sb-server-runner:latest

config:

{
  "port": 8080,
  "globalSalt": "mirrorsalt",
  "adminUserID": "c132d179bfa6a48f4014c163e4f530ecb401f505cf2e5fd02f5b1bb55ac97f5c",
  "behindProxy": true,
  "postgres": {
    "user": "sponsorblock",
    "password": "abc123",
    "host": "postgres",
    "port": 5432
  },
  "mode": "mirror",
  "dumpDatabase": {
    "postgresExportPath": "/mirror"
  }
}

and the current logs:

❯ docker logs -f sponsorblock-mirror
Uses SponsorBlock data from https://sponsor.ajay.app/
Downloading from sponsor.ajay.app
Validating Downloads
rsync error: timeout waiting for daemon connection (code 35) at socket.c(278) [Receiver=3.2.3]
head: /mirror/*.csv: No such file or directory
awk: /mirror/*.csv: No such file or directory
mv: can't rename 'tmp.csv': No such file or directory
Starting SQLite Conversion
Error: cannot open "/mirror/*.csv"
Starting rsync daemon
Downloading from sponsor.ajay.app
rsync error: timeout waiting for daemon connection (code 35) at socket.c(278) [Receiver=3.2.3]
Validating Downloads
head: /mirror/*.csv: No such file or directory
awk: /mirror/*.csv: No such file or directory
mv: can't rename 'tmp.csv': No such file or directory
Starting SQLite Conversion
Error: cannot open "/mirror/*.csv"
❯ docker logs -f sponsorblock-server
Licenced under the MIT Licence https://github.com/ajayyy/SponsorBlockServer
Already up to date.

> [email protected] start
> ts-node src/index.ts

 WARN  2021-12-04T08:51:02.929Z:   [dumpDatabase] No tables configured

404 - Not Found

After setting everything up, the only answers I'm getting from the API is this:
sponsorblock.errors.NotFoundException: Not Found: 404 Not Found
I'm able to connect to the API, but it won't give any result.
The CSVs are being downloaded by rsync into the mirror folder and the sqlite .db files are being generated into the export folder. Is there anything I'm missing?

docker-compose.yml:
version: '3' services: postgres: ports: - '5432:5432' environment: - POSTGRES_USER=mirror_db_user - POSTGRES_PASSWORD=mirror_db_pass volumes: - ./mirror:/mirror image: postgres:alpine sb-mirror: image: mchangrh/sb-mirror:latest build: ./build/sb-mirror # map port externally ports: - "873:873" environment: - MIRROR=TRUE # enable cronjob # - MIRROR_URL=qc.mchang.xyz # override to set upstream mirror - VALIDATE=TRUE # enable rsync checksum validation - CSVLINT=TRUE # lint csv files (will just stop sqlite3 from complaining) - SQLITE=TRUE # generate .db in /export - PADDING_VAR=false # here to make compose not complain volumes: - ./mirror:/mirror - ./export:/export sb-server: ports: - "8080:8080" volumes: - ./export/:/app/database/ # - ./sqlite-config.json:/app/config.json # - ./postgres-config.json:/app/config.json image: ghcr.io/ajayyy/sb-server:latest networks: default: external: name: npm-nw

secure/trusted replication

maybe for the future, but in order to have secure replication, there must be some sort of central trust database or some way to validate the files. This can be done with hashes such as xxh3 which runs at RAM or L3 cache speeds

This isn't that difficult to integrate but requires validation of the hash itself, either by downloading them from the main server over rsync or https. While the risk of bad data is possible, it's less likely once a list of trusted individuals/ mirrors is established

a hypothetical would go like this:

server

  • generates new dumps
  • xxh3 is generated along with dump
  • push is sent over webhook with xxh3 hash
  • xxh3hash file generated in dump

mirror

  • receive xxh3hash by webhook or rsync
  • download new files
  • verify against hashes from main server

RSync Block size

After a few tests to see why deltas were not being generated I came to the unfortunate conclusion that the data isn't really chunkable or doable in blocks


all data is replicating to May 10 dataset

Column A is Replication source date
Row 1 is block size in bytes
data is bytes matched

2048 1024 512 256 128 64
4/29 162.62 MB 404.24 MB 672.73 MB 884.11 MB 1.03 GB 1.11 GB
3/1 0.00 KB 0.00 KB 0.00 KB 0.77 KB 1.47 MB 194.18 MB
12/31 0.00 KB 0.00 KB 0.00 KB 0.26 KB 760.19 KB 162.23 MB
2048 1024 512 256 128 64
4/29 zstd 212.99 KB 510.98 KB 2.05 MB 5.91 MB 11.78 MB 19.86 MB
3/1 zstd 0.00 KB 0.00 KB 0.00 KB 0.00 KB 0.64 KB 4.80 KB
12/31 zstd 0.00 KB 0.00 KB 0.00 KB 0.00 KB 0.26 KB 0.77 KB

test with larger block sizes

blksize matched
34.46 KB 1.62 MB
32.00 KB 1.66 MB
24.00 KB 2.38 MB
12.00 KB 4.28 MB
6.40 KB 10.89 MB
3.20 KB 65.66 MB
1.60 KB 237.85 MB

Instructions unclear

Hey, I am trying to get an sb-mirror with HTTP API running over postgres. However, after starting the containers with docker-compose up and waiting for files to download, the postgres database is empty and the http server returns 404's for any query.

Configs

docker-compose.yml

version: '3'
services:
  postgres:
    ports:
      - '127.0.0.1:5432:5432'
    environment:
      - POSTGRES_USER=mirror_db_user
      - POSTGRES_PASSWORD=mirror_db_pass
    volumes:
      - ./mirror:/mirror
    image: postgres:alpine
  sb-mirror:
    image: mchangrh/sb-mirror:latest
    build: ./build/sb-mirror
    # map port externally
    ports:
      - "127.0.0.1:873:873"
    environment:
      - MIRROR=TRUE # enable cronjob
      - MIRROR_URL=qc.mchang.xyz # override to set upstream mirror
      # - VALIDATE=TRUE # enable rsync checksum validation
      # - CSVLINT=TRUE # lint csv files (will just stop sqlite3 from complaining)
      # - SQLITE=TRUE # generate .db in /export
        # SQLITE will not always generate usable files since postgres does not export files correctly.
      - PADDING_VAR=false # here to make compose not complain
    volumes:
      - ./mirror:/mirror
      - ./export:/export
  sb-server:
    ports:
     - "127.0.0.1:6000:8080"
    volumes:
      - ./export/:/app/database/
      - ./mirror/:/mirror
      # - ./sqlite-config.json:/app/config.json
      - ./postgres-config.json:/app/config.json
    image: ghcr.io/ajayyy/sb-server:latest

postgres-config.json

{
  "port": 8080,
  "globalSalt": "mirrorsalt",
  "adminUserID": "c132d179bfa6a48f4014c163e4f530ecb401f505cf2e5fd02f5b1bb55ac97f5c",
  "behindProxy": true,
  "postgres": {
      "user": "mirror_db_user",
      "password": "mirror_db_pass",
      "host": "postgres",
      "port": 5432
  },
  "mode": "mirror",
  "dumpDatabase": {
    "postgresExportPath": "/mirror"
  }
}

Output

Creating sb-mirror_postgres_1  ... done
Creating sb-mirror_sb-server_1 ... done
Creating sb-mirror_sb-mirror_1 ... done
Attaching to sb-mirror_sb-mirror_1, sb-mirror_sb-server_1, sb-mirror_postgres_1
sb-mirror_1  | Uses SponsorBlock data from https://sponsor.ajay.app/
sb-mirror_1  | Downloading from mirror: qc.mchang.xyz
sb-server_1  | Entrypoint script
postgres_1   | The files belonging to this database system will be owned by user "postgres".
postgres_1   | This user must also own the server process.
postgres_1   |
postgres_1   | The database cluster will be initialized with locale "en_US.utf8".
postgres_1   | The default database encoding has accordingly been set to "UTF8".
postgres_1   | The default text search configuration will be set to "english".
postgres_1   |
postgres_1   | Data page checksums are disabled.
postgres_1   |
postgres_1   | fixing permissions on existing directory /var/lib/postgresql/data ... ok
postgres_1   | creating subdirectories ... ok
postgres_1   | selecting dynamic shared memory implementation ... posix
postgres_1   | selecting default max_connections ... 100
postgres_1   | selecting default shared_buffers ... 128MB
postgres_1   | selecting default time zone ... UTC
postgres_1   | creating configuration files ... ok
postgres_1   | running bootstrap script ... ok
sb-mirror_1  | receiving incremental file list
postgres_1   | performing post-bootstrap initialization ... sh: locale: not found
postgres_1   | 2022-11-09 18:48:54.437 UTC [30] WARNING:  no usable system locales were found
postgres_1   | ok
postgres_1   | syncing data to disk ... ok
postgres_1   |
postgres_1   |
postgres_1   | Success. You can now start the database server using:
postgres_1   |
postgres_1   |     pg_ctl -D /var/lib/postgresql/data -l logfile start
postgres_1   |
postgres_1   | initdb: warning: enabling "trust" authentication for local connections
postgres_1   | initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
postgres_1   | waiting for server to start....2022-11-09 18:48:55.647 UTC [36] LOG:  starting PostgreSQL 15.0 on x86_64-pc-linux-musl, compiled by gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219, 64-bit
postgres_1   | 2022-11-09 18:48:55.649 UTC [36] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
postgres_1   | 2022-11-09 18:48:55.656 UTC [39] LOG:  database system was shut down at 2022-11-09 18:48:55 UTC
postgres_1   | 2022-11-09 18:48:55.661 UTC [36] LOG:  database system is ready to accept connections
postgres_1   |  done
postgres_1   | server started
postgres_1   | CREATE DATABASE
postgres_1   |
postgres_1   |
postgres_1   | /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
postgres_1   |
postgres_1   | waiting for server to shut down....2022-11-09 18:48:55.853 UTC [36] LOG:  received fast shutdown request
postgres_1   | 2022-11-09 18:48:55.884 UTC [36] LOG:  aborting any active transactions
postgres_1   | 2022-11-09 18:48:55.886 UTC [36] LOG:  background worker "logical replication launcher" (PID 42) exited with exit code 1
postgres_1   | 2022-11-09 18:48:55.886 UTC [37] LOG:  shutting down
postgres_1   | 2022-11-09 18:48:55.888 UTC [37] LOG:  checkpoint starting: shutdown immediate
postgres_1   | 2022-11-09 18:48:55.961 UTC [37] LOG:  checkpoint complete: wrote 918 buffers (5.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.024 s, sync=0.045 s, total=0.076 s; sync files=250, longest=0.037 s, average=0.001 s; distance=4221 kB, estimate=4221 kB
postgres_1   | 2022-11-09 18:48:55.972 UTC [36] LOG:  database system is shut down
postgres_1   |  done
postgres_1   | server stopped
postgres_1   |
postgres_1   | PostgreSQL init process complete; ready for start up.
postgres_1   |
postgres_1   | 2022-11-09 18:48:56.084 UTC [1] LOG:  starting PostgreSQL 15.0 on x86_64-pc-linux-musl, compiled by gcc (Alpine 11.2.1_git20220219) 11.2.1 20220219, 64-bit
postgres_1   | 2022-11-09 18:48:56.084 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
postgres_1   | 2022-11-09 18:48:56.084 UTC [1] LOG:  listening on IPv6 address "::", port 5432
postgres_1   | 2022-11-09 18:48:56.127 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
postgres_1   | 2022-11-09 18:48:56.133 UTC [51] LOG:  database system was shut down at 2022-11-09 18:48:55 UTC
postgres_1   | 2022-11-09 18:48:56.137 UTC [1] LOG:  database system is ready to accept connections
sb-mirror_1  | ./
sb-mirror_1  | categoryVotes.csv
      8,771,182 100%    6.83MB/s    0:00:01 (xfr#1, to-chk=8/10)
sb-mirror_1  | lockCategories.csv
      1,609,363 100%    4.25MB/s    0:00:00 (xfr#2, to-chk=7/10)
sb-mirror_1  | ratings.csv
        924,650 100%    2.06MB/s    0:00:00 (xfr#3, to-chk=6/10)
sb-mirror_1  | sponsorTimes.csv
  1,328,630,890 100%   11.10MB/s    0:01:54 (xfr#4, to-chk=5/10)
sb-mirror_1  | unlistedVideos.csv
      5,729,807 100%   11.43MB/s    0:00:00 (xfr#5, to-chk=4/10)
sb-mirror_1  | userNames.csv
      7,971,851 100%    6.55MB/s    0:00:01 (xfr#6, to-chk=3/10)
sb-mirror_1  | videoInfo.csv
    408,766,072 100%   13.63MB/s    0:00:28 (xfr#7, to-chk=2/10)
sb-mirror_1  | vipUsers.csv
          3,558 100%    5.70kB/s    0:00:00 (xfr#8, to-chk=1/10)
sb-mirror_1  | warnings.csv
        153,796 100%  240.69kB/s    0:00:00 (xfr#9, to-chk=0/10)
sb-mirror_1  |
sb-mirror_1  | sent 209 bytes  received 1,690,016,195 bytes  11,304,457.55 bytes/sec
sb-mirror_1  | total size is 1,762,561,169  speedup is 1.04
sb-mirror_sb-mirror_1 exited with code 0

What I also tried:

  • Setting DBINIT=TRUE environment variable in sb-server seems to create a "sponsorTimes" tables in postgres, which contains many sane tables (which all stay empty however)
  • Changing /app/config.json to /usr/src/app/config.json because that seems to be the correct location for sb-server having checked the source code, doing so prints this warning to the console: sb-server_1 | WARN 2022-11-09T19:18:46.410Z: [dumpDatabase] No tables configured
  • Enabling all flags in sb-mirror which results in countless errors from sb-mirror, which all look like this
sb-mirror_1  | /mirror/sponsorTimes.csv:2802: INSERT failed: NOT NULL constraint failed: sponsorTimes.startTime
sb-mirror_1  | /mirror/sponsorTimes.csv:2803: expected 20 columns but found 1 - filling the rest with NULL

add XMR donation address

projects like this can use bounties for fixing certain issues or
to refund instances, website costs etc.

therefore please add a XMR (Monero) address for donations since it's more privacy friendly

Thank you!

mangle/ workaround for bad csv

postgres exports csvs with "" indicating an empty field, which is interpreted as single escape. Either SQLite will have to read from mangled stdin or another tool will be needed to import to sqlite

sb-server-runner crashing

I am using the mchangrh/sb-server-runner:latest image directly with postgres.

Licenced under the MIT Licence https://github.com/ajayyy/SponsorBlockServer
mv: can't rename '/build/node_modules': No such file or directory

> [email protected] start
> ts-node src/index.ts

 WARN  2021-10-29T13:06:36.211Z:   [dumpDatabase] No tables configured
/bin/sh: git: not found
'Error: Command failed: git rev-parse HEAD\n' +
  '/bin/sh: git: not found\n' +
  '\n' +
  '    at checkExecSyncError (node:child_process:826:11)\n' +
  '    at execSync (node:child_process:900:15)\n' +
  '    at getCommit (/app/src/utils/getCommit.ts:4:47)\n' +
  '    at /app/src/index.ts:21:24\n' +
  '    at Generator.next (<anonymous>)\n' +
  '    at fulfilled (/app/src/index.ts:5:58)\n' +
  '    at processTicksAndRejections (node:internal/process/task_queues:96:5)'

rsync --append without-verify won't make the output file identical

➜  echo -n "Hello world" > source.txt
➜  echo -n "123" > dest.txt
➜  rsync -ztvP --zc=lz4 --append source.txt dest.txt
source.txt
             11 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=0/1)

sent 96 bytes  received 35 bytes  262.00 bytes/sec
total size is 11  speedup is 0.08
➜  cat dest.txt
123lo world

# expect dest.txt contains "Hello world"

rsync --append works with assumption that the existing content of source file isn't edited, and new content always append to the end of source file.
Rows in sponsorTimes.csv can be modified or deleted time to time, so rsync --append will end up with a corrupted file.


I tried with --append-verify but it slow, only save 50% bandwidth, it isn't worth the CPU wasted to compress and compare the diff. ☹️

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.