noman-land / transcript.fish Goto Github PK

View Code? Open in Web Editor NEW

16.0 2.0 2.0 179.33 MB

Unofficial No Such Thing As A Fish episode transcripts.

Home Page: https://transcript.fish/

HTML 1.59% Python 14.54% TypeScript 81.32% Shell 1.49% CSS 1.06%

ai facts javascript python whisper-cpp nosuchthingasafish nstaaf typescript react sqlite

transcript.fish's Introduction

transcript.fish

Unofficial No Such Thing As A Fish episode transcripts.

Running webapp locally

Run npm install
Run npm run dev

Download episodes from the RSS feed, transcribe them, and add them to the database

TODO: Add instructions for creating database

Install deps
- Run pip install -r requirements.txt
Download most recent episodes and transcribe them
- Change line 11 of whisper.py to local_files_only=False
- (Optional): Change line 5 of whisper.py model_size = 'large-v2' to your preferred model, see note below for details, see available models.
- Run npm run convert (this is idempotent and will go through all episodes)
  
  NOTE: By default this uses the medium.en Whisper model. On an M1 Mac with 64GB of RAM this transcribes at about 1.4x speed. This means an hour long episode gets transcribed in about 42 minutes.
  
  So, as of 25 July 2023:
```
select sum(duration) from episodes
-- 1292175
```
```
   1,292,175.0 seconds
÷         60.0 seconds
÷         60.0 minutes
÷         24.0 hours
-----------------------
=         15.0 days
÷          1.4 speed
-----------------------
=         10.7 days
```
  The good news is changing to the small.en or the tiny.en increases this speed dramatically but the accuracy goes down slightly. small.en transcribes at about 3x speed, for example.
  
  The other good news is you can kill the script (Ctrl + C) and restart it at any time and it will pick back up after the last fully transcribed episode.
  
  NOTE: This script also downloads all the audio files for the episodes as well as each episode's album art. As of 25 July 2023 this amounts to 487 episodes, ~20GB audio, ~130MB images.
Split database into chunks
- Run npm run split:db
(Optional) Sync database, audio, images, and fonts to (Cloudflare) R2. Needs rclone and jq installed.
- Run npm run sync

transcript.fish's People

Contributors

Stargazers

Watchers

Forkers

lilymrt wanghaisheng

transcript.fish's Issues

chore: Serve episode images from own domain

Stop spamming audioboom and cache heavily.

Todo

~~Download images as part of convert script~~ (#85)
~~Update URLs in markup to point to locally hosted files~~ (#99)

feat: Global audio player

Player should always be visible no matter what episode you're listening to and what episode you're looking at.

bug(accessibility): Make work with screen reader

It's really bad now. Unusable.

feat(audio): Allow playing episode audio

feat(ux): Make words clickable, being able to start audio from the word that was clicked

Still need to figure out the UX of this because I don't think clicking should immediately start playing, otherwise it would interfere with highlighting text.

Highlighting text is important for:

Copy/paste to other places (sharing quotes, etc)
Generating links directly to words and phrases (#73)

There may need to be a little popup menu that shows up when you highlight some text where you can chose some options like sharing the link to the quote or playing from the highlighted section.

Add footer with last updated date and by

Get more structured metadata for episodes

feat: Make installable progressive web app (PWA)

https://developer.mozilla.org/en-US/docs/Web/Manifest

feat: Add filters to search

bug: Blank yellow screen on mobile Safari for some people

I haven't been able to repro because I don't have an iPhone but at least two people have reported it on mobile Safari.

Knowns:

Person A is not using iCloud Private Relay
Person A said it "used to work"
Person A said other mobile browsers also don't work
Person A said it works on their iPad

Unknowns:

Versions of iOS
Versions of Safari

model name
transcription date

Note: The large-v2 model has been used for episodes since about episode 492 (#134), but most of the ones before that were done with the medium model.

number
title
description

Use sqlite FTS: https://www.sqlite.org/fts5.html

Make another table with one episode per row with the whole transcript in one cell

feat(router): Use URL routing to link to episodes

Make convert.py idempotent

Iterate the whole episode list instead of just finding the latest episodes, fill in any gaps
Create the database and tables if they don't exist

-- 06:59:20 -- Episode 296 -- Downloading at https://pscrb.fm/rss/p/pdst.fm/e/arttrk.com/p/ABMA5/audioboom.com/posts/7431775.mp3?modified=1657034312&sid=2399216&source=rss.
Traceback (most recent call last):
  File "/dev/nosuchthing/src/convert.py", line 7, in <module>
    fetch.download_audio(episode)
  File "/dev/nosuchthing/src/fetch.py", line 15, in download_audio
    urllib.request.urlretrieve(audio_url, audio_path)
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 241, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
                            ^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 525, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 634, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 557, in error
    result = self._call_chain(*args)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 749, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 525, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 634, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 557, in error
    result = self._call_chain(*args)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 749, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 525, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 634, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 563, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable: Back-end server is at capacity