GithubHelp home page GithubHelp logo

noman-land / transcript.fish Goto Github PK

View Code? Open in Web Editor NEW
16.0 2.0 2.0 179.33 MB

Unofficial No Such Thing As A Fish episode transcripts.

Home Page: https://transcript.fish/

HTML 1.59% Python 14.54% TypeScript 81.32% Shell 1.49% CSS 1.06%
ai facts javascript python whisper-cpp nosuchthingasafish nstaaf typescript react sqlite

transcript.fish's Introduction

transcript.fish

Unofficial No Such Thing As A Fish episode transcripts.

Running webapp locally

  1. Run npm install
  2. Run npm run dev

Download episodes from the RSS feed, transcribe them, and add them to the database

TODO: Add instructions for creating database

  1. Install deps

    • Run pip install -r requirements.txt
  2. Download most recent episodes and transcribe them

    • Change line 11 of whisper.py to local_files_only=False

    • (Optional): Change line 5 of whisper.py model_size = 'large-v2' to your preferred model, see note below for details, see available models.

    • Run npm run convert (this is idempotent and will go through all episodes)

      NOTE: By default this uses the medium.en Whisper model. On an M1 Mac with 64GB of RAM this transcribes at about 1.4x speed. This means an hour long episode gets transcribed in about 42 minutes.

      So, as of 25 July 2023:

      select sum(duration) from episodes
      -- 1292175
         1,292,175.0 seconds
      ÷         60.0 seconds
      ÷         60.0 minutes
      ÷         24.0 hours
      -----------------------
      =         15.0 days
      ÷          1.4 speed
      -----------------------
      =         10.7 days
      

      The good news is changing to the small.en or the tiny.en increases this speed dramatically but the accuracy goes down slightly. small.en transcribes at about 3x speed, for example.

      The other good news is you can kill the script (Ctrl + C) and restart it at any time and it will pick back up after the last fully transcribed episode.

      NOTE: This script also downloads all the audio files for the episodes as well as each episode's album art. As of 25 July 2023 this amounts to 487 episodes, ~20GB audio, ~130MB images.

  3. Split database into chunks

    • Run npm run split:db
  4. (Optional) Sync database, audio, images, and fonts to (Cloudflare) R2. Needs rclone and jq installed.

    • Run npm run sync

transcript.fish's People

Contributors

dependabot[bot] avatar lilymrt avatar noman-land avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

transcript.fish's Issues

feat: Global audio player

Player should always be visible no matter what episode you're listening to and what episode you're looking at.

feat(ux): Make words clickable, being able to start audio from the word that was clicked

Still need to figure out the UX of this because I don't think clicking should immediately start playing, otherwise it would interfere with highlighting text.

Highlighting text is important for:

  • Copy/paste to other places (sharing quotes, etc)
  • Generating links directly to words and phrases (#73)

There may need to be a little popup menu that shows up when you highlight some text where you can chose some options like sharing the link to the quote or playing from the highlighted section.

bug: Blank yellow screen on mobile Safari for some people

I haven't been able to repro because I don't have an iPhone but at least two people have reported it on mobile Safari.

Knowns:

  • Person A is not using iCloud Private Relay
  • Person A said it "used to work"
  • Person A said other mobile browsers also don't work
  • Person A said it works on their iPad

Unknowns:

  • Versions of iOS
  • Versions of Safari

feat: Paginate episode list

Currently it loads almost 500 episode summaries and as a result a lot of data. Mostly in the form of background images :. 150 MB :.

Redo transcriptions with larger model

Add db columns:

  • model name
  • transcription date

Note: The large-v2 model has been used for episodes since about episode 492 (#134), but most of the ones before that were done with the medium model.

Make convert.py idempotent

  1. Iterate the whole episode list instead of just finding the latest episodes, fill in any gaps
  2. Create the database and tables if they don't exist

bug: convert.py crashes if downloading episode fails

-- 06:59:20 -- Episode 296 -- Downloading at https://pscrb.fm/rss/p/pdst.fm/e/arttrk.com/p/ABMA5/audioboom.com/posts/7431775.mp3?modified=1657034312&sid=2399216&source=rss.
Traceback (most recent call last):
  File "/dev/nosuchthing/src/convert.py", line 7, in <module>
    fetch.download_audio(episode)
  File "/dev/nosuchthing/src/fetch.py", line 15, in download_audio
    urllib.request.urlretrieve(audio_url, audio_path)
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 241, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
                            ^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 525, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 634, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 557, in error
    result = self._call_chain(*args)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 749, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 525, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 634, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 557, in error
    result = self._call_chain(*args)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 749, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 525, in open
    response = meth(req, response)
               ^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 634, in http_response
    response = self.parent.error(
               ^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 563, in error
    return self._call_chain(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/.pyenv/versions/3.11.3/lib/python3.11/urllib/request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable: Back-end server is at capacity

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.