GithubHelp home page GithubHelp logo

Comments (2)

AxillV avatar AxillV commented on July 27, 2024 1

I see, thank you for the very thoughtful answer. Sounds like too much work without a whole lot of reward. I'm still at the start of my language learning journey so indeed, the utility might be a lot lower than what I expected.

(Sorry the for (re)opening spam).

from memento.

ripose-jp avatar ripose-jp commented on July 27, 2024

The major problem to solve is subtitle tokenization.

This can be done fast and easy with MeCab. The issue with only relying on MeCab's results is that it only tokenizes based on data in ipadic. This isn't necessarily going to line up with what is actually available in a user's dictionary. For example, jmdict contains a lot of definitions for phrases which MeCab likely won't consider a single token.

The alternative to MeCab would be writing a tokenizer that's aware of the user's dictionaries. A simple algorithm would be for each character in the subtitle, create a token for every possible substring starting from that character then highlight all the matches. This is O(n^2) just in searches done, which is expensive since each search goes out to disk and Anki in order to get a result. If subtitles are on the screen for only a second or two, there's no guarantee that you even get a result back in time unless you're preloading results.

The other question I have is what is the utility of this all? If you search a word, it's likely because you didn't know it or didn't remember it. Knowing you have a card for the term before you even search doesn't really move the needle in my opinion since Memento is not an SRS program.

Sorry for the half-posted comment originally. I accidentally pressed Ctrl+Enter which GitHub takes as "publish my in progress comment".

from memento.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.