GithubHelp home page GithubHelp logo

subtitle_cleanser's Introduction

Introduction

I like watching movies and TV shows with subtitles. Mostly it's because I'm more of a visual person so I like to read the words, rather than just listening. Sometimes it's also because the actors speak with accents, or mumble, or use lingo I'm not familiar with, so it's helpful to confirm what they're saying with the subtitles.

However, I get annoyed when the subtitles aren't what I would deem "perfect". They have typos, or they're all in CAPS, or the words are wrong because the subtitler themselves have misheard. I Googled around a little, but I couldn't find any tool that would let me adjust the subs the way I wanted, so I decided to make it myself. Besides the practicality of this project, I also wanted to create something simple but tangible to show off my coding skills.

Eventually, I hope to turn this into a useful tool with customizable options, to allow other people to use it outside my definition of "perfect". For example, I personally don't like the a name with a colon that indicates who's talking, because I can hear the actor myself. But there are plenty of people who do like that, so that would be an option to turn on or off.

Usage

This is an executable Python script written in Python 3.7.

You can drag the script to the same level as the subtitle script you want to cleanse and execute it from the command line with the filename as a parameter. It will output a new file with the word "-cleansed" attached to the original file name in the same level. Currently, I've only tested it with .srt files, since that's the only subtitle file type I use. But it probably works with plaintext file type.

Example:

$ ./subtitle_cleanser 12\ Meet\ Your\ Maker.srt
$ ls -la
...
-rw-r--r--   1 jacetan  ***    71K Aug 24 13:34 12 Meet Your Maker-cleansed.srt
-rwxrwxrwx   1 jacetan  ***    72K Aug 24 13:33 12 Meet Your Maker.srt
-rwxr-xr-x   1 jacetan  ***   9.7K Aug 24 13:30 subtitle_cleanser.py

Features

This script will:

  • Re-number subtitle block numberings
  • Preserve timestamps
  • Remove brackets and their contents (), [], <>,
  • Remove musical notes and paragraph symbols and their contents (usually lyrics for background music)
  • Remove names of speakers and their corresponding colon : e.g. "Jace: Hello World!" --> "Hello World!" (ignores numbers since they're part of timings e.g. "4:00")
  • Remove underscores _ that appear consecutively 2 or more times
  • Remove spaces before sentence punctuation e.g. "This is a sentence part , this is the other part ." --> "This is a sentence part, this is the other part."
  • Add spaces after sentence punctuation (but respects and capitalizes acronyms) e.g. "This is a sentence part,made by j.t., which is me!This is a new sentence." --> "This is a sentence part, made by J.T., which is me! This is a new sentence."
  • Add spaces between a small letter followed by a capial letter as this usually indicates joined words e.g. "Previously onElementary" --> "Previously on Elementary"
  • Replace double symbols and double spaces with single symbols (ignores ellipses ...) e.g. "Wait... You can't do--" --> "Wait... You can't do-"
  • Removes unnecessary spaces between common contractions e.g. "can 't" --> "can't"
  • Removes subtitle blocks that are non-content like Advertising or Titles based on a custom list of JUNK_PATTERNS that can be expanded upon over time
  • Compress multi-lines that are 40 chars or less into single lines
  • Convert multi-lines into 2 roughly balanced lines

Future Features

  • Appropriate Capitalizing
  • Typo detection and fixes
  • Handling of non-English characters
  • Combining short subtitle blocks into bigger blocks
  • Move ending periods inside closing quotes to be outside them unless the speaker is quoting the whole sentence.

Changelog

19 Sep 2021

  • Fixed regex mistake with paragraph symbol
  • Added pattern to detect and remove star symbol pairs * * and single star symbols *
  • Added SMALL_CAPS_EXCEPTIONS to store exceptions to the rule that inserts a space between a small letter and a capital letter
  • Added logic to handle exceptions to the smallCAPS rule
  • Removed logic for handling preceding-hyphens
  • Updated regex pattern to remove speaker name and colon only if the preceding word starts with a capital letter that indicates a person's name
  • Added logic to remove multi-line unwanted content
  • Added a new handlePrecedingHyphens func and added it to main func
  • Called balanceContent only for content that does not have preceding hyphens

24 Aug 2021

  • Added sys.argv[1] to allow specification of the filename from command line
  • Updated outputFilename to only replace the last period with "-cleansed." instead of all periods

18 Aug 2021

  • Added the hyphen symbol - as a possible prefix to speaker names
  • Updated removeUnwantedContent with logic to remove underscores that appear consecutively 2 or more times

16 Aug 2021

  • Updated removeUnwantedContent func to remove paragraph symbols

14 Aug 2021

  • Shortened the pattern to recognize double symbols
  • Added logic to recognize acronyms, capitalize them and prevent them from having spaces added in the middle
  • Added balanceContent func with detailed docstring and logic add back preceding hyphens if any
  • Added logic to compress multi-lines that are 40 chars or less into single lines
  • Added logic to convert multi-lines into 2 roughly balanced lines

12 Aug 2021

  • Added 2 new patterns to JUNK_PATTERNS
  • Updated regex to ignore numbers before colons that are part of times
  • Updated regex to ignore the punctuation that immediately precedes a speaker name with colon

10 Aug 2021

  • Added verifyContent func with detailed docstring and logic to detect junk content
  • Added constants.py file with a list of JUNK_PATTERNS and used it in verifyContent func
  • Added logic to skip over non-content subtitleBlocks based on results from verifyContent

09 Aug 2021

  • Added removeUnwantedContent func with detailed docstring and logic to remove all unwanted content like bracketed content, speaker names with colons, and preceding hyphens
  • Updated getNextSubtitleBlock to detect end of file and return None
  • Added a while loop in the main func to loop through all subtitleBlocks until the end of the file
  • Added a preceding-hypens key to the subtitleBlock to flag it for processing later
  • Added cleanupContent func with detailed docstring and logic to cleanup simple mistakes like double symbols, double spaces, removing spaces before and adding spaces after sentence punctuations, and removing spaces between common contractions.
  • Added logic to write the cleansed subtitle block to a separate file
  • Added COPYING file with GNU 3.0 licence
  • Added README.md file with Introduction, Usage, Features and Future Features, and Changelog

07 Aug 2021

  • Added subtitle_cleanser.py with shebang line and made it executable
  • Added main func and pseudo code
  • Added validation logic to ensure the subtitle file exists and can be opened
  • Added getNextSubtitleBlock func with detailed docstring and logic to remove Byte Order Marks, ignore the Subtitle Block Number and extract the timestamp and content lines

subtitle_cleanser's People

Contributors

jacetan avatar

Watchers

 avatar

subtitle_cleanser's Issues

cleanupContent adds spaces after periods, but doesn't respect acronyms

Original text is "F.B.I.", but the function changes it to "F. B. I. ".

Expected behaviour is to recognize acronym patterns and do not add spaces.

Suggestion: Use the pattern ([A-Z]\.){2,} to detect a single capital letter followed by a period, and the whole thing needs to appear concurrently at least twice.

Removal of bracketed content `()` `[]` fails to remove content that spans multiple lines

Observation:
Original text is

(door closing,
bell ringing)

After running the ./subtitle_cleanser, I still see this text in the cleansed file.

Expectation:
All content inside brackets are removed.

(Optional) Suggestion:
Convert the content in each subtitle block as a single line before making changes.
This is already done in the balanceContent func anyway.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.