GithubHelp home page GithubHelp logo

jsonzilla / vtt_to_srt3 Goto Github PK

View Code? Open in Web Editor NEW
57.0 3.0 14.0 100 KB

python script to convert all SIMPLE vtt files in a directory and all of its subdirectories to srt subtitle format

Home Page: https://pypi.org/project/vtt-to-srt3/

License: Apache License 2.0

Python 100.00%
vtt vtt-files srt srt-subtitle-format

vtt_to_srt3's Introduction

vtt_to_srt3

Convert vtt files to srt subtitle format

For Python 3.x you can get version for Python 2.7 here

Docs

https://jsonzilla.github.io/vtt_to_srt3/

Installation

pip install vtt_to_srt3
python -m pip install vtt_to_srt3

Usage from terminal

usage: vtt_to_srt [-h] [-r] [-e ENCODING] pathname

Convert vtt files to srt files

positional arguments:
  pathname              a file or directory with files to be converted

options:
  -h, --help            show this help message and exit
  -r, --recursive       walk path recursively
  -e ENCODING, --encoding ENCODING
                        encoding format for input and output files

Usage as a lib

Convert vtt file

from vtt_to_srt.vtt_to_srt import ConvertFile

convert_file = ConvertFile("input_utf8.vtt", "utf-8")
convert_file.convert()

Recursively convert all vtt files in directory

from vtt_to_srt.vtt_to_srt import ConvertDirectories

recursive = False
convert_file = ConvertDirectories(".", recursive, "utf-8")
convert_file.convert()

Manual build

Generate wheel

python -m pip install --upgrade setuptools wheel build
python -m build

Generate documentation

Generate documentation

python -m pip install pdoc3
pdoc --html vtt_to_srt/vtt_to_srt.py -o docs
mv docs/vtt_to_srt.html docs/index.html
rm -rm docs/vtt_to_srt

vtt_to_srt3's People

Contributors

heniotierra avatar jamesderlin avatar jansenicus avatar jsonzilla avatar mend-bolt-for-github[bot] avatar neel-bp avatar trekologer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

vtt_to_srt3's Issues

[BUG] Converting timestamp

Note: for support questions, please use stackoverflow.
This repository's issues are reserved for feature requests and bug reports.
Your issue may already be reported! Please search on the issue tab before creating one.

Expected Behavior

Converting correctly

Current Behavior

Can't convert 1 case

Possible Solution

Context (Environment)

  • Version: vtt_to_srt3-0.2.0.1-py3-none-any.whl
  • Platform: Windows 64-bit, Python 3.9-64bit
  • Subsystem:
  • Files: vtt_to_srt.py

Detailed Description

Converting this subtitle has an issue with 1 specific timestamp

VTT

59:53.280 --> 59:57.480 line:90% position:50% align:middle
‫text


59:57.720 --> 01:00:00.240 line:90% position:50% align:middle
‫text


01:00:00.360 --> 01:00:02.040 line:90% position:50% align:middle
text

01:00:02.160 --> 01:00:06.080 line:90% position:50% align:middle
text

01:00:06.880 --> 01:00:09.080 line:90% position:50% align:middle
text

01:00:09.200 --> 01:00:10.920 line:90% position:50% align:middle
text

SRT

664
00:59:53,280 --> 00:59:57,480
text

59:57,720 --> 01:00:00,240
text

665
01:00:00,360 --> 01:00:02,040
text

666
01:00:02,160 --> 01:00:06,080
text

667
01:00:06,880 --> 01:00:09,080
text

668
01:00:09,200 --> 01:00:10,920
text

Should be

664
00:59:53,280 --> 00:59:57,480
text

665
00:59:57,720 --> 01:00:00,240
text

666
01:00:00,360 --> 01:00:02,040
text

667
01:00:02,160 --> 01:00:06,080
text

668
01:00:06,880 --> 01:00:09,080
text

669
01:00:09,200 --> 01:00:10,920
text

As you notice in the minute 59:57.720 --> 01:00:00.240 it converts it to 59:57,720 --> 01:00:00,240
without the 00: and a sequence number

Possible Implementation

Cannot work with directory of vtt files

Traceback (most recent call last):
File "path\Documents\audicut\vtts_to_srtttt.py", line 1, in
from vtt_to_srt import vtts_to_srt
ImportError: cannot import name 'vtts_to_srt' from 'vtt_to_srt' (C:\path\Programs\Python\Python37\lib\site-packages\vtt_to_srt_init_.py)
[Finished in 0.4s]

'gbk' codec can't decode byte 0x82 in position 107: illegal multibyte sequence

Traceback (most recent call last):
  File "\music_core\ytb_dlr.py", line 203, in file_path
    v2s.vtt_to_srt(file_path + "/" + filename)
  File "\vtt_to_srt\vtt_to_srt.py", line 68, in vtt_to_srt
    file_contents: str = read_text_file(str_name_file)
  File "\vtt_to_srt\vtt_to_srt.py", line 59, in read_text_file
    return f.read()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 107: illegal multibyte sequence

The English version is converted normally, but zh-TW does not seem to be converted correctly

n-BXNXvTvV4.zh-TW.zip

No module named 'vtt_to_srt'

I have ran

pip install vtt_to_srt3

and it said Successfully installed vtt-to-srt3-0.1.8.4.

But can't import the module. Why?

[BUG] Conversion does not remove optional cue identifiers in a vtt cue block (results in doubled sequence numbering)

Expected Behavior

In the WebVTT format each cue block can include an optional cue identifier.

The vtt-files I currently want to convert with the help of this library are very close to the srt format and so use a sequence number as cue identifier already.

So looking at your code that converts the formats, I would expect this cue identifier to get handled (removed) to create a valid conversion.

Current Behavior

When I try to convert those numbered cues, I end up with a doubled numbering in consecutive lines before each timestamp.
This is, because you do not look for cue identifiers and remove them in the current version of this library.
Sequence numbers are added in any case and so you end up with a doubled numbering.

Possible Solution

As the conversion in this library is mostly done by several direct replacements on the file contents instead of parsing the full vtt content first, it is not easy to modify it and drop any possibly detected cue identifier lines.
So I modified the function add_sequence_numbers to drop any non-empty lines before a line with a timestamp.
It's not a very elegant solution, but it works and doesn't need a complex redesign of the input handling in convert_content.

Steps to Reproduce

  1. Download example file 'E1x1_en.vtt.txt' from attachments and rename it to .vtt
  2. Convert the single file with the following code snippet
import vtt_to_srt.vtt_to_srt as vtt_to_srt
vtt_file = vtt_to_srt.ConvertFile('E1x1_en.vtt', 'utf-8')
vtt_file.convert()
  1. Check created srt-file for double numbering in front of any cue block

Context (Environment)

  • Version: vtt_to_srt3-0.2.0.0-py3-none-any.whl
  • Platform: Windows 64-bit, Python 3.7-32bit
  • Subsystem: -
  • Files: vtt_to_srt.py

Detailed Description

see above

Possible Implementation

    def add_sequence_numbers(self, contents):
        """Adds sequence numbers to subtitle contents and returns new subtitle contents

        :contents -- contents of vtt file
        """
        output = ''
        lines = contents.split('\n')
        i = 1
        n = 0
        while n < len(lines)-1:
            line = lines[n]
            next_line = lines[n+1]
            if self.has_timestamp(next_line):
                if line == '':
                    output += '\n'
                output += str(i) + '\n'
                output += next_line + '\n'
                i += 1
                n += 2
            else:
                output += line + '\n'
                n += 1
        output += lines[-1] + '\n'
        return output

E1x1_en.vtt.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.