GithubHelp home page GithubHelp logo

learningequality / pressurecooker Goto Github PK

View Code? Open in Web Editor NEW
3.0 13.0 9.0 5.78 MB

A library of various media and content processing utilities for use in Ricecooker

License: MIT License

Makefile 0.27% Python 98.73% HTML 1.01%

pressurecooker's Introduction

pressurecooker

A library of various media processing functions and utilities

Setting up your environment

TODO

Converting caption files

The pressurecooker library contains utilities for converting caption files from a few various formats into the preferred VTT format. The currently supported formats include:

Within pressurecooker, the term "captions" and "subtitles" are used interchangeably. All of the classes and functions handling conversion use the "subtitles" term.

Language codes

The DFXP, SAMI, and TTML formats can encapsulate caption contents for multiple languages within one file. The SCC, SRT, and VTT formats are generally limited to a single language that isn't defined in the file (the VTT may be an exception to this rule, but our converters do not detect its language). Therefore when converting these files we cannot know what language we are working with and must instead use the constant LANGUAGE_CODE_UNKNOWN to extract the converted subtitles.

Note also that language codes used within the subtitle files might differ from the LE internal language codes defined in le-utils.

Creating a converter from a file

To create a subtitle converter from a local file path, use these commands:

from pressurecooker.subtitles import build_subtitle_converter_from_file

converter = build_subtitle_converter_from_file('/path/to/file.srt')

Creating a converter from a string

If you already have the captions loaded into a string variable, you can create the converter like so:

from pressurecooker.subtitles import build_subtitle_converter

captions_str = ''   # In this example, `captions_str` holds the caption contents
converter = build_subtitle_converter(captions_str)

Converting captions

For the SCC, SRT, and VTT subtitles format that do not contain language code info, you must refer to the language as the constant LANGUAGE_CODE_UNKNOWN at the time of extracting the converted subtitles:

from pressurecooker.subtitles import build_subtitle_converter_from_file
from pressurecooker.subtitles import LANGUAGE_CODE_UNKNOWN

converter = build_subtitle_converter_from_file('/path/to/file.srt')

# Option A: Obtain the contents of the converted VTT file as a string
output_str = converter.convert(LANGUAGE_CODE_UNKNOWN)

# Option B: Write the converted subtitles to a local path
converter.write("/path/to/file.vtt", LANGUAGE_CODE_UNKNOWN)

The LANGUAGE_CODE_UNKNOWN constant is the internal representation pycaption uses to denote subtitles in an unknown language code. This will be the default and only language code for SCC, SRT, and VTT subtitle converters.

If you are unsure of the format, but you know the language of the file, it is safer to conditionally replace the LANGUAGE_CODE_UNKNOWN with that language:

from pressurecooker.subtitles import build_subtitle_converter_from_file
from pressurecooker.subtitles import LANGUAGE_CODE_UNKNOWN

converter = build_subtitle_converter_from_file('/path/to/file')

# Replace unknown language code if present
if converter.has_language(LANGUAGE_CODE_UNKNOWN):
    converter.replace_unknown_language('en')
    
assert converter.has_language('en'), 'Must have English after replace'

output_str = converter.convert('en')

An example showing how to handle the subtitle formats like DFXP, SAMI, and TTML, which have multiple languages is shown below:

from pressurecooker.subtitles import build_subtitle_converter_from_file
from pressurecooker.subtitles import LANGUAGE_CODE_UNKNOWN, InvalidSubtitleLanguageError

converter = build_subtitle_converter_from_file('/path/to/file')

for lang_code in converter.get_language_codes():
    # `some_logic` would be your decisions on whether to use this language
    if some_logic(lang_code):
        converter.write("/path/to/file-{}.vtt".format(lang_code), lang_code)
    elif lang_code == LANGUAGE_CODE_UNKNOWN:
        raise InvalidSubtitleLanguageError('Unexpected unknown language')

pressurecooker's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pressurecooker's Issues

PDF Splitting

PDF Splitting has become a feature needed by multiple sushi-chefs, and there are people asking for this functionality in Studio, so it would be good to have this functionality moved to pressurecooker where both ricecooker and Studio can benefit from it.

srt2vtt dependecy webvtt-py is Py3.4+ only

See https://pypi.org/project/webvtt-py/

This shouldn't be a problem on studio if we don't import pressurecooker.converters, but we'll have to address this before we introduce subs conversion on Studio.

Desired resolution is for this issue to go away on it's own once we update Studio to Py3; otherwise we'll have to find another library to convert subs.

Subtitles formats and compatible youtube_language code checks

This is a sample data from the substitles info returned by yt_resource.get_resource_subtitles()
for this video.

{
  "en": [
    {
      "url": "https://www.youtube.com/api/timedtext?lang=en&v=FN12ty5ztAs&fmt=ttml&name=+via+Dotsub",
      "ext": "ttml"
    },
    {
      "url": "https://www.youtube.com/api/timedtext?lang=en&v=FN12ty5ztAs&fmt=vtt&name=+via+Dotsub",
      "ext": "vtt"
    }
  ],
  "fr": [
    {
      "url": "https://www.youtube.com/api/timedtext?lang=fr&v=FN12ty5ztAs&fmt=ttml&name=+via+Dotsub",
      "ext": "ttml"
    },
    {
      "url": "https://www.youtube.com/api/timedtext?lang=fr&v=FN12ty5ztAs&fmt=vtt&name=+via+Dotsub",
      "ext": "vtt"
    }
  ],
  "zu": [
    {
      "url": "https://www.youtube.com/api/timedtext?lang=zu&v=FN12ty5ztAs&fmt=ttml&name=+via+Dotsub",
      "ext": "ttml"
    },
    {
      "url": "https://www.youtube.com/api/timedtext?lang=zu&v=FN12ty5ztAs&fmt=vtt&name=+via+Dotsub",
      "ext": "vtt"
    }
  ]
}

Upstream code will have to be aware of the following data issues:

  1. Must process only the ext=vtt subs and ignore the ext=ttml ones
  2. Must check for compatibility of the youtube language code with our internal representation, as defiend in le-utils. Ricecooker uses the function is_youtube_subtitle_file_supported_language which would make sense to move to pressurecooker so can be used here too. For example is_youtube_subtitle_file_supported_language('zu') returns True since zu can be mapped to internal language code zul.
  3. For incompatible languages---should skip subtitle file, raise a warning, and send an email to admins@studio so we'll know about it and can add to le-utils. In ricecooker someone is checking the logs so there is a human in the loop that can see when we run into an incompatible language code, but if youtube import is running as a background task on studio we won't know about it. (maybe also tell the user that we failed to import certain languages, but not that useful since there is nothing they can do about it--it's LE admins job to add lang code to le-utils).
  4. We have to map youtube language code to internal language representation before creating the subtitles files on Studio. Ricecooker provides another helper function _get_language_with_alpha2_fallback for this purpose -- returns the language_object. For example calling _get_language_with_alpha2_fallback('zu') returns Language(native_name='isiZulu', primary_code='zul', subcode=None, name='Zulu', ka_name=None).

@kollivier I'm going to open a PR to move these functions to pressurecooker, so you'll have them available once you start working on the Studio youtube import functionality.

ffmpeg fails to encode some videos

The file format we're encoding to requires the height in pixels to be divisible by two:

If we acquire a sample video:
wget http://kennedyctr.vo.llnwd.net/o41/artsedge/videostories/culkin/culkin10.m4v

And if we try to encode using the command from https://github.com/learningequality/pressurecooker/blob/master/pressurecooker/videos.py#L42
(including the scale=....:-1)
ffmpeg -y -i culkin10.m4v -profile:v baseline -level 3.0 -b:a 32k -ac 1 -vf scale='480:-1' -crf 28 -preset slow -strict -2 -stats 'dragon.mp4'

it fails:

[libx264 @ 0x27e40a0] height not divisible by 2 (480x267)
Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height

But if we use -2, as per https://ffmpeg.org/ffmpeg-filters.html#scale

If one and only one of the values is -n with n >= 1, the scale filter will use a value that maintains the aspect ratio of the input image, calculated from the other specified dimension. After that it will, however, make sure that the calculated dimension is divisible by n and adjust the value if necessary.

we get a successfully encoded video.

ffmpeg -y -i culkin10.m4v -profile:v baseline -level 3.0 -b:a 32k -ac 1 -vf scale='480:-2' -crf 28 -preset slow -strict -2 -stats 'dragon.mp4'

I'd also recommend we use subprocess.check_call(), catch the exception and maybe log an error: at the moment this just creates 0 byte files silently.

try:
    subprocess.check_output(["ffmpeg", "foo.mp4"], stderr=subprocess.STDOUT)
except Exception as e:
    print(e.output)

TL;DR: change the -1 to a -2 in scale = "'{}:-1'" on line 40 of pressurecooker's video.py

Tagging @jayoshih as they wrote the original code and might find this useful or know reasons not to do this.

Conflicting versions of lxml

  • ricecooker version: 0.6.61
  • Python version: 3.6

Description

So pip is not thaaaat great at dependencies, but at least now it's good at finding out about conflicts. In this case, lxml is resolved by the top-level dependency pressurecooker, but later contradicted by a transient dependency.

Can be resolved by:

  • adding le-pycaption's version target in pressurecooker
  • updating le-pycaption with a higher lxml max version (which is great)
  • relaxing the lxml max version in le-pycaption

I'll try out ricecooker with lxml 4.5.0 and see how that goes.

Cutting out the various parts of below output, here's the issue:

  - pressurecooker [required: >=0.0.27, installed: 0.0.27]
    - EbookLib [required: >=0.17.1, installed: 0.17.1]
      - lxml [required: Any, installed: 4.5.0]
    - le-pycaption [required: >=2.1.0a2, installed: 2.1.0a2]
      - lxml [required: >=3.2.3,<4.4.0, installed: 4.5.0]

Installing latest ricecooker:

ERROR: le-pycaption 2.1.0a2 has requirement lxml<4.4.0,>=3.2.3, but you'll have lxml 4.5.0 which is incompatible.

Output from pipdeptree:

➜  sushi-chef-ubongokids git:(2020-update) ✗ pipdeptree                   
Warning!!! Possibly conflicting dependencies found:
* le-pycaption==2.1.0a2
 - lxml [required: >=3.2.3,<4.4.0, installed: 4.5.0]
------------------------------------------------------------------------
pipdeptree==0.13.2
  - pip [required: >=6.0.0, installed: 20.0.2]
ricecooker==0.6.42
  - beautifulsoup4 [required: >=4.6.3, installed: 4.9.0]
    - soupsieve [required: >1.2, installed: 2.0]
  - cachecontrol [required: ==0.12.0, installed: 0.12.0]
    - msgpack-python [required: Any, installed: 0.5.6]
    - requests [required: Any, installed: 2.23.0]
      - certifi [required: >=2017.4.17, installed: 2020.4.5.1]
      - chardet [required: >=3.0.2,<4, installed: 3.0.4]
      - idna [required: >=2.5,<3, installed: 2.9]
      - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.9]
  - colorlog [required: >=4.1.0,<4.2, installed: 4.1.0]
  - css-html-js-minify [required: ==2.2.2, installed: 2.2.2]
    - anglerfish [required: Any, installed: 2.5.0]
  - dictdiffer [required: >=0.8.0, installed: 0.8.1]
  - docopt [required: >=0.6.2, installed: 0.6.2]
  - html5lib [required: Any, installed: 1.0.1]
    - six [required: >=1.9, installed: 1.14.0]
    - webencodings [required: Any, installed: 0.5.1]
  - le-utils [required: >=0.1.24, installed: 0.1.24]
    - pycountry [required: ==17.5.14, installed: 17.5.14]
  - lockfile [required: ==0.12.2, installed: 0.12.2]
  - mock [required: ==2.0.0, installed: 2.0.0]
    - pbr [required: >=0.11, installed: 5.4.5]
    - six [required: >=1.9, installed: 1.14.0]
  - Pillow [required: ==5.4.1, installed: 5.4.1]
  - pressurecooker [required: >=0.0.27, installed: 0.0.27]
    - beautifulsoup4 [required: >=4.6.3, installed: 4.9.0]
      - soupsieve [required: >1.2, installed: 2.0]
    - EbookLib [required: >=0.17.1, installed: 0.17.1]
      - lxml [required: Any, installed: 4.5.0]
      - six [required: Any, installed: 1.14.0]
    - ffmpy [required: >=0.2.2, installed: 0.2.2]
    - le-pycaption [required: >=2.1.0a2, installed: 2.1.0a2]
      - beautifulsoup4 [required: >=4.6.3, installed: 4.9.0]
        - soupsieve [required: >1.2, installed: 2.0]
      - cssutils [required: >=0.9.10, installed: 1.0.2]
      - future [required: Any, installed: 0.18.2]
      - lxml [required: >=3.2.3,<4.4.0, installed: 4.5.0]
      - six [required: >=1.9.0, installed: 1.14.0]
    - le-utils [required: >=0.1.24, installed: 0.1.24]
      - pycountry [required: ==17.5.14, installed: 17.5.14]
    - matplotlib [required: ==2.2.3, installed: 2.2.3]
      - cycler [required: >=0.10, installed: 0.10.0]
        - six [required: Any, installed: 1.14.0]
      - kiwisolver [required: >=1.0.1, installed: 1.2.0]
      - numpy [required: >=1.7.1, installed: 1.15.4]
      - pyparsing [required: >=2.0.1,!=2.1.6,!=2.1.2,!=2.0.4, installed: 2.4.7]
      - python-dateutil [required: >=2.1, installed: 2.8.1]
        - six [required: >=1.5, installed: 1.14.0]
      - pytz [required: Any, installed: 2019.3]
      - six [required: >=1.10, installed: 1.14.0]
    - numpy [required: ==1.15.4, installed: 1.15.4]
    - pdf2image [required: ==1.11.0, installed: 1.11.0]
      - pillow [required: Any, installed: 5.4.1]
    - Pillow [required: ==5.4.1, installed: 5.4.1]
    - youtube-dl [required: >=2020.1.24, installed: 2020.3.24]
  - pypdf2 [required: >=1.26.0, installed: 1.26.0]
  - pytest [required: >=3.0.2, installed: 5.4.1]
    - attrs [required: >=17.4.0, installed: 19.3.0]
    - importlib-metadata [required: >=0.12, installed: 1.6.0]
      - zipp [required: >=0.5, installed: 3.1.0]
    - more-itertools [required: >=4.0.0, installed: 8.2.0]
    - packaging [required: Any, installed: 20.3]
      - pyparsing [required: >=2.0.2, installed: 2.4.7]
      - six [required: Any, installed: 1.14.0]
    - pluggy [required: >=0.12,<1.0, installed: 0.13.1]
      - importlib-metadata [required: >=0.12, installed: 1.6.0]
        - zipp [required: >=0.5, installed: 3.1.0]
    - py [required: >=1.5.0, installed: 1.8.1]
    - wcwidth [required: Any, installed: 0.1.9]
  - requests [required: >=2.11.1, installed: 2.23.0]
    - certifi [required: >=2017.4.17, installed: 2020.4.5.1]
    - chardet [required: >=3.0.2,<4, installed: 3.0.4]
    - idna [required: >=2.5,<3, installed: 2.9]
    - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.9]
  - requests-cache [required: >=0.4.13, installed: 0.5.2]
    - requests [required: >=1.1.0, installed: 2.23.0]
      - certifi [required: >=2017.4.17, installed: 2020.4.5.1]
      - chardet [required: >=3.0.2,<4, installed: 3.0.4]
      - idna [required: >=2.5,<3, installed: 2.9]
      - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.9]
  - requests-file [required: Any, installed: 1.4.3]
    - requests [required: >=1.0.0, installed: 2.23.0]
      - certifi [required: >=2017.4.17, installed: 2020.4.5.1]
      - chardet [required: >=3.0.2,<4, installed: 3.0.4]
      - idna [required: >=2.5,<3, installed: 2.9]
      - urllib3 [required: >=1.21.1,<1.26,!=1.25.1,!=1.25.0, installed: 1.25.9]
    - six [required: Any, installed: 1.14.0]
  - selenium [required: ==3.0.1, installed: 3.0.1]
  - validators [required: Any, installed: 0.14.3]
    - decorator [required: >=3.4.0, installed: 4.4.2]
    - six [required: >=1.4.0, installed: 1.14.0]
  - websocket-client [required: ==0.40.0, installed: 0.40.0]
    - six [required: Any, installed: 1.14.0]
  - youtube-dl [required: Any, installed: 2020.3.24]
setuptools==46.1.3
shelve2==1.0
wheel==0.34.2

Audio quality suffers during compression

Description

During QA for the KA-bg channel, it was established that audio quality is significantly degraded during compression (sound appears muffled).

example muffled video
original

It seems the "muffling" is audio-codec-dependent because this other video has good audio. orig

What I Did

Ran a the KA chef with --compress using default settings (crf: 32) and with (crf: 22). In both cases audio track of video

Possible workaround

Compress video as currently, then re-combine with original audio. Since audio tracks are usually much smaller files, resulting videos files should not get much larger.

Next steps

investigate audio formats in orginals and perform compression tests

Mac Invalid GUI

I got this error today while working on something unrelated:

UsageError: Invalid GUI request 'ps', valid ones are:dict_keys(['inline', 'nbagg', 'notebook', 'ipympl', 'widget', None, 'qt4', 'qt', 'qt5', 'wx', 'tk', 'gtk', 'gtk3', 'osx', 'asyncio'])

I'm on Mac OS with pressurecooker==0.0.17 and matplotlib==2.0.0 in a virtualenv

screen shot 2018-05-17 at 11 10 57 am

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.