GithubHelp home page GithubHelp logo

suber's Introduction

SubER - Subtitle Edit Rate

SubER is an automatic, reference-based, segmentation- and timing-aware edit distance metric to measure quality of subtitle files. For a detailed description of the metric and a human post-editing evaluation we refer to our IWSLT 2022 paper. In addition to the SubER metric, this scoring tool calculates a wide range of established speech recognition and machine translation metrics (WER, BLEU, TER, chrF) directly on subtitle files.

Installation

pip install subtitle-edit-rate

will install the suber command line tool. Alternatively, check out this git repository and run the contained suber module with python -m suber.

Basic Usage

Currently, we expect subtitle files to come in SubRip text (SRT) format. Given a human reference subtitle file reference.srt and a hypothesis file hypothesis.srt (typically the output of an automatic subtitling system) the SubER score can be calculated by running:

$ suber -H hypothesis.srt -R reference.srt
{
    "SubER": 19.048
}

The SubER score is printed to stdout in json format. As SubER is an edit rate, lower scores are better. As a rough rule of thumb from our experience, a score lower than 20(%) is very good quality while a score above 40 to 50(%) is bad.

Make sure that there is no constant time offset between the timestamps in hypothesis and reference as this will lead to incorrect scores. Also, note that <i>, <b> and <u> formatting tags are ignored if present in the files. All other formatting must be removed from the files before scoring for accurate results.

Punctuation and Case-Sensitivity

The main SubER metric is computed on normalized text, which means case-insensitive and without taking punctuation into account, as we observe higher correlation with human judgements and post-edit effort in this setting. We provide an implementation of a case-sensitive variant which also uses a tokenizer to take punctuation into account as separate tokens which you can use "at your own risk" or to reassess our findings. For this, add --metrics SubER-cased to the command above. Please do not report results using this variant as "SubER" unless explicitly mentioning the punctuation-/case-sensitivity.

Other Metrics

The SubER tool supports computing the following other metrics directly on subtitle files:

  • word error rate (WER)
  • bilingual evaluation understudy (BLEU)
  • translation edit rate (TER)
  • character n-gram F score (chrF)
  • character error rate (CER)

BLEU, TER and chrF calculations are done using SacreBLEU with default settings. WER is computed with JiWER on normalized text (lower-cased, punctuation removed).

Assuming hypothesis.srt and reference.srt are parallel, i.e. they contain the same number of subtitles and the contents of the n-th subtitle in both files corresponds to each other, the above-mentioned metrics can be computed by running:

$ suber -H hypothesis.srt -R reference.srt --metrics WER BLEU TER chrF CER
{
    "WER": 23.529,
    "BLEU": 39.774,
    "TER": 23.529,
    "chrF": 68.402,
    "CER": 17.857
}

In this mode, the text from each parallel subtitle pair is considered to be a sentence pair.

Scoring Non-Parallel Subtitle Files

In the general case, subtitle files for the same video can have different numbers of subtitles with different time stamps. All metrics - except SubER - usually require to be calculated on parallel segments. To apply these metrics to general subtitle files, the hypothesis file has to be re-segmented to correspond to the reference subtitles. The SubER tool implements two options:

See our paper for further details.

To use the Levenshtein method add an AS- prefix to the metric name, e.g.:

suber -H hypothesis.srt -R reference.srt --metrics AS-BLEU

The AS- prefix terminology is taken from Matusov et al. and stands for "automatic segmentation". To use the time-alignment method instead, add a t- prefix. This works for all metrics (except for SubER itself which does not require re-segmentation). In particular, we implement t-BLEU from Cherry et al.. We encode the segmentation method (or lack thereof) in the metric name to explicitly distinguish the different resulting metric scores!

To inspect the re-segmentation applied to the hypothesis you can use the align_hyp_to_ref.py tool (run python -m suber.tools.align_hyp_to_ref -h for help).

In case of Levenshtein alignment, there is also the option to give a plain file as the reference. This can be used to provide sentences instead of subtitles as reference segments (each line will be considered a segment):

suber -H hypothesis.srt -R reference.txt --reference-format plain --metrics AS-TER

We provide a simple tool to extract sentences from SRT files based on punctuation:

python -m suber.tools.srt_to_plain -i reference.srt -o reference.txt --sentence-segmentation

It can be used to create the plain sentence-level reference reference.txt for the scoring command above.

Scoring Line Breaks as Tokens

The line breaks present in the subtitle files can be included into the text segments to be scored as <eol> (end of line) and <eob> (end of block) tokens. For example:

636
00:50:52,200 -> 00:50:57,120
Ladies and gentlemen,
the dance is about to begin.

would be represented as

Ladies and gentlemen, <eol> the dance is about to begin. <eob>

To do so, add a -seg ("segmentation-aware") postfix to the metric name, e.g. BLEU-seg, AS-TER-seg or t-WER-seg. Character-level metrics (chrF and CER) do not support this as it is not obvious how to count character edits for <eol> tokens.

TER-br

As a special case, we implement TER-br from Karakanta et al.. It is similar to TER-seg, but all (real) words are replaced by a mask token. This would convert the sentence from the example above to:

<mask> <mask> <mask> <eol> <mask> <mask> <mask> <mask> <mask> <mask> <eob>

Note, that also TER-br has variants for computing it on existing parallel segments (TER-br) or on re-aligned segments (AS-TER-br/t-TER-br). Re-segmentation happens before masking.

Contributing

If you run into an issue, have a feature request or have questions about the usage or the implementation of SubER, please do not hesitate to open an issue or a thread under "discussions". Pull requests are welcome too, of course!

Things I'm already considering to add in future versions:

  • support for other subtitling formats than SRT
  • a verbose output that explains the SubER score (list of edit operations)

suber's People

Contributors

patrick-wilken avatar apptek avatar

Stargazers

 avatar  avatar Guglielmo Camporese avatar Rafal Černiavski avatar Vincenzo Timmel avatar Oleguer avatar Yuan-Man avatar Ondřej Cífka avatar Fabian-Robert Stöter avatar Michael A. Davis avatar Jorge Iranzo avatar  avatar Prayash Mohapatra avatar Gonçal Garcés Díaz-Munío avatar Johanes Effendi avatar Thiago Castro Ferreira avatar Ramon R.D. avatar  avatar

Watchers

 avatar  avatar

suber's Issues

SubER computation on more than 1 srt file

Hi, this more than a problem is a question: if I have more than 1 .srt file with which I can make a comparison, how can I compute the SubER metric (and also the AS-/T-BLEU) metrics?
Is it sufficient to concatenate them and then compute the metrics or we need something more sophisticated?

Thanks for your work.

Statistical Significance / Confidence Intervals

Hi @patrick-wilken ,

I think it would be great to offer the opportunity to score statistical significance between two hypotheses. This can be done with bootstrap resampling, even though the main challenge would be to understand how to sample from SRT files, as there is (in general) no alignment between SRTs generated by two systems nor with the references. Do you have comments/ideas on how to do this? I can also assist with the implementation.

Thanks,
Marco

Punctuation and case sensitive

Hi @patrick-wilken,
I am here again to point out some observations I made on the SubER outcomes I obtained from the analysis of different models.
I know that you found no significative differences in the correlation of SubER with and without punctuation and true casing, as reported in the paper, but I think it would be very useful to add an option to the SubER tool in which you can indicate whether to use or not punctuation and true casing. Currently, you are normalizing the text, and tokenization is not needed to compute TER (as far as I understood from your implementation) but it would become necessary if we avoid the normalization step (as they do in the sacrebleu tool).
I noticed that computing SubER by normalizing the text strongly favors systems that are not good at inserting punctuation and correctly capitalizing words and the outcomes of SubER are in fact in contrast with BLEU scores but also with Sigma scores.
Just to give you an idea, I found that a system scoring 5 BLEU point less than all the other systems that I tested can achieve a lower (thus, better) SubER and the difference in the quality of the translation also emerges upon manual evaluation and the absence of punctuation strongly affects the understanding. Therefore, I suggest integrating the option that I mentioned before and maybe further exploring this aspect.

Thanks

Verbatim SubER

Hi again,
This is not a real issue but an "enhancement" request.
I am using SubER for a paper and asking if there is a way to obtain more information about the results obtained, i.e. since the metric is Levenshtein-based, can we have information about deletion, insertion, etc.?
It would be useful to perform analyses and have some suggestions about system behavior.
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.