GithubHelp home page GithubHelp logo

Punctuation and case sensitive about suber HOT 5 CLOSED

sarapapi avatar sarapapi commented on September 15, 2024
Punctuation and case sensitive

from suber.

Comments (5)

patrick-wilken avatar patrick-wilken commented on September 15, 2024

Yes, thanks for those proposals. As you saw we experimented with casing and punctuation, but also with tokenization, when designing the metric and it is indeed a bit unfortunate that normalized SubER worked best in our experiments. 😅 I excluded other versions from the code mainly to avoid confusion about the metric definition.
But I guess I can add "SubER-cased" as a metric which would be true-cased and with punctuation, in analogy to the "WER-cased" metric. By the way, the default for TER in other tools is also case insensitive...
Regarding tokenization: the default for TER always seems to be to turn it off. Probably for historic reasons? I agree that it is intuitive to enable it, I don't know if somebody has shown rigorously that it improves the TER metric. I can revisit my experiments and see what numbers I get with/without tokenization for SubER.

from suber.

sarapapi avatar sarapapi commented on September 15, 2024

Hi Patrick, it would be great to include the SubER-cased.
Moreover, I saw in the original TER implementation (TERCOM) that the input is not actually tokenized but can be enabled with the "normalized" parameter as it is in sacrebleu. However, in the official paper, the authors wrote "In addition, punctuation tokens are treated as normal words and mis-capitalization is counted as an edit.", thus punctuation is treated as a token (which is true only if we tokenize -- or normalize in the TERCOM library -- the text) and the computation is actually case sensitive. I think that they set as default parameters in the library something different from what they actually used for the official calculation (which I think is the correct one).

from suber.

patrick-wilken avatar patrick-wilken commented on September 15, 2024

Ok, that all makes sense. :) I implemented it, see #6. Maybe you want to check the details.
Another question is whether we should also change the "TER" metrics to be tokenized and case-sensitive. But I would rather keep it just an interface to sacrebleu with default options. Because it's not really the focus of this repo to provide all the options for the other metrics. But it's easy to set them in suber/metrics/sacrebleu_interface.py if someone needs them.

from suber.

sarapapi avatar sarapapi commented on September 15, 2024

Hi Patrick, sorry for my late reply but I have taken some time to take a look at the implementation and compute the metrics by myself. The cased version seems sound to me and the results are now consistent with that of the other metrics that I am using. Thanks again for your time.

from suber.

patrick-wilken avatar patrick-wilken commented on September 15, 2024

That sounds good! I will merge then.

from suber.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.