Comments (2)
Hi Marco,
thanks for the proposal, definitely sounds like a useful addition! You are in particular referring to this https://aclanthology.org/W04-3250.pdf, right?
Regarding sampling from subtitles: yes, that seems to be much less obvious than sampling from sentences. For the SubER calculation the files are already split into parts at points in time where both hypothesis and reference agree that there is no subtitle. So far this is an implementation detail for more efficient computation. But this is the closest thing to parallel segments that currently exists and those could maybe be used as units for sampling? There are several problems with this though: 1. segmentation depends on the hypotheses; 2. probably too few segments, depending on specific subtitle content; 3. length of segments varies greatly.
Another idea that comes to my mind is to calculate the SubER edit operations on the whole file, sample a subset of reference subtitle blocks, and calculate SubER scores using only the edit operations (and reference length) corresponding to those blocks. But this is only brainstorming right now, have to think it through...
I will be travelling the next two weeks, so can only really look into this after that. 🙃
from suber.
Hi @patrick-wilken ! Thanks for your reply. Yes, that is the paper I was referring to. I looked into the code in these days and the easiest thing that comes to my mind is the following:
In the SubER for loop (https://github.com/apptek/SubER/blob/main/suber/metrics/suber.py#L29), we can keep track of the single edits and reference lengths, instead of just comulating them. Once we have these fine-grained stats, we can bootstrap with them. I already have some sort of implementation doing this. The main issues in this case would be:
- How to integrate this in a clean way in the tool?
- In this way we can only compute confidence intervals rather than the statistical significance between two hypotheses. But this second thing is very hard for all alignment issues. So as a first step, CI may be enough. What do you think?
Thanks,
Marco
from suber.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from suber.