rumaak / parczech_speech_lengths Goto Github PK
View Code? Open in Web Editor NEWCompute speech lengths of members of parliament.
Compute speech lengths of members of parliament.
Create a wiki and move project-related information there.
The interval aggregator should be able to accept datetime
object as well as string representation of date and time. This would be practical as during the validation of requests, the corresponding fields are already converted to datetime
objects.
Title.
Design and implement a server API. This issue will contain the server API description until it becomes more stable; then it will be moved to the wiki.
FastAPI. Reasons:
Also considered - Django, Flask.
Request / response structure specification.
Notes:
<parameter>
<value>
Every member of parliament will (probably) have his own page with basic personal data and statistics computed over the totality of data. There will be an option to re-calculate the statistics over a custom time interval. This leads us to the following request/response pairs:
/single/precomputed
Request structure:
{
"MoP": <MoP_id>
}
Response structure:
{
"speaking_time": {
"regular": [
["Election period", "Words", "Sentences", "Paragraphs", "Utterances"],
[<period1>, <words1>, <sentences1>, <paragraphs1>, <utterances1>],
[<period2>, <words2>, <sentences2>, <paragraphs2>, <utterances2>],
...
],
...
},
"relative_diff": {
"regular": [
["Election period", "Utterance vs paragraph", "Paragraph vs sentence", "Sentence vs word"],
[<period1>, <up1>, <ps1>, <sw1>],
[<period2>, <up2>, <ps2>, <sw2>],
...
],
...
},
"unanchored": {
"regular": [
["Election period", "Unanchored"],
[<period1>, <unanchored1>],
[<period2>, <unanchored2>],
...
],
},
"wpm": {
"regular": [
["Election period", "Words per minute"],
[<period1>, <wpm1>],
[<period2>, <wpm2>],
...
],
...
}
}
/single/interval
Request structure:
{
"MoP": <MoP_id>,
"start": <time_start>,
"end": <time_end>
}
Response structure:
{
"speaking_time": {
"regular": [
["Speaker", "Words", "Sentences", "Paragraphs", "Utterances"],
[<MoP_name>, <words>, <sentences>, <paragraphs>, <utterances>]
],
...
},
"relative_diff": {
...
},
"unanchored": {
...
},
"wpm": {
...
}
}
We've skipped the relative_diff
, unanchored
, and wpm
fields because their structure is almost identical to the structure of these fields in the /single/precomputed
request, except the independent variable is Speaker
instead of Election period
.
The user can select a subset of speakers for which the statistics will be plotted.
/multiple/interval
This results in an almost identical response as in the case of the /single/interval
request, except there will be data for multiple speakers.
Request structure:
{
"speakers": {
"static": {
"MoPs": [
<MoP_id1>,
<MoP_id2>,
...
]
"birth": {
"from": <birth_from>,
"to": <birth_to>
},
"sex": <sex>,
"role": <role>
},
"dynamic": {
"age": {
"from": <age_from>,
"to": <age_to>
},
"group": <group>,
"party": <party>
}
},
"data": {
"interval": {
"start": <start>,
"end": <end>
},
"term": <term>,
"meeting": <meeting>,
"sitting": <sitting>,
"agenda": <agenda>
}
}
Response structure:
{
"speaking_time": {
"regular": [
["Speaker", "Words", "Sentences", "Paragraphs", "Utterances"],
[<MoP_name1>, <words1>, <sentences1>, <paragraphs1>, <utterances1>],
[<MoP_name2>, <words2>, <sentences2>, <paragraphs2>, <utterances2>],
...
],
...
},
"relative_diff": {
...
},
"unanchored": {
...
},
"wpm": {
...
}
}
The request is more complicated in this case, so we shall describe it in further detail. There are two parts of the request:
data
- over what data should the statistics be computed (i.e. what audio files should be included)speakers
- what speakers to include when computing the statistics (we consider the same person with two different roles as two different speakers in this context)The fields in the data
part are quite self-explanatory. In the case of the speakers
field, the notion of static
and dynamic
parameters is introduced:
static
- speaker properties that don't change in timedynamic
- properties of speaker that change in timeAn interesting thing to note is that role
is included as a static
property of a speaker; this is related to what we stated before, i.e. the same person in two different roles is seen as two different speakers. From an implementational point of view, this makes more sense: for example, if a persons age changes over the selected time interval to a value above / below selected age interval, it makes sense to disregard the person completely; however, it wouldn't make sense to disregard a person completely just because he also spoke in other roles (even though we would count only the statistics per the selected role).
The user can also skip choosing the subset of MoPs and see top performers according to each metric instead.
/top/interval
Request structure:
{
"start": <time_start>,
"end": <time_end>
}
Response structure:
{
"speaking_time": {
"regular": [
["Speaker", "Words", "Sentences", "Paragraphs", "Utterances"],
[<MoP_name1>, <words1>, <sentences1>, <paragraphs1>, <utterances1>],
[<MoP_name2>, <words2>, <sentences2>, <paragraphs2>, <utterances2>],
...
],
...
},
"relative_diff": {
...
},
"unanchored": {
...
},
"wpm": {
...
}
}
Routes providing other functionality.
/data/speakers
A route providing a list of available speakers. The request is realized as a simple GET
request without any parameters. A response will have the following structure:
{
"speakers": [
<speaker1>,
<speaker2>,
...
]
}
Additional notes regarding frameworks and interface.
wpm
statistic and only for role chair
), but as of right now, values of all the statistics are sent back to the clientStatistics word_count
and no_anchor
are not updated from existing statistics computed over audio files - this means that when audio is used in multiple XML files (as I am afraid is possible to happen) only the statistics computed over the last XML file are kept, which is obviously wrong.
Title.
As the project grows larger, it might be a good idea to think about the documentation. Currently the biggest issue is that many classes / functions miss docstrings.
The project has a good coverage in terms of user documentation (i.e. installation, example usage, some simple presentation of results).
For some reason, only statistics in a single role seem to be computed.
Add to wiki info about server usage.
When the word_count
and no_anchor
statistics are too small, a number of problems arise, e.g. when computing the words per minute statistic.
When there are multiple timelines, tabs are pasted for each of them. This is wrong, as they were intended to use in the case that no matching timeline exists.
As of right now an aggregator saves computed data to the filesystem. This was convenient for offline plotting; however, it is not practical in the case of online aggregated data retrieval. The aggregator should return the computed values, too.
IntervalAggregator
and TermAggregator
share most of the code. It would be wise to refactor into either same class or create a common ancestor.
We would like to be able to select a subset of speakers when visualizing the data on the website frontend. However, as of right now, there is no route providing such information. Such a route should thus be added and documented in #14. Below is the structure of the corresponding request/response pair.
/data/speakers
A route providing a list of available speakers. The request is realized as a simple GET
request without any parameters. A response will have the following structure:
{
"speakers": [
<speaker1>,
<speaker2>,
...
]
}
Generate speakers' statistics for a selected agenda item (items on the meeting level)
For some reason, the website isn't responsive.
Extract data about MoPs, guests, ... and save into some reasonable (tabular) representation for later use. Example use of that might be using MoPs full name instead of an id, but also creating statistics based on these data, e.g. dependence of speed of speaking on age of MoP.
As of now, the XSLT script is inefficient and runs too slowly.
This is just a list of ideas(not necessarily speech-related) that can be implemented. This can be included in our eshop/comparator...
The xslt_apply.sh
bash script appends to file instead of overwriting it. This is undesirable behavior and should be removed.
As of right now, the only way to filter speakers during data aggregation/visualization is by explicitly listing the list of speakers to visualize. The only way to compute statistics over only a subset of data is by specifying an exact time interval. We would like to be able to filter based on the following properties:
Design and implement the frontend part of the website. Details will be specified in this issue for now; later, they will be moved to the wiki.
Some redundancies were found in the per_file_statistics.py
script; for example, the structure below does not need to remember speaker
and role
values.
parczech_speech_lengths/scripts/per_file_statistics.py
Lines 30 to 35 in 51b6ebf
As of now, all the statistics are updated by the word
field, which is obviously wrong (e.g. sentence
field is updated using word
value). This is probably because the code was copy-pasted heavily (and as we know, copy-pasting is often a source of mistakes).
This error is not caught on the user side - so for user application seems to be inactive
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.