rumaak / parczech_speech_lengths Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 32.45 MB

Compute speech lengths of members of parliament.

XSLT 17.61% Python 82.05% Shell 0.33%

parczech_speech_lengths's People

Contributors

Watchers

parczech_speech_lengths's Issues

Move examples / usage from `README.md` to wiki

Create a wiki and move project-related information there.

Aggregator accept `datetime` object

The interval aggregator should be able to accept datetime object as well as string representation of date and time. This would be practical as during the validation of requests, the corresponding fields are already converted to datetime objects.

Aggregator should offer an option not to save data to filesystem

Title.

Server API

Design and implement a server API. This issue will contain the server API description until it becomes more stable; then it will be moved to the wiki.

Server API design

Framework

FastAPI. Reasons:

written in Python (easier integration of already existing code)
simple setup (compared to Django)
fast, asynchronous (compared to Flask)
microframework (compared to Django)
automatic documentation (not sure whether available in Flask, Django)
type hints (not sure whether available in Flask, Django)

Also considered - Django, Flask.

Interface

Request / response structure specification.

Notes:

MoP is an abbreviation of member of parliament
request parameters are specified as <parameter>
values to be computed on/retrieved from the server are specified as <value>

Single speaker

Every member of parliament will (probably) have his own page with basic personal data and statistics computed over the totality of data. There will be an option to re-calculate the statistics over a custom time interval. This leads us to the following request/response pairs:

`/single/precomputed`

Request structure:

{
    "MoP": <MoP_id>
}

Response structure:

{
    "speaking_time": {
        "regular": [
            ["Election period", "Words", "Sentences", "Paragraphs", "Utterances"],
            [<period1>, <words1>, <sentences1>, <paragraphs1>, <utterances1>],
            [<period2>, <words2>, <sentences2>, <paragraphs2>, <utterances2>],
            ...
        ],
        ...
    },
    "relative_diff": {
        "regular": [
            ["Election period", "Utterance vs paragraph", "Paragraph vs sentence", "Sentence vs word"],
            [<period1>, <up1>, <ps1>, <sw1>],
            [<period2>, <up2>, <ps2>, <sw2>],
            ...
        ],
        ...
    },
    "unanchored": {
        "regular": [
            ["Election period", "Unanchored"],
            [<period1>, <unanchored1>],
            [<period2>, <unanchored2>],
            ...
        ],
    },
    "wpm": {
        "regular": [
            ["Election period", "Words per minute"],
            [<period1>, <wpm1>],
            [<period2>, <wpm2>],
            ...
        ],
        ...
    }
}

`/single/interval`

Request structure:

{
    "MoP": <MoP_id>,
    "start": <time_start>,
    "end": <time_end>
}

Response structure:

{
    "speaking_time": {
        "regular": [
            ["Speaker", "Words", "Sentences", "Paragraphs", "Utterances"],
            [<MoP_name>, <words>, <sentences>, <paragraphs>, <utterances>]
        ],
        ...
    },
    "relative_diff": {
        ...
    },
    "unanchored": {
        ...
    },
    "wpm": {
        ...
    }
}

We've skipped the relative_diff, unanchored, and wpm fields because their structure is almost identical to the structure of these fields in the /single/precomputedrequest, except the independent variable is Speaker instead of Election period.

Multiple speakers

The user can select a subset of speakers for which the statistics will be plotted.

`/multiple/interval`

This results in an almost identical response as in the case of the /single/interval request, except there will be data for multiple speakers.

Request structure:

{
    "speakers": {
        "static": {
            "MoPs": [
                <MoP_id1>,
                <MoP_id2>,
                ...
            ]
            "birth": {
                "from": <birth_from>,
                "to": <birth_to>
            },
            "sex": <sex>,
            "role": <role>
        },
        "dynamic": {
            "age": {
                "from": <age_from>,
                "to": <age_to>
            },
            "group": <group>,
            "party": <party>
        }
    },
    "data": {
        "interval": {
            "start": <start>,
            "end": <end>
        },
        "term": <term>,
        "meeting": <meeting>,
        "sitting": <sitting>,
        "agenda": <agenda>
    }
}

Response structure:

{
    "speaking_time": {
        "regular": [
            ["Speaker", "Words", "Sentences", "Paragraphs", "Utterances"],
            [<MoP_name1>, <words1>, <sentences1>, <paragraphs1>, <utterances1>],
            [<MoP_name2>, <words2>, <sentences2>, <paragraphs2>, <utterances2>],
            ...
        ],
        ...
    },
    "relative_diff": {
        ...
    },
    "unanchored": {
        ...
    },
    "wpm": {
        ...
    }
}

The request is more complicated in this case, so we shall describe it in further detail. There are two parts of the request:

data - over what data should the statistics be computed (i.e. what audio files should be included)
speakers - what speakers to include when computing the statistics (we consider the same person with two different roles as two different speakers in this context)

The fields in the data part are quite self-explanatory. In the case of the speakersfield, the notion of static and dynamic parameters is introduced:

static - speaker properties that don't change in time
dynamic - properties of speaker that change in time

An interesting thing to note is that role is included as a static property of a speaker; this is related to what we stated before, i.e. the same person in two different roles is seen as two different speakers. From an implementational point of view, this makes more sense: for example, if a persons age changes over the selected time interval to a value above / below selected age interval, it makes sense to disregard the person completely; however, it wouldn't make sense to disregard a person completely just because he also spoke in other roles (even though we would count only the statistics per the selected role).

Top speakers

The user can also skip choosing the subset of MoPs and see top performers according to each metric instead.

`/top/interval`

Request structure:

{
    "start": <time_start>,
    "end": <time_end>
}

Response structure:

{
    "speaking_time": {
        "regular": [
            ["Speaker", "Words", "Sentences", "Paragraphs", "Utterances"],
            [<MoP_name1>, <words1>, <sentences1>, <paragraphs1>, <utterances1>],
            [<MoP_name2>, <words2>, <sentences2>, <paragraphs2>, <utterances2>],
            ...
        ],
        ...
    },
    "relative_diff": {
        ...
    },
    "unanchored": {
        ...
    },
    "wpm": {
        ...
    }
}

Additional routes

Routes providing other functionality.

`/data/speakers`

A route providing a list of available speakers. The request is realized as a simple GET request without any parameters. A response will have the following structure:

{
    "speakers": [
        <speaker1>,
        <speaker2>,
        ...
    ] 
}

Notes

Additional notes regarding frameworks and interface.

when designing the interface, only statistics we are currently able to compute were taken into account; for example, even though there will (probably) be a statistic comparing values of metrics across genders, the request for such statistic is not yet specified in the design
it is possible to change the API to allow for the explicit selection of a single statistic (e.g. send only data for the wpm statistic and only for role chair), but as of right now, values of all the statistics are sent back to the client

New statistics are not updated from existing audios

Statistics word_count and no_anchor are not updated from existing statistics computed over audio files - this means that when audio is used in multiple XML files (as I am afraid is possible to happen) only the statistics computed over the last XML file are kept, which is obviously wrong.

`plot_top.py` script missing description

Title.

Documentation

As the project grows larger, it might be a good idea to think about the documentation. Currently the biggest issue is that many classes / functions miss docstrings.

The project has a good coverage in terms of user documentation (i.e. installation, example usage, some simple presentation of results).

Per-audio file statistics are computed incorrectly

For some reason, only statistics in a single role seem to be computed.

Document server usage

Add to wiki info about server usage.

Low word count / unanchored breaks statistics

When the word_count and no_anchor statistics are too small, a number of problems arise, e.g. when computing the words per minute statistic.

XSLT script tabs for each timeline

When there are multiple timelines, tabs are pasted for each of them. This is wrong, as they were intended to use in the case that no matching timeline exists.

Aggregator should return computed data

As of right now an aggregator saves computed data to the filesystem. This was convenient for offline plotting; however, it is not practical in the case of online aggregated data retrieval. The aggregator should return the computed values, too.

IntervalAggregator and TermAggregator mostly similar

IntervalAggregator and TermAggregator share most of the code. It would be wise to refactor into either same class or create a common ancestor.

Route to retrieve all speakers

We would like to be able to select a subset of speakers when visualizing the data on the website frontend. However, as of right now, there is no route providing such information. Such a route should thus be added and documented in #14. Below is the structure of the corresponding request/response pair.

`/data/speakers`

A route providing a list of available speakers. The request is realized as a simple GET request without any parameters. A response will have the following structure:

{
    "speakers": [
        <speaker1>,
        <speaker2>,
        ...
    ] 
}

Statistics for agenda items

Generate speakers' statistics for a selected agenda item (items on the meeting level)

Mobile view broken

For some reason, the website isn't responsive.

Missing origin in charts

All charts are missing origin - it can be confusing for users:

Extracting person related data

Extract data about MoPs, guests, ... and save into some reasonable (tabular) representation for later use. Example use of that might be using MoPs full name instead of an id, but also creating statistics based on these data, e.g. dependence of speed of speaking on age of MoP.

XSLT script performance

As of now, the XSLT script is inefficient and runs too slowly.

Possible future extensions

This is just a list of ideas(not necessarily speech-related) that can be implemented. This can be included in our eshop/comparator...

(1) named entities related statistics

citations (podle...)
mentions

(2) words

PoS statistics
vocabulary

(3) text coherence

https://ufal.mff.cuni.cz/evald

bash script for xslt transformation appends to file

The xslt_apply.sh bash script appends to file instead of overwriting it. This is undesirable behavior and should be removed.

parczech_speech_lengths/scripts/xslt_apply.sh

Line 7 in 444ab75

xsltproc scripts/speech_timestamps.xsl "$file" >> "${2}/${bn}.txt"

Speaker and data filtering

As of right now, the only way to filter speakers during data aggregation/visualization is by explicitly listing the list of speakers to visualize. The only way to compute statistics over only a subset of data is by specifying an exact time interval. We would like to be able to filter based on the following properties:

speakers
- ids
- birth
- sex
- role
- age
- group
- party
data
- time period
- term
- meeting
- sitting
- agenda

Website frontend

Design and implement the frontend part of the website. Details will be specified in this issue for now; later, they will be moved to the wiki.

`per_file_statistics.py` refactoring

Some redundancies were found in the per_file_statistics.py script; for example, the structure below does not need to remember speaker and role values.

parczech_speech_lengths/scripts/per_file_statistics.py

Lines 30 to 35 in 51b6ebf

 self.last_continuous = { 

 "speaker": None, 

 "role": None, 

 "beg": None, 

 "end": None 

 }

Values are updated from existing audio files in wrong way

As of now, all the statistics are updated by the word field, which is obviously wrong (e.g. sentence field is updated using word value). This is probably because the code was copy-pasted heavily (and as we know, copy-pasting is often a source of mistakes).

Server returns error 500 when requested timespan not overlap data timespan

This error is not caught on the user side - so for user application seems to be inactive

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

	self.last_continuous = {
	"speaker": None,
	"role": None,
	"beg": None,
	"end": None
	}