GithubHelp home page GithubHelp logo

parczech_speech_lengths's People

Contributors

rumaak avatar

Watchers

 avatar  avatar  avatar

parczech_speech_lengths's Issues

Aggregator accept `datetime` object

The interval aggregator should be able to accept datetime object as well as string representation of date and time. This would be practical as during the validation of requests, the corresponding fields are already converted to datetime objects.

Server API

Design and implement a server API. This issue will contain the server API description until it becomes more stable; then it will be moved to the wiki.

Server API design

Framework

FastAPI. Reasons:

  • written in Python (easier integration of already existing code)
  • simple setup (compared to Django)
  • fast, asynchronous (compared to Flask)
  • microframework (compared to Django)
  • automatic documentation (not sure whether available in Flask, Django)
  • type hints (not sure whether available in Flask, Django)

Also considered - Django, Flask.

Interface

Request / response structure specification.

Notes:

  • MoP is an abbreviation of member of parliament
  • request parameters are specified as <parameter>
  • values to be computed on/retrieved from the server are specified as <value>

Single speaker

Every member of parliament will (probably) have his own page with basic personal data and statistics computed over the totality of data. There will be an option to re-calculate the statistics over a custom time interval. This leads us to the following request/response pairs:

/single/precomputed

Request structure:

{
    "MoP": <MoP_id>
}

Response structure:

{
    "speaking_time": {
        "regular": [
            ["Election period", "Words", "Sentences", "Paragraphs", "Utterances"],
            [<period1>, <words1>, <sentences1>, <paragraphs1>, <utterances1>],
            [<period2>, <words2>, <sentences2>, <paragraphs2>, <utterances2>],
            ...
        ],
        ...
    },
    "relative_diff": {
        "regular": [
            ["Election period", "Utterance vs paragraph", "Paragraph vs sentence", "Sentence vs word"],
            [<period1>, <up1>, <ps1>, <sw1>],
            [<period2>, <up2>, <ps2>, <sw2>],
            ...
        ],
        ...
    },
    "unanchored": {
        "regular": [
            ["Election period", "Unanchored"],
            [<period1>, <unanchored1>],
            [<period2>, <unanchored2>],
            ...
        ],
    },
    "wpm": {
        "regular": [
            ["Election period", "Words per minute"],
            [<period1>, <wpm1>],
            [<period2>, <wpm2>],
            ...
        ],
        ...
    }
}
/single/interval

Request structure:

{
    "MoP": <MoP_id>,
    "start": <time_start>,
    "end": <time_end>
}

Response structure:

{
    "speaking_time": {
        "regular": [
            ["Speaker", "Words", "Sentences", "Paragraphs", "Utterances"],
            [<MoP_name>, <words>, <sentences>, <paragraphs>, <utterances>]
        ],
        ...
    },
    "relative_diff": {
        ...
    },
    "unanchored": {
        ...
    },
    "wpm": {
        ...
    }
}

We've skipped the relative_diff, unanchored, and wpm fields because their structure is almost identical to the structure of these fields in the /single/precomputedrequest, except the independent variable is Speaker instead of Election period.

Multiple speakers

The user can select a subset of speakers for which the statistics will be plotted.

/multiple/interval

This results in an almost identical response as in the case of the /single/interval request, except there will be data for multiple speakers.

Request structure:

{
    "speakers": {
        "static": {
            "MoPs": [
                <MoP_id1>,
                <MoP_id2>,
                ...
            ]
            "birth": {
                "from": <birth_from>,
                "to": <birth_to>
            },
            "sex": <sex>,
            "role": <role>
        },
        "dynamic": {
            "age": {
                "from": <age_from>,
                "to": <age_to>
            },
            "group": <group>,
            "party": <party>
        }
    },
    "data": {
        "interval": {
            "start": <start>,
            "end": <end>
        },
        "term": <term>,
        "meeting": <meeting>,
        "sitting": <sitting>,
        "agenda": <agenda>
    }
}

Response structure:

{
    "speaking_time": {
        "regular": [
            ["Speaker", "Words", "Sentences", "Paragraphs", "Utterances"],
            [<MoP_name1>, <words1>, <sentences1>, <paragraphs1>, <utterances1>],
            [<MoP_name2>, <words2>, <sentences2>, <paragraphs2>, <utterances2>],
            ...
        ],
        ...
    },
    "relative_diff": {
        ...
    },
    "unanchored": {
        ...
    },
    "wpm": {
        ...
    }
}

The request is more complicated in this case, so we shall describe it in further detail. There are two parts of the request:

  • data - over what data should the statistics be computed (i.e. what audio files should be included)
  • speakers - what speakers to include when computing the statistics (we consider the same person with two different roles as two different speakers in this context)

The fields in the data part are quite self-explanatory. In the case of the speakersfield, the notion of static and dynamic parameters is introduced:

  • static - speaker properties that don't change in time
  • dynamic - properties of speaker that change in time

An interesting thing to note is that role is included as a static property of a speaker; this is related to what we stated before, i.e. the same person in two different roles is seen as two different speakers. From an implementational point of view, this makes more sense: for example, if a persons age changes over the selected time interval to a value above / below selected age interval, it makes sense to disregard the person completely; however, it wouldn't make sense to disregard a person completely just because he also spoke in other roles (even though we would count only the statistics per the selected role).

Top speakers

The user can also skip choosing the subset of MoPs and see top performers according to each metric instead.

/top/interval

Request structure:

{
    "start": <time_start>,
    "end": <time_end>
}

Response structure:

{
    "speaking_time": {
        "regular": [
            ["Speaker", "Words", "Sentences", "Paragraphs", "Utterances"],
            [<MoP_name1>, <words1>, <sentences1>, <paragraphs1>, <utterances1>],
            [<MoP_name2>, <words2>, <sentences2>, <paragraphs2>, <utterances2>],
            ...
        ],
        ...
    },
    "relative_diff": {
        ...
    },
    "unanchored": {
        ...
    },
    "wpm": {
        ...
    }
}

Additional routes

Routes providing other functionality.

/data/speakers

A route providing a list of available speakers. The request is realized as a simple GET request without any parameters. A response will have the following structure:

{
    "speakers": [
        <speaker1>,
        <speaker2>,
        ...
    ] 
}

Notes

Additional notes regarding frameworks and interface.

  • when designing the interface, only statistics we are currently able to compute were taken into account; for example, even though there will (probably) be a statistic comparing values of metrics across genders, the request for such statistic is not yet specified in the design
  • it is possible to change the API to allow for the explicit selection of a single statistic (e.g. send only data for the wpm statistic and only for role chair), but as of right now, values of all the statistics are sent back to the client

New statistics are not updated from existing audios

Statistics word_count and no_anchor are not updated from existing statistics computed over audio files - this means that when audio is used in multiple XML files (as I am afraid is possible to happen) only the statistics computed over the last XML file are kept, which is obviously wrong.

Documentation

As the project grows larger, it might be a good idea to think about the documentation. Currently the biggest issue is that many classes / functions miss docstrings.

The project has a good coverage in terms of user documentation (i.e. installation, example usage, some simple presentation of results).

XSLT script tabs for each timeline

When there are multiple timelines, tabs are pasted for each of them. This is wrong, as they were intended to use in the case that no matching timeline exists.

Aggregator should return computed data

As of right now an aggregator saves computed data to the filesystem. This was convenient for offline plotting; however, it is not practical in the case of online aggregated data retrieval. The aggregator should return the computed values, too.

Route to retrieve all speakers

We would like to be able to select a subset of speakers when visualizing the data on the website frontend. However, as of right now, there is no route providing such information. Such a route should thus be added and documented in #14. Below is the structure of the corresponding request/response pair.

/data/speakers

A route providing a list of available speakers. The request is realized as a simple GET request without any parameters. A response will have the following structure:

{
    "speakers": [
        <speaker1>,
        <speaker2>,
        ...
    ] 
}

Extracting person related data

Extract data about MoPs, guests, ... and save into some reasonable (tabular) representation for later use. Example use of that might be using MoPs full name instead of an id, but also creating statistics based on these data, e.g. dependence of speed of speaking on age of MoP.

Possible future extensions

This is just a list of ideas(not necessarily speech-related) that can be implemented. This can be included in our eshop/comparator...

(1) named entities related statistics

  • citations (podle...)
  • mentions

(2) words

  • PoS statistics
  • vocabulary

(3) text coherence

Speaker and data filtering

As of right now, the only way to filter speakers during data aggregation/visualization is by explicitly listing the list of speakers to visualize. The only way to compute statistics over only a subset of data is by specifying an exact time interval. We would like to be able to filter based on the following properties:

  • speakers
    • ids
    • birth
    • sex
    • role
    • age
    • group
    • party
  • data
    • time period
    • term
    • meeting
    • sitting
    • agenda

Website frontend

Design and implement the frontend part of the website. Details will be specified in this issue for now; later, they will be moved to the wiki.

Values are updated from existing audio files in wrong way

As of now, all the statistics are updated by the word field, which is obviously wrong (e.g. sentence field is updated using word value). This is probably because the code was copy-pasted heavily (and as we know, copy-pasting is often a source of mistakes).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.