GithubHelp home page GithubHelp logo

ml-kuleuven / socceraction Goto Github PK

View Code? Open in Web Editor NEW
563.0 24.0 127.0 26.22 MB

Convert soccer event stream data to SPADL and value player actions using VAEP or xT

License: MIT License

Python 99.75% Makefile 0.25%
soccer-analytics soccer soccer-data sports-analytics

socceraction's Introduction

Convert soccer event stream data to the SPADL format
and value on-the-ball player actions


PyPi Python Version: 3.7.1+ Downloads License: MIT

Build Status Read the Docs Code coverage



Socceraction is a Python package for objectively quantifying the impact of the individual actions performed by soccer players using event stream data. The general idea is to assign a value to each on-the-ball action based on the action's impact on the game outcome, while accounting for the context in which the action happened. The video below gives a quick two-minute introduction to action values.

Valuing.Player.Actions.in.Soccer.mp4

Features

Socceraction contains the following components:

  • A set of API clients for loading event stream data from StatsBomb, Opta, Wyscout, Stats Perform and WhoScored as Pandas DataFrames using a unified data model. Read more »
  • Converters for each of these provider's proprietary data format to the SPADL and atomic-SPADL formats, which are unified and expressive languages for on-the-ball player actions. Read more »
  • An implementation of the Expected Threat (xT) possession value framework. Read more »
  • An implementation of the VAEP and Atomic-VAEP possession value frameworks. Read more »

Installation / Getting started

The recommended way to install socceraction is to simply use pip. The latest version officially supports Python 3.9 - 3.11.

$ pip install socceraction

The folder public-notebooks provides a demo of the full pipeline from raw StatsBomb event stream data to action values and player ratings. More detailed installation/usage instructions can be found in the Documentation.

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome. However, be aware that socceraction is not actively developed. It's primary use is to enable reproducibility of our research. If you believe there is a feature missing, feel free to raise a feature request, but please do be aware that the overwhelming likelihood is that your feature request will not be accepted. To learn more on how to contribute, see the Contributor Guide.

Research

If you make use of this package in your research, please consider citing the following papers:

  • Tom Decroos, Lotte Bransen, Jan Van Haaren, and Jesse Davis. Actions speak louder than goals: Valuing player actions in soccer. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1851-1861. 2019.
    [ pdf | bibtex ]

  • Maaike Van Roy, Pieter Robberechts, Tom Decroos, and Jesse Davis. Valuing on-the-ball actions in soccer: a critical comparison of XT and VAEP. In Proceedings of the AAAI-20 Workshop on Artifical Intelligence in Team Sports. AI in Team Sports Organising Committee, 2020.
    [ pdf | bibtex ]

The Expected Threat (xT) framework was originally introduced by Karun Singh on his blog in 2019.

License

Distributed under the terms of the MIT license, socceraction is free and open source software. Although not strictly required, we appreciate it if you include a link to this repo or cite our research in your work if you make use of socceraction.

socceraction's People

Contributors

0xflotus avatar bwyckaert avatar c-roensholt avatar dependabot[bot] avatar karunsingh avatar ksbharaj avatar maaikevr avatar measaverb avatar nikitakoselev avatar npranav10 avatar prateek-senapati avatar probberechts avatar tomdecroos avatar zanderhinton avatar znstrider avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

socceraction's Issues

No matching distribution found for json (from socceraction)

During pip install socceraction occures error of not finding json dependency. It occures at anaconda or normal python pip installation and also at Google Colab. As I saw you've changed json dependecies lately, then it would be worthy to look at it.

Collecting socceraction
  Downloading https://files.pythonhosted.org/packages/0c/d1/9e51784a2375d477996ef2574218103bb0a0bfadf4914b7d3738cfd25af1/socceraction-0.0.7.tar.gz
Collecting tqdm
  Downloading https://files.pythonhosted.org/packages/bb/62/6f823501b3bf2bac242bd3c320b592ad1516b3081d82c77c1d813f076856/tqdm-4.39.0-py2.py3-none-any.whl (53kB)
     |████████████████████████████████| 61kB 558kB/s
ERROR: Could not find a version that satisfies the requirement json (from socceraction) (from versions: none)
ERROR: No matching distribution found for json (from socceraction)

Add left/right foot to bodyparts

In Perform's data we can split actions by left/right foot.

It is helpful when we know player's good foot.

I've added it to bodyparts var

bodyparts: List[str] = ['foot', 'right foot', 'left foot', 'head', 'other', 'head/other']

and _get_bodypart_id() function to increase xG classifier metrics.

Is it ok? Will it break other functionality? I'm ready to make pull request.

Add Support for Wyscout API v3

On July 20th 2021, Wyscout switched their API to v3. Currently, socceraction only supports v2 of the Wyscout API, which is now a legacy version. Adding support for the new API format will require substantial changes to the socceraction.data.wyscout and socceraction.spadl.wyscout modules. As v2 of the API will remain available until the release of v4 (no release date yet), making these changes is currently not a priority. However, pull requests would be welcome.

See https://apidocs.wyscout.com/ for details.

Information leakage when estimating scoring probabilities

I believe there is information leakage in 3-estimate-scoring-and-conceding-probabilities.ipynb when using the result of action a0 or the end location of action a0 as a feature to predict scoring when a0 is of type shot (including, freekicks and penalties).

Bug in conversion of owngoals to atomic SPADL

When converting SPADL actions to atomic SPADL actions, an owngoal is currently converted as follows:

shotlike = ['shot', 'shot_freekick', 'shot_penalty']
shot_ids = list(_spadl.actiontypes.index(ty) for ty in shotlike)
samegame = actions.game_id == next_actions.game_id
sameperiod = actions.period_id == next_actions.period_id
shot = actions.type_id.isin(shot_ids)
goal = shot & (actions.result_id == _spadl.results.index('success'))
owngoal = shot & (actions.result_id == _spadl.results.index('owngoal'))

However, owngoals in SPADL are the result of a bad_touch and not of a shotlike, so owngoals are never correctly converted.

This can be fixed by replacing line 127 with:

    owngoal = (actions.type_id == _spadl.actiontypes.index("bad_touch")) & (actions.result_id == _spadl.results.index('owngoal'))

encoding for stastbomb add_players_and_teams and add_events

Thanks for the great work on making statsbomb data more accessible.

For the statsbomb.py functions, utf-8 encoding is declared for matches and competitions but not teams and players or events.

 with open(competition_file, "rt", encoding="utf-8") as fh:
            matches += json.load(fh)

vs.

with open(lineup_file, "r") as fh:
            lineups += json.load(fh)

This is causing my code to break as default decoding is ascii. Could the functions be aligned to utf-8? Alternatively - add try/except logic?

Thanks

Chris

KeyError: "['referee_id', 'venue'] not in index" when trying to load StatsBomb Champions League data

I encountered the following error when trying to follow the steps in public notebook 1, using StatsBomb's open data.
This error isn't encountered for the other datasets (e.g. La Liga), so presumably the Champions League data is a bit different to the rest. The section that gives the error is below:

games = list(
    SBL.games(row.competition_id, row.season_id)
    for row in selected_competitions.itertuples()
)
games = pd.concat(games, sort=True).reset_index(drop=True)
games[["home_team_id", "away_team_id", "game_date", "home_score", "away_score"]]

Error with function add_players_and_teams in statsbomb.py

Hi all

I am trying to execute notebook 2 of tutorial and it is failing on this function

def add_players_and_teams(lineups_url, h5file):
on this line
players = pd.DataFrame(players.values())

DataFrame constructor not properly called!

Captdddura

Regards

Opta data never converts to goalkick SPADL action

When converting Opta event stream data, there is never a conversion to a 'goalkick' SPADL action.

To do so, the _get_type_id function in opta.py needs to be changed. According to this source (couldn't find an up-to-date official document on the internet), a goal kick corresponds to pass qualifier 124.

Another question also arises: is a keeper throw (qualifier 123) also considered a goal kick in terms of SPADL actions. I think it makes sense to either have separate SPADL actions for these two, or one (renamed?) SPADL action for them both. I do not think that a keeper throw should be considered a regular pass, since there is no pressure on the keeper to execute this action quickly (in contrast to a regular pass). Therefore it may need a different treatment when processing the data.

Header shots in Wyscout

Bug

Wyscout does not distinguish between headers and other body parts on shots. The SPADL convertor simply labels all shots as performed by foot, which causes issues when training an expected goals model.

Solution

This can be fixed easily:

def determine_bodypart_id(event):
    """
    This function determines the body part used for an event
    Args:
    event (pd.Series): Wyscout event Series
    Returns:
    int: id of the body part used for the action
    """
    if event["subtype_id"] in [81, 36, 21, 90, 91]:
        body_part = "other"
    elif event["subtype_id"] == 82:  # or event['head_or_body']:
        body_part = "head"
+   elif event["type_id"] == 10 and event['head/body']:
+      body_part = "?" 
    else:  # all other cases
        body_part = "foot"
    return bodyparts.index(body_part)

However, I do not know which body part I should use for these shots. Just other, or head, or create a new one head/other?

SPADL Definition

For my thesis I'm defining a sequence of ball possession of a team as a specific sequence of SPADL actions that occur in a larger sequence of SPADL actions (the precise definition is not important for this issue). For this, I'm basing myself on Table 3.1 in Tom Decroos PhD thesis (https://tomdecroos.github.io/reports/thesis_tomdecroos.pdf). This Table defines all SPADL actions and which attribute values each action can have. However, this definition does not seem up to date with the actual implementation in socceraction. I encountered the following differences:

  • The goalkick action is not described in the definition of SPADL in the thesis
  • The definition of SPADL in the thesis describes a keeper pick-up action. However the Statsbomb to SPADL converter will never convert an action to this type after inspecting the code.
  • The definition of SPADL in the thesis states that fouls will always have 'success' as a result, while the converter will always give 'fail' as the result of this action
  • The definition of SPADL in the thesis states that interceptions will always have 'success' as a result, while the converter will attribute 'success' to some interceptions (in case they succeed) and 'fail' to other interceptions (in case the Statsbomb data say that the ball was intercepted but knocked to opposition or the ball was intercepted but went out of bounds by doing so)
  • The definition of SPADL in the thesis states that keeper saves will always have 'success' as a result, but the Statsbomb action-attribute pair 'Shot Saved (In Play Danger)' would lead to a conversion to a failed keeper save action. The same goes for the Statsbomb action-attribute pair 'Punch (In Play Danger)', although I'm not sure whether that combination can actually occur in the data.

A precise definition of the SPADL data format is necessary to correctly define a sequence of ball possession in terms of SPADL actions. It's important to state more tricky things, eg. that a failed interception of another team does not impact a ball possession sequence of one team. The possible occurrence of failed interceptions was for example denied in the thesis.

I therefore propose that it would maybe be a good idea to have an up-to-date definition somewhere which precisely defines SPADL and what action-attribute pairs are valid. This allows to build definitions in terms of SPADL actions or SPADL action sequences.

When building this definition, I also think that the following things in the original definition in the thesis should require some attention:

  • According to the original definition, all corner actions can have offside as a special result. This cannot occur in practice.
  • According to the original definition, tackles can have a yellow or red card as a special result. However, I think that in this case it should be classified as a foul. Maybe the converter should be build in such a way that it converts a failed tackle with a card as result (from the original data of 3rd parties) to 2 actions in SPADL, in which the first is a failed tackle and the second is a foul with the card as a result.
  • According to the original definition, penalty shots and free kick shots can have an owngoal as a special result. However, unless some truly mafioso things are going on, this can never be the case in practice. However, in football you never know of course... :)

check for same period when adding separating out dribbles

(I'm using statsbomb data, from a glimpse it looks like the opta and wyscout spadl conversion scripts should suffer the same bug)

https://github.com/ML-KULeuven/socceraction/blob/master/socceraction/spadl/statsbomb.py#L406

when selecting for events to add dribbles in between, there is no check that the two events take place in the same period (half) of the game. Given that the time seconds reverts back to 0(ish) for the next event, the difference in times is negative (and so less than 10)

Therefore, the first event of the next half has the chance to be a dribble between the last event of the proceeding half, and the pass to start the following half.

A simple check for

    same_period = actions.period_id == next_actions.period_id

should resolve this

I'm on holiday at the moment but will create a pull request to solve this in the next few days

SchemaError when converting Wyscout events to SPADL actions

When using the ´convert_to_actions´ method

def convert_to_actions(events: pd.DataFrame, home_team_id: int) -> DataFrame[SPADLSchema]:

to convert Wyscout events to SPADL actions, I get the following error:

image

I am using Python 3.8 and socceraction 1.1.3. The bug/error can be recreated as follows (set the download flag as needed):

    pwl = socceraction.data.wyscout.PublicWyscoutLoader(download=True)
    competitions = pwl.competitions()
    world_cup = competitions.loc[6]
    wc_games = pwl.games(world_cup.competition_id, world_cup.season_id)
    game_id = wc_games.at[0, 'game_id']
    home_team_id = wc_games.at[0, 'home_team_id']
    events = pwl.events(game_id)
    actions = socceraction.spadl.wyscout.convert_to_actions(events, home_team_id)

Removing the .astype(int) on line 53 seems to fix this for me.

actions[col] = actions[col].astype(int)

Determine more accurate timestamps for extra actions in Atomic-SPADL

Socceraction creates the timestamps for dribbles and receptions (atomic) as the midpoint (t2 - t1) / 2 between the first and subsequent actions.

The downside of that might be that you can have a short pass, then a long dribble, but the timestamp would still be in the middle (of the dribble).
The other way around might happen too with a long ball traveling a few sec that is immediately passed on, thus the timestamp being in the middle of the pass.

Would it make sense to assume a fixed passing speed and create the timestamp for the ball receival based on the distance and thus the time the ball would take from origin to destination?

Now I am not sure if this is very relevant. It might be useful for looking at holding times of the ball in those individual ball possessions for example.
Are Opta Timestamps always just in seconds? Maybe the lacking accuracy of those would make this approach unnecessary.

Refactor opta.py and wyscout.py to follow the same pattern of statsbomb.py

statsbomb.py now offers some great atomic functions to download and convert statsbomb data. The notebook public-notebooks/1-load-and-convert-statsbomb-data demonstrates the full pipeline and makes it possible for users to inspect intermediate results, view only a small part of the data, and debug in case things go wrong.

In contrast, opta.py and wyscout.py are still very much black boxes that only offer a single public function for converting an entire folder of data. These modules have none of the benefits statsbomb.py has now.

Bug in conversion to atomic SPADL due to gaps in Wyscout data

Wyscout collects its data by video analysis. This means that when replays of certain events are shown, the events that occur during that replay are not captured. The most common replayed events are goals and this occasionally causes the kick-off and subsequent actions to be absent from the game data. Most of the time this causes no issues (except for the fact that there is no data for small parts of the game), but when converting SPADL actions to atomic SPADL actions we run into problems. When converting from default SPADL to atomic SPADL, the result (result_id and result_name) are replaced by an extra action. For shotlike actions this is done as follows:

def _extra_from_shots(actions: pd.DataFrame) -> pd.DataFrame:
next_actions = actions.shift(-1)
shotlike = ['shot', 'shot_freekick', 'shot_penalty']
shot_ids = list(_spadl.actiontypes.index(ty) for ty in shotlike)
samegame = actions.game_id == next_actions.game_id
sameperiod = actions.period_id == next_actions.period_id
shot = actions.type_id.isin(shot_ids)
goal = shot & (actions.result_id == _spadl.results.index('success'))
owngoal = shot & (actions.result_id == _spadl.results.index('owngoal'))
next_corner_goalkick = next_actions.type_id.isin(
[
_atomicspadl.actiontypes.index('corner_crossed'),
_atomicspadl.actiontypes.index('corner_short'),
_atomicspadl.actiontypes.index('goalkick'),
]
)
out = shot & next_corner_goalkick & samegame & sameperiod
extra_idx = goal | owngoal | out
prev = actions[extra_idx]
# nex = next_actions[extra_idx]
extra = pd.DataFrame()
extra['game_id'] = prev.game_id
extra['original_event_id'] = prev.original_event_id
extra['period_id'] = prev.period_id
extra['action_id'] = prev.action_id + 0.1
extra['time_seconds'] = prev.time_seconds # + nex.time_seconds) / 2
extra['start_x'] = prev.end_x
extra['start_y'] = prev.end_y
extra['end_x'] = prev.end_x
extra['end_y'] = prev.end_y
extra['bodypart_id'] = prev.bodypart_id
extra['result_id'] = -1
extra['team_id'] = prev.team_id
extra['player_id'] = prev.player_id
ar = _atomicspadl.actiontypes
extra['type_id'] = -1
extra['type_id'] = (
extra.type_id.mask(goal, ar.index('goal'))
.mask(owngoal, ar.index('owngoal'))
.mask(out, ar.index('out'))
)
actions = pd.concat([actions, extra], ignore_index=True, sort=False)
actions = actions.sort_values(['game_id', 'period_id', 'action_id']).reset_index(drop=True)
actions['action_id'] = range(len(actions))
return actions

In short: when the result of a shot action is success, the next action will be a goal, when the action following the shot is a corner or a goalkick, the next action will be out. The problem now lies in:

ar = _atomicspadl.actiontypes
extra['type_id'] = -1
extra['type_id'] = (
extra.type_id.mask(goal, ar.index('goal'))
.mask(owngoal, ar.index('owngoal'))
.mask(out, ar.index('out'))

Due to some events not being registered by Wyscout after a goal, it is possible that the first event registered after a goal is a goalkick or a corner, instead of the expected pass (the kickoff). This means that line 161 will override what is done on line 159, causing the goals to be incorrectly converted.

There are two possible ways to fix this (that I've come up with at least):

  • The first is to simply replace line 159 with line 161, changing the order of the masks.
  • The second is to allow a maximum time difference between a shotlike action and a goalkick or corner for the goalkick or corner to be considered the action following the shot. However, considering e.g. VAR interventions, which might take some time to complete, this might be imprecise.

Inconsistenties in action type of own goals

In the SPADL representation of Opta and Statsbomb all own goals are labeled as shots, while the Wyscout convertor labels them as passes, interceptions or clearances. First, I think it would be better to be consistent and use the same data types for each provider. Second, shot is not a good action type for own goals in my opinion. I prefer bad_touch . Another reasonable option would be to use the type of the intended action (i.e., clearance, interception, pass, keeper_save,...) as is done now in the Wyscout convertor, but I do not know whether it is easy to do this for Statsbomb and Opta.

Discrepancy between successful passes in the SPADL and atomic-SPADL representations

While converting Wyscout events to SPADL actions most duels are removed as they are not considered on the ball actions, however in doing so some information is lost. Wyscout considers a pass which is followed by a duel as accurate (translated to SPADL as a successful action) even if the duel is lost by the teammate of the player who gave the pass. This causes successful passes to be followed by an action of the opposing team. It would make more sense (in my opinion) to mark the pass as failed and follow it up by an interception of the opposing player.

Bug in _get_minutes_played

The _get_minutes_played method in wyscout loader contains 2 bugs:

  1. The duration of a game is defined as follows:

periods_ts = {i: [0] for i in range(6)}
for e in events:
period_id = wyscout_periods[e['matchPeriod']]
periods_ts[period_id].append(e['eventSec'])
duration = int(sum([max(periods_ts[i]) / 60 for i in range(5)]))

In words: take the time in seconds of the last event in every period, convert to minutes and sum up the values. This means that injury time is taken into account for every period. Wyscout, however, only takes into account injury time of the last period of the game when defining the minute of e.g. a substitution, a red card. So when you now define the amount of minutes played of a substitute as follows:

'minutes_played': duration - substitution['minute'],

and the time played by a player who gets substituted as:

pg[substitution['playerOut']]['minutes_played'] = substitution['minute']

the amount of minutes played will be too high and too low respectively. For players who play a full game there is obviously no problem.

  1. Red cards are not taken into account, so a player who gets a red card in e.g. minute 5 will get his minutes played set to the duration of the game.

Create a consistent definition for keeper events in SPADL

SPADL defines 4 different event types for describing save/ball recovery actions by keepers:

Action type Description Success
Keeper save Keeper saves a shot on goal Always success
Keeper claim Keeper catches a cross Does not drop the bal.
Keeper punch Keeper punches the ball clear Always success.
Keeper pick-up Keeper picks up the ball Always success

First, it is somewhat unclear what the differences between these four events are. For example,

  • When a keeper saves a shot but can not claim the ball, is it a "keeper save" or "keeper punch" action? Or are "keeper punch" actions only for crosses?
  • When a keeper rushes out to either cut out an attacking pass (in a race with the opposition player) or to close-down an opposition player is that a pick-up or a claim?
  • Is there a difference between a keeper pick-up and interception action (apart from the body part). If there is no difference it might be better to simply drop the keeper pick-up action.

Second, there are incenstensies between the different converters.

  • The keeper pick-up action is missing in the StatsBomb and Wyscout converters. I believe that "Goalkeeper" events with type "collected (25)" and outcome "success (15)" should be converted to this type, while events with outcome "claim (47)" should be keeper claim events.
  • Both the keeper pick-up and keeper claim actions are missing in the Wyscout converter
  • The definition of SPADL states that keeper saves will always have 'success' as a result, but the Statsbomb action-attribute pair 'Shot Saved (In Play Danger)' would lead to a conversion to a failed keeper save action. The same goes for the Statsbomb action-attribute pair 'Punch (In Play Danger)', although I'm not sure whether that combination can actually occur in the data.

A related point was raised in #45: How should keeper throws be addressed? Is a keeper throw also considered a goal kick in terms of SPADL actions? I think it makes sense to either have separate SPADL actions for these two, or one (renamed?) SPADL action for them both. I do not think that a keeper throw should be considered a regular pass, since there is no pressure on the keeper to execute this action quickly (in contrast to a regular pass). Therefore it may need a different treatment when processing the data.

Although keeper actions are not very important in the action valuing frameworks, this might be useful in other applications of SPADL. Therefore, I believe it would be good to agree upon a definition for these events and fix the inconsistencies in the converters.

Handling of label.scores and label.concedes of final actions

file: socceraction/classification/features.py

 for i in range(1, nr_actions):
        for c in ["team_id", "goal", "owngoal"]:
            shifted = y[c].shift(-i)
            shifted[-i:] = y[c][len(y) - i]
            y["%s+%d" % (c, i)] = shifted

This code does not correctly propagate goals for the last n_actions in a match.

Action positions are wrong when converting from Wyscout events

Two mistakes are made when converting Wyscout events to SPADL events:

  1. make_new_positions clips all x and y coordinates to (0, 105) and (0, 68) respectively, where 105 and 68 are the field width and field length used in the whole package.

def make_new_positions(events: pd.DataFrame) -> pd.DataFrame:
"""Extract the start and end coordinates for each action.
Parameters
----------
events : pd.DataFrame
Wyscout event dataframe
Returns
-------
pd.DataFrame
Wyscout event dataframe with start and end coordinates for each action.
"""
new_positions = events[['event_id', 'positions']].apply(
lambda x: _make_position_vars(x[0], x[1]), axis=1
)
new_positions.columns = ['event_id', 'start_x', 'start_y', 'end_x', 'end_y']
events = pd.merge(events, new_positions, left_on='event_id', right_on='event_id')
events[['start_x', 'end_x']] = events[['start_x', 'end_x']].clip(0, 105)
events[['start_y', 'end_y']] = events[['start_y', 'end_y']].clip(0, 68)
events = events.drop('positions', axis=1)
return events

The problem is that when this method is called, we are still working with Wyscout positions and Wyscout defines positions as follows:

image

which I got from https://figshare.com/articles/dataset/Events/7770599?backTo=/collections/Soccer_match_event_dataset/4415000. This means that either the positions have to be clipped to (0, 100) (both x and y) or they have to be clipped at a later stage. I don't know what's more desirable.

  1. The second mistake is that line 47 in convert_to_actions fixes the direction of play (the players who makes the action always plays from left to right), but as can be seen in the picture above, Wyscout already has its events defined like this, so this line just reverts the process.

def convert_to_actions(events: pd.DataFrame, home_team_id: int) -> DataFrame[SPADLSchema]:
"""
Convert Wyscout events to SPADL actions.
Parameters
----------
events : pd.DataFrame
DataFrame containing Wyscout events from a single game.
home_team_id : int
ID of the home team in the corresponding game.
Returns
-------
actions : pd.DataFrame
DataFrame with corresponding SPADL actions.
"""
events = pd.concat([events, get_tagsdf(events)], axis=1)
events = make_new_positions(events)
events = fix_wyscout_events(events)
actions = create_df_actions(events)
actions = fix_actions(actions)
actions = _fix_direction_of_play(actions, home_team_id)
actions = _fix_clearances(actions)
actions['action_id'] = range(len(actions))
actions = _add_dribbles(actions)
return actions.pipe(DataFrame[SPADLSchema])

Add support for new Opta type IDs

It seems like Opta added some new type ID's for the 2021/22 season. These are not yet supported in socceraction causing the following error.

SchemaError: non-nullable series 'type_name' contains null value

As a temporary solution, you can downgrade to v1.1.1 (pip install socceraction=1.1.1).

bug when fixing clearances

In statsbomb.py, opta.py and wyscout.py, the function fix_clearances fails when a clearance is the last action of a game. In this case end_x and end_y become nan as there is no next action.

def fix_clearances(actions):
    next_actions = actions.shift(-1)
    clearance_idx = actions.type_id == actiontypes.index("clearance")
    actions.loc[clearance_idx, "end_x"] = next_actions[clearance_idx].start_x.values
    actions.loc[clearance_idx, "end_y"] = next_actions[clearance_idx].start_y.values
    return actions

StatsPerform JSON parsers: load data from memory

Current loaders is designed to read files and process them into dataframes.

It's not suitable if you want to use SDDP feed instead of SDAPI. The difference is that SDDP adds events during the match and SDAPI is available only after match.

I want to calculate some metrics during the match and created alternative memory loader based on MA3 Loader:
https://gist.github.com/denisov-vlad/28d4668c4861b7c551a6caba3c341ba2

As you see, there is a lot of duplicated code. It will be awesome to split extract functions into loading from disk and processing data.

Data leak in expected goals model

Hi!

First of all, thanks for the great package, it makes working with JSON files much easier!

I was reviewing the latest expected goal model code (EXTRA-build-expected-goals-models) and I think there is a small data leak in the model.

Features dx_a0 and dy_ao (also the movement_a0 as it is derived from other 2) uses end location information of the action, for shots this becomes the end location of the shot. All the successful (goal) shots have end x location either 0 or 105 (30-37 for y values) - so movement information actually inherits the result of the shot in it. For example, if the shot is taken from 25 (start_x_a0 = 25) meters horizontal distance away from the goal, any value less than 25 dx_a0 would automatically mean that the shot was not a goal. It is not a direct leak but still, I believe the model would be better without these features in it.

Please let me know if I'm missing a point!

Best wishes!

Handle lagging SPADL features for first actions in games/periods

a question not an issue per se

When lagging gamestates to compute features on spadl

def gamestates(actions : pd.DataFrame, nb_prev_actions: int =3) -> List[pd.DataFrame]:
the default fill is 0. Given that 0 is a valid type_id (at least for Statsbomb where it is a pass), is this (ever so slightly) affecting results by saying that (e.g.) when a team kick off, the last 3 actions have been passes.

I imagine this is of little to no consequence in reality as so few actions happen from kick off but might be worth assigning either a 999 or NA (etc.) to lagged actions which do not have a preceeding action?

Bugs and Error in OptaLoader's extract_lineups() function affecting "is_starter" & "minutes_played" columns (F7_XML)

To make the bugs reproducible, I used the OptaLoader on the XML feeds from test folder and have attached a screenshot here.

bugs

I noticed 2 bugs which affect the columns is_starter and minutes_played. Both the bugs can be located in extract_lineups() function belonging to _F7XMLParser class. There is also an error possible in logic used to calculate minutes_played (refer Issue 3 on this page )

Issue 1: a possible bug
Location : Line 841 in spadl/opta.py

In is_starter=player_elm.attrib['Formation_Place'] != 0,, the attribute Formation_Place turns out to be a character containing numbers from 0 to 11. So in this case, is_starter becomes True irrespective of the value of Formation_Place because of difference in data types.

Issue 2: a possible bug
Location : Line 827 in spadl/opta.py

The following piece of code
sub_on = int(next((item['Time'] for item in subst if item['SubOn'] == f'p{player_id}'), 0))

assigns value 0 to variable sub_on for substitutes who don't get subbed on. So the players who stay on the bench throughout the game have the value of minutes_played to be equal to stats['match_time'], because minutes_played = sub_off - sub_on and the sub_off value for all players who don't get subbed off is set to stats['match_time']. So as seen from the picture above, Iturraspe a player who doesn't play a single minute in the game, has minutes_played = 96 - 0 = 96

Issue 3 : possible Error causing line:
Location : Line 827 in spadl/opta.py
Substitutions events in Opta doesn't necessarily have to involve 2 players. A player retirement is also part of the sub event. So when a player gets retired (i.e. team has exhausted its available sub opportunities), the Subtitution element will not have the 'SubOn' attribute but just the 'SubOff'. Hence one would get a KeyError in these circumstances as the list comprehension looks for SubOn key in each Subtitution element

P.S : I managed a fix locally for all the 3 issues and will be issuing a pull request momentarily. I thought posting this as an issue would create a log of this issue on the issues section of this repo and might help people in the future.

OptaLoader Whoscored parser

Hello, can you please explain a little better how the OptaLoader with Whoscored works?
What is the feeds dict() format?

dict_opta = {
                'whoscored': "PremierLeague-2020_2021\\1485314.json"
            }
datafolder = "..\data\Premier_League-2020_2021"
SBL = opta.OptaLoader(root=datafolder, feeds=dict_opta, parser='whoscored')

I tried this quick test but I don't think I am doing it correctly since I am not getting any competitions from competitions = SBL.competitions()

Can you give me a quick example on how to load the whoscored json?

Kind regards

add original_event_id to SPADL data

Sometimes people want to use some extra information of event data not available in the SPADL format, e.g., pressure attribute of StatsBomb data.

We can't extend SPADL to accomodate every extra piece of information that might be relevant, because then we lose the simplicity of SPADL and also its cross-compability with Wyscout, Opta, and StatsBomb.

For now, the best way to allow people to use extra information seems to be to include an "original_event_id" column to the SPADL data. This column will allow people to join SPADL dataframes with the original event dataframes of Wyscout, Opta, or StatsBom. People will thus be able to access all extra event information of a vendor with only one simple join operation.

Add kloppy as reader

First of all: thanks for this great library!

I was wondering what is needed to add kloppy as a reader for input files. When kloppy can be used users of socceraction can easily switch to other formats supported by kloppy (like Sportec).

Challenges:

  • What attributes are required for socceraction to work, and is kloppy (at this moment) able to provide all those?
  • Same question but than for future development

Curious what you think about this. If it seems doable I can start working on a PR for socceraction.

Adding R xThreat Implementation

This repo did wonders for me when I was trying to wrap my head around the mathematics behind Karun Singh's Expected Threat model. As someone who is much more comfortable with R instead of Python, I actually ended up converting xthreat.py to an RScript. That file can be found here.

If your group wanted to add that script (which is SPADL compatible) to your repository, I would have no reservations. I think it would allow the Expected Threat model to become more accessible.

Thanks for your work!

xG example: remove movement_a0 from features

In xG model example you use movement_a0 feature which is highly correlated with classification result and breaks other features importances.

I've tested the model with and without it and compared with Understat data. With this feature you have good values for goals but other shots from good positions have small probabilities.

Of course AUC score will decrease to ~ 0.83 (tested with XGBoost / LightGBM models) but final result for each shot seems more accurate.

Lack of compatibility with Wyscout's Soccer-logs open dataset

Hi!
Lately i'm working with Wyscout's soccer-logs open dataset of matches' ball events and tried to calculate VAEP scores for each of passes made during games, as this would be needed for my masters thesis. There occured a problem though.

It looks like jsons files stored there are in different format/structure as "normal" Wyscout's ones, cause there is a problem with i.e. matches info not occuring in events jsons. As I found out it would be pretty easy to change actual jsons_to_h5 wyscout function to the one, which would load data from open dataset (as I probably got it right now), but I can't test it, not knowning values which would be produced by original algorithm. Probably I'd do pull request for you to look at this issue.

As it's said that SPADL/VAEP is compatible with Wyscout's data and looking for a bunch of people, who would probably use it and want to know results of your work, that could be a good feature to be added on, to work with this kind of data.

Best wishes!

Ed: forgot the link to dataset: https://figshare.com/collections/Soccer_match_event_dataset/4415000/2

manual running of tests/datasets/download.py required for tests

For the testing suite to succeed, it seems we need to manually run tests/datasets/download.py multiple times with all of "statsbomb", "wyscout", "convert-statsbomb" and "convert-wyscout" args.

As is this should probably at least be mentioned in the contributing guide. Or can this be automated when running nox?

Further the main function errors out if no further arg is provided. How should the logic be here?

if __name__ == '__main__':
    if len(sys.argv) == 1 or sys.argv[1] == 'statsbomb':
        download_statsbomb_data()
    if sys.argv[1] == 'convert-statsbomb':
        convert_statsbomb_data()
    if len(sys.argv) == 1 or sys.argv[1] == 'wyscout':
        download_wyscout_data()
    if sys.argv[1] == 'convert-wyscout':
        convert_wyscout_data()
    if len(sys.argv) == 1 or sys.argv[1] == 'spadl':
        create_spadl(8657, 777)

goalscore function also picking up goal kicks

The goalscore function in atomic/vaep/features.py function is matching goalkicks as well as goals, discovered when working through atomic tutorial -2.
Attached screenshot is from game ID 7537.
Screen Shot 2021-02-18 at 7 10 54 PM.

ModuleNotFoundError: No module named 'socceraction.data'

I am getting this error when running 1-load-and-convert-statsbomb-data.ipynb using the latest version. Here is the code from the notebook:
from socceraction.data.statsbomb import StatsBombLoader
Which gives an error.

The following works, however:
from socceraction.socceraction.data.statsbomb import StatsBombLoader

This is an issue for all imports, e.g. StatsBombLoader requires:
from socceraction.data.base import EventDataLoader, ParseError

tests failing when running nox with "expected series to have type int64, got int32"

When running nox, multiple tests fail for me for the same reason:
pandera expected a series to have type int64, got int32.

When I try to convert the downloaded statsbomb / wyscout files by invoking tests/datasets/downloads.py with convert-"provider" I get the same error.

I am using python 3.9.9, socceraction 1.1.2 and pandera 0.8.0

I had a similar problem while using socceraction that I can't exactly recall, where I had to use a prior version of pandera (0.6.1) to make it work. convert-statsbomb and convert-wyscout work fine with this pandera version as well.

socceraction_nox_errors_output

Wyscout convertor discards own goals from touch events

Bug

Own goals resulting from bad touch events in the Wyscout event streams are missing in the SPADL representation.

Minimal example

As a minimal example, here is an own goal from the game between Leicester and Stoke on 24 Feb 2018. Stoke's goalkeeper Jack Butland allows a low cross to bounce off his gloves and into the net:

eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id
466559 8 Cross "[{'id': 402}, {'id': 801}, {'id': 1802}]" 8013 "[{'y': 89, 'x': 97}, {'y': 0, 'x': 0}]" 2499994 Pass 1631 2H 1496 80 230320305
466560 7 Touch [{'id': 102}] 8094 "[{'y': 50, 'x': 1}, {'y': 100, 'x': 100}]" 2499994 Others on the ball 1639 2H 1497 72 230320132
466561 9 Reflexes "[{'id': 101}, {'id': 1802}]" 8094 "[{'y': 100, 'x': 100}, {'y': 50, 'x': 1}]" 2499994 Save attempt 1639 2H 1499 90 230320135

--> Download source

And the corresponding SPADL representation:

game_id period_id time_seconds team_id player_id start_x start_y end_x end_y bodypart_id type_id result_id
0 2499994.0 2.0 1496 1631.0 8013.0 101.85 7.48 0.0 68.0 0 1 0
1 2499994.0 2.0 1499 1639.0 8094.0 1.05 34.0 1.05 34.0 2 14 1

The result_id of the second action should be 3 (= own_goal).

Encoding issues

Hi! first thanks for this package--I can't wait to dive into the "cleaned" data.
Second, I keep running into this error, and I think it has something to do with encodings in a few of the lineups, because if I remove certain line up files, it will run through. Sorry, I'm 99% an R user, so I'm not sure how to diagnose this! I'm following along in the open notebook you provided.

Thanks!

...Adding competitions to [redacted]\statsbomb.h5
...Adding matches to [redacted]\statsbomb.h5
...Adding players and teams to [redacted]\statsbomb.h5: 
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-0b9bd700b599> in <module>
----> 1 spadl.statsbombjson_to_statsbombh5("[redacted],statsbomb_h5)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\socceraction\spadl\statsbomb.py in jsonfiles_to_h5(datafolder, h5file)
     18     print(f"...Adding matches to {h5file}")
     19     add_matches(os.path.join(datafolder, "matches/"), h5file)
---> 20     add_players_and_teams(os.path.join(datafolder, "lineups/"), h5file)
     21     add_events(os.path.join(datafolder, "events/"), h5file)
     22 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\socceraction\spadl\statsbomb.py in add_players_and_teams(lineups_url, h5file)
     43     ):
     44         with open(lineup_file, "r") as fh:
---> 45             lineups += json.load(fh)
     46             for lineup in lineups:
     47                 for p in [flatten_id(p) for p in lineup["lineup"]]:

~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 562: character maps to <undefined>

Different output for xT values and visualizations in the 'EXTRA-run-xT.ipynb' notebook

When running the 'EXTRA-run-xT.ipynb' notebook under the 'public-notebooks' folder, I got a completely different output for xT values and thus different visualizations as follows.

image

image

As guided, I ran the '1-load-and-convert-statsbomb-data.ipynb' notebook first. I noticed that there are some changes for the StatsBomb dataset, but I don't think it is the reason resulting in the completely different output. I am using Python 3.7.2 and socceraction 1.2.2.

Minutes played by a player in a game is wrong using the PublicWyscoutLoader class

The _get_minutes_played method uses the timestamps of events in a single game to determine the length of that game and thus how long players played.

def _get_minutes_played(
teamsData: List[Dict[str, Any]], events: List[Dict[str, Any]]
) -> pd.DataFrame:
periods_ts = {i: [0] for i in range(6)}
for e in events:
period_id = wyscout_periods[e['matchPeriod']]
periods_ts[period_id].append(e['eventSec'])
duration = int(sum([max(periods_ts[i]) / 60 for i in range(5)]))

The players method in the PublicWyscoutLoader however, passes all the events in the competition of a game, instead of the events in the game itself.

competition_id, season_id = self._match_index.loc[game_id, ['competition_id', 'season_id']]
path_events = os.path.join(
self.root, self._index.at[(competition_id, season_id), 'db_events']
)
mp = _get_minutes_played(lineups, cast(List[Dict[str, Any]], self.get(path_events)))

This causes all games in the same competition to have the same length and the minutes played for all players to be wrong. As a temporary solution, replacing line 290 with the following 2 lines fixes the issue (but I don't know if it's the most efficient way to do it):

match_events = filter(lambda event: event['matchId'] == game_id, self.get(path_events))
mp = _get_minutes_played(lineups, cast(List[Dict[str, Any]], match_events))

Incorrect result_id in Wyscout interception passes.

Bug

The Wyscout convertor converts passes that are also interceptions in the event data into two separate events, first an interception and then a pass. However, the interception gets the result_id of the original combined event, which can be problematic when the pass results in an own goal. If this happens, it seems like the player made two own goals instead of one.

Minimal example

The Wyscout event:

eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id
30658 8 Head pass "[{'id': 102}, {'id': 1401}, {'id': 1801}]" 38093 "[{'y': 56, 'x': 5}, {'y': 100, 'x': 100}]" 2499737 Pass 1610 2H 2184 82 180427412

--> Download source

The corresponding SPADL representation:

game_id period_id time_seconds team_id player_id start_x start_y end_x end_y bodypart_id type_id result_id
0 2499737.0 2.0 2184 1610.0 38093.0 99.75 38.08 99.75 38.08 0 10 3
1 2499737.0 2.0 2184 1610.0 38093.0 99.75 38.08 0.0 68.0 1 0 3

Solution

The insert_interception_passes function has to be adapted. This can be fixed easily, but I have to know the exact definition of a successful interception event. In particular, should the result in the example above be "success" (because he intercepted the ball) or "fail" (because he lost it immediately). More general, is an interception successful if you touch the ball or if you keep possession in the successive action?

Exception in socceraction.atomic.spadl.convert_to_atomic method

NameError Traceback (most recent call last)
in
12 player_games.append(statsbomb.extract_player_games(events))
13 actions = statsbomb.convert_to_actions(events,match.home_team_id)
---> 14 atomic_actions[match.match_id] = atomicspadl.convert_to_atomic(actions)
15
16 games = matches.rename(columns={"match_id":"game_id"})

C:\ProgramData\Anaconda3\envs\datasciencesoccer-RQVybP6P\lib\site-packages\socceraction\atomic\spadl.py in convert_to_atomic(actions)
36 def convert_to_atomic(actions):
37 actions = actions.copy()
---> 38 actions = extra_from_passes(actions)
39 actions = add_dribbles(actions) # for some reason this adds more dribbles
40 actions = extra_from_shots(actions)

C:\ProgramData\Anaconda3\envs\datasciencesoccer-RQVybP6P\lib\site-packages\socceraction\atomic\spadl.py in extra_from_passes(actions)
113 extra["result_id"] = -1
114
--> 115 offside = prev.result_id == results.index("offside")
116 out = ((nex.type_id == actiontypes.index("goalkick")) & (~same_team)) | (
117 nex.type_id == actiontypes.index("throw_in")

NameError: name 'results' is not defined

Error in inspect hdf file

After trying to reaad hdf file on notebook 2 i get this error maybe a version package issue, Thanks in advance

imagen

error !

refading it could be a numpy issue ... any solution?
This is my numpy version

np.version
'1.16.4'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.