ml-kuleuven / socceraction Goto Github PK

Convert soccer event stream data to SPADL and value player actions using VAEP or xT

License: MIT License

Python 99.75% Makefile 0.25%

soccer-analytics soccer soccer-data sports-analytics

socceraction's Issues

tests failing when running nox with "expected series to have type int64, got int32"

When running nox, multiple tests fail for me for the same reason:
pandera expected a series to have type int64, got int32.

When I try to convert the downloaded statsbomb / wyscout files by invoking tests/datasets/downloads.py with convert-"provider" I get the same error.

I am using python 3.9.9, socceraction 1.1.2 and pandera 0.8.0

I had a similar problem while using socceraction that I can't exactly recall, where I had to use a prior version of pandera (0.6.1) to make it work. convert-statsbomb and convert-wyscout work fine with this pandera version as well.

Header shots in Wyscout

Bug

Wyscout does not distinguish between headers and other body parts on shots. The SPADL convertor simply labels all shots as performed by foot, which causes issues when training an expected goals model.

Solution

This can be fixed easily:

def determine_bodypart_id(event):
    """
    This function determines the body part used for an event
    Args:
    event (pd.Series): Wyscout event Series
    Returns:
    int: id of the body part used for the action
    """
    if event["subtype_id"] in [81, 36, 21, 90, 91]:
        body_part = "other"
    elif event["subtype_id"] == 82:  # or event['head_or_body']:
        body_part = "head"
+   elif event["type_id"] == 10 and event['head/body']:
+      body_part = "?" 
    else:  # all other cases
        body_part = "foot"
    return bodyparts.index(body_part)

However, I do not know which body part I should use for these shots. Just other, or head, or create a new one head/other?

add original_event_id to SPADL data

Sometimes people want to use some extra information of event data not available in the SPADL format, e.g., pressure attribute of StatsBomb data.

We can't extend SPADL to accomodate every extra piece of information that might be relevant, because then we lose the simplicity of SPADL and also its cross-compability with Wyscout, Opta, and StatsBomb.

For now, the best way to allow people to use extra information seems to be to include an "original_event_id" column to the SPADL data. This column will allow people to join SPADL dataframes with the original event dataframes of Wyscout, Opta, or StatsBom. People will thus be able to access all extra event information of a vendor with only one simple join operation.

Different output for xT values and visualizations in the 'EXTRA-run-xT.ipynb' notebook

When running the 'EXTRA-run-xT.ipynb' notebook under the 'public-notebooks' folder, I got a completely different output for xT values and thus different visualizations as follows.

As guided, I ran the '1-load-and-convert-statsbomb-data.ipynb' notebook first. I noticed that there are some changes for the StatsBomb dataset, but I don't think it is the reason resulting in the completely different output. I am using Python 3.7.2 and socceraction 1.2.2.

Create a consistent definition for keeper events in SPADL

SPADL defines 4 different event types for describing save/ball recovery actions by keepers:

Action type	Description	Success
Keeper save	Keeper saves a shot on goal	Always success
Keeper claim	Keeper catches a cross	Does not drop the bal.
Keeper punch	Keeper punches the ball clear	Always success.
Keeper pick-up	Keeper picks up the ball	Always success

First, it is somewhat unclear what the differences between these four events are. For example,

When a keeper saves a shot but can not claim the ball, is it a "keeper save" or "keeper punch" action? Or are "keeper punch" actions only for crosses?
When a keeper rushes out to either cut out an attacking pass (in a race with the opposition player) or to close-down an opposition player is that a pick-up or a claim?
Is there a difference between a keeper pick-up and interception action (apart from the body part). If there is no difference it might be better to simply drop the keeper pick-up action.

Second, there are incenstensies between the different converters.

The keeper pick-up action is missing in the StatsBomb and Wyscout converters. I believe that "Goalkeeper" events with type "collected (25)" and outcome "success (15)" should be converted to this type, while events with outcome "claim (47)" should be keeper claim events.
Both the keeper pick-up and keeper claim actions are missing in the Wyscout converter
The definition of SPADL states that keeper saves will always have 'success' as a result, but the Statsbomb action-attribute pair 'Shot Saved (In Play Danger)' would lead to a conversion to a failed keeper save action. The same goes for the Statsbomb action-attribute pair 'Punch (In Play Danger)', although I'm not sure whether that combination can actually occur in the data.

A related point was raised in #45: How should keeper throws be addressed? Is a keeper throw also considered a goal kick in terms of SPADL actions? I think it makes sense to either have separate SPADL actions for these two, or one (renamed?) SPADL action for them both. I do not think that a keeper throw should be considered a regular pass, since there is no pressure on the keeper to execute this action quickly (in contrast to a regular pass). Therefore it may need a different treatment when processing the data.

Although keeper actions are not very important in the action valuing frameworks, this might be useful in other applications of SPADL. Therefore, I believe it would be good to agree upon a definition for these events and fix the inconsistencies in the converters.

Handling of label.scores and label.concedes of final actions

file: socceraction/classification/features.py

 for i in range(1, nr_actions):
        for c in ["team_id", "goal", "owngoal"]:
            shifted = y[c].shift(-i)
            shifted[-i:] = y[c][len(y) - i]
            y["%s+%d" % (c, i)] = shifted

This code does not correctly propagate goals for the last n_actions in a match.

ModuleNotFoundError: No module named 'socceraction.data'

I am getting this error when running 1-load-and-convert-statsbomb-data.ipynb using the latest version. Here is the code from the notebook:
from socceraction.data.statsbomb import StatsBombLoader
Which gives an error.

The following works, however:
from socceraction.socceraction.data.statsbomb import StatsBombLoader

This is an issue for all imports, e.g. StatsBombLoader requires:
from socceraction.data.base import EventDataLoader, ParseError

Determine more accurate timestamps for extra actions in Atomic-SPADL

Socceraction creates the timestamps for dribbles and receptions (atomic) as the midpoint (t2 - t1) / 2 between the first and subsequent actions.

The downside of that might be that you can have a short pass, then a long dribble, but the timestamp would still be in the middle (of the dribble).
The other way around might happen too with a long ball traveling a few sec that is immediately passed on, thus the timestamp being in the middle of the pass.

Would it make sense to assume a fixed passing speed and create the timestamp for the ball receival based on the distance and thus the time the ball would take from origin to destination?

Now I am not sure if this is very relevant. It might be useful for looking at holding times of the ball in those individual ball possessions for example.
Are Opta Timestamps always just in seconds? Maybe the lacking accuracy of those would make this approach unnecessary.

encoding for stastbomb add_players_and_teams and add_events

Thanks for the great work on making statsbomb data more accessible.

For the statsbomb.py functions, utf-8 encoding is declared for matches and competitions but not teams and players or events.

 with open(competition_file, "rt", encoding="utf-8") as fh:
            matches += json.load(fh)

vs.

with open(lineup_file, "r") as fh:
            lineups += json.load(fh)

This is causing my code to break as default decoding is ascii. Could the functions be aligned to utf-8? Alternatively - add try/except logic?

Thanks

Chris

Inconsistenties in action type of own goals

In the SPADL representation of Opta and Statsbomb all own goals are labeled as shots, while the Wyscout convertor labels them as passes, interceptions or clearances. First, I think it would be better to be consistent and use the same data types for each provider. Second, shot is not a good action type for own goals in my opinion. I prefer bad_touch . Another reasonable option would be to use the type of the intended action (i.e., clearance, interception, pass, keeper_save,...) as is done now in the Wyscout convertor, but I do not know whether it is easy to do this for Statsbomb and Opta.

SPADL Definition

For my thesis I'm defining a sequence of ball possession of a team as a specific sequence of SPADL actions that occur in a larger sequence of SPADL actions (the precise definition is not important for this issue). For this, I'm basing myself on Table 3.1 in Tom Decroos PhD thesis (https://tomdecroos.github.io/reports/thesis_tomdecroos.pdf). This Table defines all SPADL actions and which attribute values each action can have. However, this definition does not seem up to date with the actual implementation in socceraction. I encountered the following differences:

The goalkick action is not described in the definition of SPADL in the thesis
The definition of SPADL in the thesis describes a keeper pick-up action. However the Statsbomb to SPADL converter will never convert an action to this type after inspecting the code.
The definition of SPADL in the thesis states that fouls will always have 'success' as a result, while the converter will always give 'fail' as the result of this action
The definition of SPADL in the thesis states that interceptions will always have 'success' as a result, while the converter will attribute 'success' to some interceptions (in case they succeed) and 'fail' to other interceptions (in case the Statsbomb data say that the ball was intercepted but knocked to opposition or the ball was intercepted but went out of bounds by doing so)
The definition of SPADL in the thesis states that keeper saves will always have 'success' as a result, but the Statsbomb action-attribute pair 'Shot Saved (In Play Danger)' would lead to a conversion to a failed keeper save action. The same goes for the Statsbomb action-attribute pair 'Punch (In Play Danger)', although I'm not sure whether that combination can actually occur in the data.

A precise definition of the SPADL data format is necessary to correctly define a sequence of ball possession in terms of SPADL actions. It's important to state more tricky things, eg. that a failed interception of another team does not impact a ball possession sequence of one team. The possible occurrence of failed interceptions was for example denied in the thesis.

I therefore propose that it would maybe be a good idea to have an up-to-date definition somewhere which precisely defines SPADL and what action-attribute pairs are valid. This allows to build definitions in terms of SPADL actions or SPADL action sequences.

When building this definition, I also think that the following things in the original definition in the thesis should require some attention:

According to the original definition, all corner actions can have offside as a special result. This cannot occur in practice.
According to the original definition, tackles can have a yellow or red card as a special result. However, I think that in this case it should be classified as a foul. Maybe the converter should be build in such a way that it converts a failed tackle with a card as result (from the original data of 3rd parties) to 2 actions in SPADL, in which the first is a failed tackle and the second is a foul with the card as a result.
According to the original definition, penalty shots and free kick shots can have an owngoal as a special result. However, unless some truly mafioso things are going on, this can never be the case in practice. However, in football you never know of course... :)

goalscore function also picking up goal kicks

The goalscore function in atomic/vaep/features.py function is matching goalkicks as well as goals, discovered when working through atomic tutorial -2.
Attached screenshot is from game ID 7537.
.

Action positions are wrong when converting from Wyscout events

Two mistakes are made when converting Wyscout events to SPADL events:

make_new_positions clips all x and y coordinates to (0, 105) and (0, 68) respectively, where 105 and 68 are the field width and field length used in the whole package.

socceraction/socceraction/spadl/wyscout.py

Lines 161 to 182 in 3e267fb

 def make_new_positions(events: pd.DataFrame) -> pd.DataFrame: 

 """Extract the start and end coordinates for each action. 

  Parameters 

  ---------- 

  events : pd.DataFrame 

  Wyscout event dataframe 

  Returns 

  ------- 

  pd.DataFrame 

  Wyscout event dataframe with start and end coordinates for each action. 

  """ 

 new_positions = events[['event_id', 'positions']].apply( 

 lambda x: _make_position_vars(x[0], x[1]), axis=1 

 ) 

 new_positions.columns = ['event_id', 'start_x', 'start_y', 'end_x', 'end_y'] 

 events = pd.merge(events, new_positions, left_on='event_id', right_on='event_id') 

 events[['start_x', 'end_x']] = events[['start_x', 'end_x']].clip(0, 105) 

 events[['start_y', 'end_y']] = events[['start_y', 'end_y']].clip(0, 68) 

 events = events.drop('positions', axis=1) 

 return events

The problem is that when this method is called, we are still working with Wyscout positions and Wyscout defines positions as follows:

which I got from https://figshare.com/articles/dataset/Events/7770599?backTo=/collections/Soccer_match_event_dataset/4415000. This means that either the positions have to be clipped to (0, 100) (both x and y) or they have to be clipped at a later stage. I don't know what's more desirable.

The second mistake is that line 47 in convert_to_actions fixes the direction of play (the players who makes the action always plays from left to right), but as can be seen in the picture above, Wyscout already has its events defined like this, so this line just reverts the process.

socceraction/socceraction/spadl/wyscout.py

Lines 25 to 52 in 3e267fb

 def convert_to_actions(events: pd.DataFrame, home_team_id: int) -> DataFrame[SPADLSchema]: 

 """ 

  Convert Wyscout events to SPADL actions. 

  Parameters 

  ---------- 

  events : pd.DataFrame 

  DataFrame containing Wyscout events from a single game. 

  home_team_id : int 

  ID of the home team in the corresponding game. 

  Returns 

  ------- 

  actions : pd.DataFrame 

  DataFrame with corresponding SPADL actions. 

  """ 

 events = pd.concat([events, get_tagsdf(events)], axis=1) 

 events = make_new_positions(events) 

 events = fix_wyscout_events(events) 

 actions = create_df_actions(events) 

 actions = fix_actions(actions) 

 actions = _fix_direction_of_play(actions, home_team_id) 

 actions = _fix_clearances(actions) 

 actions['action_id'] = range(len(actions)) 

 actions = _add_dribbles(actions) 

 return actions.pipe(DataFrame[SPADLSchema])

StatsPerform JSON parsers: load data from memory

Current loaders is designed to read files and process them into dataframes.

It's not suitable if you want to use SDDP feed instead of SDAPI. The difference is that SDDP adds events during the match and SDAPI is available only after match.

I want to calculate some metrics during the match and created alternative memory loader based on MA3 Loader:
https://gist.github.com/denisov-vlad/28d4668c4861b7c551a6caba3c341ba2

As you see, there is a lot of duplicated code. It will be awesome to split extract functions into loading from disk and processing data.

Handle lagging SPADL features for first actions in games/periods

a question not an issue per se

When lagging gamestates to compute features on spadl

socceraction/socceraction/vaep/features.py

Line 36 in 772fa76

 def gamestates(actions : pd.DataFrame, nb_prev_actions: int =3) -> List[pd.DataFrame]: 

the default fill is 0. Given that 0 is a valid type_id (at least for Statsbomb where it is a pass), is this (ever so slightly) affecting results by saying that (e.g.) when a team kick off, the last 3 actions have been passes.

I imagine this is of little to no consequence in reality as so few actions happen from kick off but might be worth assigning either a 999 or NA (etc.) to lagged actions which do not have a preceeding action?

Nitpick: predict in xthreat.py returns type np.ndarray, not pd.Series

Nitpick: the return value here is a numpy.ndarray and not a pandas.Series

socceraction/socceraction/xthreat.py

Line 229 in 185ba8f

) -> pd.Series:

Add kloppy as reader

First of all: thanks for this great library!

I was wondering what is needed to add kloppy as a reader for input files. When kloppy can be used users of socceraction can easily switch to other formats supported by kloppy (like Sportec).

Challenges:

What attributes are required for socceraction to work, and is kloppy (at this moment) able to provide all those?
Same question but than for future development

Curious what you think about this. If it seems doable I can start working on a PR for socceraction.

Wyscout convertor discards own goals from touch events

Bug

Own goals resulting from bad touch events in the Wyscout event streams are missing in the SPADL representation.

Minimal example

As a minimal example, here is an own goal from the game between Leicester and Stoke on 24 Feb 2018. Stoke's goalkeeper Jack Butland allows a low cross to bounce off his gloves and into the net:

	eventId	subEventName	tags	playerId	positions	matchId	eventName	teamId	matchPeriod	eventSec	subEventId	id
466559	8	Cross	"[{'id': 402}, {'id': 801}, {'id': 1802}]"	8013	"[{'y': 89, 'x': 97}, {'y': 0, 'x': 0}]"	2499994	Pass	1631	2H	1496	80	230320305
466560	7	Touch	[{'id': 102}]	8094	"[{'y': 50, 'x': 1}, {'y': 100, 'x': 100}]"	2499994	Others on the ball	1639	2H	1497	72	230320132
466561	9	Reflexes	"[{'id': 101}, {'id': 1802}]"	8094	"[{'y': 100, 'x': 100}, {'y': 50, 'x': 1}]"	2499994	Save attempt	1639	2H	1499	90	230320135

--> Download source

And the corresponding SPADL representation:

	game_id	period_id	time_seconds	team_id	player_id	start_x	start_y	end_x	end_y	bodypart_id	type_id	result_id
0	2499994.0	2.0	1496	1631.0	8013.0	101.85	7.48	0.0	68.0	0	1	0
1	2499994.0	2.0	1499	1639.0	8094.0	1.05	34.0	1.05	34.0	2	14	1

The result_id of the second action should be 3 (= own_goal).

Bug in _get_minutes_played

The _get_minutes_played method in wyscout loader contains 2 bugs:

The duration of a game is defined as follows:

socceraction/socceraction/data/wyscout/loader.py

Lines 731 to 735 in 5974857

 periods_ts = {i: [0] for i in range(6)} 

 for e in events: 

 period_id = wyscout_periods[e['matchPeriod']] 

 periods_ts[period_id].append(e['eventSec']) 

 duration = int(sum([max(periods_ts[i]) / 60 for i in range(5)]))

In words: take the time in seconds of the last event in every period, convert to minutes and sum up the values. This means that injury time is taken into account for every period. Wyscout, however, only takes into account injury time of the last period of the game when defining the minute of e.g. a substitution, a red card. So when you now define the amount of minutes played of a substitute as follows:

socceraction/socceraction/data/wyscout/loader.py

Line 764 in 5974857

'minutes_played': duration - substitution['minute'],

and the time played by a player who gets substituted as:

socceraction/socceraction/data/wyscout/loader.py

Line 768 in 5974857

pg[substitution['playerOut']]['minutes_played'] = substitution['minute']

the amount of minutes played will be too high and too low respectively. For players who play a full game there is obviously no problem.

Red cards are not taken into account, so a player who gets a red card in e.g. minute 5 will get his minutes played set to the duration of the game.

Discrepancy between successful passes in the SPADL and atomic-SPADL representations

While converting Wyscout events to SPADL actions most duels are removed as they are not considered on the ball actions, however in doing so some information is lost. Wyscout considers a pass which is followed by a duel as accurate (translated to SPADL as a successful action) even if the duel is lost by the teammate of the player who gave the pass. This causes successful passes to be followed by an action of the opposing team. It would make more sense (in my opinion) to mark the pass as failed and follow it up by an interception of the opposing player.

add expected threat

Add expected threat from Karun Singh (https://karun.in/blog/expected-threat.html) as an alternative to VAEP.

Possible filename: socceraction/xthreat.py

Refactor opta.py and wyscout.py to follow the same pattern of statsbomb.py

statsbomb.py now offers some great atomic functions to download and convert statsbomb data. The notebook public-notebooks/1-load-and-convert-statsbomb-data demonstrates the full pipeline and makes it possible for users to inspect intermediate results, view only a small part of the data, and debug in case things go wrong.

In contrast, opta.py and wyscout.py are still very much black boxes that only offer a single public function for converting an entire folder of data. These modules have none of the benefits statsbomb.py has now.

https://raw.githubusercontent.com/statsbomb/open-data/master/data/ is invalid

Add left/right foot to bodyparts

In Perform's data we can split actions by left/right foot.

It is helpful when we know player's good foot.

I've added it to bodyparts var

bodyparts: List[str] = ['foot', 'right foot', 'left foot', 'head', 'other', 'head/other']

and _get_bodypart_id() function to increase xG classifier metrics.

Is it ok? Will it break other functionality? I'm ready to make pull request.

Bug in conversion to atomic SPADL due to gaps in Wyscout data

Wyscout collects its data by video analysis. This means that when replays of certain events are shown, the events that occur during that replay are not captured. The most common replayed events are goals and this occasionally causes the kick-off and subsequent actions to be absent from the game data. Most of the time this causes no issues (except for the fact that there is no data for small parts of the game), but when converting SPADL actions to atomic SPADL actions we run into problems. When converting from default SPADL to atomic SPADL, the result (result_id and result_name) are replaced by an extra action. For shotlike actions this is done as follows:

socceraction/socceraction/atomic/spadl/base.py

Lines 116 to 166 in 8e29c57

 def _extra_from_shots(actions: pd.DataFrame) -> pd.DataFrame: 

 next_actions = actions.shift(-1) 

 shotlike = ['shot', 'shot_freekick', 'shot_penalty'] 

 shot_ids = list(_spadl.actiontypes.index(ty) for ty in shotlike) 

 samegame = actions.game_id == next_actions.game_id 

 sameperiod = actions.period_id == next_actions.period_id 

 shot = actions.type_id.isin(shot_ids) 

 goal = shot & (actions.result_id == _spadl.results.index('success')) 

 owngoal = shot & (actions.result_id == _spadl.results.index('owngoal')) 

 next_corner_goalkick = next_actions.type_id.isin( 

 [ 

 _atomicspadl.actiontypes.index('corner_crossed'), 

 _atomicspadl.actiontypes.index('corner_short'), 

 _atomicspadl.actiontypes.index('goalkick'), 

 ] 

 ) 

 out = shot & next_corner_goalkick & samegame & sameperiod 

 extra_idx = goal | owngoal | out 

 prev = actions[extra_idx] 

 # nex = next_actions[extra_idx] 

 extra = pd.DataFrame() 

 extra['game_id'] = prev.game_id 

 extra['original_event_id'] = prev.original_event_id 

 extra['period_id'] = prev.period_id 

 extra['action_id'] = prev.action_id + 0.1 

 extra['time_seconds'] = prev.time_seconds # + nex.time_seconds) / 2 

 extra['start_x'] = prev.end_x 

 extra['start_y'] = prev.end_y 

 extra['end_x'] = prev.end_x 

 extra['end_y'] = prev.end_y 

 extra['bodypart_id'] = prev.bodypart_id 

 extra['result_id'] = -1 

 extra['team_id'] = prev.team_id 

 extra['player_id'] = prev.player_id 

 ar = _atomicspadl.actiontypes 

 extra['type_id'] = -1 

 extra['type_id'] = ( 

 extra.type_id.mask(goal, ar.index('goal')) 

 .mask(owngoal, ar.index('owngoal')) 

 .mask(out, ar.index('out')) 

 ) 

 actions = pd.concat([actions, extra], ignore_index=True, sort=False) 

 actions = actions.sort_values(['game_id', 'period_id', 'action_id']).reset_index(drop=True) 

 actions['action_id'] = range(len(actions)) 

 return actions

In short: when the result of a shot action is success, the next action will be a goal, when the action following the shot is a corner or a goalkick, the next action will be out. The problem now lies in:

socceraction/socceraction/atomic/spadl/base.py

Lines 156 to 161 in 8e29c57

 ar = _atomicspadl.actiontypes 

 extra['type_id'] = -1 

 extra['type_id'] = ( 

 extra.type_id.mask(goal, ar.index('goal')) 

 .mask(owngoal, ar.index('owngoal')) 

 .mask(out, ar.index('out'))

Due to some events not being registered by Wyscout after a goal, it is possible that the first event registered after a goal is a goalkick or a corner, instead of the expected pass (the kickoff). This means that line 161 will override what is done on line 159, causing the goals to be incorrectly converted.

There are two possible ways to fix this (that I've come up with at least):

The first is to simply replace line 159 with line 161, changing the order of the masks.
The second is to allow a maximum time difference between a shotlike action and a goalkick or corner for the goalkick or corner to be considered the action following the shot. However, considering e.g. VAR interventions, which might take some time to complete, this might be imprecise.

xG example: remove movement_a0 from features

In xG model example you use movement_a0 feature which is highly correlated with classification result and breaks other features importances.

I've tested the model with and without it and compared with Understat data. With this feature you have good values for goals but other shots from good positions have small probabilities.

Of course AUC score will decrease to ~ 0.83 (tested with XGBoost / LightGBM models) but final result for each shot seems more accurate.

Incorrect result_id in Wyscout interception passes.

Bug

The Wyscout convertor converts passes that are also interceptions in the event data into two separate events, first an interception and then a pass. However, the interception gets the result_id of the original combined event, which can be problematic when the pass results in an own goal. If this happens, it seems like the player made two own goals instead of one.

Minimal example

The Wyscout event:

	eventId	subEventName	tags	playerId	positions	matchId	eventName	teamId	matchPeriod	eventSec	subEventId	id
30658	8	Head pass	"[{'id': 102}, {'id': 1401}, {'id': 1801}]"	38093	"[{'y': 56, 'x': 5}, {'y': 100, 'x': 100}]"	2499737	Pass	1610	2H	2184	82	180427412

--> Download source

The corresponding SPADL representation:

	game_id	period_id	time_seconds	team_id	player_id	start_x	start_y	end_x	end_y	bodypart_id	type_id	result_id
0	2499737.0	2.0	2184	1610.0	38093.0	99.75	38.08	99.75	38.08	0	10	3
1	2499737.0	2.0	2184	1610.0	38093.0	99.75	38.08	0.0	68.0	1	0	3

Solution

The insert_interception_passes function has to be adapted. This can be fixed easily, but I have to know the exact definition of a successful interception event. In particular, should the result in the example above be "success" (because he intercepted the ball) or "fail" (because he lost it immediately). More general, is an interception successful if you touch the ball or if you keep possession in the successive action?

Bugs and Error in OptaLoader's extract_lineups() function affecting "is_starter" & "minutes_played" columns (F7_XML)

To make the bugs reproducible, I used the OptaLoader on the XML feeds from test folder and have attached a screenshot here.

I noticed 2 bugs which affect the columns is_starter and minutes_played. Both the bugs can be located in extract_lineups() function belonging to _F7XMLParser class. There is also an error possible in logic used to calculate minutes_played (refer Issue 3 on this page )

Issue 1: a possible bug
Location : Line 841 in spadl/opta.py

In is_starter=player_elm.attrib['Formation_Place'] != 0,, the attribute Formation_Place turns out to be a character containing numbers from 0 to 11. So in this case, is_starter becomes True irrespective of the value of Formation_Place because of difference in data types.

Issue 2: a possible bug
Location : Line 827 in spadl/opta.py

The following piece of code
sub_on = int(next((item['Time'] for item in subst if item['SubOn'] == f'p{player_id}'), 0))

assigns value 0 to variable sub_on for substitutes who don't get subbed on. So the players who stay on the bench throughout the game have the value of minutes_played to be equal to stats['match_time'], because minutes_played = sub_off - sub_on and the sub_off value for all players who don't get subbed off is set to stats['match_time']. So as seen from the picture above, Iturraspe a player who doesn't play a single minute in the game, has minutes_played = 96 - 0 = 96

Issue 3 : possible Error causing line:
Location : Line 827 in spadl/opta.py
Substitutions events in Opta doesn't necessarily have to involve 2 players. A player retirement is also part of the sub event. So when a player gets retired (i.e. team has exhausted its available sub opportunities), the Subtitution element will not have the 'SubOn' attribute but just the 'SubOff'. Hence one would get a KeyError in these circumstances as the list comprehension looks for SubOn key in each Subtitution element

P.S : I managed a fix locally for all the 3 issues and will be issuing a pull request momentarily. I thought posting this as an issue would create a log of this issue on the issues section of this repo and might help people in the future.

Add Support for Wyscout API v3

On July 20th 2021, Wyscout switched their API to v3. Currently, socceraction only supports v2 of the Wyscout API, which is now a legacy version. Adding support for the new API format will require substantial changes to the socceraction.data.wyscout and socceraction.spadl.wyscout modules. As v2 of the API will remain available until the release of v4 (no release date yet), making these changes is currently not a priority. However, pull requests would be welcome.

See https://apidocs.wyscout.com/ for details.

Encoding issues

Hi! first thanks for this package--I can't wait to dive into the "cleaned" data.
Second, I keep running into this error, and I think it has something to do with encodings in a few of the lineups, because if I remove certain line up files, it will run through. Sorry, I'm 99% an R user, so I'm not sure how to diagnose this! I'm following along in the open notebook you provided.

Thanks!

...Adding competitions to [redacted]\statsbomb.h5
...Adding matches to [redacted]\statsbomb.h5
...Adding players and teams to [redacted]\statsbomb.h5: 
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-0b9bd700b599> in <module>
----> 1 spadl.statsbombjson_to_statsbombh5("[redacted],statsbomb_h5)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\socceraction\spadl\statsbomb.py in jsonfiles_to_h5(datafolder, h5file)
     18     print(f"...Adding matches to {h5file}")
     19     add_matches(os.path.join(datafolder, "matches/"), h5file)
---> 20     add_players_and_teams(os.path.join(datafolder, "lineups/"), h5file)
     21     add_events(os.path.join(datafolder, "events/"), h5file)
     22 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\socceraction\spadl\statsbomb.py in add_players_and_teams(lineups_url, h5file)
     43     ):
     44         with open(lineup_file, "r") as fh:
---> 45             lineups += json.load(fh)
     46             for lineup in lineups:
     47                 for p in [flatten_id(p) for p in lineup["lineup"]]:

~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 562: character maps to <undefined>

Data leak in expected goals model

Hi!

First of all, thanks for the great package, it makes working with JSON files much easier!

I was reviewing the latest expected goal model code (EXTRA-build-expected-goals-models) and I think there is a small data leak in the model.

Features dx_a0 and dy_ao (also the movement_a0 as it is derived from other 2) uses end location information of the action, for shots this becomes the end location of the shot. All the successful (goal) shots have end x location either 0 or 105 (30-37 for y values) - so movement information actually inherits the result of the shot in it. For example, if the shot is taken from 25 (start_x_a0 = 25) meters horizontal distance away from the goal, any value less than 25 dx_a0 would automatically mean that the shot was not a goal. It is not a direct leak but still, I believe the model would be better without these features in it.

Please let me know if I'm missing a point!

Best wishes!

Bug in conversion of owngoals to atomic SPADL

When converting SPADL actions to atomic SPADL actions, an owngoal is currently converted as follows:

socceraction/socceraction/atomic/spadl/base.py

Lines 119 to 127 in 8e29c57

 shotlike = ['shot', 'shot_freekick', 'shot_penalty'] 

 shot_ids = list(_spadl.actiontypes.index(ty) for ty in shotlike) 

 samegame = actions.game_id == next_actions.game_id 

 sameperiod = actions.period_id == next_actions.period_id 

 shot = actions.type_id.isin(shot_ids) 

 goal = shot & (actions.result_id == _spadl.results.index('success')) 

 owngoal = shot & (actions.result_id == _spadl.results.index('owngoal'))

However, owngoals in SPADL are the result of a bad_touch and not of a shotlike, so owngoals are never correctly converted.

This can be fixed by replacing line 127 with:

    owngoal = (actions.type_id == _spadl.actiontypes.index("bad_touch")) & (actions.result_id == _spadl.results.index('owngoal'))

No matching distribution found for json (from socceraction)

During pip install socceraction occures error of not finding json dependency. It occures at anaconda or normal python pip installation and also at Google Colab. As I saw you've changed json dependecies lately, then it would be worthy to look at it.

Collecting socceraction
  Downloading https://files.pythonhosted.org/packages/0c/d1/9e51784a2375d477996ef2574218103bb0a0bfadf4914b7d3738cfd25af1/socceraction-0.0.7.tar.gz
Collecting tqdm
  Downloading https://files.pythonhosted.org/packages/bb/62/6f823501b3bf2bac242bd3c320b592ad1516b3081d82c77c1d813f076856/tqdm-4.39.0-py2.py3-none-any.whl (53kB)
     |████████████████████████████████| 61kB 558kB/s
ERROR: Could not find a version that satisfies the requirement json (from socceraction) (from versions: none)
ERROR: No matching distribution found for json (from socceraction)

SchemaError when converting Wyscout events to SPADL actions

When using the ´convert_to_actions´ method

socceraction/socceraction/spadl/wyscout.py

Line 25 in 5f6dcb5

 def convert_to_actions(events: pd.DataFrame, home_team_id: int) -> DataFrame[SPADLSchema]: 

to convert Wyscout events to SPADL actions, I get the following error:

I am using Python 3.8 and socceraction 1.1.3. The bug/error can be recreated as follows (set the download flag as needed):

    pwl = socceraction.data.wyscout.PublicWyscoutLoader(download=True)
    competitions = pwl.competitions()
    world_cup = competitions.loc[6]
    wc_games = pwl.games(world_cup.competition_id, world_cup.season_id)
    game_id = wc_games.at[0, 'game_id']
    home_team_id = wc_games.at[0, 'home_team_id']
    events = pwl.events(game_id)
    actions = socceraction.spadl.wyscout.convert_to_actions(events, home_team_id)

Removing the .astype(int) on line 53 seems to fix this for me.

socceraction/socceraction/spadl/wyscout.py

Line 53 in 5f6dcb5

actions[col] = actions[col].astype(int)

Lack of compatibility with Wyscout's Soccer-logs open dataset

Hi!
Lately i'm working with Wyscout's soccer-logs open dataset of matches' ball events and tried to calculate VAEP scores for each of passes made during games, as this would be needed for my masters thesis. There occured a problem though.

It looks like jsons files stored there are in different format/structure as "normal" Wyscout's ones, cause there is a problem with i.e. matches info not occuring in events jsons. As I found out it would be pretty easy to change actual jsons_to_h5 wyscout function to the one, which would load data from open dataset (as I probably got it right now), but I can't test it, not knowning values which would be produced by original algorithm. Probably I'd do pull request for you to look at this issue.

As it's said that SPADL/VAEP is compatible with Wyscout's data and looking for a bunch of people, who would probably use it and want to know results of your work, that could be a good feature to be added on, to work with this kind of data.

Best wishes!

Ed: forgot the link to dataset: https://figshare.com/collections/Soccer_match_event_dataset/4415000/2

Error with function add_players_and_teams in statsbomb.py

Hi all

I am trying to execute notebook 2 of tutorial and it is failing on this function

def add_players_and_teams(lineups_url, h5file):
on this line
players = pd.DataFrame(players.values())

DataFrame constructor not properly called!

Regards

Adding R xThreat Implementation

This repo did wonders for me when I was trying to wrap my head around the mathematics behind Karun Singh's Expected Threat model. As someone who is much more comfortable with R instead of Python, I actually ended up converting xthreat.py to an RScript. That file can be found here.

If your group wanted to add that script (which is SPADL compatible) to your repository, I would have no reservations. I think it would allow the Expected Threat model to become more accessible.

Thanks for your work!

Add support for new Opta type IDs

It seems like Opta added some new type ID's for the 2021/22 season. These are not yet supported in socceraction causing the following error.

SchemaError: non-nullable series 'type_name' contains null value

As a temporary solution, you can downgrade to v1.1.1 (pip install socceraction=1.1.1).

Exception in socceraction.atomic.spadl.convert_to_atomic method

NameError Traceback (most recent call last)
in
12 player_games.append(statsbomb.extract_player_games(events))
13 actions = statsbomb.convert_to_actions(events,match.home_team_id)
---> 14 atomic_actions[match.match_id] = atomicspadl.convert_to_atomic(actions)
15
16 games = matches.rename(columns={"match_id":"game_id"})

C:\ProgramData\Anaconda3\envs\datasciencesoccer-RQVybP6P\lib\site-packages\socceraction\atomic\spadl.py in convert_to_atomic(actions)
36 def convert_to_atomic(actions):
37 actions = actions.copy()
---> 38 actions = extra_from_passes(actions)
39 actions = add_dribbles(actions) # for some reason this adds more dribbles
40 actions = extra_from_shots(actions)

C:\ProgramData\Anaconda3\envs\datasciencesoccer-RQVybP6P\lib\site-packages\socceraction\atomic\spadl.py in extra_from_passes(actions)
113 extra["result_id"] = -1
114
--> 115 offside = prev.result_id == results.index("offside")
116 out = ((nex.type_id == actiontypes.index("goalkick")) & (~same_team)) | (
117 nex.type_id == actiontypes.index("throw_in")

NameError: name 'results' is not defined

KeyError: "['referee_id', 'venue'] not in index" when trying to load StatsBomb Champions League data

I encountered the following error when trying to follow the steps in public notebook 1, using StatsBomb's open data.
This error isn't encountered for the other datasets (e.g. La Liga), so presumably the Champions League data is a bit different to the rest. The section that gives the error is below:

games = list(
    SBL.games(row.competition_id, row.season_id)
    for row in selected_competitions.itertuples()
)
games = pd.concat(games, sort=True).reset_index(drop=True)
games[["home_team_id", "away_team_id", "game_date", "home_score", "away_score"]]

Opta data never converts to goalkick SPADL action

When converting Opta event stream data, there is never a conversion to a 'goalkick' SPADL action.

To do so, the _get_type_id function in opta.py needs to be changed. According to this source (couldn't find an up-to-date official document on the internet), a goal kick corresponds to pass qualifier 124.

Another question also arises: is a keeper throw (qualifier 123) also considered a goal kick in terms of SPADL actions. I think it makes sense to either have separate SPADL actions for these two, or one (renamed?) SPADL action for them both. I do not think that a keeper throw should be considered a regular pass, since there is no pressure on the keeper to execute this action quickly (in contrast to a regular pass). Therefore it may need a different treatment when processing the data.

Typo in SPADL Statsbomb script misclassifies duels

Typo at https://github.com/ML-KULeuven/socceraction/blob/master/socceraction/spadl/statsbomb.py#L313

"Lost in Play " should capitalise the I and be "Lost In Play"

Error in inspect hdf file

After trying to reaad hdf file on notebook 2 i get this error maybe a version package issue, Thanks in advance

refading it could be a numpy issue ... any solution?
This is my numpy version

np.version
'1.16.4'

bug when fixing clearances

In statsbomb.py, opta.py and wyscout.py, the function fix_clearances fails when a clearance is the last action of a game. In this case end_x and end_y become nan as there is no next action.

def fix_clearances(actions):
    next_actions = actions.shift(-1)
    clearance_idx = actions.type_id == actiontypes.index("clearance")
    actions.loc[clearance_idx, "end_x"] = next_actions[clearance_idx].start_x.values
    actions.loc[clearance_idx, "end_y"] = next_actions[clearance_idx].start_y.values
    return actions

manual running of tests/datasets/download.py required for tests

For the testing suite to succeed, it seems we need to manually run tests/datasets/download.py multiple times with all of "statsbomb", "wyscout", "convert-statsbomb" and "convert-wyscout" args.

As is this should probably at least be mentioned in the contributing guide. Or can this be automated when running nox?

Further the main function errors out if no further arg is provided. How should the logic be here?

if __name__ == '__main__':
    if len(sys.argv) == 1 or sys.argv[1] == 'statsbomb':
        download_statsbomb_data()
    if sys.argv[1] == 'convert-statsbomb':
        convert_statsbomb_data()
    if len(sys.argv) == 1 or sys.argv[1] == 'wyscout':
        download_wyscout_data()
    if sys.argv[1] == 'convert-wyscout':
        convert_wyscout_data()
    if len(sys.argv) == 1 or sys.argv[1] == 'spadl':
        create_spadl(8657, 777)

Lack of compatibility with Wyscout's Soccer-logs open dataset

Hi @probberechts , where can I find the EXTRA-load-and-convert-wyscout-data.ipynb notebook? The link below is no longer working. Thank you

@agdhruv I've added a notebook which downloads and converts the Wyscout dataset to the SPADL format in the wyscout_support branch. If you use it, you should be aware that there are still some bugs in the wyscout converter (see other issues).

Originally posted by @probberechts in #14 (comment)

Information leakage when estimating scoring probabilities

I believe there is information leakage in 3-estimate-scoring-and-conceding-probabilities.ipynb when using the result of action a0 or the end location of action a0 as a feature to predict scoring when a0 is of type shot (including, freekicks and penalties).

Minutes played by a player in a game is wrong using the PublicWyscoutLoader class

The _get_minutes_played method uses the timestamps of events in a single game to determine the length of that game and thus how long players played.

socceraction/socceraction/data/wyscout/loader.py

Lines 726 to 733 in e7bc0d0

 def _get_minutes_played( 

 teamsData: List[Dict[str, Any]], events: List[Dict[str, Any]] 

 ) -> pd.DataFrame: 

 periods_ts = {i: [0] for i in range(6)} 

 for e in events: 

 period_id = wyscout_periods[e['matchPeriod']] 

 periods_ts[period_id].append(e['eventSec']) 

 duration = int(sum([max(periods_ts[i]) / 60 for i in range(5)]))

The players method in the PublicWyscoutLoader however, passes all the events in the competition of a game, instead of the events in the game itself.

socceraction/socceraction/data/wyscout/loader.py

Lines 286 to 290 in e7bc0d0

 competition_id, season_id = self._match_index.loc[game_id, ['competition_id', 'season_id']] 

 path_events = os.path.join( 

 self.root, self._index.at[(competition_id, season_id), 'db_events'] 

 ) 

 mp = _get_minutes_played(lineups, cast(List[Dict[str, Any]], self.get(path_events)))

This causes all games in the same competition to have the same length and the minutes played for all players to be wrong. As a temporary solution, replacing line 290 with the following 2 lines fixes the issue (but I don't know if it's the most efficient way to do it):

match_events = filter(lambda event: event['matchId'] == game_id, self.get(path_events))
mp = _get_minutes_played(lineups, cast(List[Dict[str, Any]], match_events))

OptaLoader Whoscored parser

Hello, can you please explain a little better how the OptaLoader with Whoscored works?
What is the feeds dict() format?

dict_opta = {
                'whoscored': "PremierLeague-2020_2021\\1485314.json"
            }
datafolder = "..\data\Premier_League-2020_2021"
SBL = opta.OptaLoader(root=datafolder, feeds=dict_opta, parser='whoscored')

I tried this quick test but I don't think I am doing it correctly since I am not getting any competitions from competitions = SBL.competitions()

Can you give me a quick example on how to load the whoscored json?

Kind regards

check for same period when adding separating out dribbles

(I'm using statsbomb data, from a glimpse it looks like the opta and wyscout spadl conversion scripts should suffer the same bug)

https://github.com/ML-KULeuven/socceraction/blob/master/socceraction/spadl/statsbomb.py#L406

when selecting for events to add dribbles in between, there is no check that the two events take place in the same period (half) of the game. Given that the time seconds reverts back to 0(ish) for the next event, the difference in times is negative (and so less than 10)

Therefore, the first event of the next half has the chance to be a dribble between the last event of the proceeding half, and the pass to start the following half.

A simple check for

    same_period = actions.period_id == next_actions.period_id

should resolve this

I'm on holiday at the moment but will create a pull request to solve this in the next few days

	def make_new_positions(events: pd.DataFrame) -> pd.DataFrame:
	"""Extract the start and end coordinates for each action.

	Parameters
	----------
	events : pd.DataFrame
	Wyscout event dataframe

	Returns
	-------
	pd.DataFrame
	Wyscout event dataframe with start and end coordinates for each action.
	"""
	new_positions = events[['event_id', 'positions']].apply(
	lambda x: _make_position_vars(x[0], x[1]), axis=1
	)
	new_positions.columns = ['event_id', 'start_x', 'start_y', 'end_x', 'end_y']
	events = pd.merge(events, new_positions, left_on='event_id', right_on='event_id')
	events[['start_x', 'end_x']] = events[['start_x', 'end_x']].clip(0, 105)
	events[['start_y', 'end_y']] = events[['start_y', 'end_y']].clip(0, 68)
	events = events.drop('positions', axis=1)
	return events

	def convert_to_actions(events: pd.DataFrame, home_team_id: int) -> DataFrame[SPADLSchema]:
	"""
	Convert Wyscout events to SPADL actions.

	Parameters
	----------
	events : pd.DataFrame
	DataFrame containing Wyscout events from a single game.
	home_team_id : int
	ID of the home team in the corresponding game.

	Returns
	-------
	actions : pd.DataFrame
	DataFrame with corresponding SPADL actions.

	"""
	events = pd.concat([events, get_tagsdf(events)], axis=1)
	events = make_new_positions(events)
	events = fix_wyscout_events(events)
	actions = create_df_actions(events)
	actions = fix_actions(actions)
	actions = _fix_direction_of_play(actions, home_team_id)
	actions = _fix_clearances(actions)
	actions['action_id'] = range(len(actions))
	actions = _add_dribbles(actions)

	return actions.pipe(DataFrame[SPADLSchema])

	periods_ts = {i: [0] for i in range(6)}
	for e in events:
	period_id = wyscout_periods[e['matchPeriod']]
	periods_ts[period_id].append(e['eventSec'])
	duration = int(sum([max(periods_ts[i]) / 60 for i in range(5)]))

	def _extra_from_shots(actions: pd.DataFrame) -> pd.DataFrame:
	next_actions = actions.shift(-1)

	shotlike = ['shot', 'shot_freekick', 'shot_penalty']
	shot_ids = list(_spadl.actiontypes.index(ty) for ty in shotlike)

	samegame = actions.game_id == next_actions.game_id
	sameperiod = actions.period_id == next_actions.period_id

	shot = actions.type_id.isin(shot_ids)
	goal = shot & (actions.result_id == _spadl.results.index('success'))
	owngoal = shot & (actions.result_id == _spadl.results.index('owngoal'))
	next_corner_goalkick = next_actions.type_id.isin(
	[
	_atomicspadl.actiontypes.index('corner_crossed'),
	_atomicspadl.actiontypes.index('corner_short'),
	_atomicspadl.actiontypes.index('goalkick'),
	]
	)
	out = shot & next_corner_goalkick & samegame & sameperiod

	extra_idx = goal \| owngoal \| out
	prev = actions[extra_idx]
	# nex = next_actions[extra_idx]

	extra = pd.DataFrame()
	extra['game_id'] = prev.game_id
	extra['original_event_id'] = prev.original_event_id
	extra['period_id'] = prev.period_id
	extra['action_id'] = prev.action_id + 0.1
	extra['time_seconds'] = prev.time_seconds # + nex.time_seconds) / 2
	extra['start_x'] = prev.end_x
	extra['start_y'] = prev.end_y
	extra['end_x'] = prev.end_x
	extra['end_y'] = prev.end_y
	extra['bodypart_id'] = prev.bodypart_id
	extra['result_id'] = -1
	extra['team_id'] = prev.team_id
	extra['player_id'] = prev.player_id

	ar = _atomicspadl.actiontypes
	extra['type_id'] = -1
	extra['type_id'] = (
	extra.type_id.mask(goal, ar.index('goal'))
	.mask(owngoal, ar.index('owngoal'))
	.mask(out, ar.index('out'))
	)
	actions = pd.concat([actions, extra], ignore_index=True, sort=False)
	actions = actions.sort_values(['game_id', 'period_id', 'action_id']).reset_index(drop=True)
	actions['action_id'] = range(len(actions))
	return actions

	def _get_minutes_played(
	teamsData: List[Dict[str, Any]], events: List[Dict[str, Any]]
	) -> pd.DataFrame:
	periods_ts = {i: [0] for i in range(6)}
	for e in events:
	period_id = wyscout_periods[e['matchPeriod']]
	periods_ts[period_id].append(e['eventSec'])
	duration = int(sum([max(periods_ts[i]) / 60 for i in range(5)]))

	competition_id, season_id = self._match_index.loc[game_id, ['competition_id', 'season_id']]
	path_events = os.path.join(
	self.root, self._index.at[(competition_id, season_id), 'db_events']
	)
	mp = _get_minutes_played(lineups, cast(List[Dict[str, Any]], self.get(path_events)))

ml-kuleuven / socceraction Goto Github PK

socceraction's Issues

Bug

Solution

Bug

Minimal example

Bug

Minimal example

Solution

Recommend Projects

Recommend Topics

Recommend Org

Jobs