ml-kuleuven / socceraction Goto Github PK
View Code? Open in Web Editor NEWConvert soccer event stream data to SPADL and value player actions using VAEP or xT
License: MIT License
Convert soccer event stream data to SPADL and value player actions using VAEP or xT
License: MIT License
When running nox, multiple tests fail for me for the same reason:
pandera expected a series to have type int64, got int32.
When I try to convert the downloaded statsbomb / wyscout files by invoking tests/datasets/downloads.py with convert-"provider" I get the same error.
I am using python 3.9.9, socceraction 1.1.2 and pandera 0.8.0
I had a similar problem while using socceraction that I can't exactly recall, where I had to use a prior version of pandera (0.6.1) to make it work. convert-statsbomb and convert-wyscout work fine with this pandera version as well.
Wyscout does not distinguish between headers and other body parts on shots. The SPADL convertor simply labels all shots as performed by foot, which causes issues when training an expected goals model.
This can be fixed easily:
def determine_bodypart_id(event):
"""
This function determines the body part used for an event
Args:
event (pd.Series): Wyscout event Series
Returns:
int: id of the body part used for the action
"""
if event["subtype_id"] in [81, 36, 21, 90, 91]:
body_part = "other"
elif event["subtype_id"] == 82: # or event['head_or_body']:
body_part = "head"
+ elif event["type_id"] == 10 and event['head/body']:
+ body_part = "?"
else: # all other cases
body_part = "foot"
return bodyparts.index(body_part)
However, I do not know which body part I should use for these shots. Just other
, or head
, or create a new one head/other
?
Sometimes people want to use some extra information of event data not available in the SPADL format, e.g., pressure attribute of StatsBomb data.
We can't extend SPADL to accomodate every extra piece of information that might be relevant, because then we lose the simplicity of SPADL and also its cross-compability with Wyscout, Opta, and StatsBomb.
For now, the best way to allow people to use extra information seems to be to include an "original_event_id" column to the SPADL data. This column will allow people to join SPADL dataframes with the original event dataframes of Wyscout, Opta, or StatsBom. People will thus be able to access all extra event information of a vendor with only one simple join operation.
When running the 'EXTRA-run-xT.ipynb' notebook under the 'public-notebooks' folder, I got a completely different output for xT values and thus different visualizations as follows.
As guided, I ran the '1-load-and-convert-statsbomb-data.ipynb' notebook first. I noticed that there are some changes for the StatsBomb dataset, but I don't think it is the reason resulting in the completely different output. I am using Python 3.7.2 and socceraction 1.2.2.
SPADL defines 4 different event types for describing save/ball recovery actions by keepers:
Action type | Description | Success |
---|---|---|
Keeper save | Keeper saves a shot on goal | Always success |
Keeper claim | Keeper catches a cross | Does not drop the bal. |
Keeper punch | Keeper punches the ball clear | Always success. |
Keeper pick-up | Keeper picks up the ball | Always success |
First, it is somewhat unclear what the differences between these four events are. For example,
Second, there are incenstensies between the different converters.
A related point was raised in #45: How should keeper throws be addressed? Is a keeper throw also considered a goal kick in terms of SPADL actions? I think it makes sense to either have separate SPADL actions for these two, or one (renamed?) SPADL action for them both. I do not think that a keeper throw should be considered a regular pass, since there is no pressure on the keeper to execute this action quickly (in contrast to a regular pass). Therefore it may need a different treatment when processing the data.
Although keeper actions are not very important in the action valuing frameworks, this might be useful in other applications of SPADL. Therefore, I believe it would be good to agree upon a definition for these events and fix the inconsistencies in the converters.
file: socceraction/classification/features.py
for i in range(1, nr_actions):
for c in ["team_id", "goal", "owngoal"]:
shifted = y[c].shift(-i)
shifted[-i:] = y[c][len(y) - i]
y["%s+%d" % (c, i)] = shifted
This code does not correctly propagate goals for the last n_actions
in a match.
I am getting this error when running 1-load-and-convert-statsbomb-data.ipynb using the latest version. Here is the code from the notebook:
from socceraction.data.statsbomb import StatsBombLoader
Which gives an error.
The following works, however:
from socceraction.socceraction.data.statsbomb import StatsBombLoader
This is an issue for all imports, e.g. StatsBombLoader requires:
from socceraction.data.base import EventDataLoader, ParseError
Socceraction creates the timestamps for dribbles and receptions (atomic) as the midpoint (t2 - t1) / 2 between the first and subsequent actions.
The downside of that might be that you can have a short pass, then a long dribble, but the timestamp would still be in the middle (of the dribble).
The other way around might happen too with a long ball traveling a few sec that is immediately passed on, thus the timestamp being in the middle of the pass.
Would it make sense to assume a fixed passing speed and create the timestamp for the ball receival based on the distance and thus the time the ball would take from origin to destination?
Now I am not sure if this is very relevant. It might be useful for looking at holding times of the ball in those individual ball possessions for example.
Are Opta Timestamps always just in seconds? Maybe the lacking accuracy of those would make this approach unnecessary.
Thanks for the great work on making statsbomb data more accessible.
For the statsbomb.py functions, utf-8 encoding is declared for matches and competitions but not teams and players or events.
with open(competition_file, "rt", encoding="utf-8") as fh:
matches += json.load(fh)
vs.
with open(lineup_file, "r") as fh:
lineups += json.load(fh)
This is causing my code to break as default decoding is ascii. Could the functions be aligned to utf-8? Alternatively - add try/except logic?
Thanks
Chris
In the SPADL representation of Opta and Statsbomb all own goals are labeled as shots, while the Wyscout convertor labels them as passes, interceptions or clearances. First, I think it would be better to be consistent and use the same data types for each provider. Second, shot
is not a good action type for own goals in my opinion. I prefer bad_touch
. Another reasonable option would be to use the type of the intended action (i.e., clearance
, interception
, pass
, keeper_save
,...) as is done now in the Wyscout convertor, but I do not know whether it is easy to do this for Statsbomb and Opta.
For my thesis I'm defining a sequence of ball possession of a team as a specific sequence of SPADL actions that occur in a larger sequence of SPADL actions (the precise definition is not important for this issue). For this, I'm basing myself on Table 3.1 in Tom Decroos PhD thesis (https://tomdecroos.github.io/reports/thesis_tomdecroos.pdf). This Table defines all SPADL actions and which attribute values each action can have. However, this definition does not seem up to date with the actual implementation in socceraction. I encountered the following differences:
A precise definition of the SPADL data format is necessary to correctly define a sequence of ball possession in terms of SPADL actions. It's important to state more tricky things, eg. that a failed interception of another team does not impact a ball possession sequence of one team. The possible occurrence of failed interceptions was for example denied in the thesis.
I therefore propose that it would maybe be a good idea to have an up-to-date definition somewhere which precisely defines SPADL and what action-attribute pairs are valid. This allows to build definitions in terms of SPADL actions or SPADL action sequences.
When building this definition, I also think that the following things in the original definition in the thesis should require some attention:
Two mistakes are made when converting Wyscout events to SPADL events:
make_new_positions
clips all x and y coordinates to (0, 105) and (0, 68) respectively, where 105 and 68 are the field width and field length used in the whole package.socceraction/socceraction/spadl/wyscout.py
Lines 161 to 182 in 3e267fb
The problem is that when this method is called, we are still working with Wyscout positions and Wyscout defines positions as follows:
which I got from https://figshare.com/articles/dataset/Events/7770599?backTo=/collections/Soccer_match_event_dataset/4415000. This means that either the positions have to be clipped to (0, 100) (both x and y) or they have to be clipped at a later stage. I don't know what's more desirable.
convert_to_actions
fixes the direction of play (the players who makes the action always plays from left to right), but as can be seen in the picture above, Wyscout already has its events defined like this, so this line just reverts the process.socceraction/socceraction/spadl/wyscout.py
Lines 25 to 52 in 3e267fb
Current loaders is designed to read files and process them into dataframes.
It's not suitable if you want to use SDDP feed instead of SDAPI. The difference is that SDDP adds events during the match and SDAPI is available only after match.
I want to calculate some metrics during the match and created alternative memory loader based on MA3 Loader:
https://gist.github.com/denisov-vlad/28d4668c4861b7c551a6caba3c341ba2
As you see, there is a lot of duplicated code. It will be awesome to split extract functions into loading from disk and processing data.
a question not an issue per se
When lagging gamestates to compute features on spadl
the default fill is 0. Given that 0 is a valid type_id (at least for Statsbomb where it is a pass), is this (ever so slightly) affecting results by saying that (e.g.) when a team kick off, the last 3 actions have been passes.I imagine this is of little to no consequence in reality as so few actions happen from kick off but might be worth assigning either a 999 or NA (etc.) to lagged actions which do not have a preceeding action?
Nitpick: the return value here is a numpy.ndarray
and not a pandas.Series
socceraction/socceraction/xthreat.py
Line 229 in 185ba8f
First of all: thanks for this great library!
I was wondering what is needed to add kloppy as a reader for input files. When kloppy can be used users of socceraction can easily switch to other formats supported by kloppy (like Sportec).
Challenges:
Curious what you think about this. If it seems doable I can start working on a PR for socceraction.
Own goals resulting from bad touch events in the Wyscout event streams are missing in the SPADL representation.
As a minimal example, here is an own goal from the game between Leicester and Stoke on 24 Feb 2018. Stoke's goalkeeper Jack Butland allows a low cross to bounce off his gloves and into the net:
eventId | subEventName | tags | playerId | positions | matchId | eventName | teamId | matchPeriod | eventSec | subEventId | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
466559 | 8 | Cross | "[{'id': 402}, {'id': 801}, {'id': 1802}]" | 8013 | "[{'y': 89, 'x': 97}, {'y': 0, 'x': 0}]" | 2499994 | Pass | 1631 | 2H | 1496 | 80 | 230320305 |
466560 | 7 | Touch | [{'id': 102}] | 8094 | "[{'y': 50, 'x': 1}, {'y': 100, 'x': 100}]" | 2499994 | Others on the ball | 1639 | 2H | 1497 | 72 | 230320132 |
466561 | 9 | Reflexes | "[{'id': 101}, {'id': 1802}]" | 8094 | "[{'y': 100, 'x': 100}, {'y': 50, 'x': 1}]" | 2499994 | Save attempt | 1639 | 2H | 1499 | 90 | 230320135 |
And the corresponding SPADL representation:
game_id | period_id | time_seconds | team_id | player_id | start_x | start_y | end_x | end_y | bodypart_id | type_id | result_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2499994.0 | 2.0 | 1496 | 1631.0 | 8013.0 | 101.85 | 7.48 | 0.0 | 68.0 | 0 | 1 | 0 |
1 | 2499994.0 | 2.0 | 1499 | 1639.0 | 8094.0 | 1.05 | 34.0 | 1.05 | 34.0 | 2 | 14 | 1 |
The result_id
of the second action should be 3 (= own_goal).
The _get_minutes_played
method in wyscout loader contains 2 bugs:
socceraction/socceraction/data/wyscout/loader.py
Lines 731 to 735 in 5974857
In words: take the time in seconds of the last event in every period, convert to minutes and sum up the values. This means that injury time is taken into account for every period. Wyscout, however, only takes into account injury time of the last period of the game when defining the minute of e.g. a substitution, a red card. So when you now define the amount of minutes played of a substitute as follows:
and the time played by a player who gets substituted as:
the amount of minutes played will be too high and too low respectively. For players who play a full game there is obviously no problem.
While converting Wyscout events to SPADL actions most duels are removed as they are not considered on the ball actions, however in doing so some information is lost. Wyscout considers a pass which is followed by a duel as accurate (translated to SPADL as a successful action) even if the duel is lost by the teammate of the player who gave the pass. This causes successful passes to be followed by an action of the opposing team. It would make more sense (in my opinion) to mark the pass as failed and follow it up by an interception of the opposing player.
Add expected threat from Karun Singh (https://karun.in/blog/expected-threat.html) as an alternative to VAEP.
Possible filename: socceraction/xthreat.py
statsbomb.py
now offers some great atomic functions to download and convert statsbomb data. The notebook public-notebooks/1-load-and-convert-statsbomb-data
demonstrates the full pipeline and makes it possible for users to inspect intermediate results, view only a small part of the data, and debug in case things go wrong.
In contrast, opta.py
and wyscout.py
are still very much black boxes that only offer a single public function for converting an entire folder of data. These modules have none of the benefits statsbomb.py
has now.
In Perform's data we can split actions by left/right foot.
It is helpful when we know player's good foot.
I've added it to bodyparts
var
bodyparts: List[str] = ['foot', 'right foot', 'left foot', 'head', 'other', 'head/other']
and _get_bodypart_id()
function to increase xG classifier metrics.
Is it ok? Will it break other functionality? I'm ready to make pull request.
Wyscout collects its data by video analysis. This means that when replays of certain events are shown, the events that occur during that replay are not captured. The most common replayed events are goals and this occasionally causes the kick-off and subsequent actions to be absent from the game data. Most of the time this causes no issues (except for the fact that there is no data for small parts of the game), but when converting SPADL actions to atomic SPADL actions we run into problems. When converting from default SPADL to atomic SPADL, the result (result_id
and result_name
) are replaced by an extra action. For shotlike
actions this is done as follows:
socceraction/socceraction/atomic/spadl/base.py
Lines 116 to 166 in 8e29c57
In short: when the result of a shot
action is success, the next action will be a goal
, when the action following the shot is a corner
or a goalkick
, the next action will be out
. The problem now lies in:
socceraction/socceraction/atomic/spadl/base.py
Lines 156 to 161 in 8e29c57
Due to some events not being registered by Wyscout after a goal, it is possible that the first event registered after a goal is a goalkick
or a corner
, instead of the expected pass
(the kickoff). This means that line 161 will override what is done on line 159, causing the goals to be incorrectly converted.
There are two possible ways to fix this (that I've come up with at least):
In xG model example you use movement_a0
feature which is highly correlated with classification result and breaks other features importances.
I've tested the model with and without it and compared with Understat data. With this feature you have good values for goals but other shots from good positions have small probabilities.
Of course AUC score will decrease to ~ 0.83 (tested with XGBoost / LightGBM models) but final result for each shot seems more accurate.
The Wyscout convertor converts passes that are also interceptions in the event data into two separate events, first an interception and then a pass. However, the interception gets the result_id
of the original combined event, which can be problematic when the pass results in an own goal. If this happens, it seems like the player made two own goals instead of one.
The Wyscout event:
eventId | subEventName | tags | playerId | positions | matchId | eventName | teamId | matchPeriod | eventSec | subEventId | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
30658 | 8 | Head pass | "[{'id': 102}, {'id': 1401}, {'id': 1801}]" | 38093 | "[{'y': 56, 'x': 5}, {'y': 100, 'x': 100}]" | 2499737 | Pass | 1610 | 2H | 2184 | 82 | 180427412 |
The corresponding SPADL representation:
game_id | period_id | time_seconds | team_id | player_id | start_x | start_y | end_x | end_y | bodypart_id | type_id | result_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2499737.0 | 2.0 | 2184 | 1610.0 | 38093.0 | 99.75 | 38.08 | 99.75 | 38.08 | 0 | 10 | 3 |
1 | 2499737.0 | 2.0 | 2184 | 1610.0 | 38093.0 | 99.75 | 38.08 | 0.0 | 68.0 | 1 | 0 | 3 |
The insert_interception_passes function has to be adapted. This can be fixed easily, but I have to know the exact definition of a successful interception event. In particular, should the result in the example above be "success" (because he intercepted the ball) or "fail" (because he lost it immediately). More general, is an interception successful if you touch the ball or if you keep possession in the successive action?
To make the bugs reproducible, I used the OptaLoader on the XML feeds from test folder and have attached a screenshot here.
I noticed 2 bugs which affect the columns is_starter and minutes_played. Both the bugs can be located in extract_lineups() function belonging to _F7XMLParser class. There is also an error possible in logic used to calculate minutes_played (refer Issue 3 on this page )
Issue 1: a possible bug
Location : Line 841 in spadl/opta.py
In is_starter=player_elm.attrib['Formation_Place'] != 0,
, the attribute Formation_Place turns out to be a character containing numbers from 0 to 11. So in this case, is_starter becomes True irrespective of the value of Formation_Place because of difference in data types.
Issue 2: a possible bug
Location : Line 827 in spadl/opta.py
The following piece of code
sub_on = int(next((item['Time'] for item in subst if item['SubOn'] == f'p{player_id}'), 0))
assigns value 0 to variable sub_on for substitutes who don't get subbed on. So the players who stay on the bench throughout the game have the value of minutes_played to be equal to stats['match_time'], because minutes_played = sub_off - sub_on
and the sub_off value for all players who don't get subbed off is set to stats['match_time']. So as seen from the picture above, Iturraspe a player who doesn't play a single minute in the game, has minutes_played = 96 - 0 = 96
Issue 3 : possible Error causing line:
Location : Line 827 in spadl/opta.py
Substitutions events in Opta doesn't necessarily have to involve 2 players. A player retirement is also part of the sub event. So when a player gets retired (i.e. team has exhausted its available sub opportunities), the Subtitution element will not have the 'SubOn' attribute but just the 'SubOff'. Hence one would get a KeyError in these circumstances as the list comprehension looks for SubOn key in each Subtitution element
P.S : I managed a fix locally for all the 3 issues and will be issuing a pull request momentarily. I thought posting this as an issue would create a log of this issue on the issues section of this repo and might help people in the future.
On July 20th 2021, Wyscout switched their API to v3. Currently, socceraction only supports v2 of the Wyscout API, which is now a legacy version. Adding support for the new API format will require substantial changes to the socceraction.data.wyscout
and socceraction.spadl.wyscout
modules. As v2 of the API will remain available until the release of v4 (no release date yet), making these changes is currently not a priority. However, pull requests would be welcome.
See https://apidocs.wyscout.com/ for details.
Hi! first thanks for this package--I can't wait to dive into the "cleaned" data.
Second, I keep running into this error, and I think it has something to do with encodings in a few of the lineups, because if I remove certain line up files, it will run through. Sorry, I'm 99% an R user, so I'm not sure how to diagnose this! I'm following along in the open notebook you provided.
Thanks!
...Adding competitions to [redacted]\statsbomb.h5
...Adding matches to [redacted]\statsbomb.h5
...Adding players and teams to [redacted]\statsbomb.h5:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-4-0b9bd700b599> in <module>
----> 1 spadl.statsbombjson_to_statsbombh5("[redacted],statsbomb_h5)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\socceraction\spadl\statsbomb.py in jsonfiles_to_h5(datafolder, h5file)
18 print(f"...Adding matches to {h5file}")
19 add_matches(os.path.join(datafolder, "matches/"), h5file)
---> 20 add_players_and_teams(os.path.join(datafolder, "lineups/"), h5file)
21 add_events(os.path.join(datafolder, "events/"), h5file)
22
~\AppData\Local\Continuum\anaconda3\lib\site-packages\socceraction\spadl\statsbomb.py in add_players_and_teams(lineups_url, h5file)
43 ):
44 with open(lineup_file, "r") as fh:
---> 45 lineups += json.load(fh)
46 for lineup in lineups:
47 for p in [flatten_id(p) for p in lineup["lineup"]]:
~\AppData\Local\Continuum\anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 562: character maps to <undefined>
Hi!
First of all, thanks for the great package, it makes working with JSON files much easier!
I was reviewing the latest expected goal model code (EXTRA-build-expected-goals-models) and I think there is a small data leak in the model.
Features dx_a0 and dy_ao (also the movement_a0 as it is derived from other 2) uses end location information of the action, for shots this becomes the end location of the shot. All the successful (goal) shots have end x location either 0 or 105 (30-37 for y values) - so movement information actually inherits the result of the shot in it. For example, if the shot is taken from 25 (start_x_a0 = 25) meters horizontal distance away from the goal, any value less than 25 dx_a0 would automatically mean that the shot was not a goal. It is not a direct leak but still, I believe the model would be better without these features in it.
Please let me know if I'm missing a point!
Best wishes!
When converting SPADL actions to atomic SPADL actions, an owngoal is currently converted as follows:
socceraction/socceraction/atomic/spadl/base.py
Lines 119 to 127 in 8e29c57
However, owngoals in SPADL are the result of a bad_touch
and not of a shotlike
, so owngoals are never correctly converted.
This can be fixed by replacing line 127 with:
owngoal = (actions.type_id == _spadl.actiontypes.index("bad_touch")) & (actions.result_id == _spadl.results.index('owngoal'))
During pip install socceraction occures error of not finding json dependency. It occures at anaconda or normal python pip installation and also at Google Colab. As I saw you've changed json dependecies lately, then it would be worthy to look at it.
Collecting socceraction
Downloading https://files.pythonhosted.org/packages/0c/d1/9e51784a2375d477996ef2574218103bb0a0bfadf4914b7d3738cfd25af1/socceraction-0.0.7.tar.gz
Collecting tqdm
Downloading https://files.pythonhosted.org/packages/bb/62/6f823501b3bf2bac242bd3c320b592ad1516b3081d82c77c1d813f076856/tqdm-4.39.0-py2.py3-none-any.whl (53kB)
|████████████████████████████████| 61kB 558kB/s
ERROR: Could not find a version that satisfies the requirement json (from socceraction) (from versions: none)
ERROR: No matching distribution found for json (from socceraction)
When using the ´convert_to_actions´ method
to convert Wyscout events to SPADL actions, I get the following error:
I am using Python 3.8 and socceraction 1.1.3. The bug/error can be recreated as follows (set the download flag as needed):
pwl = socceraction.data.wyscout.PublicWyscoutLoader(download=True)
competitions = pwl.competitions()
world_cup = competitions.loc[6]
wc_games = pwl.games(world_cup.competition_id, world_cup.season_id)
game_id = wc_games.at[0, 'game_id']
home_team_id = wc_games.at[0, 'home_team_id']
events = pwl.events(game_id)
actions = socceraction.spadl.wyscout.convert_to_actions(events, home_team_id)
Removing the .astype(int)
on line 53 seems to fix this for me.
Hi!
Lately i'm working with Wyscout's soccer-logs open dataset of matches' ball events and tried to calculate VAEP scores for each of passes made during games, as this would be needed for my masters thesis. There occured a problem though.
It looks like jsons files stored there are in different format/structure as "normal" Wyscout's ones, cause there is a problem with i.e. matches info not occuring in events jsons. As I found out it would be pretty easy to change actual jsons_to_h5 wyscout function to the one, which would load data from open dataset (as I probably got it right now), but I can't test it, not knowning values which would be produced by original algorithm. Probably I'd do pull request for you to look at this issue.
As it's said that SPADL/VAEP is compatible with Wyscout's data and looking for a bunch of people, who would probably use it and want to know results of your work, that could be a good feature to be added on, to work with this kind of data.
Best wishes!
Ed: forgot the link to dataset: https://figshare.com/collections/Soccer_match_event_dataset/4415000/2
This repo did wonders for me when I was trying to wrap my head around the mathematics behind Karun Singh's Expected Threat model. As someone who is much more comfortable with R instead of Python, I actually ended up converting xthreat.py to an RScript. That file can be found here.
If your group wanted to add that script (which is SPADL compatible) to your repository, I would have no reservations. I think it would allow the Expected Threat model to become more accessible.
Thanks for your work!
It seems like Opta added some new type ID's for the 2021/22 season. These are not yet supported in socceraction causing the following error.
SchemaError: non-nullable series 'type_name' contains null value
As a temporary solution, you can downgrade to v1.1.1 (pip install socceraction=1.1.1).
NameError Traceback (most recent call last)
in
12 player_games.append(statsbomb.extract_player_games(events))
13 actions = statsbomb.convert_to_actions(events,match.home_team_id)
---> 14 atomic_actions[match.match_id] = atomicspadl.convert_to_atomic(actions)
15
16 games = matches.rename(columns={"match_id":"game_id"})
C:\ProgramData\Anaconda3\envs\datasciencesoccer-RQVybP6P\lib\site-packages\socceraction\atomic\spadl.py in convert_to_atomic(actions)
36 def convert_to_atomic(actions):
37 actions = actions.copy()
---> 38 actions = extra_from_passes(actions)
39 actions = add_dribbles(actions) # for some reason this adds more dribbles
40 actions = extra_from_shots(actions)
C:\ProgramData\Anaconda3\envs\datasciencesoccer-RQVybP6P\lib\site-packages\socceraction\atomic\spadl.py in extra_from_passes(actions)
113 extra["result_id"] = -1
114
--> 115 offside = prev.result_id == results.index("offside")
116 out = ((nex.type_id == actiontypes.index("goalkick")) & (~same_team)) | (
117 nex.type_id == actiontypes.index("throw_in")
NameError: name 'results' is not defined
I encountered the following error when trying to follow the steps in public notebook 1, using StatsBomb's open data.
This error isn't encountered for the other datasets (e.g. La Liga), so presumably the Champions League data is a bit different to the rest. The section that gives the error is below:
games = list(
SBL.games(row.competition_id, row.season_id)
for row in selected_competitions.itertuples()
)
games = pd.concat(games, sort=True).reset_index(drop=True)
games[["home_team_id", "away_team_id", "game_date", "home_score", "away_score"]]
When converting Opta event stream data, there is never a conversion to a 'goalkick' SPADL action.
To do so, the _get_type_id function in opta.py needs to be changed. According to this source (couldn't find an up-to-date official document on the internet), a goal kick corresponds to pass qualifier 124.
Another question also arises: is a keeper throw (qualifier 123) also considered a goal kick in terms of SPADL actions. I think it makes sense to either have separate SPADL actions for these two, or one (renamed?) SPADL action for them both. I do not think that a keeper throw should be considered a regular pass, since there is no pressure on the keeper to execute this action quickly (in contrast to a regular pass). Therefore it may need a different treatment when processing the data.
Typo at https://github.com/ML-KULeuven/socceraction/blob/master/socceraction/spadl/statsbomb.py#L313
"Lost in Play " should capitalise the I and be "Lost In Play"
In statsbomb.py
, opta.py
and wyscout.py
, the function fix_clearances
fails when a clearance is the last action of a game. In this case end_x
and end_y
become nan
as there is no next action.
def fix_clearances(actions):
next_actions = actions.shift(-1)
clearance_idx = actions.type_id == actiontypes.index("clearance")
actions.loc[clearance_idx, "end_x"] = next_actions[clearance_idx].start_x.values
actions.loc[clearance_idx, "end_y"] = next_actions[clearance_idx].start_y.values
return actions
For the testing suite to succeed, it seems we need to manually run tests/datasets/download.py multiple times with all of "statsbomb", "wyscout", "convert-statsbomb" and "convert-wyscout" args.
As is this should probably at least be mentioned in the contributing guide. Or can this be automated when running nox?
Further the main function errors out if no further arg is provided. How should the logic be here?
if __name__ == '__main__':
if len(sys.argv) == 1 or sys.argv[1] == 'statsbomb':
download_statsbomb_data()
if sys.argv[1] == 'convert-statsbomb':
convert_statsbomb_data()
if len(sys.argv) == 1 or sys.argv[1] == 'wyscout':
download_wyscout_data()
if sys.argv[1] == 'convert-wyscout':
convert_wyscout_data()
if len(sys.argv) == 1 or sys.argv[1] == 'spadl':
create_spadl(8657, 777)
Hi @probberechts , where can I find the EXTRA-load-and-convert-wyscout-data.ipynb notebook? The link below is no longer working. Thank you
@agdhruv I've added a notebook which downloads and converts the Wyscout dataset to the SPADL format in the wyscout_support
branch. If you use it, you should be aware that there are still some bugs in the wyscout converter (see other issues).
Originally posted by @probberechts in #14 (comment)
I believe there is information leakage in 3-estimate-scoring-and-conceding-probabilities.ipynb
when using the result of action a0
or the end location of action a0
as a feature to predict scoring when a0
is of type shot (including, freekicks and penalties).
The _get_minutes_played
method uses the timestamps of events in a single game to determine the length of that game and thus how long players played.
socceraction/socceraction/data/wyscout/loader.py
Lines 726 to 733 in e7bc0d0
The players
method in the PublicWyscoutLoader however, passes all the events in the competition of a game, instead of the events in the game itself.
socceraction/socceraction/data/wyscout/loader.py
Lines 286 to 290 in e7bc0d0
This causes all games in the same competition to have the same length and the minutes played for all players to be wrong. As a temporary solution, replacing line 290 with the following 2 lines fixes the issue (but I don't know if it's the most efficient way to do it):
match_events = filter(lambda event: event['matchId'] == game_id, self.get(path_events))
mp = _get_minutes_played(lineups, cast(List[Dict[str, Any]], match_events))
Hello, can you please explain a little better how the OptaLoader with Whoscored works?
What is the feeds dict() format?
dict_opta = {
'whoscored': "PremierLeague-2020_2021\\1485314.json"
}
datafolder = "..\data\Premier_League-2020_2021"
SBL = opta.OptaLoader(root=datafolder, feeds=dict_opta, parser='whoscored')
I tried this quick test but I don't think I am doing it correctly since I am not getting any competitions from competitions = SBL.competitions()
Can you give me a quick example on how to load the whoscored json?
Kind regards
(I'm using statsbomb data, from a glimpse it looks like the opta and wyscout spadl conversion scripts should suffer the same bug)
https://github.com/ML-KULeuven/socceraction/blob/master/socceraction/spadl/statsbomb.py#L406
when selecting for events to add dribbles in between, there is no check that the two events take place in the same period (half) of the game. Given that the time seconds reverts back to 0(ish) for the next event, the difference in times is negative (and so less than 10)
Therefore, the first event of the next half has the chance to be a dribble between the last event of the proceeding half, and the pass to start the following half.
A simple check for
same_period = actions.period_id == next_actions.period_id
should resolve this
I'm on holiday at the moment but will create a pull request to solve this in the next few days
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.