Lately, I’ve also been looking to improve my abilities in Python and saw this as a nice opportunity to do so. My idea was to compare the results of picking based on our gut instincts to some simple “bots” (for fun, I’m using the term loosely). I wanted to develop a more complex prediction model, but maybe I’ll leave that for another since I’m already short on time (the season starts today!).2 Can we do better than these very simple bots? I’ll use this post to introduce the bots, and then at the end of the season write up a little retrospective. More interestingly, we might learn about whether these bots were too aggressive or too conservative, and if I get around to adding a fourth bot that uses more information3, how much more we can gain by adding this information.
Before we turn to each of the three bots, let’s set up some preliminaries. Namely, importing required libaries, setting a random number seed (for reproducibility and accountability), reading in the results from the 2021-2022 Premier League season, cleaning up that data frame, and replacing the relegated teams with the newly promoted teams.4
import numpy as np import pandas as pd import random import wikipedia as wp random.seed(4082022) # read webage html = wp.page('2021–22_Premier_League').html().encode('UTF-8') # take table into df df = pd.read_html(html) # select only relevant columns df = df[['Team', 'Pld', 'W', 'D', 'L', 'GF', 'GA']] # replace extraneous information from team names df['Team'].replace(' \(C\)', '', regex = True, inplace = True) df['Team'].replace(' \(R\)', '', regex = True, inplace = True) # calculate win percentage df['W_pct'] = df['W']/df['Pld'] # calculate goals per game df['gpg'] = df['GF']/df['Pld'] # promoted teams promoted_teams = ['Fulham', 'Bournemouth', 'Nottingham Forest'] # replace relegated teams for i in range(0,3): df['Team'].iloc[17+i] = promoted_teams[i]
Now let’s go around the room and introduce our three bots. The first bot simply randomly picks a result for each matchup. If you’re doing worse than this, that’s bad news for you.
# Bot 1: def simple_bot(Team1, Team2): result = random.choice(['W', 'D', 'L']) if result == 'W': result = Team1 elif result == 'L': result = Team2 return(result)
The second bot uses each team’s win percentage from last season and then draws from the uniform distribution (bounded by 0 and 1) to give each team a “point” and then determines the result based on that. For example, suppose
Team1 won 70% of their matches in 2021-2022, while
Team2 won 50% of their matches in 2021-2022. We would then draw two random numbers uniformly between 0 and 1. If
Team1’s random number is less than 70%, then they get a point. If
Team2’s random number is less than 50%, then they get a point (notice how the better team is more likely to get a point). Now, if
Team1 has a point and
Team2 does not, then
Team1 “wins”, and vice versa. If both teams have a point, then the result is a draw.
I perform this procedure this for one draw only, to mimic the randomness associated with a soccer match. Doing this in expectation would just be equivalent to the team with the better win percentage always winning, which of course doesn’t always happen (but would be another interesting comparison to make). However, I have a feeling this is likely going to be too conservative (and overestimate the number of results that are draws), and would also be curious about what number of draws (from the distribution) minimizes error. This would tell us something interesting about the variance associated with these matches.
# Bot 2: def win_pct_bot(Team1, Team2): w_pct1 = df.loc[df['Team'] == Team1, 'W_pct'].iloc w_pct2 = df.loc[df['Team'] == Team2, 'W_pct'].iloc team1_point = 1 if random.uniform(a = 0, b = 1) < w_pct1 else 0 team2_point = 1 if random.uniform(a = 0, b = 1) < w_pct2 else 0 if team1_point > team2_point: result = Team1 elif team1_point == team2_point: result = 'D' elif team1_point < team2_point: result = Team2 return(result)
The third bot uses the number of goals scored per game in each team’s previous season as a parameter in the poisson distribution to mimic the number of goals scored in the future matchup. Then the result is determined in the same manner as the result for any soccer match.
This comes with the same caveat I gave to the second bot, where performing this procedure in expectation would lead to a prediction of the team with the higher goals per game in the previous season always winning. In a similar manner, it will be interesting to compare this bot to the results in expectation, and the number of draws that minimize error. Another upgrade to the bot would be to consider defensive prowess from the previous season: one could also draw from another poisson distribution with the average number of goals conceded in the previous season and add that to the opposite teams score. While this would probably overestimate the total number of goals scored (since it’s essentially double counting), it may lead to more accurate predictions.
# Bot 3: def pois_bot(Team1, Team2): gpg1 = df.loc[df['Team'] == Team1, 'gpg'].iloc gpg2 = df.loc[df['Team'] == Team2, 'gpg'].iloc team1_goals = np.random.poisson(lam = gpg1, size = 1) team2_goals = np.random.poisson(lam = gpg2, size = 1) if team1_goals > team2_goals: result = Team1 elif team1_goals == team2_goals: result = 'D' elif team1_goals < team2_goals: result = Team2 return(result)
Coding and writing this up has been a pleasant experience and made me think quite a bit about predictions in soccer. Perhaps I will look into all these extra pieces ex post,5 but the season is rapidly upon us, so I wanted to get this out before the first match—consider this maybe as a pre-registration of sorts.
That said, I don’t like Manchester United, and in an individual match, I’m typically inclined to root for whichever team has an Argentinian, Colombian, or American player I like. Currently I’m all in on Cristian Romero, who I think is the solution—finally—to Argentina’s defensive woes. ↩
Oh, and I have a dissertation to work on… right. ↩
Maybe transfer spending, team salaries, or advanced stats like expected goals. ↩
I did this by “order” of the relegated and promoted teams respetively, so Championship winners Fulham replaced the best performing relegated team Burnley, and so on. ↩
One other point that comes to mind is that the existing bots that use data from the previous season would probably be improved if the parameters used were updated each week using this season’s matches. We’ll see! ↩