Margin-Elo for NFL outcomes

J. Scott Moreland | Philadelphia, PA | January 11th, 2020

A brief tutorial of an NFL implementation of the margin-dependent Elo (MELO) model.

Intro

This Jupyter notebook describes the nflmodel Python package which can be used to predict the full probability distribution of NFL point-spread and point-total outcomes.

The model is inspired by the Elo based sports analytics work at fivethirtyeight.com and makes use of a machine learning algorithm that I developed called the margin-dependent Elo (MELO) model.

I describe the theory behind the algorithm at length in an arXiv pre-print, and you can also read about it on the MELO sphinx documentation page. The purpose of the blog post is not to rehash the theory behind the model, but rather to demonstrate how it can be used effectively in practice.

Requirements

The nflmodel package requires Python3 and is intended to be used through a command line interface. I've tested the package on Arch Linux and OSX, but not on Windows.

The model also requires an active internet connection to pull in the latest game data and schedule information. I'm working to enable offline capabilities, but this does not exist at the moment.

Installation

Navigate to the parent directory where you want to save the nflmodel package source, then run the following from the command line to install.

In [1]:
!git clone https://github.com/morelandjs/nfl-model.git nflmodel
!pip3 install --user nflmodel/.
Cloning into 'nflmodel'...
remote: Enumerating objects: 254, done.
remote: Counting objects: 100% (254/254), done.
remote: Compressing objects: 100% (156/156), done.
remote: Total 371 (delta 126), reused 194 (delta 76), pack-reused 117
Receiving objects: 100% (371/371), 569.42 KiB | 887.00 KiB/s, done.
Resolving deltas: 100% (173/173), done.
Processing ./nflmodel
...
Building wheels for collected packages: nflmodel
  Building wheel for nflmodel (setup.py) ... done
  Created wheel for nflmodel: filename=nflmodel-0.1-py3-none-any.whl size=12063 sha256=1c2b182dbb429044bb3828816c6fdcbc5ce3ec26079987e65acd1c7c638213f4
  Stored in directory: /tmp/pip-ephem-wheel-cache-hn01xv8m/wheels/95/4a/2a/7caa67ad61638e063d0ace52ffbafff26848f75a09d18008ce
Successfully built nflmodel
Installing collected packages: nflmodel
  Attempting uninstall: nflmodel
    Found existing installation: nflmodel 0.1
    Uninstalling nflmodel-0.1:
      Successfully uninstalled nflmodel-0.1
Successfully installed nflmodel-0.1

After installing the nflmodel package, you'll need to populate the database of NFL game data. Since this is presumably your first time running the package, it will download all available games dating back to the 2009 season.

In [2]:
!nflmodel update
[INFO][data] updating season 2009 week 1
[INFO][data] updating season 2009 week 2
[INFO][data] updating season 2009 week 3
[INFO][data] updating season 2009 week 4
[INFO][data] updating season 2009 week 5
...
[INFO][data] updating season 2019 week 13
[INFO][data] updating season 2019 week 14
[INFO][data] updating season 2019 week 15
[INFO][data] updating season 2019 week 16
[INFO][data] updating season 2019 week 17

Subsequent calls to nflmodel update will incrementally refresh the database and pull in all new game data since the last update. If at any point your database becomes corrupted, you can rebuild it from scratch using the optional --rebuild flag.

Now that we've populated the database, let's inspect some of the game data contained within.

In [3]:
!nflmodel games
                datetime  season  week  ... team_away       qb_away  score_away
0    2009-09-10 20:30:00    2009     1  ...       TEN     K.Collins          10
1    2009-09-13 13:00:00    2009     1  ...       MIA  C.Pennington           7
2    2009-09-13 13:00:00    2009     1  ...        KC      B.Croyle          24
3    2009-09-13 13:00:00    2009     1  ...       PHI      D.McNabb          38
4    2009-09-13 13:00:00    2009     1  ...       DEN       K.Orton          12
...                  ...     ...   ...  ...       ...           ...         ...
2810 2019-12-29 13:00:00    2019    17  ...       PHI       C.Wentz          34
2811 2019-12-29 13:00:00    2019    17  ...       ATL        M.Ryan          28
2812 2019-12-29 16:25:00    2019    17  ...       OAK        D.Carr          15
2813 2019-12-29 16:25:00    2019    17  ...       ARI      K.Murray          24
2814 2019-12-29 16:25:00    2019    17  ...        SF   J.Garoppolo          26

[2815 rows x 9 columns]

Each row corresponds to a single game, and each column is a certain attribute of that game. At the moment I populate the following attributes:

  • datetime - (datetime64) - game start time
  • season (int) - year in which the season started
  • week (int) - nfl season week 1–17
  • team_home (string) - home team city abbreviation
  • qb_home (string) - home team quarterback name, first initial dot surname
  • score_home (int) - home team points scored
  • team_away (string) - away team city abbreviation
  • qb_away (string) - away team quarterback name, first initial dot surname
  • score_away (int) - away team points scored

Note, the nflmodel games runner is printing a pandas dataframe, so it may hide columns in order to fit the width of your terminal window. If you want to see more column output, try enlarging your terminal window.

If you want to see more row output, use the --head and --tail commands. For example,

In [4]:
!nflmodel games --head 3
             datetime  season  week  ... team_away       qb_away  score_away
0 2009-09-10 20:30:00    2009     1  ...       TEN     K.Collins          10
1 2009-09-13 13:00:00    2009     1  ...       MIA  C.Pennington           7
2 2009-09-13 13:00:00    2009     1  ...        KC      B.Croyle          24

[3 rows x 9 columns]

The games dataframe is sorted from oldest to most recent, so --head N returns the N oldest games in the database and --tail N returns the N most recent games. If you want to print the entire dataset to standard out, just set --head or --tail to a very large number.

Calibration

The model predictions depend on a handful of hyperparameter values that are unknown a priori. These hyperparameters are:

  • kfactor (float) - prefactor multiplying the Elo rating update; a larger kfactor makes the ratings more reponsive to game outcomes
  • regress_coeff (float) - fraction to regress each rating to the mean after 3 months inactivity
  • rest_bonus (float) - prefactor multiplying each matchup's rest difference (or sum)
  • exp_bonus (float) - prefactor multiplying each matchup's quarterback experience difference (or sum)
  • weight_qb (float) - coeficient used to blend team ratings and quarterback ratings.

Home field advantage is accounted for naturally by the MELO model constructor, so it is not included in the above list.

The hyperparameter values are calibrated by the following command (this will take several minutes or so, so grab a drink).

In [5]:
!nflmodel calibrate --steps 200
[INFO][model] calibrating spread hyperparameters
[INFO][data] updating season 2019 week 17
[INFO][tpe] build_posterior_wrapper took 0.001400 seconds
[INFO][tpe] TPE using 0 trials
[INFO][tpe] build_posterior_wrapper took 0.001555 seconds
[INFO][tpe] TPE using 1/1 trials with best loss 10.549234
[INFO][tpe] build_posterior_wrapper took 0.001593 seconds
[INFO][tpe] TPE using 2/2 trials with best loss 10.409128
[INFO][tpe] build_posterior_wrapper took 0.001607 seconds
[INFO][tpe] TPE using 3/3 trials with best loss 10.409128
[INFO][tpe] build_posterior_wrapper took 0.001581 seconds
[INFO][tpe] TPE using 4/4 trials with best loss 10.379357
...
[INFO][tpe] build_posterior_wrapper took 0.001602 seconds
[INFO][tpe] TPE using 195/195 trials with best loss 10.699190
[INFO][tpe] build_posterior_wrapper took 0.001602 seconds
[INFO][tpe] TPE using 196/196 trials with best loss 10.699190
[INFO][tpe] build_posterior_wrapper took 0.001628 seconds
[INFO][tpe] TPE using 197/197 trials with best loss 10.699190
[INFO][tpe] build_posterior_wrapper took 0.001667 seconds
[INFO][tpe] TPE using 198/198 trials with best loss 10.699190
[INFO][tpe] build_posterior_wrapper took 0.001667 seconds
[INFO][tpe] TPE using 199/199 trials with best loss 10.699190
[INFO][model] caching total model to /home/morelandjs/.local/share/nflmodel/total.pkl

Here the argument --steps (default value 100) specifies the number of calibration steps used by hyperopt to optimize the hyperparameter values. I've generally found that around 200 calibration steps is sufficient, but the convergence is already very good after only 100 steps.

Note, the MELO model is actually very fast to condition once the hyperparameter values are known, so default behavior is to condition the model every time it is queried from the CLI. This ensures that the model predictions are always up to date.

Weekly forecasts

Once calibrated, the nflmodel package can forecast various metrics for a given NFL season and week. For instance, the command

In [6]:
!nflmodel forecast 2019 10
[INFO][nflmodel] Forecast for season 2019 week 10

           favorite underdog  win prob  spread  total
date                                                 
2019-11-10      @NO      ATL      0.91   -10.0   47.9
2019-11-10      BAL     @CIN      0.90    -8.9   45.9
2019-11-10     @IND      MIA      0.74    -5.9   48.4
2019-11-11      @SF      SEA      0.65    -4.2   49.7
2019-11-10     @DAL      MIN      0.54    -2.9   45.1
2019-11-10       KC     @TEN      0.59    -2.9   49.0
2019-11-10      BUF     @CLE      0.57    -2.8   42.7
2019-11-10      @GB      CAR      0.54    -1.9   54.1
2019-11-10     @CHI      DET      0.61    -1.8   44.8
2019-11-07     @OAK      LAC      0.53    -1.7   43.6
2019-11-10     @NYJ      NYG      0.53    -1.3   47.5
2019-11-10      ARI      @TB      0.52    -0.5   51.1
2019-11-10       LA     @PIT      0.50    -0.1   46.1 

*win probability and spread are for the favored team

will generate predictions for the 10th week of the 2019 NFL season, using all available game data prior to the start of that week. If you do not provide the year and week, the code will try to infer the upcoming week based on the current date and known NFL season schedule.

The output is structured relative to the favorite team in each matchup, with home teams indicated by an '@' symbol. For example, the forecast output above says that NO is playing at home versus ATL, where they have a 91% win probability, are favored by 10 points, and are predicted to combine for 47.9 total points. The games are also sorted so that the most lopsided matchups appear first and most even matchups appear last.

Team rankings

The model will also provide team rankings at a given moment in time. For example, suppose I want to rank every team at precisely datetime = 2019-11-11T20:23:06. I can issue the command

In [7]:
!nflmodel rank --datetime "2019-11-11T20:23:06"
[INFO][nflmodel] Rankings as of 2019-11-11T20:23:06

       win prob        spread         total
rank                                       
1     BAL  0.76  │  NE  -10.1  │   KC  50.4
2      NE  0.76  │  BAL  -7.4  │   TB  50.3
3     SEA  0.74  │   SF  -5.8  │  NYG  48.4
4      NO  0.71  │  MIN  -5.7  │  SEA  48.2
5      SF  0.71  │   LA  -4.3  │  BAL  48.0
6     MIN  0.68  │  HOU  -4.2  │  CAR  47.8
7     HOU  0.68  │  DAL  -4.0  │  ATL  46.9
8      GB  0.66  │   NO  -4.0  │  ARI  46.9
9     PIT  0.64  │  SEA  -3.1  │  OAK  46.8
10    PHI  0.60  │  PIT  -2.9  │  PHI  46.8
11    CAR  0.60  │   KC  -2.8  │   GB  46.4
12     LA  0.60  │  PHI  -2.7  │  DET  46.4
13     KC  0.57  │   GB  -2.5  │   LA  46.1
14    DAL  0.55  │  CAR  -2.3  │   NO  46.1
15    CHI  0.53  │  LAC  -1.9  │  HOU  46.1
16    OAK  0.53  │  CHI  -1.9  │   SF  46.0
17    TEN  0.52  │  BUF  -0.5  │  CIN  45.8
18    BUF  0.50  │  JAX   0.0  │  CLE  45.4
19    JAX  0.50  │  DEN   0.1  │  DAL  45.4
20    LAC  0.48  │  IND   1.0  │  NYJ  45.3
21    DEN  0.47  │  TEN   1.2  │  IND  45.1
22    CLE  0.46  │  CLE   1.3  │  MIA  45.1
23    ATL  0.43  │  ATL   1.3  │  PIT  45.0
24    IND  0.41  │  DET   1.5  │   NE  44.3
25    DET  0.41  │   TB   1.8  │  MIN  44.3
26     TB  0.39  │  OAK   3.1  │  LAC  43.7
27    ARI  0.38  │  ARI   3.4  │  TEN  43.6
28    MIA  0.37  │  NYG   4.7  │  JAX  43.4
29    NYJ  0.33  │  WAS   4.9  │  WAS  42.8
30    WAS  0.33  │  CIN   5.1  │  BUF  42.7
31    CIN  0.30  │  NYJ   5.6  │  DEN  42.2
32    NYG  0.28  │  MIA   6.1  │  CHI  42.0 

*expected performance against league average
opponent on a neutral field

and see how the model thinks the teams should be ranked at that moment in time, according to their predicted performance against a league average opponent on a neutral field. The far left column is each team's predicted win probability, the middle column is its predicted Vegas point spread, and the far right column is its predicted point total.

The table above, for instance, tells me that BAL is most likely to win a generic matchup, while NE is most likely to blow out their opponent, and KC is the most likely to find itself in a shoot out.

Individual game predictions

One attractive feature of the MELO base estimator is that it predicts full probability distributions. This enables the model to do all sorts of cool things like predict interquartile ranges for the point spread or draw samples of a matchup's point total distribution.

Most notably, it means that the model can estimate the probability that the point spread (or point total) falls above or below a given line. This means that it can evaluate the profitability of various bets point spread and point total lines.

This capability is accessed using the nflmodel predict entry point. For example, suppose you want to analyze the game 2019-12-01 CLE at MIA with the following betting profile

SPREAD TOTAL
CLE -9.5 (-110) O 46.5 (-105)
MIA +9.5 (-110) U 46.5 (-115)

This is accomplished by calling the nfl predict runner with the following arguments.

In [8]:
!nflmodel predict "2019-12-01" CLE MIA --spread -110 -110 9.5 --total -105 -115 46.5
[INFO][nflmodel] 2019-12-01 CLE at MIA

                away   home
team             CLE    MIA
win prob         67%    33%
spread         -12.4   12.4
total           44.9   44.9
score             29     16
spread cover     58%    42%
spread return    12%   -22%
                           
                over  under
total cover      47%    53%
total return     -8%    -1% 

*actual return rate lower than predicted

There's a lot of information here, so let's unpack what it means. First, the model believes CLE is a heavy favorite on the road. CLE's predicted win probability is 67%, their predicted point spread is -12.4 points, and their predicted point total is 44.9 points (according to the model). This would correspond to a characteristic final score of 29-16 CLE over MIA with fairly large uncertainties.

The model is also providing input on the Vegas point spread and point total lines. It expects CLE to cover their point spread line 58% of the time which is good for a 12% ROI, accounting for the house cut. Conversely, betting on MIA is expected to net a loss of 22% on average.

The over/under metrics are reported in a similar fashion. The model thinks there's a 53% chance the total score goes under the published Vegas line which results in a -1% loss on average. Similarly, taking the over is expected to net an 8% loss.

Quarterback changes

While it has not been readily apparent up to this point, the model accounts for changes at the quarterback position. It does this by tracking ratings at both the team and quarterback level.

For example, suppose Tom Brady were the only quarterback that ever played for the Patriots. Then there would be two associated ratings, one for T.Brady and one for NE, which are effectively identical. If however, Tom Brady left NE and went to play for SF for the last two years of his career, his rating would diverge from NE's rating and begin to track the performance of SF.

I take the weighted average of QB level and team level ratings when generating the effective rating for each upcoming game. In this way, I mix together the historical performance of the team with its QB. The weighted average is controlled by the qb_weight hyperparameter which is fixed when calibrating the model.

The model has no direct knowledge of QB injuries, so you'll have to explicitly tell the model to generate predictions with a different quarterback if that's what you intend to do. At the moment, the only runner that can accommodate QB injuries is the nflmodel predict runner.

Suppose, for example, you want to see how CLE would perform on the road against PHI on 2019-12-01 if Carson Wentz went out with an injury practicing before the game. First let's see how the two teams would matchup if Wentz never got hurt.

In [9]:
!nflmodel predict "2019-12-01" CLE-B.Mayfield PHI-C.Wentz
[INFO][nflmodel] 2019-12-01 CLE-B.Mayfield at PHI-C.Wentz

                    away         home
team      CLE-B.Mayfield  PHI-C.Wentz
win prob             44%          56%
spread               4.5         -4.5
total               44.9         44.9
score                 20           25 

*actual return rate lower than predicted

Notice that the predictions are exactly the same if I omit the quarterback suffixes.

In [10]:
!nflmodel predict "2019-12-01" CLE PHI
[INFO][nflmodel] 2019-12-01 CLE at PHI

          away  home
team       CLE   PHI
win prob   44%   56%
spread     4.5  -4.5
total     44.9  44.9
score       20    25 

*actual return rate lower than predicted

This is because both Baker Mayfield and Carson Wentz played in the game preceding the specified date, so the model assumes they are the starting QBs for upcoming games by default.

Suppose now, that Wentz got hurt. We can specify his backup QB Josh McCown to see how that would affect the model predictions.

In [11]:
!nflmodel predict "2019-12-01" CLE PHI-J.McCown
[INFO][nflmodel] 2019-12-01 CLE at PHI-J.McCown

          away          home
team       CLE  PHI-J.McCown
win prob   51%           49%
spread    -1.8           1.8
total     44.8          44.8
score       23            22 

*actual return rate lower than predicted

This creates roughly a 7 point swing in the point spread and CLE is now the favorite. In fairness to Josh McCown, the magnitude of this point shift is not purely a statement about the better QB. It also accounts for the fact that PHI has not been game planning for McCown, and the offense is not built around him.

Validation

The nflmodel package includes a command line runner to validate the predictions of the model. If the model predictions are statistically robust, then the distribution of standardized residuals will be unit normal and the distribution of residual quantiles will be uniform.

This is a very powerful test of the model's veracity, but it does not necessarily mean the model is accurate. Rather it tests whether the model is correctly reporting its own uncertainties. To quantify the model's accuracy, nflmodel validate also reports the models mean absolute prediction error for seasons 2011 through present.

In [12]:
!nflmodel validate
[INFO][validate] spread residual mean: 0.10
[INFO][validate] spread residual mean absolute error: 10.37
[INFO][validate] total residual mean: 0.46
[INFO][validate] total residual mean absolute error: 10.70

This will produce two figures, validate_spread.png and validate_total.png in the current working directory. For example, the point spread validation figure is shown below.

spread validation

The model's point spreads have a mean absolute error of 10.37 points and its point totals have a mean absolute error of 10.70 points. The table below compares the model's point spread mean absolute error to Vegas for the specific season range 2011–2019.

MAE SPREAD TOTAL
MODEL 10.37 pts 10.70 pts
VEGAS 10.23 pts 10.49 pts

The model is less accurate than Vegas, but these numbers are still promising considering that many factors remain missing from the model such as weather and personnel changes.

Betting simulation

I've also bundled a small script nflmodel/tutorial/simulate-bets.py inside the tutorial directory which can simulate the performance of the model as a betting tool on historical games. Hereafter, I restrict my attention to point spreads since I'm currently neglecting many factors which are specifically important to point total values like stadium type and weather.

The general idea is as follows. I backtest the model and estimate the probability that the home team and away team each cover their Vegas spread (scraped from covers.com) at the moment just before kickoff. If the model believes that either team will cover their spread with a likelihood greater than X, where X is a fixed decision threshold, then I place a simulated bet on that team.

When the threshold X=50%, the simulation places bets on every game, and when the threshold X ≫ 50%, the simulation is far more selective with the games it chooses to bet on. If I set the decision threshold X > 90% then the model cannot find any games with sufficiently high confidence to bet on.

In principle, one expects the model to get more bets correct when X is large because it is more confident in those bets. The goal here is to see if there is a threshold X which is sufficiently large to yield a positive ROI.

Technically, I need more information to compute the model's ROI than just the historical spread. I also need the vigorish or "juice" on each spread which is the cut taken by the house in order to place a bet. Typically this ranges anywhere from -100 to -120 for spread bets, which means you might need to risk up to 120 dollars in order to win 100 dollars. More extreme vigs exist but they are uncommon.

Unfortunately, I do not have the historical vigs for these lines, precluding a true ROI simulation. However, you can rest assured that the ROI will be strongly negative if the model is not getting more games right than wrong. So for now, let's just see if I can demonstrate that the model is better than 50-50 with statistical significance.

Statistical significance here is key. As I increase the betting threshold, I reduce my validation statistics because the number of qualifying games drops. For large values of the threshold X, the model may only find a dozen or so qualifying games to bet on. This means the results of the simulation will be noisy and it will be possible to be duped by statistical fluctuations (we've all seen someone flip four heads in a row).

To calculate statistical significance, I compute the 90% interquartile range for the null model, i.e. the range of reasonable outcomes for independent random wins sampled with 50% probability. In other words, I compare my model's success rate to what one would expect from random chance.

The results of this calculation are shown below for a decision threshold X=0.8.

In [21]:
!python3 simulate-bets.py 0.8
[INFO][simulate-bets] 15 won, 7 lost
[INFO][simulate-bets] 0.68% correct model; 0.36-0.64% random chance
[INFO][simulate-bets] Vegas mean abs error: 11.73 pts
[INFO][simulate-bets] Model mean abs error: 11.14 pts

Here we see that the model is performing at the ~90% quantile level, i.e. not something you'd use to bet a ton of money, but I think impressive nonetheless.

Who's going to win Super Bowl LIV?

Unfortunately, I have not prepped the model to work on the post-season yet. There's nothing that precludes applying the model to the post-season, it just involves some more work, and I haven't had time to do it yet.

In any event, I think this tutorial shows that the Vegas lines are quite accurate, and it is non-trivial to build a model that beats them. The small number of available games makes NFL modeling a uniquely interesting problem, and I've learned quite a bit on my own quest to build a better model.

Please feel free to contact me at morelandjs@gmail.com with questions and comments.

Thank you!

-Scott