Methodology

Models

We will investigate two models for estimating within-game win probability:

  • The lifelines time-varying Cox proportional hazards model (see here), and

  • An XGBoost regression model with the survival:cox objective function.

To accurately compare the models to each other and the NBA’s own win probability, we will split the dataset into two pieces: build (80%) and holdout (20%). The build dataset will be used for model training and hyperparameter tuning while the holdout dataset will be used for comparing the models. The build dataset will be broken down for each model:

Model

Build datasets

Proportion of build (total)

Description

lifelines

Train

75% (60%)

Model training data.

Tune

25% (20%)

Hyperparameter tuning data.

xgboost

Train

75% (60%)

Model training data.

Stopping/Tune

25% (20%)

Data for tuning and early
stopping *

The datasets will be stratified by season and by the target to ensure that the models are being built on representative data.

*

We will use early stopping to determine the number of boosting rounds for the model.

Hyperparameter tuning

Both models have hyperparameters that we will tune using hyperopt.

Lifelines

We will use the following hyperparameter search space for the lifelines model.

Hyperparameter

Search space

penalizer

\(Unif(0, 1)\)

l1_ratio

\(Unif(0, 1)\)

To maximize the tuning search, we limited to the following space after some trials:

Hyperparameter

Search space

penalizer

\(Unif(0.05, 0.15)\)

l1_ratio

\(Unif(0, 0.015)\)

XGBoost

Based on discussions with heytheredli and iteratively limiting the range for each parameter to maximize the effectiveness of the search, we used the following space to start:

Hyperparameter

Search space

learning_rate

\(Unif(0, 0.01)\)

subsample

\(Unif(0, 1)\)

max_delta_step

\(Unif(0, 1)\)

max_depth

\(QUnif(2, 15, 1)\)

gamma

\(Unif(0, 1)\)

reg_alpha

\(Unif(0, 1)\)

reg_lambda

\(Unif(0, 1)\)

colsample_bytree

\(Unif(0, 1)\)

colsample_bylevel

\(Unif(0, 1)\)

colsample_bynode

\(Unif(0, 1)\)

min_child_weight

\(QUnif(100, 600, 10)\)

After iteration, we used the following space:

Hyperparameter

Search space

learning_rate

\(Unif(0, 0.01)\)

subsample

\(Unif(0.4, 1)\)

max_delta_step

1

max_depth

4

gamma

\(Unif(0.5, 1)\)

reg_alpha

\(Unif(0.6, 1)\)

reg_lambda

\(Unif(0.25, 0.75)\)

colsample_bytree

\(Unif(0.5, 1)\)

colsample_bylevel

1

colsample_bynode

\(Unif(0, 0.5)\)

min_child_weight

\(QUnif(510, 530, 1)\)

We also added a monotonic constraint to ensure that the model output is monotonic in scoring margin.

Calibration

We will use isotonic regression to calibrate the output probabilities from each model to ensure that we have interpretable outputs.

Model evaluation

We will compare each survival model with the NBA win probability output using AUROC. Specifically, we will generate a plot describing the AUROC at each time step from 0 to 2880 seconds (48 minutes); this metric is based on a similar concept introduced in scikit-survival.