Package Usage¶
Pulling data from the NBA API¶
For this model, we are pulling data from the public, undocumented NBA Stats API. The data loader is built on excellent work by swar and seemethere.
Python¶
Each endpoint has an associated class (see the API reference
for a complete list). Let’s focus on nbaspa.data.endpoints.boxscore.BoxScoreTraditional.
First, we need to initialize the class:
from nbaspa.data.endpoints import BoxScoreTraditional
box = BoxScoreTraditional(GameID="0021800001", output_dir="nba-data/2018-19")
In this call, we’re specifying the game identifier and the output directory for the downloaded data. Then, we can download the data:
box.get()
You should see a new file, nba-data/2018-19/boxscoretraditionalv2/data_0021800001.json. To
preview the data, you can choose a dataset from box.datasets and then select a
dataset type:
df = box.get_data("PlayerStats")
and that’s it!
Multiple calls¶
Calling an individual endpoint for every game in a season is… tiring. So, we made a factory class that will loop through multiple calls. First, define your list of calls.
calls = [
("BoxScoreTraditional", {"GameID": "0021800001"}),
("BoxScoreTraditional", {"GameID": "0021800002"})
]
Next, initialize nbaspa.data.factory.NBADataFactory and download the data:
from nbaspa.data.factory import NBADataFactory
factory = NBADataFactory(calls=calls, output_dir="nba-data/2018-19")
factory.get()
Important
We use ratelimit to prevent overloading the NBA API. The ratelimiting is very conservative and limits to 1 call every minute.
Command-line interface¶
The simplest way to retrieve the data used for this model build is to use the CLI.
$ nbaspa-download scoreboard --output-dir nba-data --season 2018-19
The call above will download the metadata for the 2018-19 NBA season. The data will be
saved to nba-data/2018-19. Next, we can download the player-level data, including
shooting dashboards with
$ nbaspa-download players --output-dir nba-data --season 2018-19
Again, this will download the data to nba-data/2018-19. Then, let’s download team
data,
$ nbaspa-download teams --output-dir nba-data --season 2018-19
Finally, we can download the game data:
$ nbaspa-download games --output-dir nba-data --season 2018-19
If you want to bundle these calls into a single CLI command, use the season endpoint:
$ nbaspa-download season --output-dir nba-data --season 2018-19
Cleaning data¶
Our prefect data cleaning pipeline iterates through all
games on a given day. The pipeline produces two types of data: model and rating. The
model dataset will be an input for the survival analysis model while the rating dataset
will be used for generating SPA ratings.
Python¶
To clean a given day in python,
from nbaspa.data.pipeline import gen_pipeline, run_pipeline
flow = gen_pipeline()
output = run_pipeline(
flow=flow,
data_dir="nba-data/2018-19",
output_dir="nba-data/2018-19",
save_data=True,
mode="model",
Season="2018-19",
GameDate="10/16/2018"
)
This flow will save each game as a CSV in nba-data/2018-19/model-data. To read the CSV back
into python,
import pandas as pd
df = pd.read_csv(
"nba-data/2018-19/model-data/data_0021800001.csv",
sep="|",
index_col=0,
dtype={"GAME_ID": str}
)
Command-line interface¶
As with downloading data, the CLI is the best way to clean data. For model data:
$ nbaspa-clean model --data-dir nba-data --output-dir nba-data --season 2018-19
and for ratings data:
$ nbaspa-clean rating --data-dir nba-data --output-dir nba-data --season 2018-19
Both of these calls will save data to nba-data/2018-19.
Training and evaluating the models¶
Important
The outputs for the following pipelines will be saved using Prefect checkpointing. For this to work you must set the following environment variable:
$ export PREFECT__FLOWS__CHECKPOINTING=true
Python¶
Building the datasets¶
To create the build and holdout CSV files,
from nbaspa.model.pipeline import gen_data_pipeline, run_pipeline
flow = gen_data_pipeline()
output = run_pipeline(
flow=flow, data_dir="nba-data", output_dir="nba-data", splits=(0.6, 0.2, 0.2), seed=42
)
This flow will save build.csv and holdout.csv to nba-data/models.
Training the models¶
To train a lifelines model,
from nbaspa.model.pipeline import gen_lifelines_pipeline, run_pipeline
flow = gen_lifelines_pipeline()
output = run_pipeline(
flow=flow, data_dir="nba-data", output_dir="nba-data", max_evals=5000, seed=42
)
If you ran the flow on 2021-02-21, the lifelines model artifacts will be saved to the
nba-data/models/2021-02-21/lifelines folder. To train a xgboost model,
from nbaspa.model.pipeline import gen_xgboost_pipeline, run_pipeline
flow = gen_xgboost_pipeline()
output = run_pipeline(
flow=flow, data_dir="nba-data", output_dir="nba-data", max_evals=5000, seed=42
)
If you ran the flow on 2021-02-21, the xgboost model artifacts will be saved to the
nba-data/models/2021-02-21/xgboost folder.
Evaluating models¶
To evaluate a set of models,
from nbaspa.model.pipeline import gen_evaluate_pipeline, run_pipeline
flow = gen_evaluate_pipeline(
lifelines="nba-data/models/2021-02-21/lifelines/model.pkl",
xgboost="nba-data/models/2021-02-21/xgboost/model.pkl"
)
output = run_pipeline(flow=flow, data_dir="nba-data", output_dir="nba-data")
This flow will read in the model.pkl files, create the AUROC visualizations, and
save the visualizations to nba-data/models/2021-02-21.
Game-level predictions¶
To run game-level predictions for all of your data, run
from nbaspa.model.pipeline import gen_predict_pipeline, run_pipeline
flow = gen_predict_pipeline()
output = run_pipeline(
flow=flow,
data_dir="nba-data",
output_dir="nba-data",
filesystem="file",
model="nba-data/models/2021-02-21/lifelines/model.pkl",
)
To restrict to a single season, supply the season parameter:
output = run_pipeline(
flow=flow,
data_dir="nba-data",
output_dir="nba-data",
filesystem="file",
model="nba-data/models/2021-02-21/lifelines/model.pkl",
Season="2018-19"
)
Similarly, you can run the predictions for a single game:
output = run_pipeline(
flow=flow,
data_dir="nba-data",
output_dir="nba-data",
filesystem="file",
model="nba-data/models/2021-02-21/lifelines/model.pkl",
GameID="0021800001"
)
Command-line interface¶
Building the datasets¶
First, we need to split the initial dataset into build and holdout:
$ nbaspa-model build --data-dir nba-data --output-dir nba-data
This CLI call will save two CSV files to nba-data/models: build.csv and holdout.csv.
Training the models¶
Next, we can fit a model
$ nbaspa-model train --data-dir nba-data --output-dir nba-data --model lifelines
This CLI call will train a lifelines model with
a 75-25 train-tune split within the build dataset, and
a maximum of 100
hyperoptevaluations.
You can modify these parameters with --splits and --max-evals, respectively.
To train an xgboost model,
$ nbaspa-model train --data-dir nba-data --output-dir nba-data --model xgboost
For the xgboost model, our tuning dataset will double as the early stopping data.
After you call the train endpoint you will see a new subfolder within nba-data/models
corresponding to the system date. The lifelines artifacts will be saved to a lifelines
subfolder; the xgboost artifacts will be saved to a xgboost subfolder.
Evaluating models¶
To evaluate your models, use the evaluate endpoint. Suppose you trained your model on 2021-02-21:
$ nbaspa-model evaluate \
--data-dir nba-data \
--output-dir nba-data \
--model lifelines nba-data/models/2021-02-21/lifelines/model.pkl \
--model xgboost nba-data/models/2021-02-21/xgboost/model.pkl
This endpoint will read in the model .pkl files, create the AUROC visualizations, and
save them to the nba-data/models/2021-02-21 folder.
Game-level predictions¶
To run game-level predictions, use the predict endpoint. Suppose you trained your model on 2021-02-21:
$ nbaspa-model predict \
--data-dir nba-data \
--output-dir nba-data \
--model nba-data/models/2021-02-21/lifelines/model.pkl \
The above call will create game-level predictions for all cleaned game data available in nba-data.
Important
The predictions can be found in nba-data/<Season>/survival-prediction/data_<GameID>.csv.
To restrict to a season or game, supply --season or --game-id:
$ nbaspa-model predict \
--data-dir nba-data \
--output-dir nba-data \
--model nba-data/models/2021-02-21/lifelines/model.pkl \
--season 2018-19 \
--game-id 0021800001
Generate player ratings¶
Python¶
To generate player ratings for all of your data, run
from nbaspa.player_ratings.pipeline import gen_pipeline, run_pipeline
flow = gen_pipeline()
output = run_pipeline(
flow=flow,
data_dir="nba-data",
output_dir="nba-data",
filesystem="file"
)
To restrict to a given season, supply Season
output = run_pipeline(
flow=flow,
data_dir="nba-data",
output_dir="nba-data",
filesystem="file",
Season="2018-19"
)
and to restrict to a game, supply GameID
output = run_pipeline(
flow=flow,
data_dir="nba-data",
output_dir="nba-data",
filesystem="file",
GameID="0021800001"
)
To remove contextual information like team quality and schedule, (see here)
supply mode:
output = run_pipeline(
flow=flow,
data_dir="nba-data",
output_dir="nba-data",
filesystem="file",
GameID="0021800001",
mode="survival-plus"
)
Important
You can find the play-by-play impact data at <output_dir>/<Season>/pbp-impact/data_<GameID>.csv.
The aggregated game-level data can be found at <output_dir>/<Season>/game-impact/data_<GameID>/csv.
This call will also save a season summary CSV with total and average impact for each player to
<output_dir>/<Season>/impact-summary.csv, as well as player timeseries data to
<output_dir>/<Season>/impact-timeseries/data_<PlayerID>.csv.
Command-line interface¶
To run game-level player ratings,
$ nbaspa-rate \
--data-dir nba-data \
--output-dir nba-data
The above call will create ratings for all cleaned game data available in nba-data. To restrict
to a season or game, supply --season or --game-id:
$ nbaspa-rate \
--data-dir nba-data \
--output-dir nba-data \
--season 2018-19 \
--game-id 0021800001
To generate ratings that remove contextual information like team quality and schedule , change the mode parameter
$ nbaspa-rate \
--data-dir nba-data \
--output-dir nba-data \
--season 2018-19 \
--game-id 0021800001 \
--mode survival-plus
Daily snapshot¶
If you want to analyze the data for a single day’s worth of games (maybe to prevent a large batch
at the end of the season), you can use the set of daily CLI endpoints. For Christmas 2018, use
$ nbaspa-download daily \
--output-dir nba-data \
--game-date 2018-12-25
$ nbaspa-clean daily \
--data-dir nba-data \
--output-dir nba-data \
--game-date 2018-12-25
$ nbaspa-model daily \
--data-dir nba-data \
--output-dir nba-data \
--model nba-data/models/2021-02-21/lifelines/model.pkl \
--game-date 2018-12-25
$ nbaspa-rate \
--data-dir nba-data \
--output-dir nba-data \
--game-date 2018-12-25
$ nbaspa-rate \
--data-dir nba-data \
--output-dir nba-data \
--game-date 2018-12-25 \
--mode survival-plus
Using docker and gcs to push daily updates¶
Note
This documentation is adapted from the Google Cloud SDK.
First, pull the gcloud image
$ docker pull gcr.io/google.com/cloudsdktool/cloud-sdk:latest
and authenticate gcloud with service account credentials:
$ docker run \
--name gcloud-config \
gcr.io/google.com/cloudsdktool/cloud-sdk gcloud auth activate-service-account SERVICE_ACCOUNT@DOMAIN.COM --key-file=/path/key.json --project=PROJECT_ID
Important
If you want to authenticate with user credentials, call
$ docker run \
-ti \
--name gcloud-config \
gcr.io/google.com/cloudsdktool/cloud-sdk gcloud auth login
Then, build the nbaspa docker image
$ docker build --tag nbaspa .
and run the container. We will
include the authentication container
gcloud-configas a volume,update the application default credentials (ADC), and
mount our local
nba-datadirectory to the container.
Run the target container with the following script (snapshot.sh)
DATE=$(date -d "yesterday 13:00" +"%Y-%m-%d")
cd /opt
nbaspa-download daily --output-dir $DATA_DIR --game-date $DATE
nbaspa-clean daily --data-dir $DATA_DIR --output-dir $DATA_DIR --game-date $DATE
nbaspa-model daily --data-dir $DATA_DIR --output-dir $DATA_DIR --model $MODEL_PATH--game-date $DATE
nbaspa-rate --data-dir $DATA_DIR --output-dir $DATA_DIR --game-date $DATE
nbaspa-rate --data-dir $DATA_DIR --output-dir $DATA_DIR --game-date $DATE --mode survival-plus
gsutil -m rsync -r $DATA_DIR $GCS_PATH
$ docker run \
--rm \
--volumes-from gcloud-config \
--mount type=bind,src=<PATH_TO_PARENT>,target=/opt \
-e DATA_DIR=<DATA_DIRECTORY> \
-e MODEL_PATH=/opt/<PATH_TO_MODEL_PKL> \
-e GCS_PATH=gs://<BUCKET_NAME>/<TARGET_DIRECTORY> \
nbaspa snapshot.sh
for example, if you’re using docker on Windows Subsystem for Linux and the nba-data directory
exists in your local branch of nbaspa.
$ docker run \
--rm \
--volumes-from gcloud-config \
--mount type=bind,src=/mnt/c/Users/UserName/Documents/GitHub/nbaspa,target=/opt \
-e DATA_DIR=nba-data \
-e MODEL_PATH=/opt/nba-data/models/2021-02-21/lifelines/model.pkl \
-e GCS_PATH=gs://mybucket/nba-data \
nbaspa snapshot.sh