Data Exploration

This page was created from a Jupyter notebook. The original notebook can be found here. It explores some of the data contained in or derived from the database. First we must import the necessary installed modules.

import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm

Next we need to import a local module to calculate some advanced stats not stored in the database.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))

if module_path not in sys.path:
    sys.path.append(module_path)

from nba.database import Database

The code below simply customizes font sizes for all the plots that follow.

plt.rc('font', size=14)        # controls default text sizes
plt.rc('axes', titlesize=16)   # fontsize of the axes title
plt.rc('axes', labelsize=16)   # fontsize of the x and y labels
plt.rc('xtick', labelsize=14)  # fontsize of the tick labels
plt.rc('ytick', labelsize=14)  # fontsize of the tick labels
plt.rc('legend', fontsize=14)  # legend fontsize

We then need to connect to the database generated during the data wrangling process.

conn = sqlite3.connect('../nba.db')

Without knowing what two teams are playing, a reasonable baseline prediction is probably that the home team wins. Let’s take a look at how often that actually happens in the NBA. The SQL query below calculates the home team’s winning percentage since the 1990-91 season and groups the results by season. Specifically, it sums up all occurrences of ‘W’ in the HOME_WL column and divides by the total number of games in a given season.

data = pd.read_sql('''
    SELECT SEASON,
           100.0 * SUM(CASE WHEN HOME_WL = 'W' THEN 1 ELSE 0 END) / COUNT(HOME_WL) AS HomeWinPct
    FROM games
    WHERE SEASON >= 1990
    GROUP BY SEASON''', conn)

The plot below shows the win percentage for all home teams across the league from 1990-2015. The chart along with the annotation show that the home team wins about 60% of the time historically. That rate is also remarkably consistent. It has a standard deviation of less than 2% and has stayed within about 4% since the 1990-91 season. FiveThirtyEight reported a similar percentage when analyzing home court/field/ice advantages of the four major American sports. They calculated that the home team in the NBA has won 59.9% of regular season games since the 2000 season. They also estimated that playing at home provides the biggest advantage in the NBA, where home teams win nearly 10% more games than expected had all games been played at neutral sites. Contrast that with MLB, where home teams win only 4% more games than expected. It is interesting to note that regardless of the sport, FiveThirtyEight’s models expect the “home” team to win about 50% of the time on neutral sites, which makes sense when averaged across all teams and multiple seasons.

pct = data.HomeWinPct
plt.figure(figsize=(10, 6))
plt.plot(data.SEASON, pct)
plt.text(1990.5, 83,
         '$\mu=${0:.1f}%\n$\sigma=${1:.1f}%\nRange = {2:.1f}%'
         .format(np.mean(pct), np.std(pct), np.ptp(pct)))
plt.xlabel('Season')
plt.ylabel('NBA Home Team Win Percentage')
plt.title('NBA Home Teams Win Predictably Often')
plt.xlim(1990, 2015)
plt.ylim(0, 100)
plt.show()

png

Historically Great Teams

Now let’s calculate some advanced stats from the basic box score data. The code below returns a Pandas DataFrame with season-averaged team offensive and defensive ratings (points scored/allowed per 100 possessions), as well as SRS defined during data wrangling.

database = Database('../nba.db')
season_stats = database.season_stats()

We are first going to look at teams that have some of the best offenses and defenses in NBA history. Since the team stats table includes a team ID that maps to a teams table, we need to pull in the teams table to better identify which teams we are looking at.

teams = pd.read_sql('SELECT * FROM TEAMS', conn)

Below are the limits for what defines a great offense, defense, and well-balanced team (high net rating). These limits are not formally defined and are merely selected such that the resulting number of teams is small.

off_lim = 114
def_lim = 98
net_lim = 9
off_flag = (season_stats.TEAM_OFF_RTG > off_lim) & (season_stats.SEASON >= 1990)
def_flag = (season_stats.TEAM_DEF_RTG < def_lim) & (season_stats.SEASON >= 1990)
net_flag = (season_stats.TEAM_NET_RTG > net_lim) & (season_stats.SEASON >= 1990)
stats = ['SEASON', 'TEAM_ID', 'TEAM_OFF_RTG', 'TEAM_DEF_RTG', 'TEAM_NET_RTG', 'TEAM_SRS']

There are 13 teams since the 1990-91 season with an offensive rating greater than 114 and 4 of them are Jordan-led Bulls teams. The only other teams to appear more than once are the Suns (one led by Charles Barkley and two by the duo of Steve Nash and Amar’e Stoudemire) and the Warriors (one led by hall of famer Chris Mullin and the record-setting 2015-16 team). Note that the NBA’s database does not differentiate teams that change location and/or mascots, so the table identifies the 1994 Oklahoma City Thunder, even though they were the Seattle SuperSonics at the time.

# Isolate desired teams
best_off = season_stats.loc[off_flag, stats]

# Merge with teams table to add team information
best_off = teams.merge(best_off, left_on='ID', right_on='TEAM_ID')

# Remove ID columns
best_off = best_off[[c for c in best_off.columns if 'ID' not in c]]

# Sort by descending offensive rating
best_off.sort_values(by='TEAM_OFF_RTG', ascending=False)
ABBREVIATION CITY MASCOT SEASON TEAM_OFF_RTG TEAM_DEF_RTG TEAM_NET_RTG TEAM_SRS
2 CHI Chicago Bulls 1991 116.189211 105.151816 11.037395 10.068209
3 CHI Chicago Bulls 1995 115.933867 102.438493 13.495374 11.799077
7 ORL Orlando Magic 1994 115.824746 108.447942 7.376804 6.438905
11 OKC Oklahoma City Thunder 1994 115.429696 106.876043 8.553653 7.907164
10 PHX Phoenix Suns 2009 115.341356 110.211671 5.129685 4.677408
1 CHI Chicago Bulls 1990 115.229301 105.703474 9.525826 8.565998
12 UTA Utah Jazz 1994 114.985616 106.354448 8.631169 7.751535
8 PHX Phoenix Suns 1994 114.930813 110.902597 4.028215 3.849692
9 PHX Phoenix Suns 2004 114.531469 107.143974 7.387495 7.076859
6 GSW Golden State Warriors 2015 114.495585 103.776436 10.719149 10.380270
4 CHI Chicago Bulls 1996 114.352283 102.373550 11.978733 10.697142
0 CLE Cleveland Cavaliers 1991 114.335981 108.610216 5.725766 5.340416
5 GSW Golden State Warriors 1991 114.060850 110.310391 3.750460 3.766748

There are 11 teams since 1990 with a defensive rating under 98 and 3 of them are late 90s/early 2000s Spurs teams with Tim Duncan, two of which featured hall of famer David Robinson. What is more impressive is that they are the only team to grace this list more than once.

# Isolate desired teams
best_def = season_stats.loc[def_flag, stats]

# Merge with teams table to add team information
best_def = teams.merge(best_def, left_on='ID', right_on='TEAM_ID')

# Remove ID columns
best_def = best_def[[c for c in best_def.columns if 'ID' not in c]]

# Sort by ascending defensive rating
best_def.sort_values(by='TEAM_DEF_RTG')
ABBREVIATION CITY MASCOT SEASON TEAM_OFF_RTG TEAM_DEF_RTG TEAM_NET_RTG TEAM_SRS
9 SAS San Antonio Spurs 2003 102.248071 94.178366 8.069705 7.512427
7 SAS San Antonio Spurs 1998 104.036733 95.000784 9.035949 7.151218
10 DET Detroit Pistons 2003 102.057437 95.440557 6.616880 5.035363
0 ATL Atlanta Hawks 1998 100.516244 97.138526 3.377719 2.790164
3 IND Indiana Pacers 2003 103.866828 97.324036 6.542792 4.932295
2 ORL Orlando Magic 1998 100.297189 97.382225 2.914964 3.082168
1 NYK New York Knicks 1998 98.636273 97.471817 1.164456 1.423433
4 PHI Philadelphia 76ers 1998 99.928127 97.632210 2.295917 2.528796
6 POR Portland Trail Blazers 1998 104.759421 97.734222 7.025199 5.697009
8 SAS San Antonio Spurs 2000 106.556068 97.962401 8.593667 7.916999
5 PHX Phoenix Suns 2000 100.321422 97.966276 2.355146 2.626301

The 16 teams listed below have a net rating greater than 9 and are some of the strongest teams in NBA history. These include 4 of the Jordan-led Bulls and two of the recent Warriors teams. Only one of the strong defensive teams isolated above make the cut, while 5 of the strong offensive teams appear. As before, note that the 1993 Oklahoma City Thunder were in fact the Seattle SuperSonics.

# Isolate desired teams
best_net = season_stats.loc[net_flag, stats]

# Merge with teams table to add team information
best_net = teams.merge(best_net, left_on='ID', right_on='TEAM_ID')

# Remove ID columns
best_net = best_net[[c for c in best_net.columns if 'ID' not in c]]

# Sort by descending net rating
best_net.sort_values(by='TEAM_NET_RTG', ascending=False)
ABBREVIATION CITY MASCOT SEASON TEAM_OFF_RTG TEAM_DEF_RTG TEAM_NET_RTG TEAM_SRS
4 CHI Chicago Bulls 1995 115.933867 102.438493 13.495374 11.799077
5 CHI Chicago Bulls 1996 114.352283 102.373550 11.978733 10.697142
12 SAS San Antonio Spurs 2015 110.290021 98.962236 11.327785 10.276618
0 BOS Boston Celtics 2007 110.171323 98.933715 11.237609 9.307345
3 CHI Chicago Bulls 1991 116.189211 105.151816 11.037395 10.068209
8 GSW Golden State Warriors 2015 114.495585 103.776436 10.719149 10.380270
7 GSW Golden State Warriors 2014 111.603607 101.354296 10.249311 10.008125
1 CLE Cleveland Cavaliers 2008 112.441598 102.432204 10.009394 8.680389
14 OKC Oklahoma City Thunder 2012 112.398592 102.609581 9.789011 9.149670
15 UTA Utah Jazz 1996 113.642224 103.950239 9.691985 7.969128
13 OKC Oklahoma City Thunder 1993 111.969516 102.366973 9.602543 8.675828
2 CHI Chicago Bulls 1990 115.229301 105.703474 9.525826 8.565998
11 SAS San Antonio Spurs 2006 109.199410 99.859543 9.339868 8.347217
6 CHI Chicago Bulls 2011 107.405392 98.284632 9.120760 7.426924
9 LAL Los Angeles Lakers 1999 107.325502 98.224840 9.100663 8.413792
10 SAS San Antonio Spurs 1998 104.036733 95.000784 9.035949 7.151218

The following table shows all NBA champions since the 1990-91 season. Not surprisingly, most of the champions have solid net ratings and SRS numbers with the 1995 Bulls shining above all other teams. The weakest champion by these measures is the 1994 Rockets led by hall of famers Hakeem Olajuwon and Clyde Drexler.

# Create champions table that identifies which team won the NBA championship each season
champs = pd.DataFrame({'SEASON': range(1990, 2016),
                       'ABBREVIATION': ['CHI', 'CHI', 'CHI', 'HOU', 'HOU', 'CHI', 'CHI', 'CHI', 'SAS',
                                        'LAL', 'LAL', 'LAL', 'SAS', 'DET', 'SAS', 'MIA', 'SAS', 'BOS',
                                        'LAL', 'LAL', 'DAL', 'MIA', 'MIA', 'SAS', 'GSW', 'CLE']})
# Isolate desired stats from all teams
champ_stats = season_stats[stats]

# Merge with teams table to add team information
champ_stats = teams.merge(champ_stats, left_on='ID', right_on='TEAM_ID')

# Merge with champs table created above
columns = ['ABBREVIATION', 'SEASON']
champ_stats = champs.merge(champ_stats, left_on=columns, right_on=columns)

# Remove ID columns
champ_stats = champ_stats[[c for c in best_net.columns if 'ID' not in c]]
champ_stats
ABBREVIATION CITY MASCOT SEASON TEAM_OFF_RTG TEAM_DEF_RTG TEAM_NET_RTG TEAM_SRS
0 CHI Chicago Bulls 1990 115.229301 105.703474 9.525826 8.565998
1 CHI Chicago Bulls 1991 116.189211 105.151816 11.037395 10.068209
2 CHI Chicago Bulls 1992 113.620324 106.822864 6.797459 6.192637
3 HOU Houston Rockets 1993 106.444671 101.900362 4.544309 4.194021
4 HOU Houston Rockets 1994 110.403045 108.140634 2.262411 2.318131
5 CHI Chicago Bulls 1995 115.933867 102.438493 13.495374 11.799077
6 CHI Chicago Bulls 1996 114.352283 102.373550 11.978733 10.697142
7 CHI Chicago Bulls 1997 107.696343 99.779691 7.916652 7.244613
8 SAS San Antonio Spurs 1998 104.036733 95.000784 9.035949 7.151218
9 LAL Los Angeles Lakers 1999 107.325502 98.224840 9.100663 8.413792
10 LAL Los Angeles Lakers 2000 108.435083 104.794734 3.640349 3.742580
11 LAL Los Angeles Lakers 2001 109.407427 101.713071 7.694357 7.145147
12 SAS San Antonio Spurs 2002 105.629877 99.659960 5.969917 5.649958
13 DET Detroit Pistons 2003 102.057437 95.440557 6.616880 5.035363
14 SAS San Antonio Spurs 2004 107.496327 98.774515 8.721811 7.835147
15 MIA Miami Heat 2005 108.659652 104.454414 4.205239 3.592696
16 SAS San Antonio Spurs 2006 109.199410 99.859543 9.339868 8.347217
17 BOS Boston Celtics 2007 110.171323 98.933715 11.237609 9.307345
18 LAL Los Angeles Lakers 2008 112.815873 104.735539 8.080334 7.112944
19 LAL Los Angeles Lakers 2009 108.764387 103.716801 5.047586 4.782049
20 DAL Dallas Mavericks 2010 109.671262 105.041587 4.629675 4.407577
21 MIA Miami Heat 2011 106.630278 100.150438 6.479840 5.719582
22 MIA Miami Heat 2012 112.332835 103.744087 8.588748 7.031001
23 SAS San Antonio Spurs 2013 110.501196 102.404512 8.096684 7.994012
24 GSW Golden State Warriors 2014 111.603607 101.354296 10.249311 10.008125
25 CLE Cleveland Cavaliers 2015 110.873303 104.496954 6.376349 5.452116

The scatter plot below shows all teams since 1990 represented by their offensive and defensive ratings. The strongest teams are towards the bottom right, those that score a lot and do not give up many points. The line represents a constant net rating of 9, below which are the 16 teams listed above that have a net rating greater than 9.

# Add line of constant net rating
x = np.array([95, 120])
plt.figure(figsize=(10, 8))
plt.plot(x, x-net_lim, 'black', label='Net Rating = %d' % net_lim)

# Isolate teams that do not make the offensive/defensive/net rating cutoffs
others = season_stats.loc[~off_flag & ~def_flag & ~net_flag & (season_stats.SEASON >= 1990), stats]
others = teams.merge(others, left_on='ID', right_on='TEAM_ID')

# Remove the champions with an outer join
columns = ['ABBREVIATION', 'SEASON']
others = champs.merge(others, left_on=columns, right_on=columns, how='outer', indicator=True)
others = others[others._merge == 'right_only']

# Plot data
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=others, label='Everyone Else')
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=best_off, label='Off Rating > %d' % off_lim)
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=best_def, label='Def Rating < %d' % def_lim)
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=best_net, label='Net Rating > %d' % net_lim)
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=champ_stats, label='Champions')

# Label axes and add legend
plt.legend()
plt.xlabel('Offensive Rating')
plt.ylabel('Defensive Rating')
plt.xlim(90, 120)
plt.ylim(90, 120)
plt.show()

png

Team Strength Distribution

The histogram and kernel density estimation (KDE) of team SRS below show that teams are fairly normally distributed. The best fit normal distribution has a mean of essentially zero with a standard deviation of about 4.6 points. A zero-mean distribution makes sense here because an SRS of zero indicates a perfectly average team.

srs = season_stats['TEAM_SRS']
(mu, sigma) = norm.fit(srs)

plt.figure(figsize=(12, 6))
ax = sns.distplot(srs, fit=norm, kde=True, kde_kws={'label': 'KDE', 'color': 'green', 'linewidth': 3},
                  hist_kws={'label': 'SRS', 'edgecolor': 'k', 'linewidth': 2},
                  fit_kws={'label': 'Normal Dist ($\mu=${0:.2g}, $\sigma=${1:.2f})'.format(mu, sigma),
                           'linewidth': 3})
ax.legend()
plt.xlabel('SRS')
plt.ylabel('Frequency')
plt.title('Most NBA Teams are Average')
plt.xlim(-20, 20)
plt.ylim(0, 0.1)
plt.show()

png

Home vs. Away Strength

The next step is to look at games in terms of home and away team stats. The code below joins the games table with the stats table initially by home team stats and followed by away team stats.

seasons = season_stats.filter(regex='SEASON|TEAM')
games = pd.read_sql('SELECT * FROM games', conn)
games = games.merge(seasons, left_on=['SEASON', 'HOME_TEAM_ID'], right_on=['SEASON', 'TEAM_ID'])
games = games.merge(seasons, left_on=['SEASON', 'AWAY_TEAM_ID'], right_on=['SEASON', 'TEAM_ID'], 
                    suffixes=('', '_AWAY'))

The plot below shows a 2D KDE that compares home and away team SRS. By inspection, the majority of games occur between teams with SRS values within 5 points of average. This makes intuitive sense given the standard deviation of 4.6 points calculated above. Assuming the Gaussian distribution above, more than 68% of all teams since 1990 have SRS values with a magnitude less than 5 based on the definition of a normal distribution. The distribution appears symmetric about a y=x line because under normal circumstances (each team plays a full season), teams have the same number of home and away games.

ax = sns.jointplot(x='TEAM_SRS', y='TEAM_SRS_AWAY', data=games, kind='kde',
                   shade_lowest=False, stat_func=None, xlim=(-15, 15), ylim=(-15, 15), size=8)
ax.set_axis_labels(xlabel='Home Team SRS', ylabel='Away Team SRS')
plt.show()

png

The function below makes and customizes KDE plots of home and away team stats.

def kde(data, stat, label, title, ax):
    stat = 'TEAM_' + stat
    sns.kdeplot(data[stat], data[stat + '_AWAY'], cmap='Blues', shade=True, shade_lowest=False, ax=ax)
    ax.plot(0, 0, 'or')
    ax.set_xlabel('Home Team ' + label)
    ax.set_ylabel('Away Team ' + label)
    ax.set_title(title)
    ax.set_xlim(-15, 15)
    ax.set_ylim(-15, 15)

A better view of this data is to separate games into home team wins and losses. The plots below show KDE plots of SRS for home team wins and losses with a red marker added to easily identify the origin (average home and away teams). The high-density area towards the lower right of the origin for home team wins (left plot) indicates there are many games in the dataset where above-average home teams beat below-average away teams, which is not a surprising revelation. We draw the opposite conclusion for home team losses. The highest density occurs towards the upper left of the origin, meaning games where a below-average home team plays an above-average visiting teams typically does not go well for the home team. There is also a small cluster of games in the left plot where above-average home teams beat above-average visitors of roughly equal quality. This may be caused by home court advantage, but was not investigated further.

plt.figure(figsize=(14, 6))

# Find games where the home team won
kde(games[games.HOME_WL=='W'], 'SRS', 'SRS', 'KDE of Home Team Wins', plt.subplot(121))

# Find games where the home team lost
kde(games[games.HOME_WL=='L'], 'SRS', 'SRS', 'KDE of Home Team Losses', plt.subplot(122))

plt.show()

png

The KDE plots below repeat those above for team net ratings (offensive rating - defensive rating). They illustrate that the same trends hold true for net ratings as did for SRS above.

plt.figure(figsize=(14, 6))

# Find games where the home team won
kde(games[games.HOME_WL=='W'], 'NET_RTG', 'Net Rating', 'KDE of Home Team Wins', plt.subplot(121))

# Find games where the home team lost
kde(games[games.HOME_WL=='L'], 'NET_RTG', 'Net Rating', 'KDE of Home Team Losses', plt.subplot(122))

plt.show()

png

< Previous Page
Next Page >