Data Exploration

This page was created from a Jupyter notebook. The original notebook can be found here. It explores some of the data contained in or derived from the database. First we must import the necessary installed modules.

import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm

Next we need to import a local module to calculate some advanced stats not stored in the database.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))

if module_path not in sys.path:
    sys.path.append(module_path)

from nba.database import Database

The code below simply customizes font sizes for all the plots that follow.

plt.rc('font', size=14)        # controls default text sizes
plt.rc('axes', titlesize=16)   # fontsize of the axes title
plt.rc('axes', labelsize=16)   # fontsize of the x and y labels
plt.rc('xtick', labelsize=14)  # fontsize of the tick labels
plt.rc('ytick', labelsize=14)  # fontsize of the tick labels
plt.rc('legend', fontsize=14)  # legend fontsize

We then need to connect to the database generated during the data wrangling process.

conn = sqlite3.connect('../nba.db')

Without knowing what two teams are playing, a reasonable baseline prediction is probably that the home team wins. Let’s take a look at how often that actually happens in the NBA. The SQL query below calculates the home team’s winning percentage since the 1990-91 season and groups the results by season. Specifically, it sums up all occurrences of ‘W’ in the HOME_WL column and divides by the total number of games in a given season.

data = pd.read_sql('''
    SELECT SEASON,
           100.0 * SUM(CASE WHEN HOME_WL = 'W' THEN 1 ELSE 0 END) / COUNT(HOME_WL) AS HomeWinPct
    FROM games
    WHERE SEASON >= 1990
    GROUP BY SEASON''', conn)

The plot below shows the win percentage for all home teams across the league from 1990-2015. The chart along with the annotation show that the home team wins about 60% of the time historically. That rate is also remarkably consistent. It has a standard deviation of less than 2% and has stayed within about $\pm$ 4% since the 1990-91 season. FiveThirtyEight reported a similar percentage when analyzing home court/field/ice advantages of the four major American sports. They calculated that the home team in the NBA has won 59.9% of regular season games since the 2000 season. They also estimated that playing at home provides the biggest advantage in the NBA, where home teams win nearly 10% more games than expected had all games been played at neutral sites. Contrast that with MLB, where home teams win only 4% more games than expected. It is interesting to note that regardless of the sport, FiveThirtyEight’s models expect the “home” team to win about 50% of the time on neutral sites, which makes sense when averaged across all teams and multiple seasons.

pct = data.HomeWinPct
plt.figure(figsize=(10, 6))
plt.plot(data.SEASON, pct)
plt.text(1990.5, 83,
         '$\mu=${0:.1f}%\n$\sigma=${1:.1f}%\nRange = {2:.1f}%'
         .format(np.mean(pct), np.std(pct), np.ptp(pct)))
plt.xlabel('Season')
plt.ylabel('NBA Home Team Win Percentage')
plt.title('NBA Home Teams Win Predictably Often')
plt.xlim(1990, 2015)
plt.ylim(0, 100)
plt.show()

png

Historically Great Teams

Now let’s calculate some advanced stats from the basic box score data. The code below returns a Pandas DataFrame with season-averaged team offensive and defensive ratings (points scored/allowed per 100 possessions), as well as SRS defined during data wrangling.

database = Database('../nba.db')
season_stats = database.season_stats()

We are first going to look at teams that have some of the best offenses and defenses in NBA history. Since the team stats table includes a team ID that maps to a teams table, we need to pull in the teams table to better identify which teams we are looking at.

teams = pd.read_sql('SELECT * FROM TEAMS', conn)

Below are the limits for what defines a great offense, defense, and well-balanced team (high net rating). These limits are not formally defined and are merely selected such that the resulting number of teams is small.

off_lim = 114
def_lim = 98
net_lim = 9
off_flag = (season_stats.TEAM_OFF_RTG > off_lim) & (season_stats.SEASON >= 1990)
def_flag = (season_stats.TEAM_DEF_RTG < def_lim) & (season_stats.SEASON >= 1990)
net_flag = (season_stats.TEAM_NET_RTG > net_lim) & (season_stats.SEASON >= 1990)
stats = ['SEASON', 'TEAM_ID', 'TEAM_OFF_RTG', 'TEAM_DEF_RTG', 'TEAM_NET_RTG', 'TEAM_SRS']

There are 13 teams since the 1990-91 season with an offensive rating greater than 114 and 4 of them are Jordan-led Bulls teams. The only other teams to appear more than once are the Suns (one led by Charles Barkley and two by the duo of Steve Nash and Amar’e Stoudemire) and the Warriors (one led by hall of famer Chris Mullin and the record-setting 2015-16 team). Note that the NBA’s database does not differentiate teams that change location and/or mascots, so the table identifies the 1994 Oklahoma City Thunder, even though they were the Seattle SuperSonics at the time.

# Isolate desired teams
best_off = season_stats.loc[off_flag, stats]

# Merge with teams table to add team information
best_off = teams.merge(best_off, left_on='ID', right_on='TEAM_ID')

# Remove ID columns
best_off = best_off[[c for c in best_off.columns if 'ID' not in c]]

# Sort by descending offensive rating
best_off.sort_values(by='TEAM_OFF_RTG', ascending=False)

	ABBREVIATION	CITY	MASCOT	SEASON	TEAM_OFF_RTG	TEAM_DEF_RTG	TEAM_NET_RTG	TEAM_SRS
2	CHI	Chicago	Bulls	1991	116.189211	105.151816	11.037395	10.068209
3	CHI	Chicago	Bulls	1995	115.933867	102.438493	13.495374	11.799077
7	ORL	Orlando	Magic	1994	115.824746	108.447942	7.376804	6.438905
11	OKC	Oklahoma City	Thunder	1994	115.429696	106.876043	8.553653	7.907164
10	PHX	Phoenix	Suns	2009	115.341356	110.211671	5.129685	4.677408
1	CHI	Chicago	Bulls	1990	115.229301	105.703474	9.525826	8.565998
12	UTA	Utah	Jazz	1994	114.985616	106.354448	8.631169	7.751535
8	PHX	Phoenix	Suns	1994	114.930813	110.902597	4.028215	3.849692
9	PHX	Phoenix	Suns	2004	114.531469	107.143974	7.387495	7.076859
6	GSW	Golden State	Warriors	2015	114.495585	103.776436	10.719149	10.380270
4	CHI	Chicago	Bulls	1996	114.352283	102.373550	11.978733	10.697142
0	CLE	Cleveland	Cavaliers	1991	114.335981	108.610216	5.725766	5.340416
5	GSW	Golden State	Warriors	1991	114.060850	110.310391	3.750460	3.766748

There are 11 teams since 1990 with a defensive rating under 98 and 3 of them are late 90s/early 2000s Spurs teams with Tim Duncan, two of which featured hall of famer David Robinson. What is more impressive is that they are the only team to grace this list more than once.

# Isolate desired teams
best_def = season_stats.loc[def_flag, stats]

# Merge with teams table to add team information
best_def = teams.merge(best_def, left_on='ID', right_on='TEAM_ID')

# Remove ID columns
best_def = best_def[[c for c in best_def.columns if 'ID' not in c]]

# Sort by ascending defensive rating
best_def.sort_values(by='TEAM_DEF_RTG')

	ABBREVIATION	CITY	MASCOT	SEASON	TEAM_OFF_RTG	TEAM_DEF_RTG	TEAM_NET_RTG	TEAM_SRS
9	SAS	San Antonio	Spurs	2003	102.248071	94.178366	8.069705	7.512427
7	SAS	San Antonio	Spurs	1998	104.036733	95.000784	9.035949	7.151218
10	DET	Detroit	Pistons	2003	102.057437	95.440557	6.616880	5.035363
0	ATL	Atlanta	Hawks	1998	100.516244	97.138526	3.377719	2.790164
3	IND	Indiana	Pacers	2003	103.866828	97.324036	6.542792	4.932295
2	ORL	Orlando	Magic	1998	100.297189	97.382225	2.914964	3.082168
1	NYK	New York	Knicks	1998	98.636273	97.471817	1.164456	1.423433
4	PHI	Philadelphia	76ers	1998	99.928127	97.632210	2.295917	2.528796
6	POR	Portland	Trail Blazers	1998	104.759421	97.734222	7.025199	5.697009
8	SAS	San Antonio	Spurs	2000	106.556068	97.962401	8.593667	7.916999
5	PHX	Phoenix	Suns	2000	100.321422	97.966276	2.355146	2.626301

The 16 teams listed below have a net rating greater than 9 and are some of the strongest teams in NBA history. These include 4 of the Jordan-led Bulls and two of the recent Warriors teams. Only one of the strong defensive teams isolated above make the cut, while 5 of the strong offensive teams appear. As before, note that the 1993 Oklahoma City Thunder were in fact the Seattle SuperSonics.

# Isolate desired teams
best_net = season_stats.loc[net_flag, stats]

# Merge with teams table to add team information
best_net = teams.merge(best_net, left_on='ID', right_on='TEAM_ID')

# Remove ID columns
best_net = best_net[[c for c in best_net.columns if 'ID' not in c]]

# Sort by descending net rating
best_net.sort_values(by='TEAM_NET_RTG', ascending=False)

	ABBREVIATION	CITY	MASCOT	SEASON	TEAM_OFF_RTG	TEAM_DEF_RTG	TEAM_NET_RTG	TEAM_SRS
4	CHI	Chicago	Bulls	1995	115.933867	102.438493	13.495374	11.799077
5	CHI	Chicago	Bulls	1996	114.352283	102.373550	11.978733	10.697142
12	SAS	San Antonio	Spurs	2015	110.290021	98.962236	11.327785	10.276618
0	BOS	Boston	Celtics	2007	110.171323	98.933715	11.237609	9.307345
3	CHI	Chicago	Bulls	1991	116.189211	105.151816	11.037395	10.068209
8	GSW	Golden State	Warriors	2015	114.495585	103.776436	10.719149	10.380270
7	GSW	Golden State	Warriors	2014	111.603607	101.354296	10.249311	10.008125
1	CLE	Cleveland	Cavaliers	2008	112.441598	102.432204	10.009394	8.680389
14	OKC	Oklahoma City	Thunder	2012	112.398592	102.609581	9.789011	9.149670
15	UTA	Utah	Jazz	1996	113.642224	103.950239	9.691985	7.969128
13	OKC	Oklahoma City	Thunder	1993	111.969516	102.366973	9.602543	8.675828
2	CHI	Chicago	Bulls	1990	115.229301	105.703474	9.525826	8.565998
11	SAS	San Antonio	Spurs	2006	109.199410	99.859543	9.339868	8.347217
6	CHI	Chicago	Bulls	2011	107.405392	98.284632	9.120760	7.426924
9	LAL	Los Angeles	Lakers	1999	107.325502	98.224840	9.100663	8.413792
10	SAS	San Antonio	Spurs	1998	104.036733	95.000784	9.035949	7.151218

The following table shows all NBA champions since the 1990-91 season. Not surprisingly, most of the champions have solid net ratings and SRS numbers with the 1995 Bulls shining above all other teams. The weakest champion by these measures is the 1994 Rockets led by hall of famers Hakeem Olajuwon and Clyde Drexler.

# Create champions table that identifies which team won the NBA championship each season
champs = pd.DataFrame({'SEASON': range(1990, 2016),
                       'ABBREVIATION': ['CHI', 'CHI', 'CHI', 'HOU', 'HOU', 'CHI', 'CHI', 'CHI', 'SAS',
                                        'LAL', 'LAL', 'LAL', 'SAS', 'DET', 'SAS', 'MIA', 'SAS', 'BOS',
                                        'LAL', 'LAL', 'DAL', 'MIA', 'MIA', 'SAS', 'GSW', 'CLE']})
# Isolate desired stats from all teams
champ_stats = season_stats[stats]

# Merge with teams table to add team information
champ_stats = teams.merge(champ_stats, left_on='ID', right_on='TEAM_ID')

# Merge with champs table created above
columns = ['ABBREVIATION', 'SEASON']
champ_stats = champs.merge(champ_stats, left_on=columns, right_on=columns)

# Remove ID columns
champ_stats = champ_stats[[c for c in best_net.columns if 'ID' not in c]]
champ_stats

	ABBREVIATION	CITY	MASCOT	SEASON	TEAM_OFF_RTG	TEAM_DEF_RTG	TEAM_NET_RTG	TEAM_SRS
0	CHI	Chicago	Bulls	1990	115.229301	105.703474	9.525826	8.565998
1	CHI	Chicago	Bulls	1991	116.189211	105.151816	11.037395	10.068209
2	CHI	Chicago	Bulls	1992	113.620324	106.822864	6.797459	6.192637
3	HOU	Houston	Rockets	1993	106.444671	101.900362	4.544309	4.194021
4	HOU	Houston	Rockets	1994	110.403045	108.140634	2.262411	2.318131
5	CHI	Chicago	Bulls	1995	115.933867	102.438493	13.495374	11.799077
6	CHI	Chicago	Bulls	1996	114.352283	102.373550	11.978733	10.697142
7	CHI	Chicago	Bulls	1997	107.696343	99.779691	7.916652	7.244613
8	SAS	San Antonio	Spurs	1998	104.036733	95.000784	9.035949	7.151218
9	LAL	Los Angeles	Lakers	1999	107.325502	98.224840	9.100663	8.413792
10	LAL	Los Angeles	Lakers	2000	108.435083	104.794734	3.640349	3.742580
11	LAL	Los Angeles	Lakers	2001	109.407427	101.713071	7.694357	7.145147
12	SAS	San Antonio	Spurs	2002	105.629877	99.659960	5.969917	5.649958
13	DET	Detroit	Pistons	2003	102.057437	95.440557	6.616880	5.035363
14	SAS	San Antonio	Spurs	2004	107.496327	98.774515	8.721811	7.835147
15	MIA	Miami	Heat	2005	108.659652	104.454414	4.205239	3.592696
16	SAS	San Antonio	Spurs	2006	109.199410	99.859543	9.339868	8.347217
17	BOS	Boston	Celtics	2007	110.171323	98.933715	11.237609	9.307345
18	LAL	Los Angeles	Lakers	2008	112.815873	104.735539	8.080334	7.112944
19	LAL	Los Angeles	Lakers	2009	108.764387	103.716801	5.047586	4.782049
20	DAL	Dallas	Mavericks	2010	109.671262	105.041587	4.629675	4.407577
21	MIA	Miami	Heat	2011	106.630278	100.150438	6.479840	5.719582
22	MIA	Miami	Heat	2012	112.332835	103.744087	8.588748	7.031001
23	SAS	San Antonio	Spurs	2013	110.501196	102.404512	8.096684	7.994012
24	GSW	Golden State	Warriors	2014	111.603607	101.354296	10.249311	10.008125
25	CLE	Cleveland	Cavaliers	2015	110.873303	104.496954	6.376349	5.452116

The scatter plot below shows all teams since 1990 represented by their offensive and defensive ratings. The strongest teams are towards the bottom right, those that score a lot and do not give up many points. The line represents a constant net rating of 9, below which are the 16 teams listed above that have a net rating greater than 9.

# Add line of constant net rating
x = np.array([95, 120])
plt.figure(figsize=(10, 8))
plt.plot(x, x-net_lim, 'black', label='Net Rating = %d' % net_lim)

# Isolate teams that do not make the offensive/defensive/net rating cutoffs
others = season_stats.loc[~off_flag & ~def_flag & ~net_flag & (season_stats.SEASON >= 1990), stats]
others = teams.merge(others, left_on='ID', right_on='TEAM_ID')

# Remove the champions with an outer join
columns = ['ABBREVIATION', 'SEASON']
others = champs.merge(others, left_on=columns, right_on=columns, how='outer', indicator=True)
others = others[others._merge == 'right_only']

# Plot data
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=others, label='Everyone Else')
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=best_off, label='Off Rating > %d' % off_lim)
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=best_def, label='Def Rating < %d' % def_lim)
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=best_net, label='Net Rating > %d' % net_lim)
plt.scatter(x='TEAM_OFF_RTG', y='TEAM_DEF_RTG', data=champ_stats, label='Champions')

# Label axes and add legend
plt.legend()
plt.xlabel('Offensive Rating')
plt.ylabel('Defensive Rating')
plt.xlim(90, 120)
plt.ylim(90, 120)
plt.show()

png

Team Strength Distribution

The histogram and kernel density estimation (KDE) of team SRS below show that teams are fairly normally distributed. The best fit normal distribution has a mean of essentially zero with a standard deviation of about 4.6 points. A zero-mean distribution makes sense here because an SRS of zero indicates a perfectly average team.

srs = season_stats['TEAM_SRS']
(mu, sigma) = norm.fit(srs)

plt.figure(figsize=(12, 6))
ax = sns.distplot(srs, fit=norm, kde=True, kde_kws={'label': 'KDE', 'color': 'green', 'linewidth': 3},
                  hist_kws={'label': 'SRS', 'edgecolor': 'k', 'linewidth': 2},
                  fit_kws={'label': 'Normal Dist ($\mu=${0:.2g}, $\sigma=${1:.2f})'.format(mu, sigma),
                           'linewidth': 3})
ax.legend()
plt.xlabel('SRS')
plt.ylabel('Frequency')
plt.title('Most NBA Teams are Average')
plt.xlim(-20, 20)
plt.ylim(0, 0.1)
plt.show()

png

Home vs. Away Strength

The next step is to look at games in terms of home and away team stats. The code below joins the games table with the stats table initially by home team stats and followed by away team stats.

seasons = season_stats.filter(regex='SEASON|TEAM')
games = pd.read_sql('SELECT * FROM games', conn)
games = games.merge(seasons, left_on=['SEASON', 'HOME_TEAM_ID'], right_on=['SEASON', 'TEAM_ID'])
games = games.merge(seasons, left_on=['SEASON', 'AWAY_TEAM_ID'], right_on=['SEASON', 'TEAM_ID'], 
                    suffixes=('', '_AWAY'))

The plot below shows a 2D KDE that compares home and away team SRS. By inspection, the majority of games occur between teams with SRS values within 5 points of average. This makes intuitive sense given the standard deviation of 4.6 points calculated above. Assuming the Gaussian distribution above, more than 68% of all teams since 1990 have SRS values with a magnitude less than 5 based on the definition of a normal distribution. The distribution appears symmetric about a y=x line because under normal circumstances (each team plays a full season), teams have the same number of home and away games.

ax = sns.jointplot(x='TEAM_SRS', y='TEAM_SRS_AWAY', data=games, kind='kde',
                   shade_lowest=False, stat_func=None, xlim=(-15, 15), ylim=(-15, 15), size=8)
ax.set_axis_labels(xlabel='Home Team SRS', ylabel='Away Team SRS')
plt.show()

png

The function below makes and customizes KDE plots of home and away team stats.

def kde(data, stat, label, title, ax):
    stat = 'TEAM_' + stat
    sns.kdeplot(data[stat], data[stat + '_AWAY'], cmap='Blues', shade=True, shade_lowest=False, ax=ax)
    ax.plot(0, 0, 'or')
    ax.set_xlabel('Home Team ' + label)
    ax.set_ylabel('Away Team ' + label)
    ax.set_title(title)
    ax.set_xlim(-15, 15)
    ax.set_ylim(-15, 15)

A better view of this data is to separate games into home team wins and losses. The plots below show KDE plots of SRS for home team wins and losses with a red marker added to easily identify the origin (average home and away teams). The high-density area towards the lower right of the origin for home team wins (left plot) indicates there are many games in the dataset where above-average home teams beat below-average away teams, which is not a surprising revelation. We draw the opposite conclusion for home team losses. The highest density occurs towards the upper left of the origin, meaning games where a below-average home team plays an above-average visiting teams typically does not go well for the home team. There is also a small cluster of games in the left plot where above-average home teams beat above-average visitors of roughly equal quality. This may be caused by home court advantage, but was not investigated further.

plt.figure(figsize=(14, 6))

# Find games where the home team won
kde(games[games.HOME_WL=='W'], 'SRS', 'SRS', 'KDE of Home Team Wins', plt.subplot(121))

# Find games where the home team lost
kde(games[games.HOME_WL=='L'], 'SRS', 'SRS', 'KDE of Home Team Losses', plt.subplot(122))

plt.show()

png

The KDE plots below repeat those above for team net ratings (offensive rating - defensive rating). They illustrate that the same trends hold true for net ratings as did for SRS above.

plt.figure(figsize=(14, 6))

# Find games where the home team won
kde(games[games.HOME_WL=='W'], 'NET_RTG', 'Net Rating', 'KDE of Home Team Wins', plt.subplot(121))

# Find games where the home team lost
kde(games[games.HOME_WL=='L'], 'NET_RTG', 'Net Rating', 'KDE of Home Team Losses', plt.subplot(122))

plt.show()

< Previous Page

Next Page >

DataBall

Data Exploration

Historically Great Teams

Team Strength Distribution

Home vs. Away Strength