Modeling NBA salary with bio-metric features

Predicting NBA salary from biological stats.

I will follow the paradigm set out by ML lessons:

Find data sources
Explore and visualize the data
Clean data,
Feature engineer
Additional data
Train models
Deploy best model.

Refs: IBM, coursera,

Imports and settings

# Other
import os

#data science
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# Visualization
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14,
                    'figure.figsize':(11,7)})

def plot_metric_salary(metric, joined_data):
    joined_data_replace_nan = joined_data.fillna(0)
    data = joined_data_replace_nan[joined_data_replace_nan[metric] > 0]
    plt.scatter(x=data[metric], y=data['Log Salary (M $)'], 
                s=50, alpha=0.5)
    plt.xlabel(metric)
    plt.ylabel('Log Salary (Millions, $)')
    plt.show()

Find data sources

These are from kaggle: player salary and player bio stats

player_salary = pd.read_csv(os.path.join('NBA_data','player_salary_2017','NBA_season1718_salary.csv'))
player_salary.tail()

	Unnamed: 0	Player	Tm	season17_18
568	569	Quinn Cook	NOP	25000.0
569	570	Chris Johnson	HOU	25000.0
570	571	Beno Udrih	DET	25000.0
571	572	Joel Bolomboy	MIL	22248.0
572	573	Jarell Eddie	CHI	17224.0

From statista, the minimum salary is $815615. So let’s elminate anyone with a salary below that. I’m not sure how they snuck in there!

player_salary = player_salary[player_salary['season17_18'] >= 815615]

player_bio_stats = pd.read_csv(os.path.join('NBA_data','player_measurements_1947-to-2017','player_measures_1947-2017.csv'))
player_bio_stats.head()

	Player Full Name	Birth Date	Year Start	Year End	Position	Height (ft 1/2)	Height (inches 2/2)	Height (in cm)	Wingspan (in cm)	Standing Reach (in cm)	Hand Length (in inches)	Hand Width (in inches)	Weight (in lb)	Body Fat (%)	College
0	A.C. Green	10/4/1963	1986	2001	F-C	6.0	9.0	205.7	NaN	NaN	NaN	NaN	220.0	NaN	Oregon State University
1	A.J. Bramlett	1/10/1977	2000	2000	C	6.0	10.0	208.3	NaN	NaN	NaN	NaN	227.0	NaN	University of Arizona
2	A.J. English	7/11/1967	1991	1992	G	6.0	3.0	190.5	NaN	NaN	NaN	NaN	175.0	NaN	Virginia Union University
3	A.J. Guyton	2/12/1978	2001	2003	G	6.0	1.0	185.4	192.4	247.7	NaN	NaN	180.0	NaN	Indiana University
4	A.J. Hammons	8/27/1992	2017	2017	C	7.0	0.0	213.4	NaN	NaN	NaN	NaN	260.0	NaN	Purdue University

I want to limit my predictor data to match my salary data: Players active in 2017-2018 season
From the first entry, A. C. Green, I can understand the convention for this dataset. A. C. Green’s first season was the 1985-1986 season and his last season was the 2000-2001 season. So the Year Start and Year End columns use the later year of the season.

(Note, that’s a super long career A. C.!

player_bio_1718 = player_bio_stats[
    (player_bio_stats['Year Start'] <= 2018)&(player_bio_stats['Year End'] >= 2018)
].reset_index(drop=True)
player_bio_1718.head()

	Player Full Name	Birth Date	Year Start	Year End	Position	Height (ft 1/2)	Height (inches 2/2)	Height (in cm)	Wingspan (in cm)	Standing Reach (in cm)	Hand Length (in inches)	Hand Width (in inches)	Weight (in lb)	Body Fat (%)	College
0	Aaron Brooks	1/14/1985	2008	2018	G	6.0	0.0	182.9	193.0	238.8	NaN	NaN	161.0	2.7%	University of Oregon
1	Aaron Gordon	9/16/1995	2015	2018	F	6.0	9.0	205.7	212.7	266.7	8.75	10.5	220.0	5.1%	University of Arizona
2	Abdel Nader	9/25/1993	2018	2018	F	6.0	6.0	198.1	NaN	NaN	NaN	NaN	230.0	NaN	Iowa State University
3	Adreian Payne	2/19/1991	2015	2018	F-C	6.0	10.0	208.3	223.5	276.9	9.25	9.5	237.0	7.6%	Michigan State University
4	Al Horford	6/3/1986	2008	2018	C-F	6.0	10.0	208.3	215.3	271.8	NaN	NaN	245.0	9.1%	University of Florida

Explore and visualize the data

player_salary['Salary (Million USD)'] = player_salary['season17_18']/1000000
fig, ax = plt.subplots(1,2, figsize=(11*2,7))
plt.subplot(1,2,1)
plt.hist(player_salary['Salary (Million USD)'], align='right',rwidth=.95,)
plt.ylabel("Frequency")
plt.xlabel("Salary (Millions, $)")
plt.fig_size=(11,7)
plt.subplot(1,2,2)
plt.hist(player_salary['Salary (Million USD)'], align='right',bins=400,
                                  cumulative=True,
                                  density=True,
                                  histtype='step'
                                 )
plt.ylabel("Cumulative probability")
plt.xlabel("Salary (Millions, $)")
plt.fig_size=(11,7)

plt.show()

png

These data are highly skewed, with a mean near \$7.0 M/year and a standard deviation of \$7.2 M/year, and a long tail towards the higher salaries

player_salary['Salary (Million USD)'].mean(), player_salary['Salary (Million USD)'].std(), 

(7.001891322851145, 7.336425350468712)

Join the data

I have my predictor and response variables: biological stats and salary

player_bio_1718['Player'] = player_bio_1718['Player Full Name']
joined_data = player_bio_1718.merge(player_salary, on="Player")
joined_data.head()

	Player Full Name	Birth Date	Year Start	Year End	Position	Height (ft 1/2)	Height (inches 2/2)	Height (in cm)	Wingspan (in cm)	Standing Reach (in cm)	Hand Length (in inches)	Hand Width (in inches)	Weight (in lb)	Body Fat (%)	College	Player	Unnamed: 0	Tm	season17_18	Salary (Million USD)
0	Aaron Brooks	1/14/1985	2008	2018	G	6.0	0.0	182.9	193.0	238.8	NaN	NaN	161.0	2.7%	University of Oregon	Aaron Brooks	319	MIN	2116955.0	2.116955
1	Aaron Gordon	9/16/1995	2015	2018	F	6.0	9.0	205.7	212.7	266.7	8.75	10.5	220.0	5.1%	University of Arizona	Aaron Gordon	190	ORL	5504420.0	5.504420
2	Abdel Nader	9/25/1993	2018	2018	F	6.0	6.0	198.1	NaN	NaN	NaN	NaN	230.0	NaN	Iowa State University	Abdel Nader	446	BOS	1167333.0	1.167333
3	Al Horford	6/3/1986	2008	2018	C-F	6.0	10.0	208.3	215.3	271.8	NaN	NaN	245.0	9.1%	University of Florida	Al Horford	11	BOS	27734405.0	27.734405
4	Al Jefferson	1/4/1985	2005	2018	C-F	6.0	10.0	208.3	219.7	279.4	NaN	NaN	289.0	10.5%	NaN	Al Jefferson	128	IND	9769821.0	9.769821

For the player data, I’m going to look at Height (cm), Weight (lbs), age, …

# Clean this up by converting body fat from string to float
joined_data['Body Fat (%)'] = joined_data['Body Fat (%)'].str.rstrip('%').astype('float') / 100.0

columns = player_bio_1718.columns
metrics = list(columns[7:-2])
# replace Nan with 0 for now, and
# neglect 0 values because these are physical measurements that should never be 0
joined_data_replace_nan = joined_data.fillna(0)
fig, ax = plt.subplots(7,2, figsize=(11*2,7*len(metrics)))
for count, metric in enumerate(metrics):
    data = joined_data_replace_nan[joined_data_replace_nan[metric] > 0]
    ax[count%len(metrics),0].hist(data[metric],rwidth=.95,)
    ax[count%len(metrics),0].set_xlabel(metric)
    ax[count%len(metrics),1].hist(data[metric],
                                  bins=400,
                                  cumulative=True,
                                  density=True,
                                  histtype='step'
                                 )
    ax[count%len(metrics),1].set_xlabel(metric)
for hist in ax[:,0]:
    hist.set_ylabel('Frequency')
for CDF in ax[:,1]:
    CDF.set_ylabel('Cumulative probability')

png

These are looking pretty Gaussian, but I will have to scale them to make their values and ranges similar.

Clean data and Feature Engineer

Salary

The salarys appear roughly log-normal so I am going to transform them to make it look more Gaussian

joined_data['Log Salary (M $)'] = np.log10(joined_data['Salary (Million USD)'])
plt.hist(joined_data['Log Salary (M $)'], align='right',rwidth=.95,)
plt.ylabel("Frequency")
plt.xlabel("Log 10 Salary (Millions, $)")
plt.show()

png

This shows it’s more log uniform than log-normal. Nevertheless let’s proceed:

Let’s see how the biological stats compare to the Log10 Salary

columns = player_bio_1718.columns
metrics = list(columns[7:-2])
scatter_list = []
# replace Nan with 0 for now, and
# neglect 0 values because these are physical measurements that should never be 0
joined_data_replace_nan = joined_data.fillna(0)
fig, ax = plt.subplots(4,2, figsize=(11*2,7*4))
for count, metric in enumerate(metrics):
    data = joined_data_replace_nan[joined_data_replace_nan[metric] > 0]
    ax[count//2, count%2].scatter(x=data[metric], 
                               y=data['Log Salary (M $)'],
                                  s=50,
                                 alpha=0.5)
    ax[count//2, count%2].set_xlabel(metric)
    ax[count//2, count%2].set_ylabel('Log Salary (Millions, $)')

png

These are looking generally pretty uniform, meaning salary is independent of most of these features.

There some data points at the extremes that stick out. Let’s keep going - maybe we can add some more predictive features.

Create Features

print('Total number of players: '+str(len(player_bio_1718)))
for column in player_bio_1718:
    print(player_bio_1718[column].isna().sum(), column)

Total number of players: 471
Player Full Name
Birth Date
Year Start
Year End
Position
Height (ft 1/2)
Height (inches 2/2)
Height (in cm)
Wingspan (in cm)
Standing Reach (in cm)
Hand Length (in inches)
Hand Width (in inches)
Weight (in lb)
Body Fat (%)
College
Player

Predict wingspan

We’re missing a fair amount of data in Body Fat, Wingspan, Reach, and Hand Length and Width. Can we reconstruct it from domain knowledge?

data=player_bio_1718[player_bio_1718['Wingspan (in cm)'] > 0]
x_dim = 'Height (in cm)'
y_dim = 'Wingspan (in cm)'
plt.scatter(
    x=data[x_dim],
    y=data[y_dim],
    alpha=0.5)
plt.xlabel(x_dim)
plt.ylabel(y_dim)

Text(0, 0.5, 'Wingspan (in cm)')

png

We know that height can be used to predict wingspan fairly well in the general population, and the chart above is promising. Let’s try it.

no_wingspan_data = player_bio_1718[player_bio_1718['Wingspan (in cm)'] > 0]
heights = no_wingspan_data['Height (in cm)']
wingspans = no_wingspan_data['Wingspan (in cm)']
regression = LinearRegression().fit(np.array(heights).reshape(-1,1), np.array(wingspans))
player_bio_1718['Wingspan predictions (in cm)'] = regression.predict(
    np.array(player_bio_1718['Height (in cm)']).reshape(-1,1))

regression.score(np.array(heights).reshape(-1,1), np.array(wingspans))

0.6885110373104122

data=player_bio_1718[player_bio_1718['Wingspan (in cm)'] > 0]
x_dim = 'Height (in cm)'
y_dim = 'Wingspan (in cm)'
plt.scatter(
    x=data[x_dim],
    y=data[y_dim],
    label='Observations',
    alpha=0.5)
plt.scatter(
    x=data[x_dim],
    y=data['Wingspan predictions (in cm)'],
    label='Predictions from height')
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.legend()
plt.show()

png

This is not a great prediction, the $R^2$ score is $0.69$ out of $1.0$.

The wingspan of the NBA population is more independent of the height of the players than the average population. I’m not going to use it in my model. Because I’m doing a linear regression this would basically just modify the coefficient on the height feature already in the data.

Create BMI

Do NBA players have similar BMIs? $BMI = \frac{mass (kg)}{(height (m))^2}$ Typically this measurement is not useful for athletes because it does not distinguish fat weight from muscle weight. Note on BMI.

We have height in cm and mass in pounds, so to convert it I will use the formula: $BMI = \frac{mass (lbs) / 2.2}{(height (cm)/100)^2}$

joined_data['BMI']= (joined_data['Weight (in lb)']/2.2)/((joined_data['Height (in cm)']/100)**2)
plot_metric_salary('BMI',joined_data)

png

We have some players coming in “overweight” according to BMI (BMI > 25), but, as mentioned above, it doesn’t account for what type of tissue the weight is coming from.

Actually, those guys are some of the higher paid players!

Create Hand Area

I think the surface area of the hand is more closely related to its impact on the game of basketball then either the width or length individually. I’m going to treat that as an ellipse and create it as a feature using: $A = \pi * length * width$

joined_data['Hand Area (inches^2)'] = joined_data['Hand Width (in inches)']*joined_data['Hand Length (in inches)']
metric='Hand Area (inches^2)'
plot_metric_salary(metric,joined_data)

png

Create Age

I will use the birthdate column to calculate the age of the player in the 2017-2018 season.

joined_data['Birth Date'].head()

  1/14/1985
  9/16/1995
  9/25/1993
   6/3/1986
   1/4/1985
Name: Birth Date, dtype: object

joined_data['Birth Date']=pd.to_datetime(joined_data['Birth Date'])
joined_data['Age'] = 2018 - pd.DatetimeIndex(joined_data['Birth Date']).year
joined_data.Age.head()

  33
  23
  25
  32
  33
Name: Age, dtype: int64

metric='Age'
plot_metric_salary(metric,joined_data)

png

Who is in the 45+ category?

joined_data[joined_data.Age > 45]

	Player Full Name	Birth Date	Year Start	Year End	Position	Height (ft 1/2)	Height (inches 2/2)	Height (in cm)	Wingspan (in cm)	Standing Reach (in cm)	...	College	Player	Unnamed: 0	Tm	season17_18	Salary (Million USD)	Log Salary (M $)	BMI	Hand Area (inches^2)	Age
259	Larry Nance	1959-02-12	2016	2018	F	6.0	9.0	205.7	217.2	274.3	...	University of Wyoming	Larry Nance	384	CLE	1471382.0	1.471382	0.167725	24.707942	87.75	59
389	Tim Hardaway	1966-09-01	2014	2018	G	6.0	6.0	198.1	NaN	NaN	...	University of Michigan	Tim Hardaway	64	NYK	16500000.0	16.500000	1.217484	23.744456	NaN	52

2 rows × 24 columns

These are former pros with sons in the league with their same name. Per wikipedia, the biological stats for Larry Nance, and Tim Hardaway, match more closely to the Juniors (except for Birth Date), so I’m going to update the Birth Dates and recalculate the ages

player_salary[player_salary.Player.str.contains('Nance')], player_salary[player_salary.Player.str.contains('Hardaway')]

(     Unnamed: 0       Player   Tm  season17_18  Salary (Million USD)
 383         384  Larry Nance  CLE    1471382.0              1.471382,
     Unnamed: 0        Player   Tm  season17_18  Salary (Million USD)
 63          64  Tim Hardaway  NYK   16500000.0                  16.5)

These teams and salaries match the data from the 2017-2018 season.

joined_data.loc[joined_data['Player']=='Larry Nance','Birth Date'] = '1993-01-01' #Larry Nance Jr.
joined_data.loc[joined_data['Player']=='Tim Hardaway','Birth Date'] = '1992-03-16' #Tim Hardaway Jr.
joined_data['Birth Date']=pd.to_datetime(joined_data['Birth Date'])
joined_data['Age'] = 2018 - pd.DatetimeIndex(joined_data['Birth Date']).year

metric='Age'
plot_metric_salary(metric,joined_data)

png

joined_data[joined_data.Age > 39]

	Player Full Name	Birth Date	Year Start	Year End	Position	Height (ft 1/2)	Height (inches 2/2)	Height (in cm)	Wingspan (in cm)	Standing Reach (in cm)	...	College	Player	Unnamed: 0	Tm	season17_18	Salary (Million USD)	Log Salary (M $)	BMI	Hand Area (inches^2)	Age
105	Dirk Nowitzki	1978-06-19	1999	2018	F	7.0	0.0	213.4	NaN	NaN	...	NaN	Dirk Nowitzki	201	DAL	5000000.0	5.000000	0.698970	24.454263	NaN	40
179	Jason Terry	1977-09-15	2000	2018	G	6.0	2.0	188.0	NaN	NaN	...	University of Arizona	Jason Terry	296	MIL	2328652.0	2.328652	0.367105	23.792131	NaN	41
274	Manu Ginobili	1977-07-28	2003	2018	G	6.0	6.0	198.1	NaN	NaN	...	NaN	Manu Ginobili	277	SAS	2500000.0	2.500000	0.397940	23.744456	NaN	41
416	Vince Carter	1977-01-26	1999	2018	G-F	6.0	6.0	198.1	NaN	NaN	...	University of North Carolina	Vince Carter	143	SAC	8000000.0	8.000000	0.903090	25.481856	NaN	41

4 rows × 24 columns

This checks out!

Create Years in the league

joined_data['Years in the league'] = 2018 - joined_data['Year Start']
metric='Years in the league'
plot_metric_salary(metric,joined_data)

png

Train the model

Before I do that, I definitely need to drop some columns. I’m going to address the categorical data (College, Position, Team) separately right now in case I want it later.

joined_data.columns

Index(['Player Full Name', 'Birth Date', 'Year Start', 'Year End', 'Position',
       'Height (ft 1/2)', 'Height (inches 2/2)', 'Height (in cm)',
       'Wingspan (in cm)', 'Standing Reach (in cm)', 'Hand Length (in inches)',
       'Hand Width (in inches)', 'Weight (in lb)', 'Body Fat (%)', 'College',
       'Player', 'Unnamed: 0', 'Tm', 'season17_18', 'Salary (Million USD)',
       'Log Salary (M $)', 'BMI', 'Hand Area (inches^2)', 'Age',
       'Years in the league'],
      dtype='object')

dropped_columns = ['Player Full Name', 'Birth Date', 'Year Start', 'Year End',
       'Height (ft 1/2)', 'Height (inches 2/2)','Unnamed: 0','season17_18','Player']
categorical_columns = ['Tm','Position','College']
joined_data_dropped = joined_data.drop(columns=dropped_columns)
joined_data_dropped = joined_data_dropped.drop(columns=categorical_columns)
joined_data_dropped.head()

	Height (in cm)	Wingspan (in cm)	Standing Reach (in cm)	Hand Length (in inches)	Hand Width (in inches)	Weight (in lb)	Body Fat (%)	Salary (Million USD)	Log Salary (M $)	BMI	Hand Area (inches^2)	Age	Years in the league
0	182.9	193.0	238.8	NaN	NaN	161.0	0.027	2.116955	0.325712	21.876396	NaN	33	10
1	205.7	212.7	266.7	8.75	10.5	220.0	0.051	5.504420	0.740712	23.633684	91.875	23	3
2	198.1	NaN	NaN	NaN	NaN	230.0	NaN	1.167333	0.067195	26.640122	NaN	25	0
3	208.3	215.3	271.8	NaN	NaN	245.0	0.091	27.734405	1.443019	25.666394	NaN	32	10
4	208.3	219.7	279.4	NaN	NaN	289.0	0.105	9.769821	0.989887	30.275869	NaN	33	13

Options for Nans

Work on dataset of only intact rows
Work on dataset of only intact columns
Replace nan with median

Work on dataset with only intact rows

Let’s start with 1, and see how they all compare

training_data_intact_rows = joined_data_dropped.dropna()
training_data_intact_rows.head()

	Height (in cm)	Wingspan (in cm)	Standing Reach (in cm)	Hand Length (in inches)	Hand Width (in inches)	Weight (in lb)	Body Fat (%)	Salary (Million USD)	Log Salary (M $)	BMI	Hand Area (inches^2)	Age	Years in the league
1	205.7	212.7	266.7	8.75	10.50	220.0	0.051	5.504420	0.740712	23.633684	91.875	23	3
5	198.1	208.3	262.9	9.00	8.25	214.0	0.051	10.845506	1.035250	24.786896	74.250	27	6
7	215.9	222.3	0.0	9.00	10.75	260.0	0.064	4.187599	0.621965	25.353936	96.750	25	4
8	205.7	221.6	275.6	9.50	9.50	220.0	0.082	7.319035	0.864454	23.633684	90.250	28	7
10	198.1	211.5	262.9	8.25	8.50	210.0	0.047	19.332500	1.286288	24.323589	70.125	26	4

X = training_data_intact_rows.drop(columns=['Salary (Million USD)','Log Salary (M $)'])
y = training_data_intact_rows[['Log Salary (M $)']]
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(X,y)
print('score: '+str(model.score(X,y)))

score: 0.6153114126697995

That’s a good score, but how does it predict outside of the training data?

To answer that, I’m going to fill in the values for the players I removed with the median values for each category.

training_data_filled = joined_data_dropped.fillna(joined_data_dropped.median())
X_filled = training_data_filled.drop(columns=['Salary (Million USD)','Log Salary (M $)'])
y_filled = training_data_filled[['Log Salary (M $)']]
model.score(X_filled,y_filled)

-0.9542480606293884

This model is overfit - when I evaluate my dataset with other players given reasonable values for hand size, reach, wingspan, etc, it fails spectacularly.

y_filled.insert(0,"Salary (M $)",10**y_filled['Log Salary (M $)'])
y_filled.insert(0,"Predicted Log Salary", model.predict(X_filled))
y_filled.insert(0,"Predicted Salary (M $)",10**y_filled["Predicted Log Salary"])

data=y_filled
x_dim = 'Salary (M $)'
y_dim = 'Salary (M $)'
plt.scatter(
    data=y_filled,
    x=x_dim,
    y=y_dim,
    label='Observations',
    alpha=0.5)
y_dim = 'Predicted Salary (M $)'
plt.scatter(
    data=y_filled,
    x=x_dim,
    y=y_dim,
    label='Predictions',
    alpha=0.5,)
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.legend()
plt.show()

png

Who is that we predict should be puling in $1B?

joined_data.iloc[y_filled["Predicted Salary (M $)"].idxmax()].head()

Player Full Name          Dirk Nowitzki
Birth Date          1978-06-19 00:00:00
Year Start                         1999
Year End                           2018
Position                              F
Name: 105, dtype: object

Per Wikipeida, Dirk Nowitzki “is widely regarded as one of the greatest power forwards of all time and is considered by many to be the greatest European player of all time.”

So maybe the Mavs were getting a deal! Or maybe the model is flawed.

Work on dataset with only intact columns

columns_with_na =[]
for column in joined_data_dropped.columns:
    if np.sum(joined_data_dropped[column].isna()):
        columns_with_na.append(column)

training_data_intact_cols=joined_data_dropped.drop(columns=columns_with_na)
training_data_intact_cols.head()

	Height (in cm)	Weight (in lb)	Salary (Million USD)	Log Salary (M $)	BMI	Age	Years in the league
0	182.9	161.0	2.116955	0.325712	21.876396	33	10
1	205.7	220.0	5.504420	0.740712	23.633684	23	3
2	198.1	230.0	1.167333	0.067195	26.640122	25	0
3	208.3	245.0	27.734405	1.443019	25.666394	32	10
4	208.3	289.0	9.769821	0.989887	30.275869	33	13

X = training_data_intact_cols.drop(columns=['Salary (Million USD)','Log Salary (M $)'])
y = training_data_intact_cols[['Log Salary (M $)']]
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(X,y)
print('score: '+str(model.score(X,y)))

score: 0.2710707378505274

The scores are lower for this - this is as expected. There are more examples and fewer categories.

Work with filling NaN with median

training_data_filled = joined_data_dropped.fillna(joined_data_dropped.median())
X = training_data_filled.drop(columns=['Salary (Million USD)','Log Salary (M $)'])
y = training_data_filled[['Log Salary (M $)']]
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(X,y)
print('score: '+str(model.score(X,y)))

score: 0.2827808374769417

Test the model

Test/train split

Preprocessing

I’m going to try some polynomial features here on the “intact columns” dataset.

I’m choosing that because many of the features I added are multiplicative of one another, so adding polynomial features will only amplify this.

I’m also choosing to drop BMI for the same reason: it’s a product of height and weight.

X = training_data_intact_cols.drop(columns=['Salary (Million USD)','Log Salary (M $)','BMI'])
y = training_data_intact_cols[['Log Salary (M $)']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
score_list =[]
degrees = list(range(1,5))
for degree in degrees:
    model = make_pipeline(StandardScaler(), PolynomialFeatures(degree, include_bias=False), Ridge()) 
    model.fit(X_train,y_train)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    score_list.append([train_score, test_score])

plt.plot(degrees, score_list, marker='o')
plt.legend(['Training data','Testing data'])
plt.xlabel('Polynominal degree')
plt.ylabel('R^2 score')
plt.show()

png

Polynomial degree of 2 has a slightly higher score on the testing data - above 2 we begin overfitting. This is a high variance error. We will use degree 2 for the following predictions.

Predicitions/Production

model = make_pipeline(StandardScaler(), PolynomialFeatures(2, include_bias=False), Ridge()) 
model.fit(X,y) #train on the full dataset
model.feature_names_in_

array(['Height (in cm)', 'Weight (in lb)', 'Age', 'Years in the league'],
      dtype=object)

RWL_stats = {
            X.columns[0]:[180.4],  #5'11" (height in cm)
            X.columns[1]:[175], # weight in lbs
            X.columns[2]:[29], #age
            X.columns[3]:[0], #years in league
            }
RWL_df = pd.DataFrame.from_dict(RWL_stats)
RWL_df

	Height (in cm)	Weight (in lb)	Age	Years in the league
0	180.4	175	29	0

log10_RWL_salary = model.predict(RWL_df)
RWL_salary = 10**log10_RWL_salary
RWL_salary[0][0]

1.0938939170193462

So, a fair salary for me would be $1,090,000 a year.

Coach Ham, I already live in LA. My application is in the mail!

y_pred = y
y_pred.insert(0,"Salary (M $)",10**y_pred['Log Salary (M $)'])
y_pred.insert(0,"Predicted Log Salary", model.predict(X))
y_pred.insert(0,"Predicted Salary (M $)",10**y_pred["Predicted Log Salary"])

x_dim = 'Salary (M $)'
y_dim = 'Salary (M $)'
plt.scatter(
    data=y_pred,
    x=x_dim,
    y=y_dim,
    label='Observations')
y_dim = 'Predicted Salary (M $)'
plt.scatter(
    data=y_pred,
    x=x_dim,
    y=y_dim,
    label='Predictions',
    alpha=0.5,)
plt.scatter(
    x=0,
    y=RWL_salary,
    label='RWL',
    alpha=1,)
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.legend()
plt.show()

png

plt.subplots(1,2,figsize=(22,7))
plt.subplot(1,2,1)
data=y_filled
x_dim = 'Salary (M $)'
y_dim = 'Salary (M $)'
plt.scatter(
    data=y_filled,
    x=x_dim,
    y=y_dim,
    label='Observations')
y_dim = 'Predicted Salary (M $)'
plt.scatter(
    data=y_filled,
    x=x_dim,
    y=y_dim,
    label='Predictions',
    alpha=0.5,)
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.title('Predictions with all biometrics')
plt.ylim([0,1100])
plt.legend()
plt.subplot(1,2,2)
x_dim = 'Salary (M $)'
y_dim = 'Salary (M $)'
plt.scatter(
    data=y_pred,
    x=x_dim,
    y=y_dim,
    label='Observations')
y_dim = 'Predicted Salary (M $)'
plt.scatter(
    data=y_pred,
    x=x_dim,
    y=y_dim,
    label='Predictions',
    alpha=0.5,)
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.ylim([0,1100])
plt.title('Predictions with height, weight, age, and years')
plt.legend()
plt.show()

png

When viewed on the same scale, the model with fewer features is doing much better. But it’s $R^2$ score was still only 0.45, which is not great. We must wonder:

Why is this model doing so badly?

Or, why am worth more than minimum wage for an NBA player?

coef_df = pd.DataFrame(
    zip(
    list(model['polynomialfeatures'].get_feature_names_out(model.feature_names_in_)),
    list(model['ridge'].coef_,)[0]), columns=['Category','Coefficient']
)
coef_df.insert(2,'Magntitute of coef',np.abs(model['ridge'].coef_,)[0])
coef_df.sort_values(by=['Magntitute of coef'],ascending=False)

	Category	Coefficient	Magntitute of coef
3	Years in the league	0.579669	0.579669
12	Age Years in the league	-0.393072	0.393072
2	Age	-0.244206	0.244206
11	Age^2	0.140130	0.140130
13	Years in the league^2	0.077877	0.077877
0	Height (in cm)	0.048072	0.048072
6	Height (in cm) Age	0.044353	0.044353
7	Height (in cm) Years in the league	-0.034242	0.034242
5	Height (in cm) Weight (in lb)	0.026550	0.026550
9	Weight (in lb) Age	0.021481	0.021481
4	Height (in cm)^2	-0.019526	0.019526
1	Weight (in lb)	-0.017313	0.017313
10	Weight (in lb) Years in the league	-0.016464	0.016464
8	Weight (in lb)^2	-0.003941	0.003941

I really like this analysis because it finds that age and years in the league are more important than height or weight.

Not only that, but the quadratic features of Age$^2$ and years$^2$ match my intuition and represent two different cases of high salary:

Young players have high potential: teams are eager to get young, talented players, and incentivize them with high salaries.

Players with more years in the league are more experienced and known quantities, which is valuable in a different way. There could also be some survivorship basis at play here: Perhaps only the good (valuable) players play for many years. In that case, they are valuable for another reason (high skill), which correlates with years in the league.

Age$^2$ has a positive coefficient with high magnitude, because there players at both ends of that parabola are valuable. I draw a similar conclusion from the high magnitude, positive coefficient on years$^2$

X_uniform = pd.DataFrame({"Age":np.linspace(20,40,num=len(X)), "Years in the league":np.linspace(0,20,num=len(X))})
X.update(X_uniform)
scaled_X=model['standardscaler'].transform(X)
scaled_age = scaled_X[:,2]
scaled_years=scaled_X[:,3]
age_coef, age2_coef = (np.array(coef_df.loc[coef_df['Category']=='Age','Coefficient']), 
                       np.array(coef_df.loc[coef_df['Category']=='Age^2','Coefficient']))
age_salary = scaled_age*(age_coef)+(scaled_age**2)*(age2_coef)
years_coef, years2_coef =  (np.array(coef_df.loc[coef_df['Category']=='Years in the league','Coefficient']), 
                            np.array(coef_df.loc[coef_df['Category']=='Years in the league^2','Coefficient']))
years_salary =scaled_years*(years_coef)+(scaled_years**2)*(years2_coef)

plt.subplots(1,2,figsize=(22,7))
plt.subplot(1,2,1)
x_dim='Age'
y_dim='Salary (Million USD)'
plt.scatter(
    data=joined_data,
    x=x_dim,
    y=y_dim,
    label='Observations',
    alpha=0.5,
)
plt.plot(X_uniform['Age'],10**age_salary,
        label="Contribution from age only",
         color='#ff7f0e')
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.ylim([-2,38])
plt.legend()

plt.subplot(1,2,2)
x_dim='Years in the league'
plt.scatter(
    data=joined_data,
    x=x_dim,
    y=y_dim,
    label='Observations',
    alpha=0.5,

)
plt.plot(X_uniform['Years in the league'],10**years_salary,
        label="Contribution from years only",
         color='#ff7f0e')
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.ylim([-2,38])
plt.legend()
plt.show()

png

This calculation neglected the cross term of $age*years$. Let’s include it and re-evaluate.

X_uniform = pd.DataFrame({"Age":np.linspace(20,40,num=len(X)), "Years in the league":np.linspace(0,20,num=len(X))})
X.update(X_uniform)
scaled_X=model['standardscaler'].transform(X)
scaled_age = scaled_X[:,2]
scaled_years=scaled_X[:,3]
age_years_coef = np.array(coef_df.loc[coef_df['Category']=='Age Years in the league','Coefficient'])
age_years_salary = scaled_age*(age_coef)+(scaled_age**2)*(age2_coef)+(scaled_age*scaled_years)*(age_years_coef)
x_dim='Age'
y_dim='Salary (Million USD)'
plt.scatter(
    data=joined_data,
    x=x_dim,
    y=y_dim,
    label='Observations',
    alpha=0.5
)
plt.plot(X_uniform['Age'],10**age_years_salary,
        label="Contribution from age and years",
         color='#ff7f0e')
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.ylim([-2,38])
plt.legend()
plt.show()

png

The little bump centered around 23 years old is showing that NBA salaries are rewarding the rare combination of low age and high experience. After an age of about 28, the negatives of age overtakes benefit from experience.

Share on

Twitter Facebook LinkedIn

Rob Learsch