Modeling NBA salary with bio-metric features

Predicting NBA salary from biological stats.

I will follow the paradigm set out by ML lessons:

  1. Find data sources
  2. Explore and visualize the data
  3. Clean data,
  4. Feature engineer
  5. Additional data
  6. Train models
  7. Deploy best model.

Refs: IBM, coursera,

Imports and settings

# Other
import os

#data science
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# Visualization
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14,
                    'figure.figsize':(11,7)})

def plot_metric_salary(metric, joined_data):
    joined_data_replace_nan = joined_data.fillna(0)
    data = joined_data_replace_nan[joined_data_replace_nan[metric] > 0]
    plt.scatter(x=data[metric], y=data['Log Salary (M $)'], 
                s=50, alpha=0.5)
    plt.xlabel(metric)
    plt.ylabel('Log Salary (Millions, $)')
    plt.show()

Find data sources

These are from kaggle: player salary and player bio stats

player_salary = pd.read_csv(os.path.join('NBA_data','player_salary_2017','NBA_season1718_salary.csv'))
player_salary.tail()
Unnamed: 0PlayerTmseason17_18
568569Quinn CookNOP25000.0
569570Chris JohnsonHOU25000.0
570571Beno UdrihDET25000.0
571572Joel BolomboyMIL22248.0
572573Jarell EddieCHI17224.0

From statista, the minimum salary is $815615. So let’s elminate anyone with a salary below that. I’m not sure how they snuck in there!

player_salary = player_salary[player_salary['season17_18'] >= 815615]
player_bio_stats = pd.read_csv(os.path.join('NBA_data','player_measurements_1947-to-2017','player_measures_1947-2017.csv'))
player_bio_stats.head()
Player Full NameBirth DateYear StartYear EndPositionHeight (ft 1/2)Height (inches 2/2)Height (in cm)Wingspan (in cm)Standing Reach (in cm)Hand Length (in inches)Hand Width (in inches)Weight (in lb)Body Fat (%)College
0A.C. Green10/4/196319862001F-C6.09.0205.7NaNNaNNaNNaN220.0NaNOregon State University
1A.J. Bramlett1/10/197720002000C6.010.0208.3NaNNaNNaNNaN227.0NaNUniversity of Arizona
2A.J. English7/11/196719911992G6.03.0190.5NaNNaNNaNNaN175.0NaNVirginia Union University
3A.J. Guyton2/12/197820012003G6.01.0185.4192.4247.7NaNNaN180.0NaNIndiana University
4A.J. Hammons8/27/199220172017C7.00.0213.4NaNNaNNaNNaN260.0NaNPurdue University

I want to limit my predictor data to match my salary data: Players active in 2017-2018 season
From the first entry, A. C. Green, I can understand the convention for this dataset. A. C. Green’s first season was the 1985-1986 season and his last season was the 2000-2001 season. So the Year Start and Year End columns use the later year of the season.

(Note, that’s a super long career A. C.!

player_bio_1718 = player_bio_stats[
    (player_bio_stats['Year Start'] <= 2018)&(player_bio_stats['Year End'] >= 2018)
].reset_index(drop=True)
player_bio_1718.head()
Player Full NameBirth DateYear StartYear EndPositionHeight (ft 1/2)Height (inches 2/2)Height (in cm)Wingspan (in cm)Standing Reach (in cm)Hand Length (in inches)Hand Width (in inches)Weight (in lb)Body Fat (%)College
0Aaron Brooks1/14/198520082018G6.00.0182.9193.0238.8NaNNaN161.02.7%University of Oregon
1Aaron Gordon9/16/199520152018F6.09.0205.7212.7266.78.7510.5220.05.1%University of Arizona
2Abdel Nader9/25/199320182018F6.06.0198.1NaNNaNNaNNaN230.0NaNIowa State University
3Adreian Payne2/19/199120152018F-C6.010.0208.3223.5276.99.259.5237.07.6%Michigan State University
4Al Horford6/3/198620082018C-F6.010.0208.3215.3271.8NaNNaN245.09.1%University of Florida

Explore and visualize the data

player_salary['Salary (Million USD)'] = player_salary['season17_18']/1000000
fig, ax = plt.subplots(1,2, figsize=(11*2,7))
plt.subplot(1,2,1)
plt.hist(player_salary['Salary (Million USD)'], align='right',rwidth=.95,)
plt.ylabel("Frequency")
plt.xlabel("Salary (Millions, $)")
plt.fig_size=(11,7)
plt.subplot(1,2,2)
plt.hist(player_salary['Salary (Million USD)'], align='right',bins=400,
                                  cumulative=True,
                                  density=True,
                                  histtype='step'
                                 )
plt.ylabel("Cumulative probability")
plt.xlabel("Salary (Millions, $)")
plt.fig_size=(11,7)

plt.show()

png

These data are highly skewed, with a mean near \$7.0 M/year and a standard deviation of \$7.2 M/year, and a long tail towards the higher salaries

player_salary['Salary (Million USD)'].mean(), player_salary['Salary (Million USD)'].std(), 
(7.001891322851145, 7.336425350468712)

Join the data

I have my predictor and response variables: biological stats and salary

player_bio_1718['Player'] = player_bio_1718['Player Full Name']
joined_data = player_bio_1718.merge(player_salary, on="Player")
joined_data.head()
Player Full NameBirth DateYear StartYear EndPositionHeight (ft 1/2)Height (inches 2/2)Height (in cm)Wingspan (in cm)Standing Reach (in cm)Hand Length (in inches)Hand Width (in inches)Weight (in lb)Body Fat (%)CollegePlayerUnnamed: 0Tmseason17_18Salary (Million USD)
0Aaron Brooks1/14/198520082018G6.00.0182.9193.0238.8NaNNaN161.02.7%University of OregonAaron Brooks319MIN2116955.02.116955
1Aaron Gordon9/16/199520152018F6.09.0205.7212.7266.78.7510.5220.05.1%University of ArizonaAaron Gordon190ORL5504420.05.504420
2Abdel Nader9/25/199320182018F6.06.0198.1NaNNaNNaNNaN230.0NaNIowa State UniversityAbdel Nader446BOS1167333.01.167333
3Al Horford6/3/198620082018C-F6.010.0208.3215.3271.8NaNNaN245.09.1%University of FloridaAl Horford11BOS27734405.027.734405
4Al Jefferson1/4/198520052018C-F6.010.0208.3219.7279.4NaNNaN289.010.5%NaNAl Jefferson128IND9769821.09.769821

For the player data, I’m going to look at Height (cm), Weight (lbs), age, …

# Clean this up by converting body fat from string to float
joined_data['Body Fat (%)'] = joined_data['Body Fat (%)'].str.rstrip('%').astype('float') / 100.0
columns = player_bio_1718.columns
metrics = list(columns[7:-2])
# replace Nan with 0 for now, and
# neglect 0 values because these are physical measurements that should never be 0
joined_data_replace_nan = joined_data.fillna(0)
fig, ax = plt.subplots(7,2, figsize=(11*2,7*len(metrics)))
for count, metric in enumerate(metrics):
    data = joined_data_replace_nan[joined_data_replace_nan[metric] > 0]
    ax[count%len(metrics),0].hist(data[metric],rwidth=.95,)
    ax[count%len(metrics),0].set_xlabel(metric)
    ax[count%len(metrics),1].hist(data[metric],
                                  bins=400,
                                  cumulative=True,
                                  density=True,
                                  histtype='step'
                                 )
    ax[count%len(metrics),1].set_xlabel(metric)
for hist in ax[:,0]:
    hist.set_ylabel('Frequency')
for CDF in ax[:,1]:
    CDF.set_ylabel('Cumulative probability')

png

These are looking pretty Gaussian, but I will have to scale them to make their values and ranges similar.

Clean data and Feature Engineer

Salary

The salarys appear roughly log-normal so I am going to transform them to make it look more Gaussian

joined_data['Log Salary (M $)'] = np.log10(joined_data['Salary (Million USD)'])
plt.hist(joined_data['Log Salary (M $)'], align='right',rwidth=.95,)
plt.ylabel("Frequency")
plt.xlabel("Log 10 Salary (Millions, $)")
plt.show()

png

This shows it’s more log uniform than log-normal. Nevertheless let’s proceed:

Let’s see how the biological stats compare to the Log10 Salary

columns = player_bio_1718.columns
metrics = list(columns[7:-2])
scatter_list = []
# replace Nan with 0 for now, and
# neglect 0 values because these are physical measurements that should never be 0
joined_data_replace_nan = joined_data.fillna(0)
fig, ax = plt.subplots(4,2, figsize=(11*2,7*4))
for count, metric in enumerate(metrics):
    data = joined_data_replace_nan[joined_data_replace_nan[metric] > 0]
    ax[count//2, count%2].scatter(x=data[metric], 
                               y=data['Log Salary (M $)'],
                                  s=50,
                                 alpha=0.5)
    ax[count//2, count%2].set_xlabel(metric)
    ax[count//2, count%2].set_ylabel('Log Salary (Millions, $)')

png

These are looking generally pretty uniform, meaning salary is independent of most of these features.

There some data points at the extremes that stick out. Let’s keep going - maybe we can add some more predictive features.

Create Features

print('Total number of players: '+str(len(player_bio_1718)))
for column in player_bio_1718:
    print(player_bio_1718[column].isna().sum(), column)
Total number of players: 471
0 Player Full Name
0 Birth Date
0 Year Start
0 Year End
0 Position
0 Height (ft 1/2)
0 Height (inches 2/2)
0 Height (in cm)
177 Wingspan (in cm)
177 Standing Reach (in cm)
259 Hand Length (in inches)
259 Hand Width (in inches)
0 Weight (in lb)
190 Body Fat (%)
82 College
0 Player

Predict wingspan

We’re missing a fair amount of data in Body Fat, Wingspan, Reach, and Hand Length and Width. Can we reconstruct it from domain knowledge?

data=player_bio_1718[player_bio_1718['Wingspan (in cm)'] > 0]
x_dim = 'Height (in cm)'
y_dim = 'Wingspan (in cm)'
plt.scatter(
    x=data[x_dim],
    y=data[y_dim],
    alpha=0.5)
plt.xlabel(x_dim)
plt.ylabel(y_dim)

Text(0, 0.5, 'Wingspan (in cm)')

png

We know that height can be used to predict wingspan fairly well in the general population, and the chart above is promising. Let’s try it.

no_wingspan_data = player_bio_1718[player_bio_1718['Wingspan (in cm)'] > 0]
heights = no_wingspan_data['Height (in cm)']
wingspans = no_wingspan_data['Wingspan (in cm)']
regression = LinearRegression().fit(np.array(heights).reshape(-1,1), np.array(wingspans))
player_bio_1718['Wingspan predictions (in cm)'] = regression.predict(
    np.array(player_bio_1718['Height (in cm)']).reshape(-1,1))

regression.score(np.array(heights).reshape(-1,1), np.array(wingspans))
0.6885110373104122
data=player_bio_1718[player_bio_1718['Wingspan (in cm)'] > 0]
x_dim = 'Height (in cm)'
y_dim = 'Wingspan (in cm)'
plt.scatter(
    x=data[x_dim],
    y=data[y_dim],
    label='Observations',
    alpha=0.5)
plt.scatter(
    x=data[x_dim],
    y=data['Wingspan predictions (in cm)'],
    label='Predictions from height')
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.legend()
plt.show()

png

This is not a great prediction, the $R^2$ score is $0.69$ out of $1.0$.

The wingspan of the NBA population is more independent of the height of the players than the average population. I’m not going to use it in my model. Because I’m doing a linear regression this would basically just modify the coefficient on the height feature already in the data.

Create BMI

Do NBA players have similar BMIs? \(BMI = \frac{mass (kg)}{(height (m))^2}\) Typically this measurement is not useful for athletes because it does not distinguish fat weight from muscle weight. Note on BMI.

We have height in cm and mass in pounds, so to convert it I will use the formula: \(BMI = \frac{mass (lbs) / 2.2}{(height (cm)/100)^2}\)

joined_data['BMI']= (joined_data['Weight (in lb)']/2.2)/((joined_data['Height (in cm)']/100)**2)
plot_metric_salary('BMI',joined_data)

png

We have some players coming in “overweight” according to BMI (BMI > 25), but, as mentioned above, it doesn’t account for what type of tissue the weight is coming from.

Actually, those guys are some of the higher paid players!

Create Hand Area

I think the surface area of the hand is more closely related to its impact on the game of basketball then either the width or length individually. I’m going to treat that as an ellipse and create it as a feature using: \(A = \pi * length * width\)

joined_data['Hand Area (inches^2)'] = joined_data['Hand Width (in inches)']*joined_data['Hand Length (in inches)']
metric='Hand Area (inches^2)'
plot_metric_salary(metric,joined_data)

png

Create Age

I will use the birthdate column to calculate the age of the player in the 2017-2018 season.

joined_data['Birth Date'].head()
0    1/14/1985
1    9/16/1995
2    9/25/1993
3     6/3/1986
4     1/4/1985
Name: Birth Date, dtype: object
joined_data['Birth Date']=pd.to_datetime(joined_data['Birth Date'])
joined_data['Age'] = 2018 - pd.DatetimeIndex(joined_data['Birth Date']).year
joined_data.Age.head()
0    33
1    23
2    25
3    32
4    33
Name: Age, dtype: int64
metric='Age'
plot_metric_salary(metric,joined_data)

png

Who is in the 45+ category?

joined_data[joined_data.Age > 45]
Player Full NameBirth DateYear StartYear EndPositionHeight (ft 1/2)Height (inches 2/2)Height (in cm)Wingspan (in cm)Standing Reach (in cm)...CollegePlayerUnnamed: 0Tmseason17_18Salary (Million USD)Log Salary (M $)BMIHand Area (inches^2)Age
259Larry Nance1959-02-1220162018F6.09.0205.7217.2274.3...University of WyomingLarry Nance384CLE1471382.01.4713820.16772524.70794287.7559
389Tim Hardaway1966-09-0120142018G6.06.0198.1NaNNaN...University of MichiganTim Hardaway64NYK16500000.016.5000001.21748423.744456NaN52

2 rows × 24 columns

These are former pros with sons in the league with their same name. Per wikipedia, the biological stats for Larry Nance, and Tim Hardaway, match more closely to the Juniors (except for Birth Date), so I’m going to update the Birth Dates and recalculate the ages

player_salary[player_salary.Player.str.contains('Nance')], player_salary[player_salary.Player.str.contains('Hardaway')]
(     Unnamed: 0       Player   Tm  season17_18  Salary (Million USD)
 383         384  Larry Nance  CLE    1471382.0              1.471382,
     Unnamed: 0        Player   Tm  season17_18  Salary (Million USD)
 63          64  Tim Hardaway  NYK   16500000.0                  16.5)

These teams and salaries match the data from the 2017-2018 season.

joined_data.loc[joined_data['Player']=='Larry Nance','Birth Date'] = '1993-01-01' #Larry Nance Jr.
joined_data.loc[joined_data['Player']=='Tim Hardaway','Birth Date'] = '1992-03-16' #Tim Hardaway Jr.
joined_data['Birth Date']=pd.to_datetime(joined_data['Birth Date'])
joined_data['Age'] = 2018 - pd.DatetimeIndex(joined_data['Birth Date']).year
metric='Age'
plot_metric_salary(metric,joined_data)

png

joined_data[joined_data.Age > 39]
Player Full NameBirth DateYear StartYear EndPositionHeight (ft 1/2)Height (inches 2/2)Height (in cm)Wingspan (in cm)Standing Reach (in cm)...CollegePlayerUnnamed: 0Tmseason17_18Salary (Million USD)Log Salary (M $)BMIHand Area (inches^2)Age
105Dirk Nowitzki1978-06-1919992018F7.00.0213.4NaNNaN...NaNDirk Nowitzki201DAL5000000.05.0000000.69897024.454263NaN40
179Jason Terry1977-09-1520002018G6.02.0188.0NaNNaN...University of ArizonaJason Terry296MIL2328652.02.3286520.36710523.792131NaN41
274Manu Ginobili1977-07-2820032018G6.06.0198.1NaNNaN...NaNManu Ginobili277SAS2500000.02.5000000.39794023.744456NaN41
416Vince Carter1977-01-2619992018G-F6.06.0198.1NaNNaN...University of North CarolinaVince Carter143SAC8000000.08.0000000.90309025.481856NaN41

4 rows × 24 columns

This checks out!

Create Years in the league

joined_data['Years in the league'] = 2018 - joined_data['Year Start']
metric='Years in the league'
plot_metric_salary(metric,joined_data)

png

Train the model

Before I do that, I definitely need to drop some columns. I’m going to address the categorical data (College, Position, Team) separately right now in case I want it later.

joined_data.columns
Index(['Player Full Name', 'Birth Date', 'Year Start', 'Year End', 'Position',
       'Height (ft 1/2)', 'Height (inches 2/2)', 'Height (in cm)',
       'Wingspan (in cm)', 'Standing Reach (in cm)', 'Hand Length (in inches)',
       'Hand Width (in inches)', 'Weight (in lb)', 'Body Fat (%)', 'College',
       'Player', 'Unnamed: 0', 'Tm', 'season17_18', 'Salary (Million USD)',
       'Log Salary (M $)', 'BMI', 'Hand Area (inches^2)', 'Age',
       'Years in the league'],
      dtype='object')
dropped_columns = ['Player Full Name', 'Birth Date', 'Year Start', 'Year End',
       'Height (ft 1/2)', 'Height (inches 2/2)','Unnamed: 0','season17_18','Player']
categorical_columns = ['Tm','Position','College']
joined_data_dropped = joined_data.drop(columns=dropped_columns)
joined_data_dropped = joined_data_dropped.drop(columns=categorical_columns)
joined_data_dropped.head()
Height (in cm)Wingspan (in cm)Standing Reach (in cm)Hand Length (in inches)Hand Width (in inches)Weight (in lb)Body Fat (%)Salary (Million USD)Log Salary (M $)BMIHand Area (inches^2)AgeYears in the league
0182.9193.0238.8NaNNaN161.00.0272.1169550.32571221.876396NaN3310
1205.7212.7266.78.7510.5220.00.0515.5044200.74071223.63368491.875233
2198.1NaNNaNNaNNaN230.0NaN1.1673330.06719526.640122NaN250
3208.3215.3271.8NaNNaN245.00.09127.7344051.44301925.666394NaN3210
4208.3219.7279.4NaNNaN289.00.1059.7698210.98988730.275869NaN3313

Options for Nans

  1. Work on dataset of only intact rows
  2. Work on dataset of only intact columns
  3. Replace nan with median

Work on dataset with only intact rows

Let’s start with 1, and see how they all compare

training_data_intact_rows = joined_data_dropped.dropna()
training_data_intact_rows.head()
Height (in cm)Wingspan (in cm)Standing Reach (in cm)Hand Length (in inches)Hand Width (in inches)Weight (in lb)Body Fat (%)Salary (Million USD)Log Salary (M $)BMIHand Area (inches^2)AgeYears in the league
1205.7212.7266.78.7510.50220.00.0515.5044200.74071223.63368491.875233
5198.1208.3262.99.008.25214.00.05110.8455061.03525024.78689674.250276
7215.9222.30.09.0010.75260.00.0644.1875990.62196525.35393696.750254
8205.7221.6275.69.509.50220.00.0827.3190350.86445423.63368490.250287
10198.1211.5262.98.258.50210.00.04719.3325001.28628824.32358970.125264
X = training_data_intact_rows.drop(columns=['Salary (Million USD)','Log Salary (M $)'])
y = training_data_intact_rows[['Log Salary (M $)']]
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(X,y)
print('score: '+str(model.score(X,y)))

score: 0.6153114126697995

That’s a good score, but how does it predict outside of the training data?

To answer that, I’m going to fill in the values for the players I removed with the median values for each category.

training_data_filled = joined_data_dropped.fillna(joined_data_dropped.median())
X_filled = training_data_filled.drop(columns=['Salary (Million USD)','Log Salary (M $)'])
y_filled = training_data_filled[['Log Salary (M $)']]
model.score(X_filled,y_filled)
-0.9542480606293884

This model is overfit - when I evaluate my dataset with other players given reasonable values for hand size, reach, wingspan, etc, it fails spectacularly.

y_filled.insert(0,"Salary (M $)",10**y_filled['Log Salary (M $)'])
y_filled.insert(0,"Predicted Log Salary", model.predict(X_filled))
y_filled.insert(0,"Predicted Salary (M $)",10**y_filled["Predicted Log Salary"])

data=y_filled
x_dim = 'Salary (M $)'
y_dim = 'Salary (M $)'
plt.scatter(
    data=y_filled,
    x=x_dim,
    y=y_dim,
    label='Observations',
    alpha=0.5)
y_dim = 'Predicted Salary (M $)'
plt.scatter(
    data=y_filled,
    x=x_dim,
    y=y_dim,
    label='Predictions',
    alpha=0.5,)
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.legend()
plt.show()

png

Who is that we predict should be puling in $1B?

joined_data.iloc[y_filled["Predicted Salary (M $)"].idxmax()].head()
Player Full Name          Dirk Nowitzki
Birth Date          1978-06-19 00:00:00
Year Start                         1999
Year End                           2018
Position                              F
Name: 105, dtype: object

Per Wikipeida, Dirk Nowitzki “is widely regarded as one of the greatest power forwards of all time and is considered by many to be the greatest European player of all time.”

So maybe the Mavs were getting a deal! Or maybe the model is flawed.

Work on dataset with only intact columns

columns_with_na =[]
for column in joined_data_dropped.columns:
    if np.sum(joined_data_dropped[column].isna()):
        columns_with_na.append(column)
training_data_intact_cols=joined_data_dropped.drop(columns=columns_with_na)
training_data_intact_cols.head()
Height (in cm)Weight (in lb)Salary (Million USD)Log Salary (M $)BMIAgeYears in the league
0182.9161.02.1169550.32571221.8763963310
1205.7220.05.5044200.74071223.633684233
2198.1230.01.1673330.06719526.640122250
3208.3245.027.7344051.44301925.6663943210
4208.3289.09.7698210.98988730.2758693313
X = training_data_intact_cols.drop(columns=['Salary (Million USD)','Log Salary (M $)'])
y = training_data_intact_cols[['Log Salary (M $)']]
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(X,y)
print('score: '+str(model.score(X,y)))
score: 0.2710707378505274

The scores are lower for this - this is as expected. There are more examples and fewer categories.

Work with filling NaN with median

training_data_filled = joined_data_dropped.fillna(joined_data_dropped.median())
X = training_data_filled.drop(columns=['Salary (Million USD)','Log Salary (M $)'])
y = training_data_filled[['Log Salary (M $)']]
model = make_pipeline(StandardScaler(), LinearRegression())
model.fit(X,y)
print('score: '+str(model.score(X,y)))
score: 0.2827808374769417

Test the model

Test/train split

Preprocessing

I’m going to try some polynomial features here on the “intact columns” dataset.

I’m choosing that because many of the features I added are multiplicative of one another, so adding polynomial features will only amplify this.

I’m also choosing to drop BMI for the same reason: it’s a product of height and weight.

X = training_data_intact_cols.drop(columns=['Salary (Million USD)','Log Salary (M $)','BMI'])
y = training_data_intact_cols[['Log Salary (M $)']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
score_list =[]
degrees = list(range(1,5))
for degree in degrees:
    model = make_pipeline(StandardScaler(), PolynomialFeatures(degree, include_bias=False), Ridge()) 
    model.fit(X_train,y_train)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    score_list.append([train_score, test_score])
plt.plot(degrees, score_list, marker='o')
plt.legend(['Training data','Testing data'])
plt.xlabel('Polynominal degree')
plt.ylabel('R^2 score')
plt.show()

png

Polynomial degree of 2 has a slightly higher score on the testing data - above 2 we begin overfitting. This is a high variance error. We will use degree 2 for the following predictions.

Predicitions/Production

model = make_pipeline(StandardScaler(), PolynomialFeatures(2, include_bias=False), Ridge()) 
model.fit(X,y) #train on the full dataset
model.feature_names_in_
array(['Height (in cm)', 'Weight (in lb)', 'Age', 'Years in the league'],
      dtype=object)
RWL_stats = {
            X.columns[0]:[180.4],  #5'11" (height in cm)
            X.columns[1]:[175], # weight in lbs
            X.columns[2]:[29], #age
            X.columns[3]:[0], #years in league
            }
RWL_df = pd.DataFrame.from_dict(RWL_stats)
RWL_df
Height (in cm)Weight (in lb)AgeYears in the league
0180.4175290
log10_RWL_salary = model.predict(RWL_df)
RWL_salary = 10**log10_RWL_salary
RWL_salary[0][0]
1.0938939170193462

So, a fair salary for me would be $1,090,000 a year.

Coach Ham, I already live in LA. My application is in the mail!

y_pred = y
y_pred.insert(0,"Salary (M $)",10**y_pred['Log Salary (M $)'])
y_pred.insert(0,"Predicted Log Salary", model.predict(X))
y_pred.insert(0,"Predicted Salary (M $)",10**y_pred["Predicted Log Salary"])
x_dim = 'Salary (M $)'
y_dim = 'Salary (M $)'
plt.scatter(
    data=y_pred,
    x=x_dim,
    y=y_dim,
    label='Observations')
y_dim = 'Predicted Salary (M $)'
plt.scatter(
    data=y_pred,
    x=x_dim,
    y=y_dim,
    label='Predictions',
    alpha=0.5,)
plt.scatter(
    x=0,
    y=RWL_salary,
    label='RWL',
    alpha=1,)
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.legend()
plt.show()

png

plt.subplots(1,2,figsize=(22,7))
plt.subplot(1,2,1)
data=y_filled
x_dim = 'Salary (M $)'
y_dim = 'Salary (M $)'
plt.scatter(
    data=y_filled,
    x=x_dim,
    y=y_dim,
    label='Observations')
y_dim = 'Predicted Salary (M $)'
plt.scatter(
    data=y_filled,
    x=x_dim,
    y=y_dim,
    label='Predictions',
    alpha=0.5,)
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.title('Predictions with all biometrics')
plt.ylim([0,1100])
plt.legend()
plt.subplot(1,2,2)
x_dim = 'Salary (M $)'
y_dim = 'Salary (M $)'
plt.scatter(
    data=y_pred,
    x=x_dim,
    y=y_dim,
    label='Observations')
y_dim = 'Predicted Salary (M $)'
plt.scatter(
    data=y_pred,
    x=x_dim,
    y=y_dim,
    label='Predictions',
    alpha=0.5,)
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.ylim([0,1100])
plt.title('Predictions with height, weight, age, and years')
plt.legend()
plt.show()

png

When viewed on the same scale, the model with fewer features is doing much better. But it’s $R^2$ score was still only 0.45, which is not great. We must wonder:

Why is this model doing so badly?

Or, why am worth more than minimum wage for an NBA player?

coef_df = pd.DataFrame(
    zip(
    list(model['polynomialfeatures'].get_feature_names_out(model.feature_names_in_)),
    list(model['ridge'].coef_,)[0]), columns=['Category','Coefficient']
)
coef_df.insert(2,'Magntitute of coef',np.abs(model['ridge'].coef_,)[0])
coef_df.sort_values(by=['Magntitute of coef'],ascending=False)
CategoryCoefficientMagntitute of coef
3Years in the league0.5796690.579669
12Age Years in the league-0.3930720.393072
2Age-0.2442060.244206
11Age^20.1401300.140130
13Years in the league^20.0778770.077877
0Height (in cm)0.0480720.048072
6Height (in cm) Age0.0443530.044353
7Height (in cm) Years in the league-0.0342420.034242
5Height (in cm) Weight (in lb)0.0265500.026550
9Weight (in lb) Age0.0214810.021481
4Height (in cm)^2-0.0195260.019526
1Weight (in lb)-0.0173130.017313
10Weight (in lb) Years in the league-0.0164640.016464
8Weight (in lb)^2-0.0039410.003941

I really like this analysis because it finds that age and years in the league are more important than height or weight.

Not only that, but the quadratic features of Age$^2$ and years$^2$ match my intuition and represent two different cases of high salary:

Young players have high potential: teams are eager to get young, talented players, and incentivize them with high salaries.

Players with more years in the league are more experienced and known quantities, which is valuable in a different way. There could also be some survivorship basis at play here: Perhaps only the good (valuable) players play for many years. In that case, they are valuable for another reason (high skill), which correlates with years in the league.

Age$^2$ has a positive coefficient with high magnitude, because there players at both ends of that parabola are valuable. I draw a similar conclusion from the high magnitude, positive coefficient on years$^2$

X_uniform = pd.DataFrame({"Age":np.linspace(20,40,num=len(X)), "Years in the league":np.linspace(0,20,num=len(X))})
X.update(X_uniform)
scaled_X=model['standardscaler'].transform(X)
scaled_age = scaled_X[:,2]
scaled_years=scaled_X[:,3]
age_coef, age2_coef = (np.array(coef_df.loc[coef_df['Category']=='Age','Coefficient']), 
                       np.array(coef_df.loc[coef_df['Category']=='Age^2','Coefficient']))
age_salary = scaled_age*(age_coef)+(scaled_age**2)*(age2_coef)
years_coef, years2_coef =  (np.array(coef_df.loc[coef_df['Category']=='Years in the league','Coefficient']), 
                            np.array(coef_df.loc[coef_df['Category']=='Years in the league^2','Coefficient']))
years_salary =scaled_years*(years_coef)+(scaled_years**2)*(years2_coef)
plt.subplots(1,2,figsize=(22,7))
plt.subplot(1,2,1)
x_dim='Age'
y_dim='Salary (Million USD)'
plt.scatter(
    data=joined_data,
    x=x_dim,
    y=y_dim,
    label='Observations',
    alpha=0.5,
)
plt.plot(X_uniform['Age'],10**age_salary,
        label="Contribution from age only",
         color='#ff7f0e')
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.ylim([-2,38])
plt.legend()

plt.subplot(1,2,2)
x_dim='Years in the league'
plt.scatter(
    data=joined_data,
    x=x_dim,
    y=y_dim,
    label='Observations',
    alpha=0.5,

)
plt.plot(X_uniform['Years in the league'],10**years_salary,
        label="Contribution from years only",
         color='#ff7f0e')
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.ylim([-2,38])
plt.legend()
plt.show()

png

This calculation neglected the cross term of $age*years$. Let’s include it and re-evaluate.

X_uniform = pd.DataFrame({"Age":np.linspace(20,40,num=len(X)), "Years in the league":np.linspace(0,20,num=len(X))})
X.update(X_uniform)
scaled_X=model['standardscaler'].transform(X)
scaled_age = scaled_X[:,2]
scaled_years=scaled_X[:,3]
age_years_coef = np.array(coef_df.loc[coef_df['Category']=='Age Years in the league','Coefficient'])
age_years_salary = scaled_age*(age_coef)+(scaled_age**2)*(age2_coef)+(scaled_age*scaled_years)*(age_years_coef)
x_dim='Age'
y_dim='Salary (Million USD)'
plt.scatter(
    data=joined_data,
    x=x_dim,
    y=y_dim,
    label='Observations',
    alpha=0.5
)
plt.plot(X_uniform['Age'],10**age_years_salary,
        label="Contribution from age and years",
         color='#ff7f0e')
plt.xlabel(x_dim)
plt.ylabel(y_dim)
plt.ylim([-2,38])
plt.legend()
plt.show()

png

The little bump centered around 23 years old is showing that NBA salaries are rewarding the rare combination of low age and high experience. After an age of about 28, the negatives of age overtakes benefit from experience.