Today we’ll expand on the basics of machine learning and examine how to train a linear regression machine learning model. This is the seventh part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning.
Getting Started
First, let’s take a look at an overview of this data science project. If you’re already familiar with it, feel free to skip to the next section.
Project Overview
As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step of this project:
- Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
- Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
- Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
- Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
- Data Exploration - This step focuses on analyzing and visualizing the dataset to uncover patterns, relationships, and general trends and is a helpful preliminary step before deeper analysis.
- Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
- Machine Learning - This step focuses on selecting, training, and evaluating a machine learning model. For this project, the model will identify the combination of individual player statistics that correlates with optimal performance.
We’ll use Python along with popular libraries like pandas, numpy, and scikit-learn to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, identify relationships between variables, create meaningful visualizations, and train a basic machine learning model. Due to the size of this project, today we’ll cover part of the seventh step: training a machine learning model.
Dependencies
Since this is the seventh installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the first article of the series or at least review the Getting Started section of that post before continuing.
You’ll want to have the latest version of Python installed with the following packages:
For today’s machine learning sgement specifically, we’ll want to import a few of these libraries:
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
Import Data
In Part 3 of this series, we engineered new features for our dataset, which is stored in a dataframe named player_data
. No changes have been made to the underlying dataset in the intermediary articles. If you want to follow along with the code examples in this article, it’s recommended to import the player_data
dataframe before proceeding.
player_data = pd.read_excel('player_data_engineered.xlsx')
player_data.head()
PLAYER_NAME | Team | Class | Height | Position | PLAYER_ID | TEAM_NAME | GAMES | MINUTES_PLAYED | FIELD_GOALS_MADE | ... | Conference | MINUTES_PER_GAME | FOULS_PER_GAME | POINTS_PER_GAME | ASSISTS_PER_GAME | STEALS_PER_GAME | BLOCKS_PER_GAME | REBOUNDS_PER_GAME | ASSIST_TO_TURNOVER | FANTASY_POINTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Kiara Jackson | UNLV (Mountain West) | Junior | 67 | Guard | ncaaw.p.67149 | UNLV | 29 | 895 | 128 | ... | Mountain West | 30.862069 | 1.620690 | 11.137931 | 4.655172 | 1.068966 | 0.172414 | 4.448276 | 3.214286 | 710.3 |
1 | Raven Johnson | South Carolina (SEC) | Sophomore | 68 | Guard | ncaaw.p.67515 | South Carolina | 30 | 823 | 98 | ... | SEC | 27.433333 | 1.133333 | 8.100000 | 4.933333 | 2.000000 | 0.166667 | 5.366667 | 2.792453 | 735.2 |
2 | Gina Marxen | Montana (Big Sky) | Senior | 68 | Guard | ncaaw.p.57909 | Montana | 29 | 778 | 88 | ... | Big Sky | 26.827586 | 0.896552 | 10.241379 | 3.827586 | 0.551724 | 0.068966 | 2.068966 | 2.921053 | 533.5 |
3 | McKenna Hofschild | Colorado St. (Mountain West) | Senior | 62 | Guard | ncaaw.p.60402 | Colorado St. | 29 | 1046 | 231 | ... | Mountain West | 36.068966 | 1.172414 | 22.551724 | 7.275862 | 1.241379 | 0.137931 | 3.965517 | 2.971831 | 1117.5 |
4 | Kaylah Ivey | Boston College (ACC) | Junior | 68 | Guard | ncaaw.p.64531 | Boston Coll. | 33 | 995 | 47 | ... | ACC | 30.151515 | 1.454545 | 4.333333 | 5.636364 | 1.090909 | 0.030303 | 1.727273 | 2.906250 | 500.4 |
5 rows × 40 columns
Basics of Machine Learning
Before we get into training a model, let’s briefly revisit a few basics of machine learning. If you are already familiar with these concepts, feel free to skip to the next section. Machine learning is a branch of artificial intelligence that focuses on creating algorithms and statistical models that allow computer systems to “learn” how to improve their performance on a specific task through experience. In the context of our basketball statistics project, machine learning can be particularly useful for predicting player performance, classifying player position, and identifying similar players.
Key concepts in machine learning that we’ll encounter include:
- Model - The system that learns patterns from data and can be used to make predictions on previously unseen data. Machine learning models are often of a specific type (Linear or Logistic Regression, Random Forests, Support Vector Machines, Neural Networks, etc.). Today’s model is a Linear Regression model.
- Training Data - The subset of our data used to train the model.
- Testing Data - A separate subset of data used to evaluate the model’s performance.
- Features - The input variables used to make predictions. These are sometimes referred to as the independent variable(s) or the predictor(s). For this project, these are various player statistics like three points made and assists.
- Target Variable - The variable we’re trying to predict or optimize. This is sometimes referred to as the dependent variable(s), as it depends on the independent variable(s). In today’s project, this is Fantasy Points.
- Parameters - The values that the model learns during training, such as coefficients in linear regression. These parameters define how the model transforms input features into predictions.
- Hyperparameters - The configuration settings for the model that are set before training begins. These are not learned from the data but are specified by the data scientist. Examples include learning rate, number of iterations, or regularization strength. Hyperparameters can significantly affect model performance and are often tuned to optimize the model.
- Note: The model we’ll be using today is straightforward and doesn’t typically have hyperparameters in the traditional sense. However, it’s still important to know the difference between parameters and hyperparameters since many models will have hyperparameters.
- Residuals - The differences between the observed values and the predicted values from the model. Residuals help assess how well the model fits the data and can reveal patterns or issues in the model’s predictions.
- Model Evaluation - Metrics used to assess how well our model is performing. For a Linear Regression model, this will include metrics like Mean Squared Error (MSE) and the R-squared value.
We’ll use primarily the first six terms throughout this article, so it’s best to familiarize yourself with them now. The other concepts will be explored in more detail in future articles (please let me know if that is something you are interested in!).
Note: Our focus in this article is on classic machine learning models designed for tabular data. We won’t be covering models built specifically for natural language processing, image recognition, or video analysis. However, it’s worth mentioning that many problems in these domains often get transformed into tabular data problems, so some of the principles we discuss here may still apply in those contexts. With all of that out of the way, let’s move on to training the machine learning model.
Model Training
Now that we’ve covered the basics of machine learning, we’re ready to move on to the exciting part: training our Ordinary Least Squares (OLS) linear regression model! This process involves several key steps that will help us build a robust and accurate predictive model for our basketball player statistics.
Define the Variables
As a reminder from the previous article, we defined the target
and feature
variables as:
target = 'FANTASY_POINTS'
features = ['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'THREE_POINTS_MADE',
'TWO_POINTS_MADE', 'FREE_THROWS_MADE', 'TOTAL_REBOUNDS', 'ASSISTS',
'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS']
We’ll label the features (independent variables) as X
and the target (dependent) variable as y
for conciseness.
X = player_data[features]
y = player_data[target]
Let’s take a quick look at the values in X
:
X
Height | MINUTES_PLAYED | FIELD_GOALS_MADE | THREE_POINTS_MADE | TWO_POINTS_MADE | FREE_THROWS_MADE | TOTAL_REBOUNDS | ASSISTS | TURNOVERS | STEALS | BLOCKS | FOULS | POINTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 67 | 895 | 128 | 28 | 100 | 39 | 129 | 135 | 42 | 31 | 5 | 47 | 323 |
1 | 68 | 823 | 98 | 20 | 78 | 27 | 161 | 148 | 53 | 60 | 5 | 34 | 243 |
2 | 68 | 778 | 88 | 58 | 30 | 63 | 60 | 111 | 38 | 16 | 2 | 26 | 297 |
3 | 62 | 1046 | 231 | 55 | 176 | 137 | 115 | 211 | 71 | 36 | 4 | 34 | 654 |
4 | 68 | 995 | 47 | 32 | 15 | 17 | 57 | 186 | 64 | 36 | 1 | 48 | 143 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
895 | 66 | 742 | 92 | 53 | 39 | 45 | 113 | 73 | 66 | 45 | 2 | 54 | 282 |
896 | 73 | 815 | 108 | 58 | 50 | 26 | 140 | 34 | 46 | 19 | 19 | 51 | 300 |
897 | 71 | 774 | 102 | 56 | 46 | 67 | 176 | 29 | 48 | 29 | 3 | 68 | 327 |
898 | 71 | 848 | 127 | 54 | 73 | 76 | 123 | 71 | 90 | 35 | 9 | 94 | 384 |
899 | 67 | 872 | 133 | 55 | 78 | 44 | 127 | 29 | 41 | 37 | 2 | 52 | 365 |
900 rows × 13 columns
Let’s check the values in y
as well:
y
0 710.3
1 735.2
2 533.5
3 1117.5
4 500.4
...
895 555.1
896 549.0
897 597.7
898 636.1
899 597.9
Name: FANTASY_POINTS, Length: 900, dtype: float64
These look great and match with the values we saw in the previous article, so we can move on to the next step.
Create Training and Testing Splits
Now that we have our variables defined, we can create the training and testing splits. This involves dividing our dataset into two parts: a set for training and a set for testing. The train set will be used to train the model, and the test set will be used exclusively for testing and evaluating the model after training.
At this point, you might wonder: Why don’t we just use all of the data for training? There are several reasons for this:
- Model Evaluation - Having a test set allows us to evaluate how well our model performs on unseen data, giving us a more realistic estimate of its performance in real-world scenarios.
- Preventing Overfitting - By keeping a portion of our data separate for testing, we can detect if our model is overfitting to the training data. Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, leading to poor generalization on new data.
- Validating Model Generalization - The test set helps us validate whether our model can generalize well to new, unseen data. This is crucial for ensuring that our model will perform reliably when deployed in practice.
- Hyperparameter Tuning - While we don’t have any hyperparameters to tune today, splitting the data is also essential for hyperparameter tuning techniques like cross-validation.
To create our training and test splits, we’ll use the train_test_split
function from scikit-learn. This function allows us to easily split our data while maintaining the proportion of samples for each class. Here’s how we can implement it:
X_train, X_test, y_train, y_test = train_test_split(X, y)
The _train
splits (X_train
and y_train
) include the data for training the model and the _test
splits (X_test
and y_test
) include the data for testing the model. By using these separate splits, we know that our model is trained on one subset of the data and evaluated on a completely separate subset, so we can essentially set aside the _test
splits until we’re ready to evaluate the performance of the model.
We can take a look at the first few rows of the training split for the features using X_train
:
X_train.head(5)
Height | MINUTES_PLAYED | FIELD_GOALS_MADE | THREE_POINTS_MADE | TWO_POINTS_MADE | FREE_THROWS_MADE | TOTAL_REBOUNDS | ASSISTS | TURNOVERS | STEALS | BLOCKS | FOULS | POINTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
200 | 68 | 669 | 53 | 10 | 43 | 75 | 73 | 111 | 89 | 25 | 3 | 50 | 191 |
300 | 73 | 577 | 103 | 0 | 103 | 46 | 170 | 29 | 56 | 30 | 37 | 81 | 252 |
336 | 74 | 791 | 108 | 6 | 102 | 38 | 213 | 19 | 56 | 33 | 40 | 97 | 260 |
727 | 70 | 591 | 84 | 0 | 84 | 39 | 210 | 12 | 60 | 20 | 21 | 93 | 207 |
403 | 74 | 303 | 49 | 2 | 47 | 16 | 125 | 8 | 23 | 10 | 26 | 52 | 116 |
Reproducibility
You might notice that if you run the train_test_split()
for a second time, there are different rows of data included in each split. Here’s an example of re-running the exact same code:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train.head(5)
Height | MINUTES_PLAYED | FIELD_GOALS_MADE | THREE_POINTS_MADE | TWO_POINTS_MADE | FREE_THROWS_MADE | TOTAL_REBOUNDS | ASSISTS | TURNOVERS | STEALS | BLOCKS | FOULS | POINTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
397 | 74 | 596 | 80 | 4 | 76 | 63 | 192 | 28 | 48 | 16 | 32 | 69 | 227 |
501 | 73 | 829 | 199 | 4 | 195 | 65 | 301 | 53 | 73 | 42 | 7 | 96 | 467 |
107 | 69 | 1072 | 128 | 42 | 86 | 75 | 187 | 137 | 86 | 111 | 7 | 80 | 373 |
462 | 75 | 619 | 59 | 0 | 59 | 54 | 126 | 18 | 57 | 29 | 26 | 53 | 172 |
263 | 78 | 995 | 278 | 2 | 276 | 103 | 330 | 52 | 62 | 17 | 60 | 50 | 661 |
This happens because the data is shuffled before splitting, and the shuffling is not guaranteed to be reproducible by default. This can mean that the model is trained and tested on different datasets each time that you run it. That can be a good thing, but it might be better to have reproducible results for initial creation and evaluation of the model (especially if you want to follow along with this guide).
We can ensure reproducibility of the splits by controlling the shuffling with the random_state
parameter:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=314)
X_train.head(5)
Height | MINUTES_PLAYED | FIELD_GOALS_MADE | THREE_POINTS_MADE | TWO_POINTS_MADE | FREE_THROWS_MADE | TOTAL_REBOUNDS | ASSISTS | TURNOVERS | STEALS | BLOCKS | FOULS | POINTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
239 | 68 | 1037 | 258 | 50 | 208 | 92 | 152 | 96 | 87 | 32 | 3 | 59 | 658 |
638 | 70 | 961 | 109 | 44 | 65 | 66 | 136 | 87 | 75 | 63 | 11 | 71 | 328 |
848 | 69 | 908 | 142 | 72 | 70 | 58 | 78 | 61 | 71 | 20 | 5 | 71 | 414 |
260 | 76 | 913 | 112 | 28 | 84 | 59 | 200 | 40 | 64 | 23 | 73 | 94 | 311 |
745 | 66 | 713 | 146 | 36 | 110 | 57 | 124 | 105 | 96 | 72 | 2 | 93 | 385 |
Now, no matter how many times you run the snippet, you should get the same rows of data in the train and test splits every time.
Dataset Proportions
Another parameter that is commonly specified is the test_size
(or, less commonly, the train_size
) to specify a proportion of the dataset to include in the test or train split, respectively. According to the documentation, the default test_size
(assuming train_size
is not set) is 0.25
, so one-quarter of the data will be included in the test dataset and three-quarters of the data will be included in the train dataset.
We can verify this on our own splits:
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')
Test data split proportion: 0.25
Train data split proportion: 0.75
If you want to change those proportions, then you can use either the test_size
or the train_size
parameter. For example, if you want the test split to be 20% of the data instead of 25%, you would run:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')
Test data split proportion: 0.2
Train data split proportion: 0.8
We’ll be using the default 25% test size, so we can remove the test_size
parameter for today.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=314)
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')
Test data split proportion: 0.25
Train data split proportion: 0.75
A Note on Dataframes versus numPy Arrays
You might notice that we’re using dataframes directly (player_data[features]
) in the train_test_split
, but some code samples will use numPy arrays instead (player_data[features].to_numpy()
). Both approaches can work, but they have different implications depending on how you intend to use the data. Let’s quickly review the pros and cons of each approach.
Using pandas DataFrames or Series
This entails passing X
and y
as DataFrames or Series directly to the train_test_split
function.
print(f'dtype of X: {type(X)}')
print(f'dtype of y: {type(y)}')
dtype of X: <class 'pandas.core.frame.DataFrame'>
dtype of y: <class 'pandas.core.series.Series'>
Some advantages of this approach are:
- Retains column names, which can be useful for model interpretation and debugging.
- Works seamlessly with scikit-learn, which can handle DataFrames directly.
- Makes it easier to perform certain operations, like feature selection and transformation. Some disadvantages of this approach are:
- Slightly more overhead than working with numpy arrays, but generally negligible.
Using numPy arrays
This entails passing X
and y
as numPy arrays to the train_test_split
function.
print(f'dtype of X: {type(X.to_numpy())}')
print(f'dtype of y: {type(y.to_numpy())}')
dtype of X: <class 'numpy.ndarray'>
dtype of y: <class 'numpy.ndarray'>
Some advantages of this approach are:
- Can be slightly faster for certain operations because numpy arrays are lower-level structures.
- If you need to work with libraries that require numpy arrays (although most scikit-learn functions accept DataFrames as well). Some disadvantages of this approach are:
- You lose the column names and index information, which can make it harder to trace errors or interpret results later.
- Not necessary for most scikit-learn functions, which work fine with DataFrames.
It’s generally more convenient to use DataFrames directly unless you have a specific reason to convert to numpy arrays. This way, you retain all the metadata that can be useful during data analysis and model interpretation.
In summary, you can choose either method based on your preference, but it’s generally more convenient to use DataFrames directly unless you have a specific reason to convert to numpy arrays. If you have a specific scenario where a numpy array is required, then use .to_numpy()
. For today, we’ll move on with the DataFrames approach.
Train the Model
Now that we have our data split into training and test sets, we’re ready to train our linear regression model. We’ll use scikit-learn’s LinearRegression class (which uses ordinary least squares) for this purpose.
We can start by initializing the model using an instance of the LinearRegression
class with the default parameters:
linear_reg_model = LinearRegression()
We can then use the fit()
method to actually train our model on our data. This method takes two arguments: X_train
(the training split of our features) and y_train
(the training split of our target variable).
linear_reg_model.fit(X_train, y_train)
During the training process, the model uses the training set to learn the optimal parameters for each feature that minimize the difference between the predicted values of the target variable and the actual values of the target variable. Once the training is complete, our linear_reg_model
object will contain the learned parameters (coefficients and intercept) that can be used to make predictions on new data. In the future, we can use X_test
to predict what the model thinks y
is, and then compare that output y
to the actual y
values stored in y_test
to evaluate the model performance. For now, let’s take a look at the model parameters directly.
Print the Model Equation
While not strictly necessary, it can be helpful to circle back to the linear regression equation mentioned in an earlier section by printing the final equation of our trained model. For a linear regression model, the learned parameters are the coefficients and intercept, which can be used to assemble the model equation.
The model coefficients can be printed with the coef_
property:
linear_reg_model.coef_
array([ 1.37791128e-14, 1.19348975e-15, 1.66666667e+00, 1.33333333e+00,
3.33333333e-01, 1.00000000e+00, 1.20000000e+00, 1.50000000e+00,
-1.00000000e+00, 2.00000000e+00, 2.00000000e+00, -1.20129601e-15,
-7.23379689e-16])
The model intercept can be printed with the intercept_
property:
linear_reg_model.intercept_
-6.821210263296962e-13
We can print the feature names with the feature_names_in_
property:
linear_reg_model.feature_names_in_
array(['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE',
'THREE_POINTS_MADE', 'TWO_POINTS_MADE', 'FREE_THROWS_MADE',
'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 'STEALS', 'BLOCKS',
'FOULS', 'POINTS'], dtype=object)
This isn’t very readable though, so with a bit of effort we can map each coefficient to the corresponding feature name. We can start by creating a dictionary of the coefficients:
coef_series = pd.Series(data=linear_reg_model.coef_, index=linear_reg_model.feature_names_in_)
coef_series
Height 1.377911e-14
MINUTES_PLAYED 1.193490e-15
FIELD_GOALS_MADE 1.666667e+00
THREE_POINTS_MADE 1.333333e+00
TWO_POINTS_MADE 3.333333e-01
FREE_THROWS_MADE 1.000000e+00
TOTAL_REBOUNDS 1.200000e+00
ASSISTS 1.500000e+00
TURNOVERS -1.000000e+00
STEALS 2.000000e+00
BLOCKS 2.000000e+00
FOULS -1.201296e-15
POINTS -7.233797e-16
dtype: float64
This makes the coefficients far more readable! We can print this data as a string in the format of coefficient
*feature_name
(with newlines for formatting) to then use in the model equation:
coef_string = "\n + ".join(f"{coef}*{feat}" for feat, coef in coef_series.items())
print(coef_string)
1.3779112764770156e-14*Height
+ 1.1934897514720433e-15*MINUTES_PLAYED
+ 1.6666666666666634*FIELD_GOALS_MADE
+ 1.333333333333333*THREE_POINTS_MADE
+ 0.33333333333333315*TWO_POINTS_MADE
+ 1.0000000000000009*FREE_THROWS_MADE
+ 1.1999999999999982*TOTAL_REBOUNDS
+ 1.499999999999999*ASSISTS
+ -0.9999999999999992*TURNOVERS
+ 1.999999999999999*STEALS
+ 2.000000000000001*BLOCKS
+ -1.201296007113939e-15*FOULS
+ -7.233796894823286e-16*POINTS
Let’s use this coefficient string to assemble the final model equation:
print(f'{target} = {coef_string} + {linear_reg_model.intercept_} + error')
FANTASY_POINTS = 1.3779112764770156e-14*Height
+ 1.1934897514720433e-15*MINUTES_PLAYED
+ 1.6666666666666634*FIELD_GOALS_MADE
+ 1.333333333333333*THREE_POINTS_MADE
+ 0.33333333333333315*TWO_POINTS_MADE
+ 1.0000000000000009*FREE_THROWS_MADE
+ 1.1999999999999982*TOTAL_REBOUNDS
+ 1.499999999999999*ASSISTS
+ -0.9999999999999992*TURNOVERS
+ 1.999999999999999*STEALS
+ 2.000000000000001*BLOCKS
+ -1.201296007113939e-15*FOULS
+ -7.233796894823286e-16*POINTS + -6.821210263296962e-13 + error
Analyze the Model Equation
Now that we have the final model equation, we can see that multiple variables have a coefficient that is very close to zero (e-15
or smaller values). If we remove those values using abs(coef) > 0.0001
, this equation simplifies to:
coef_series_simple = coef_series[abs(coef_series) > 0.0001]
coef_string_simple = "\n\t\t + ".join(f"{coef:.4f} * {feat}" for feat, coef in coef_series_simple.items())
print(f'{target} = {coef_string_simple} + {linear_reg_model.intercept_} + error')
FANTASY_POINTS = 1.6667 * FIELD_GOALS_MADE
+ 1.3333 * THREE_POINTS_MADE
+ 0.3333 * TWO_POINTS_MADE
+ 1.0000 * FREE_THROWS_MADE
+ 1.2000 * TOTAL_REBOUNDS
+ 1.5000 * ASSISTS
+ -1.0000 * TURNOVERS
+ 2.0000 * STEALS
+ 2.0000 * BLOCKS + 2.2737367544323206e-13 + error
Excellent! We can compare this to the original equation for Fantasy Points:
FANTASY_POINTS = 3 * THREE_POINTS_MADE +
2 * TWO_POINTS_MADE +
1 * FREE_THROWS_MADE +
1.2 * TOTAL_REBOUNDS +
1.5 * ASSISTS +
2 * BLOCKS +
2 * STEALS +
-1 * TURNOVERS
The model estimated some of the coefficients to be the same values as the known equation. We included more parameters in the machine learning model than were in the original equation and most of those extra parameters were estimated to have a coefficient around zero. However, we can see that the coefficient was not zero for one of those extra parameters, FIELD_GOALS_MADE
, and that the coefficients for THREE_POINTS_MADE
and TWO_POINTS_MADE
do not match between the estimated and original equations. Here’s a quick table to summarize the differences:
- | FIELD_GOALS_MADE | THREE_POINTS_MADE | TWO_POINTS_MADE |
---|---|---|---|
Estimated Coefficient | 1.6 | 1.3 | 0.3 |
Original Coefficient | Does Not Exist | 3 | 2 |
Note: the fractions were rounded to one-tenth for readability but the values are actually \(1 \frac{2}{3}\) instead of 1.6, \(1 \frac{1}{3}\) instead of 1.3, \(\frac{1}{3}\) instead of 0.3, etc.
This already seems pretty close, but we can take it a step further by revisiting the definition of field goals. The FIELD_GOALS_MADE
variable is actually the sum of TWO_POINTS_MADE
and THREE_POINTS_MADE
(but not FREE_THROWS_MADE
). We can sanity check this in our dataset as well:
check = player_data['FIELD_GOALS_MADE'] == player_data['TWO_POINTS_MADE'] + player_data['THREE_POINTS_MADE']
print(f'True count: {check.sum()} rows')
print(f'False count: {(~check).sum()} rows')
True count: 900 rows
False count: 0 rows
So, we can write this relationship as the equation:
FIELD_GOALS_MADE = TWO_POINTS_MADE + THREE_POINTS_MADE
We can then substitute this equation into the model’s equation:
FANTASY_POINTS = 1.6 * (TWO_POINTS_MADE + THREE_POINTS_MADE) +
1.3 * THREE_POINTS_MADE +
0.3 * TWO_POINTS_MADE +
1 * FREE_THROWS_MADE +
1.2 * TOTAL_REBOUNDS +
1.5 * ASSISTS +
2 * BLOCKS +
2 * STEALS +
-1 * TURNOVERS
A quick distribution of the coefficient turns this into:
FANTASY_POINTS = 1.6 * THREE_POINTS_MADE +
1.3 * THREE_POINTS_MADE +
1.6 * TWO_POINTS_MADE +
0.3 * TWO_POINTS_MADE +
1 * FREE_THROWS_MADE +
1.2 * TOTAL_REBOUNDS +
1.5 * ASSISTS +
2 * BLOCKS +
2 * STEALS +
-1 * TURNOVERS
Using the actual fractional notation of these coefficients (instead of the rounded values shown above), this simplifies down into:
FANTASY_POINTS = 3 * THREE_POINTS_MADE +
2 * TWO_POINTS_MADE +
1 * FREE_THROWS_MADE +
1.2 * TOTAL_REBOUNDS +
1.5 * ASSISTS +
2 * BLOCKS +
2 * STEALS +
-1 * TURNOVERS
This means that the model estimated approximately the same equation as the original fantasy points calculation, with the addition of a few dependent variables with coefficients close to zero and an intercept value close to zero. As a reminder, some models will be sufficiently complex that it might be difficult to output and effectively analyze the estimated equation, but it provides a lot of value in this case.
Alternate Training
Since we have ended up with essentially the same equation as the original fantasy points calculation, we can logically expect our model to perform pretty well in the next phase of model evaluation. So, we can also train an alternate model with a few of the features removed for comparison. First, let’s create an alternate version of X
with all three of the features with high correlation coefficients removed:
X_alt = player_data[features].drop(columns=['FIELD_GOALS_MADE', 'TWO_POINTS_MADE', 'POINTS'])
Note: this is more features than you would likely want to remove in a real-world scenario, but removing too many features will give us an opportunity to compare a less-than-perfect model to a perfect model in the model evaluation phase.
Our target variable y
is unchanged, so we can create alternate training and test splits using this X_alt
:
X_train_alt, X_test_alt, y_train_alt, y_test_alt = train_test_split(X_alt, y, random_state=314)
We can now train an alternate model using these new training splits:
ols_alt = LinearRegression()
ols_alt.fit(X_train_alt, y_train_alt)
We can print the model equation for this alternate model as well:
coef_series_alt = pd.Series(data=ols_alt.coef_, index=ols_alt.feature_names_in_)
coef_series_alt = coef_series_alt[abs(coef_series_alt) > 0.0001]
coef_string_alt = "\n\t\t + ".join(f"{coef:.4f} * {feat}" for feat, coef in coef_series_alt.items())
print(f'{target} = {coef_string_alt} + {ols_alt.intercept_} + error')
FANTASY_POINTS = 1.4457 * Height
+ 0.1001 * MINUTES_PLAYED
+ 2.2441 * THREE_POINTS_MADE
+ 2.3925 * FREE_THROWS_MADE
+ 1.5098 * TOTAL_REBOUNDS
+ 1.2482 * ASSISTS
+ 2.1495 * STEALS
+ 2.5698 * BLOCKS + -131.85858178609237 + error
FANTASY_POINTS = 2.4532 * Height
+ 0.1039 * MINUTES_PLAYED
+ 2.2037 * THREE_POINTS_MADE
+ 2.3917 * FREE_THROWS_MADE
+ 1.5219 * TOTAL_REBOUNDS
+ 1.3231 * ASSISTS
+ -0.5706 * TURNOVERS
+ 2.2393 * STEALS
+ 2.4818 * BLOCKS
+ -0.2612 * FOULS + -203.27425560271263 + error
We can see that the model coefficients and the y-intercept are substantially different from the model we originally trained. We won’t know if this alternate model performs as well as the original one until we evaluate each model in the next article.
Export Data & Models
If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export the testing dataset.
X_test.to_csv('X_test_full.csv', index=False)
X_test_alt.to_csv('X_test_few.csv', index=False)
y_test.to_csv('y_actual.csv', index=False)
While it’s not strictly necessary to export small, simple models like these, it’s often helpful for checkpointing and collaboration. There are multiple ways to export machine learning models detailed in scikit-learn’s model persistence page, including the popular pickle library, but for today we’ll use joblib.
joblib.dump(linear_reg_model, 'model_full.sav')
joblib.dump(ols_alt, 'model_few.sav')
Wrap Up
In today’s guide, we covered how to train the selected machine learning model, including how to properly split our dataset into train and test subsets. In the next part, we’ll focus on how to evaluate the model’s performance.
Also, all of the code snippets in today’s guide are available in a Jupyter Notebook in the ncaa-basketball-stats repository on GitHub.