Today we’ll expand on the basics of machine learning and examine how to train a linear regression machine learning model. This is the seventh part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning.

Getting Started

First, let’s take a look at an overview of this data science project. If you’re already familiar with it, feel free to skip to the next section.

Project Overview

As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step of this project:

the steps for this data science project

Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
Data Exploration - This step focuses on analyzing and visualizing the dataset to uncover patterns, relationships, and general trends and is a helpful preliminary step before deeper analysis.
Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
Machine Learning - This step focuses on selecting, training, and evaluating a machine learning model. For this project, the model will identify the combination of individual player statistics that correlates with optimal performance.

We’ll use Python along with popular libraries like pandas, numpy, and scikit-learn to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, identify relationships between variables, create meaningful visualizations, and train a basic machine learning model. Due to the size of this project, today we’ll cover part of the seventh step: training a machine learning model.

Dependencies

Since this is the seventh installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the first article of the series or at least review the Getting Started section of that post before continuing.

You’ll want to have the latest version of Python installed with the following packages:

For today’s machine learning sgement specifically, we’ll want to import a few of these libraries:

import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Import Data

In Part 3 of this series, we engineered new features for our dataset, which is stored in a dataframe named player_data. No changes have been made to the underlying dataset in the intermediary articles. If you want to follow along with the code examples in this article, it’s recommended to import the player_data dataframe before proceeding.

player_data = pd.read_excel('player_data_engineered.xlsx')
player_data.head()

	PLAYER_NAME	Team	Class	Height	Position	PLAYER_ID	TEAM_NAME	GAMES	MINUTES_PLAYED	FIELD_GOALS_MADE	...	Conference	MINUTES_PER_GAME	FOULS_PER_GAME	POINTS_PER_GAME	ASSISTS_PER_GAME	STEALS_PER_GAME	BLOCKS_PER_GAME	REBOUNDS_PER_GAME	ASSIST_TO_TURNOVER	FANTASY_POINTS
0	Kiara Jackson	UNLV (Mountain West)	Junior	67	Guard	ncaaw.p.67149	UNLV	29	895	128	...	Mountain West	30.862069	1.620690	11.137931	4.655172	1.068966	0.172414	4.448276	3.214286	710.3
1	Raven Johnson	South Carolina (SEC)	Sophomore	68	Guard	ncaaw.p.67515	South Carolina	30	823	98	...	SEC	27.433333	1.133333	8.100000	4.933333	2.000000	0.166667	5.366667	2.792453	735.2
2	Gina Marxen	Montana (Big Sky)	Senior	68	Guard	ncaaw.p.57909	Montana	29	778	88	...	Big Sky	26.827586	0.896552	10.241379	3.827586	0.551724	0.068966	2.068966	2.921053	533.5
3	McKenna Hofschild	Colorado St. (Mountain West)	Senior	62	Guard	ncaaw.p.60402	Colorado St.	29	1046	231	...	Mountain West	36.068966	1.172414	22.551724	7.275862	1.241379	0.137931	3.965517	2.971831	1117.5
4	Kaylah Ivey	Boston College (ACC)	Junior	68	Guard	ncaaw.p.64531	Boston Coll.	33	995	47	...	ACC	30.151515	1.454545	4.333333	5.636364	1.090909	0.030303	1.727273	2.906250	500.4

5 rows × 40 columns

Basics of Machine Learning

Before we get into training a model, let’s briefly revisit a few basics of machine learning. If you are already familiar with these concepts, feel free to skip to the next section. Machine learning is a branch of artificial intelligence that focuses on creating algorithms and statistical models that allow computer systems to “learn” how to improve their performance on a specific task through experience. In the context of our basketball statistics project, machine learning can be particularly useful for predicting player performance, classifying player position, and identifying similar players.

Key concepts in machine learning that we’ll encounter include:

Model - The system that learns patterns from data and can be used to make predictions on previously unseen data. Machine learning models are often of a specific type (Linear or Logistic Regression, Random Forests, Support Vector Machines, Neural Networks, etc.). Today’s model is a Linear Regression model.
Training Data - The subset of our data used to train the model.
Testing Data - A separate subset of data used to evaluate the model’s performance.
Features - The input variables used to make predictions. These are sometimes referred to as the independent variable(s) or the predictor(s). For this project, these are various player statistics like three points made and assists.
Target Variable - The variable we’re trying to predict or optimize. This is sometimes referred to as the dependent variable(s), as it depends on the independent variable(s). In today’s project, this is Fantasy Points.
Parameters - The values that the model learns during training, such as coefficients in linear regression. These parameters define how the model transforms input features into predictions.
Hyperparameters - The configuration settings for the model that are set before training begins. These are not learned from the data but are specified by the data scientist. Examples include learning rate, number of iterations, or regularization strength. Hyperparameters can significantly affect model performance and are often tuned to optimize the model.
- Note: The model we’ll be using today is straightforward and doesn’t typically have hyperparameters in the traditional sense. However, it’s still important to know the difference between parameters and hyperparameters since many models will have hyperparameters.
Residuals - The differences between the observed values and the predicted values from the model. Residuals help assess how well the model fits the data and can reveal patterns or issues in the model’s predictions.
Model Evaluation - Metrics used to assess how well our model is performing. For a Linear Regression model, this will include metrics like Mean Squared Error (MSE) and the R-squared value.

We’ll use primarily the first six terms throughout this article, so it’s best to familiarize yourself with them now. The other concepts will be explored in more detail in future articles (please let me know if that is something you are interested in!).

Note: Our focus in this article is on classic machine learning models designed for tabular data. We won’t be covering models built specifically for natural language processing, image recognition, or video analysis. However, it’s worth mentioning that many problems in these domains often get transformed into tabular data problems, so some of the principles we discuss here may still apply in those contexts. With all of that out of the way, let’s move on to training the machine learning model.

Model Training

Now that we’ve covered the basics of machine learning, we’re ready to move on to the exciting part: training our Ordinary Least Squares (OLS) linear regression model! This process involves several key steps that will help us build a robust and accurate predictive model for our basketball player statistics.

Define the Variables

As a reminder from the previous article, we defined the target and feature variables as:

target = 'FANTASY_POINTS'
features = ['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'THREE_POINTS_MADE',
            'TWO_POINTS_MADE', 'FREE_THROWS_MADE', 'TOTAL_REBOUNDS', 'ASSISTS',
            'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS']

We’ll label the features (independent variables) as X and the target (dependent) variable as y for conciseness.

X = player_data[features]
y = player_data[target]

Let’s take a quick look at the values in X:

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
0	67	895	128	28	100	39	129	135	42	31	5	47	323
1	68	823	98	20	78	27	161	148	53	60	5	34	243
2	68	778	88	58	30	63	60	111	38	16	2	26	297
3	62	1046	231	55	176	137	115	211	71	36	4	34	654
4	68	995	47	32	15	17	57	186	64	36	1	48	143
...	...	...	...	...	...	...	...	...	...	...	...	...	...
895	66	742	92	53	39	45	113	73	66	45	2	54	282
896	73	815	108	58	50	26	140	34	46	19	19	51	300
897	71	774	102	56	46	67	176	29	48	29	3	68	327
898	71	848	127	54	73	76	123	71	90	35	9	94	384
899	67	872	133	55	78	44	127	29	41	37	2	52	365

900 rows × 13 columns

Let’s check the values in y as well:

     710.3
     735.2
     533.5
    1117.5
     500.4
        ...  
   555.1
   549.0
   597.7
   636.1
   597.9
Name: FANTASY_POINTS, Length: 900, dtype: float64

These look great and match with the values we saw in the previous article, so we can move on to the next step.

Create Training and Testing Splits

Now that we have our variables defined, we can create the training and testing splits. This involves dividing our dataset into two parts: a set for training and a set for testing. The train set will be used to train the model, and the test set will be used exclusively for testing and evaluating the model after training.

At this point, you might wonder: Why don’t we just use all of the data for training? There are several reasons for this:

Model Evaluation - Having a test set allows us to evaluate how well our model performs on unseen data, giving us a more realistic estimate of its performance in real-world scenarios.
Preventing Overfitting - By keeping a portion of our data separate for testing, we can detect if our model is overfitting to the training data. Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, leading to poor generalization on new data.
Validating Model Generalization - The test set helps us validate whether our model can generalize well to new, unseen data. This is crucial for ensuring that our model will perform reliably when deployed in practice.
Hyperparameter Tuning - While we don’t have any hyperparameters to tune today, splitting the data is also essential for hyperparameter tuning techniques like cross-validation.

To create our training and test splits, we’ll use the train_test_split function from scikit-learn. This function allows us to easily split our data while maintaining the proportion of samples for each class. Here’s how we can implement it:

X_train, X_test, y_train, y_test = train_test_split(X, y)

The _train splits (X_train and y_train) include the data for training the model and the _test splits (X_test and y_test) include the data for testing the model. By using these separate splits, we know that our model is trained on one subset of the data and evaluated on a completely separate subset, so we can essentially set aside the _test splits until we’re ready to evaluate the performance of the model.

We can take a look at the first few rows of the training split for the features using X_train:

X_train.head(5)

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
200	68	669	53	10	43	75	73	111	89	25	3	50	191
300	73	577	103	0	103	46	170	29	56	30	37	81	252
336	74	791	108	6	102	38	213	19	56	33	40	97	260
727	70	591	84	0	84	39	210	12	60	20	21	93	207
403	74	303	49	2	47	16	125	8	23	10	26	52	116

Reproducibility

You might notice that if you run the train_test_split() for a second time, there are different rows of data included in each split. Here’s an example of re-running the exact same code:

X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train.head(5)

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
397	74	596	80	4	76	63	192	28	48	16	32	69	227
501	73	829	199	4	195	65	301	53	73	42	7	96	467
107	69	1072	128	42	86	75	187	137	86	111	7	80	373
462	75	619	59	0	59	54	126	18	57	29	26	53	172
263	78	995	278	2	276	103	330	52	62	17	60	50	661

This happens because the data is shuffled before splitting, and the shuffling is not guaranteed to be reproducible by default. This can mean that the model is trained and tested on different datasets each time that you run it. That can be a good thing, but it might be better to have reproducible results for initial creation and evaluation of the model (especially if you want to follow along with this guide).

We can ensure reproducibility of the splits by controlling the shuffling with the random_state parameter:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=314)
X_train.head(5)

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
239	68	1037	258	50	208	92	152	96	87	32	3	59	658
638	70	961	109	44	65	66	136	87	75	63	11	71	328
848	69	908	142	72	70	58	78	61	71	20	5	71	414
260	76	913	112	28	84	59	200	40	64	23	73	94	311
745	66	713	146	36	110	57	124	105	96	72	2	93	385

Now, no matter how many times you run the snippet, you should get the same rows of data in the train and test splits every time.

Dataset Proportions

Another parameter that is commonly specified is the test_size (or, less commonly, the train_size) to specify a proportion of the dataset to include in the test or train split, respectively. According to the documentation, the default test_size (assuming train_size is not set) is 0.25, so one-quarter of the data will be included in the test dataset and three-quarters of the data will be included in the train dataset.

We can verify this on our own splits:

print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')

Test data split proportion: 0.25
Train data split proportion: 0.75

If you want to change those proportions, then you can use either the test_size or the train_size parameter. For example, if you want the test split to be 20% of the data instead of 25%, you would run:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')

Test data split proportion: 0.2
Train data split proportion: 0.8

We’ll be using the default 25% test size, so we can remove the test_size parameter for today.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=314)
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')

Test data split proportion: 0.25
Train data split proportion: 0.75

A Note on Dataframes versus numPy Arrays

You might notice that we’re using dataframes directly (player_data[features]) in the train_test_split, but some code samples will use numPy arrays instead (player_data[features].to_numpy()). Both approaches can work, but they have different implications depending on how you intend to use the data. Let’s quickly review the pros and cons of each approach.

Using pandas DataFrames or Series

This entails passing X and y as DataFrames or Series directly to the train_test_split function.

print(f'dtype of X: {type(X)}')
print(f'dtype of y: {type(y)}')

dtype of X: <class 'pandas.core.frame.DataFrame'>
dtype of y: <class 'pandas.core.series.Series'>

Some advantages of this approach are:

Retains column names, which can be useful for model interpretation and debugging.
Works seamlessly with scikit-learn, which can handle DataFrames directly.
Makes it easier to perform certain operations, like feature selection and transformation. Some disadvantages of this approach are:
Slightly more overhead than working with numpy arrays, but generally negligible.

Using numPy arrays

This entails passing X and y as numPy arrays to the train_test_split function.

print(f'dtype of X: {type(X.to_numpy())}')
print(f'dtype of y: {type(y.to_numpy())}')

dtype of X: <class 'numpy.ndarray'>
dtype of y: <class 'numpy.ndarray'>

Some advantages of this approach are:

Can be slightly faster for certain operations because numpy arrays are lower-level structures.
If you need to work with libraries that require numpy arrays (although most scikit-learn functions accept DataFrames as well). Some disadvantages of this approach are:
You lose the column names and index information, which can make it harder to trace errors or interpret results later.
Not necessary for most scikit-learn functions, which work fine with DataFrames.

It’s generally more convenient to use DataFrames directly unless you have a specific reason to convert to numpy arrays. This way, you retain all the metadata that can be useful during data analysis and model interpretation.

In summary, you can choose either method based on your preference, but it’s generally more convenient to use DataFrames directly unless you have a specific reason to convert to numpy arrays. If you have a specific scenario where a numpy array is required, then use .to_numpy(). For today, we’ll move on with the DataFrames approach.

Train the Model

Now that we have our data split into training and test sets, we’re ready to train our linear regression model. We’ll use scikit-learn’s LinearRegression class (which uses ordinary least squares) for this purpose.

We can start by initializing the model using an instance of the LinearRegression class with the default parameters:

linear_reg_model = LinearRegression()

We can then use the fit() method to actually train our model on our data. This method takes two arguments: X_train (the training split of our features) and y_train (the training split of our target variable).

linear_reg_model.fit(X_train, y_train)

During the training process, the model uses the training set to learn the optimal parameters for each feature that minimize the difference between the predicted values of the target variable and the actual values of the target variable. Once the training is complete, our linear_reg_model object will contain the learned parameters (coefficients and intercept) that can be used to make predictions on new data. In the future, we can use X_test to predict what the model thinks y is, and then compare that output y to the actual y values stored in y_test to evaluate the model performance. For now, let’s take a look at the model parameters directly.

Print the Model Equation

While not strictly necessary, it can be helpful to circle back to the linear regression equation mentioned in an earlier section by printing the final equation of our trained model. For a linear regression model, the learned parameters are the coefficients and intercept, which can be used to assemble the model equation.

The model coefficients can be printed with the coef_ property:

linear_reg_model.coef_

array([ 1.37791128e-14,  1.19348975e-15,  1.66666667e+00,  1.33333333e+00,
        3.33333333e-01,  1.00000000e+00,  1.20000000e+00,  1.50000000e+00,
       -1.00000000e+00,  2.00000000e+00,  2.00000000e+00, -1.20129601e-15,
       -7.23379689e-16])

The model intercept can be printed with the intercept_ property:

linear_reg_model.intercept_

-6.821210263296962e-13

We can print the feature names with the feature_names_in_ property:

linear_reg_model.feature_names_in_

array(['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE',
       'THREE_POINTS_MADE', 'TWO_POINTS_MADE', 'FREE_THROWS_MADE',
       'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 'STEALS', 'BLOCKS',
       'FOULS', 'POINTS'], dtype=object)

This isn’t very readable though, so with a bit of effort we can map each coefficient to the corresponding feature name. We can start by creating a dictionary of the coefficients:

coef_series = pd.Series(data=linear_reg_model.coef_, index=linear_reg_model.feature_names_in_)
coef_series

Height               1.377911e-14
MINUTES_PLAYED       1.193490e-15
FIELD_GOALS_MADE     1.666667e+00
THREE_POINTS_MADE    1.333333e+00
TWO_POINTS_MADE      3.333333e-01
FREE_THROWS_MADE     1.000000e+00
TOTAL_REBOUNDS       1.200000e+00
ASSISTS              1.500000e+00
TURNOVERS           -1.000000e+00
STEALS               2.000000e+00
BLOCKS               2.000000e+00
FOULS               -1.201296e-15
POINTS              -7.233797e-16
dtype: float64

This makes the coefficients far more readable! We can print this data as a string in the format of coefficient*feature_name (with newlines for formatting) to then use in the model equation:

coef_string = "\n + ".join(f"{coef}*{feat}" for feat, coef in coef_series.items())
print(coef_string)

1.3779112764770156e-14*Height
 + 1.1934897514720433e-15*MINUTES_PLAYED
 + 1.6666666666666634*FIELD_GOALS_MADE
 + 1.333333333333333*THREE_POINTS_MADE
 + 0.33333333333333315*TWO_POINTS_MADE
 + 1.0000000000000009*FREE_THROWS_MADE
 + 1.1999999999999982*TOTAL_REBOUNDS
 + 1.499999999999999*ASSISTS
 + -0.9999999999999992*TURNOVERS
 + 1.999999999999999*STEALS
 + 2.000000000000001*BLOCKS
 + -1.201296007113939e-15*FOULS
 + -7.233796894823286e-16*POINTS

Let’s use this coefficient string to assemble the final model equation:

print(f'{target} = {coef_string} + {linear_reg_model.intercept_} + error')

FANTASY_POINTS = 1.3779112764770156e-14*Height
 + 1.1934897514720433e-15*MINUTES_PLAYED
 + 1.6666666666666634*FIELD_GOALS_MADE
 + 1.333333333333333*THREE_POINTS_MADE
 + 0.33333333333333315*TWO_POINTS_MADE
 + 1.0000000000000009*FREE_THROWS_MADE
 + 1.1999999999999982*TOTAL_REBOUNDS
 + 1.499999999999999*ASSISTS
 + -0.9999999999999992*TURNOVERS
 + 1.999999999999999*STEALS
 + 2.000000000000001*BLOCKS
 + -1.201296007113939e-15*FOULS
 + -7.233796894823286e-16*POINTS + -6.821210263296962e-13 + error

Analyze the Model Equation

Now that we have the final model equation, we can see that multiple variables have a coefficient that is very close to zero (e-15 or smaller values). If we remove those values using abs(coef) > 0.0001, this equation simplifies to:

coef_series_simple = coef_series[abs(coef_series) > 0.0001]
coef_string_simple = "\n\t\t + ".join(f"{coef:.4f} * {feat}" for feat, coef in coef_series_simple.items())
print(f'{target} = {coef_string_simple} + {linear_reg_model.intercept_} + error')

FANTASY_POINTS = 1.6667 * FIELD_GOALS_MADE
	 + 1.3333 * THREE_POINTS_MADE
	 + 0.3333 * TWO_POINTS_MADE
	 + 1.0000 * FREE_THROWS_MADE
	 + 1.2000 * TOTAL_REBOUNDS
	 + 1.5000 * ASSISTS
	 + -1.0000 * TURNOVERS
	 + 2.0000 * STEALS
	 + 2.0000 * BLOCKS + 2.2737367544323206e-13 + error

Excellent! We can compare this to the original equation for Fantasy Points:

FANTASY_POINTS =  3   * THREE_POINTS_MADE + 
                  2   * TWO_POINTS_MADE + 
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

The model estimated some of the coefficients to be the same values as the known equation. We included more parameters in the machine learning model than were in the original equation and most of those extra parameters were estimated to have a coefficient around zero. However, we can see that the coefficient was not zero for one of those extra parameters, FIELD_GOALS_MADE, and that the coefficients for THREE_POINTS_MADE and TWO_POINTS_MADE do not match between the estimated and original equations. Here’s a quick table to summarize the differences:

-	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE
Estimated Coefficient	1.6	1.3	0.3
Original Coefficient	Does Not Exist	3	2

Note: the fractions were rounded to one-tenth for readability but the values are actually \(1 \frac{2}{3}\) instead of 1.6, \(1 \frac{1}{3}\) instead of 1.3, \(\frac{1}{3}\) instead of 0.3, etc.

This already seems pretty close, but we can take it a step further by revisiting the definition of field goals. The FIELD_GOALS_MADE variable is actually the sum of TWO_POINTS_MADE and THREE_POINTS_MADE (but not FREE_THROWS_MADE). We can sanity check this in our dataset as well:

check = player_data['FIELD_GOALS_MADE'] == player_data['TWO_POINTS_MADE'] + player_data['THREE_POINTS_MADE']
print(f'True count: {check.sum()} rows')
print(f'False count: {(~check).sum()} rows')

True count: 900 rows
False count: 0 rows

So, we can write this relationship as the equation:

FIELD_GOALS_MADE = TWO_POINTS_MADE + THREE_POINTS_MADE

We can then substitute this equation into the model’s equation:

FANTASY_POINTS =  1.6 * (TWO_POINTS_MADE + THREE_POINTS_MADE) + 
                  1.3 * THREE_POINTS_MADE + 
                  0.3 * TWO_POINTS_MADE + 
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

A quick distribution of the coefficient turns this into:

FANTASY_POINTS =  1.6 * THREE_POINTS_MADE + 
                  1.3 * THREE_POINTS_MADE + 
                  1.6 * TWO_POINTS_MADE +
                  0.3 * TWO_POINTS_MADE + 
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

Using the actual fractional notation of these coefficients (instead of the rounded values shown above), this simplifies down into:

FANTASY_POINTS =  3   * THREE_POINTS_MADE + 
                  2   * TWO_POINTS_MADE +
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

This means that the model estimated approximately the same equation as the original fantasy points calculation, with the addition of a few dependent variables with coefficients close to zero and an intercept value close to zero. As a reminder, some models will be sufficiently complex that it might be difficult to output and effectively analyze the estimated equation, but it provides a lot of value in this case.

Alternate Training

Since we have ended up with essentially the same equation as the original fantasy points calculation, we can logically expect our model to perform pretty well in the next phase of model evaluation. So, we can also train an alternate model with a few of the features removed for comparison. First, let’s create an alternate version of X with all three of the features with high correlation coefficients removed:

X_alt = player_data[features].drop(columns=['FIELD_GOALS_MADE', 'TWO_POINTS_MADE', 'POINTS'])

Note: this is more features than you would likely want to remove in a real-world scenario, but removing too many features will give us an opportunity to compare a less-than-perfect model to a perfect model in the model evaluation phase.

Our target variable y is unchanged, so we can create alternate training and test splits using this X_alt:

X_train_alt, X_test_alt, y_train_alt, y_test_alt = train_test_split(X_alt, y, random_state=314)

We can now train an alternate model using these new training splits:

ols_alt = LinearRegression()
ols_alt.fit(X_train_alt, y_train_alt)

We can print the model equation for this alternate model as well:

coef_series_alt = pd.Series(data=ols_alt.coef_, index=ols_alt.feature_names_in_)
coef_series_alt = coef_series_alt[abs(coef_series_alt) > 0.0001]
coef_string_alt = "\n\t\t + ".join(f"{coef:.4f} * {feat}" for feat, coef in coef_series_alt.items())
print(f'{target} = {coef_string_alt} + {ols_alt.intercept_} + error')

FANTASY_POINTS = 1.4457 * Height
		 + 0.1001 * MINUTES_PLAYED
		 + 2.2441 * THREE_POINTS_MADE
		 + 2.3925 * FREE_THROWS_MADE
		 + 1.5098 * TOTAL_REBOUNDS
		 + 1.2482 * ASSISTS
		 + 2.1495 * STEALS
		 + 2.5698 * BLOCKS + -131.85858178609237 + error
         
FANTASY_POINTS = 2.4532 * Height
	 + 0.1039 * MINUTES_PLAYED
	 + 2.2037 * THREE_POINTS_MADE
	 + 2.3917 * FREE_THROWS_MADE
	 + 1.5219 * TOTAL_REBOUNDS
	 + 1.3231 * ASSISTS
	 + -0.5706 * TURNOVERS
	 + 2.2393 * STEALS
	 + 2.4818 * BLOCKS
	 + -0.2612 * FOULS + -203.27425560271263 + error

We can see that the model coefficients and the y-intercept are substantially different from the model we originally trained. We won’t know if this alternate model performs as well as the original one until we evaluate each model in the next article.

Export Data & Models

If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export the testing dataset.

X_test.to_csv('X_test_full.csv', index=False)
X_test_alt.to_csv('X_test_few.csv', index=False)
y_test.to_csv('y_actual.csv', index=False)

While it’s not strictly necessary to export small, simple models like these, it’s often helpful for checkpointing and collaboration. There are multiple ways to export machine learning models detailed in scikit-learn’s model persistence page, including the popular pickle library, but for today we’ll use joblib.

joblib.dump(linear_reg_model, 'model_full.sav')
joblib.dump(ols_alt, 'model_few.sav')

Wrap Up

In today’s guide, we covered how to train the selected machine learning model, including how to properly split our dataset into train and test subsets. In the next part, we’ll focus on how to evaluate the model’s performance.

Also, all of the code snippets in today’s guide are available in a Jupyter Notebook in the ncaa-basketball-stats repository on GitHub.

Training a Linear Regression Model

Outlier or Caitlin Clark? [Part 7]