Today we’ll cover the basics of machine learning and examine how to train a linear regression machine learning model. This is the fifth part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning.

As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step of this project:

the steps for this data science project

Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
Machine Learning - This step focuses on training a machine learning model to identify the combination of individual player statistics that correlates with optimal performance.

We’ll use Python along with the popular scikit-learn and statsmodels libraries to train and evaluate the model. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, create meaningful visualizations, and train a basic machine learning model. Since we already gathered the raw data from online sources in Part 1, cleaned that data in Part 2, engineered new features in Part 3, and explored visualizations in Part 4, we’re ready to move on to training a machine learning model.

Getting Started

Since this is the fifth installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the first article of the series or at least review the Getting Started section of that post before continuing.

Import Packages

You’ll want to have the latest version of Python installed with the following packages:

For today’s machine learning sgement specifically, we’ll want to import a few of these libraries:

import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Import Data

In Part 3 of this series, we engineered new features for our dataset, which is stored in a dataframe named player_data. No changes were made to the underlying data set in Part 4 of this series, since that part focused on creating visualizations from the data set. If you want to follow along with the code examples in this article, it’s recommended to import the player_data dataframe before proceeding.

player_data = pd.read_excel('player_data_engineered.xlsx')
player_data.head()

	PLAYER_NAME	Team	Class	Height	Position	PLAYER_ID	TEAM_NAME	GAMES	MINUTES_PLAYED	FIELD_GOALS_MADE	...	Conference	MINUTES_PER_GAME	FOULS_PER_GAME	POINTS_PER_GAME	ASSISTS_PER_GAME	STEALS_PER_GAME	BLOCKS_PER_GAME	REBOUNDS_PER_GAME	ASSIST_TO_TURNOVER	FANTASY_POINTS
0	Kiara Jackson	UNLV (Mountain West)	Junior	67	Guard	ncaaw.p.67149	UNLV	29	895	128	...	Mountain West	30.862069	1.620690	11.137931	4.655172	1.068966	0.172414	4.448276	3.214286	710.3
1	Raven Johnson	South Carolina (SEC)	Sophomore	68	Guard	ncaaw.p.67515	South Carolina	30	823	98	...	SEC	27.433333	1.133333	8.100000	4.933333	2.000000	0.166667	5.366667	2.792453	735.2
2	Gina Marxen	Montana (Big Sky)	Senior	68	Guard	ncaaw.p.57909	Montana	29	778	88	...	Big Sky	26.827586	0.896552	10.241379	3.827586	0.551724	0.068966	2.068966	2.921053	533.5
3	McKenna Hofschild	Colorado St. (Mountain West)	Senior	62	Guard	ncaaw.p.60402	Colorado St.	29	1046	231	...	Mountain West	36.068966	1.172414	22.551724	7.275862	1.241379	0.137931	3.965517	2.971831	1117.5
4	Kaylah Ivey	Boston College (ACC)	Junior	68	Guard	ncaaw.p.64531	Boston Coll.	33	995	47	...	ACC	30.151515	1.454545	4.333333	5.636364	1.090909	0.030303	1.727273	2.906250	500.4

5 rows × 40 columns

Basics of Machine Learning

Before we get into training a model, let’s briefly cover a few basics of machine learning. Machine learning is a branch of artificial intelligence that focuses on creating algorithms and statistical models that allow computer systems to “learn” how to improve their performance on a specific task through experience. In the context of our basketball statistics project, machine learning can be particularly useful for predicting player performance, classifying player position, and identifying similar players.

Key concepts in machine learning that we’ll encounter include:

Model: The system that learns patterns from data and can be used to make predictions on previously unseen data. Machine learning models are often of a specific type (Linear or Logistic Regression, Random Forests, Support Vector Machines, Neural Networks, etc.).
Training Data: The subset of our data used to train the model.
Testing Data: A separate subset of data used to evaluate the model’s performance.
Features: The input variables used to make predictions. This is sometimes referred to as the independent variable(s). In our case, these could be various player statistics like three points made or assists.
Target Variable: The variable we’re trying to predict or optimize, such as points scored or fantasy points. This is sometimes referred to as the dependent variable(s), as it depends on the independent variable(s).
Parameters: The values that the model learns during training, such as coefficients in linear regression. These parameters define how the model transforms input features into predictions.
Hyperparameters: The configuration settings for the model that are set before training begins. These are not learned from the data but are specified by the data scientist. Examples include learning rate, number of iterations, or regularization strength. Hyperparameters can significantly affect model performance and are often tuned to optimize the model.
- Note: The model we’ll be using today is straightforward and doesn’t typically have hyperparameters in the traditional sense. However, it’s still important to know the difference between parameters and hyperparameters since many models will have hyperparameters.
Residuals: The differences between the observed values and the predicted values from the model. Residuals help assess how well the model fits the data and can reveal patterns or issues in the model’s predictions.
Model Evaluation: Metrics used to assess how well our model is performing. For a Linear Regression model, this will include metrics like Mean Squared Error (MSE) and the R-squared value.

We’ll use primarily the first six terms throughout this article, so it’s best to familiarize yourself with them now. The other concepts will be explored in more detail in future articles (please let me know if that is something you are interested in!).

It’s important to note that our focus in this article is on classic machine learning models designed for tabular data. We won’t be covering models built specifically for natural language processing, image recognition, or video analysis. However, it’s worth mentioning that many problems in these domains often get transformed into tabular data problems, so some of the principles we discuss here may still apply in those contexts. With all of that out of the way, let’s move on to defining the problem and selecting an appropriate machine learning model.

Select a Model

Before we choose a model, it’s a good idea to clearly define our objective to help us ensure we’re using an appropriate model for our task. This step sets the foundation for our entire machine learning process and helps guide our decision-making throughout the project.

Define the Objective

The goal of a machine learning project in commercial settings will often be determined by a desired business outcome. However, for a hobby project like this, we have the freedom to pick the objective. So, for today’s machine learning model, we’ll focus on training the model to predict a target variable based on one or more input features (such as field goals, blocks, assists, etc.). Let’s choose the target variable and set of features as well.

Define the Target Variable

The target variable has a massive impact on the machine learning model, including what type (regression, classification, clustering, etc.) of machine learning model is appropriate. For today, let’s choose one of the numerical columns from Part 4:

numerical_columns = ['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 
                     'THREE_POINTS_MADE', 'TWO_POINTS_MADE', 'FREE_THROWS_MADE',
                     'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 
                     'STEALS', 'BLOCKS', 'FOULS', 'POINTS', 'FANTASY_POINTS']

We could choose the POINTS variable, but that would end up focusing on primarily offensive players in the model. Defensive players could be prioritized by choosing the BLOCKS variable. To include all types of players, let’s instead use FANTASY_POINTS as the target variable.

target = 'FANTASY_POINTS'

Define the Features

Next up, let’s define the features that the model should use. Typically, this will involve performing feature selection and feature engineering on the dataset, but we’ve already essentially done that in Part 3 and Part 4. So, we’ll use the list of numerical columns with the target variable (FANTASY_POINTS) removed.

features = ['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'THREE_POINTS_MADE',
            'TWO_POINTS_MADE', 'FREE_THROWS_MADE', 'TOTAL_REBOUNDS', 'ASSISTS',
            'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS']

Final Objective

In summary, today’s objective is to train a machine learning model to predict FANTASY_POINTS (the target variable) based on Height, MINUTES_PLAYED, FIELD_GOALS_MADE, THREE_POINTS_MADE, TWO_POINTS_MADE, FREE_THROWS_MADE, TOTAL_REBOUNDS, ASSISTS, TURNOVERS, STEALS, BLOCKS, FOULS, and POINTS (the features).

A Solved Problem

If you’ve been following this series from the beginning, you might remember that we actually calculated FANTASY_POINTS from some of these variables at the end of Part 3. The equation we used was:

FANTASY_POINTS =  3   * THREE_POINTS_MADE + 
                  2   * TWO_POINTS_MADE + 
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

So this is essentially a solved problem and training a machine learning model is technically unnecessary. However, there are some benefits of using a known relationship as a starting point for learning about machine learning and training your first model:

Verification of results - Since we know the exact formula used to calculate the target variable, we can easily verify if our model is working correctly. This provides a clear benchmark for our model’s performance.
Understanding feature importance - By comparing the coefficients yur model learns to the actual coefficients used in the fantasy points calculation, we can gain insights into how well the model identifies important features.
Concept reinforcement - Working with a known relationship helps reinforce key machine learning concepts like feature selection, model training, and evaluation in a controlled environment.
Confidence building - Successfully predicting a known relationship can also boost confidence in applying machine learning techniques to more complex, unknown relationships in the future.

While using a known relationship is a great learning tool, it’s important to remember that in real-world scenarios, the relationships between variables are often unknown or more complex. This exercise serves as a stepping stone to tackling more challenging predictive tasks. To simulate a bit of this complexity and to make the future article on model evaluation more valuable, we’ll also train an alternate version of the model after removing some of the features.

Identify Appropriate Models

The first part of training any machine learning model is selecting the model to use. This might sound obvious, but selecting the “best” model for your problem depends on a variety of factors. We’ll likely explore this step in more detail in a future article (send me a note if you would be interested in that), but for today we’ll go through scikit-learn’s model flowchart to find an appropriate model.

scikit-learn's algorithm flowchat

Let’s start with the first question in the flowchart: “Are there more than 50 samples?” We can answer this by checking the number of rows in our dataframe.

len(player_data)

This confirms that there are more than 50 rows in our dataframe, so we can follow the “yes” path to the next question: “Are we predicting a category”? We’ll be predicting FANTASY_POINTS, which we can check the datatype of using .dtypes:

player_data['FANTASY_POINTS'].dtypes

dtype('float64')

So in this case our target variable contains float64 values. Note that in many cases, the data type might be listed as object, such as if numeric data is stored as strings or if there are multiple datatypes in the column. In those cases, it can also be a good idea to look at a sample of the target variable data:

player_data['FANTASY_POINTS'].sample(10)

  776.6
  901.9
  624.2
  875.9
  611.0
  563.9
   474.0
  668.6
  410.0
  534.0
Name: FANTASY_POINTS, dtype: float64

Since our target variable contains entirely numeric data, we can answer “no” to “predicting a category?”. Note that in some cases you might treat numeric data as categorical data or you might want to separate the numeric data into bins, but for today we are not predicing a category.

Next up: “Are we predicing a quantity?” In this case, the answer is “yes”, since FANTASY_POINTS is indeed a quantity. Answering “yes” to this guides us to the regression category of models. Regression analysis is a statistical method used to model the relationship between a continuous numeric dependent variable and one or more independent variables. It aims to predict the value of the dependent variable based on the independent variables. Regression analysis models are a staple in the industry and are often a good starting point due to their simplicity and explainability of results.

The next question is: “Are there fewer than 100,000 samples?” Looking back to the output of len(player_data) above, we know that we do have fewer than 100k samples, so we can go down the “yes” path.

At this point, we have ended up at the question: “Should only a few features be important?” Answering “yes” will take us to Lasso and answering “no” will take us to RidgeRegression. However, both Lasso and Ridge regression in this context are meant to be improvements upon Ordinary Least Squares (OLS) linear regression. To clarify, linear regression is a specific type of regression analysis where the relationship between the dependent variable and the independent variables is assumed to be linear. It finds the best-fitting straight line (called the regression line) through the data points to make predictions. So, instead of using either lasso or ridge regression today, we’ll explore the classic OLS linear regression model.

Model Characteristics

Ordinary Least Squares (OLS) is the most basic form of linear regression. It aims to minimize the sum of the squared differences between observed and predicted values (the residuals). As a type of linear regression, it assumes a linear relationship between the independent and dependent variables. It can be sensitive to outliers, which can skew the results significantly. It also does not offer regularization, which means it can overfit when there are many predictors or when predictors are highly correlated. However, it is simple and explainable, so it offers a good starting point.

As mentioned earlier, linear regression assumes that the relationship between the independent variables (features) and the dependent variable (target) can be described by a straight line. This line (also known as the regression line) is represented by an equation of the form:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n\]

Where:

\(y\) is the predicted value of the target variable
\(x_1, x_2, …, x_n\) are the input features
\(b_0\) is the y-intercept (the value of \(y\) when all \(x\) are zero)
\(b_1, b_2, …, b_n\) are the coefficients that represent the change in \(y\) for a one-unit change in the corresponding \(x\), holding all other \(x\) constant

We can also rewrite this equation for today’s project as:

\[FantasyPoints = \beta_0 + \beta_1Height + \beta_2MinutesPlayed + ... + \beta_nPoints\]

The goal of linear regression is to find the values of \(b_0\) and \(b_1, b_2, …, b_n\) that minimize the difference between the predicted \(FantasyPoints\) values and the actual \(FantasyPoints\) values in our dataset.

Note: This is a slightly simplified explanation of the underlying mathematics. I highly recommend reading the Wikipedia page, or other suitable sources, for a more nuanced understanding of the process.

Now that we conceptually understand a bit more about how this model works, we can take a quick look at the common assumptions of the model and whether or not those assumptions are satisfied.

Verify Assumptions of the Model

Next up, we should verify the underlying assumptions of the machine learning model are satisfied by our particular problem and situation. This step might be tempting to skip, but it can save hours of time in the future and can help ensure your model is generalized. The basic assumptions of linear regression models generally are:

Linearity - The relationship between the target variable and the features is linear.
No Multicollinearity - The features are not too highly correlated with each other.
Weak Exogeneity - The features are treated as fixed values, not random variables, and are free from measurement errors.
Independence of errors - Residuals are independent of and unrelated to one another.
Zero Mean of Residuals - The mean of the residuals is zero or close to zero.
Constant Variance (Homoskedasticity) - Residuals have constant variance across all levels of the independent variables.

Note: These are simplified summaries of each assumption. We’ll go through each one in a bit more detail later in this article and in future articles, but once again I highly recommend reading a supplemental source for a deeper understanding. Some suitable sources include Wikipedia, Stanford, and Statistics by Jim.

You might notice that the first three assumptions pertain to the features and the last three assumptions pertain to the residuals. Residuals are the difference between the predicted target variable and the actual target variable. This means that we must be able to generate predictions and therefore have trained the model before evaluating those assumptions. So we can go through each of the first three assumptions (linearity, no multicollinearity, and weak exogeneity) today and will leave the other three for a future article.

Linearity

The first assumption we’ll look at is whether the relationship between the independent and dependent variables is linear. Looking at the basic OLS linear regression equation from the earlier section, we can see that the coefficients (parameters), random error, and variables are linear. Linear regression models can model curvature to some extent (in the variables, not the parameters), but for today we’ll focus on a strictly linear relationship.

There are a few common methods to detect linearity:

Scatterplot - We can use a simple scatterplot to evaluate the relationship between the target variable and the features. For projects with only one feature, we would only need one scatterplot. Since today’s project has multiple features, we’ll need one scatterplot for each feature.
LOWESS Curve - This is the same idea as the scatterplot, but adds in a LOWESS curve to more robustly evaluate the linearity assumption.
Residuals Plot - If the residuals plot shows curves or trends, it could mean that the linearity assumption is not satisfied.

Since we’re looking for a strictly linear relationship today, we can start with the same pairplot() function that we did in Part 4. Since we’re only looking at the relationship between the target variable (FANTASY_POINTS) and each feature, we can make a slight modification to show only the target variable plots.

pairplot = sns.pairplot(data=player_data, x_vars=features, y_vars=target)

png

A LOWESS (Locally Weighted Scatterplot Smoothing) curve is a type of local regression that fits a smooth curve through a scatter plot to visualize relationships between variables. It’s particularly useful for identifying non-linear patterns in data by fitting simple models to localized subsets. LOWESS curves are flexible and can capture complex relationships, making them a helpful visual aid for determining the linearity of each pairing.

We can use Seaborn’s regplot() function with the lowess=True parameter to add a LOWESS curve to each scatterplot:

def loess_reg(x, y, **kwargs):
    sns.regplot(x=x, y=y, lowess=True, line_kws={'color': 'red'}, **kwargs)


pairplot = sns.pairplot(data=player_data, x_vars=features, y_vars=target)
pairplot.map(loess_reg)

<seaborn.axisgrid.PairGrid at 0x74cd86773280>

png

No Multicollinearity

The next assumption that we’ll examine is the lack of multicollinearity. Multicollinearity occurs when the features in a regression model are highly correlated with each other. This can make it challenging to determine individual variable effects, lead to unstable coefficient estimates, and increase standard errors. As a result, some variables may appear statistically insignificant when they should be significant.

There are a few common ways to detect multicollinearity; we’ll look at the top two:

Correlation Coefficient - We can check the correlation coefficient between each feature.
Variance Inflation Factor - We can check the variance inflation factor after training the model.

To check the correlation coefficient, we can use the same method as in Part 3. We can check the correlation coefficients before training the model using a correlation matrix. As a reminder, a correlation matrix displays the correlation coefficients between all pairs of variables, with values ranging from -1 to 1. Strong correlations (when the absolute value of the correlation coefficient is typically above 0.8 or 0.9) suggest potential multicollinearity.

plt.figure(figsize=(12, 10))
correlation_matrix = player_data[features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

<Axes: >

png

From this chart, we can see that the variables with the strongest correlations are:

FIELD_GOALS_MADE and TWO_POINTS_MADE with a correlation coefficient of 0.86
FIELD_GOALS_MADE and POINTS with a correlation coefficient of 0.97

When you encounter variable pairings with strong correlations in your correlation matrix, there are several ways to deal with them:

Remove one of the correlated variables - If two variables are highly correlated, they might provide similar information. Keeping both may lead to redundancy and potential issues with multicollinearity. We could choose the variable that we believe is more important or relevant to your analysis.
Combine correlated variables - We might create a new feature that combines the information from both correlated variables. This could be done through a variety of methods such as by creating a composite score.
Use regularization techniques - Methods like Ridge Regression (L2 regularization) or Lasso Regression (L1 regularization) can help mitigate the effects of multicollinearity by adding a penalty term to the model that discourages large coefficients. We could also apply regularization to the features separately, before training the OLS model.
Collect more data - Sometimes, multicollinearity can be reduced by increasing the sample size, which may help differentiate the effects of correlated variables. This is more situational and will not apply in all scenarios.

Since we’re interested in seeing how the model behaves with a simple application of OLS linear regression, we’ll proceed without removing, combining, or regularizing any of the features. However, we should keep this information in mind in the future since we might want to retrain the model without one or more of the features.

Weak Exogeneity

Weak exogeneity is another crucial assumption in linear regression that we need to verify. This assumption essentially means that the predictor variables (our independent variables) can be treated as fixed values, rather than random variables. In other words, we assume that our predictor variables are not influenced by the dependent variable or by external factors that also affect the dependent variable.

Practically speaking, weak exogeneity implies that our predictor variables are error-free, meaning they are not contaminated with measurement errors. While this assumption may not always be realistic in many real-world settings, it’s an important simplification that allows us to use standard linear regression techniques.

It’s worth noting that dropping this assumption leads to significantly more complex models known as errors-in-variables models. These models account for measurement errors in the predictor variables but are considerably more challenging to implement and interpret.

There is no direct statistical test for weak exogeneity, so we’ll treat this as more of a logical check than a mathematical one. For our basketball player statistics model, weak exogeneity would mean that the statistics we’re using as predictors (such as minutes played, field goals attempted, etc.) are not themselves influenced by the player’s fantasy points or by unmeasured factors that also affect fantasy points. This makes logical sense based on our domain knowledge and understanding of how the data was collected, so we’ll consider this assumption satisfied.

Train the Model

Now that we’ve covered the basics of machine learning and verified the suitability of our chosen model, we’re ready to move on to the exciting part: training our linear regression model! This process involves several key steps that will help us build a robust and accurate predictive model for our basketball player statistics.

Define the Variables

As a reminder, in an earlier section we defined the features and target variable. We’ll label the features (independent variables) as X and the target (dependent) variable as y for conciseness.

X = player_data[features]
y = player_data[target]
X

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
0	67	895	128	28	100	39	129	135	42	31	5	47	323
1	68	823	98	20	78	27	161	148	53	60	5	34	243
2	68	778	88	58	30	63	60	111	38	16	2	26	297
3	62	1046	231	55	176	137	115	211	71	36	4	34	654
4	68	995	47	32	15	17	57	186	64	36	1	48	143
...	...	...	...	...	...	...	...	...	...	...	...	...	...
895	66	742	92	53	39	45	113	73	66	45	2	54	282
896	73	815	108	58	50	26	140	34	46	19	19	51	300
897	71	774	102	56	46	67	176	29	48	29	3	68	327
898	71	848	127	54	73	76	123	71	90	35	9	94	384
899	67	872	133	55	78	44	127	29	41	37	2	52	365

900 rows × 13 columns

     710.3
     735.2
     533.5
    1117.5
     500.4
        ...  
   555.1
   549.0
   597.7
   636.1
   597.9
Name: FANTASY_POINTS, Length: 900, dtype: float64

Create Training and Testing Splits

Now that we have our variables defined, we can create the training and testing splits. This involves dividing our dataset into two parts: a set for training and a set for testing. The train set will be used to train the model, and the test set will be used exclusively for testing and evaluating the model after training.

Why wouldn’t we just use all of the data for training?

Model Evaluation - The test set allows us to evaluate how well our model performs on unseen data, giving us a more realistic estimate of its performance in real-world scenarios.
Preventing Overfitting - By keeping a portion of our data separate for testing, we can detect if our model is overfitting to the training data. Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, leading to poor generalization on new data.
Validating Model Generalization - The test set helps us validate whether our model can generalize well to new, unseen data. This is crucial for ensuring that our model will perform reliably when deployed in practice.
Hyperparameter Tuning - While we don’t have any hyperparameters to tune today, splitting the data is also essential for hyperparameter tuning techniques like cross-validation.

To create our training and test splits, we’ll use the train_test_split function from scikit-learn. This function allows us to easily split our data while maintaining the proportion of samples for each class. Here’s how we can implement it:

X_train, X_test, y_train, y_test = train_test_split(X, y)

The _train splits (X_train and y_train) include the data for training the model and the _test splits (X_test and y_test) include the data for testing the model. By using these separate splits, we know that our model is trained on one subset of the data and evaluated on a completely separate subset, so we can essentially set aside the _test splits until we’re ready to evaluate the performance of the model.

We can take a look at the first few rows of the training split for the features using X_train:

X_train.head(5)

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
200	68	669	53	10	43	75	73	111	89	25	3	50	191
300	73	577	103	0	103	46	170	29	56	30	37	81	252
336	74	791	108	6	102	38	213	19	56	33	40	97	260
727	70	591	84	0	84	39	210	12	60	20	21	93	207
403	74	303	49	2	47	16	125	8	23	10	26	52	116

Reproducibility

You might notice that if you run the train_test_split() for a second time, there are different rows of data included in each split. This is because the data is shuffled before splitting, and the shuffling is not guaranteed to be reproducible by default.

X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train.head(5)

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
397	74	596	80	4	76	63	192	28	48	16	32	69	227
501	73	829	199	4	195	65	301	53	73	42	7	96	467
107	69	1072	128	42	86	75	187	137	86	111	7	80	373
462	75	619	59	0	59	54	126	18	57	29	26	53	172
263	78	995	278	2	276	103	330	52	62	17	60	50	661

This can mean that the model is trained and tested on different datasets each time that you run it. That’s often a good thing, but it can be better to have reproducible results for initial creation and evaluation of the model (especially if you want to follow along with this guide).

We can ensure reproducibility of the splits by controlling the shuffling with the random_state parameter:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=314)
X_train.head(5)

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
239	68	1037	258	50	208	92	152	96	87	32	3	59	658
638	70	961	109	44	65	66	136	87	75	63	11	71	328
848	69	908	142	72	70	58	78	61	71	20	5	71	414
260	76	913	112	28	84	59	200	40	64	23	73	94	311
745	66	713	146	36	110	57	124	105	96	72	2	93	385

Now, no matter how many times you run the snippet, you should get the same rows of data in the train and test splits every time.

Dataset Proportions

Another parameter that is commonly specified is the test_size (or, less commonly, the train_size) to specify a proportion of the dataset to include in the test or train split, respectively. According to the documentation, the default test_size (assuming train_size is not set) is 0.25, so one-quarter of the data will be included in the test dataset and three-quarters of the data will be included in the train dataset.

We can verify this on our own splits:

print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')

Test data split proportion: 0.25
Train data split proportion: 0.75

If you want to change those proportions, then you can use either the test_size or the train_size parameter. For example, if you want the test split to be 20% of the data instead of 25%, you would run:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')

Test data split proportion: 0.2
Train data split proportion: 0.8

We’ll be using the default 25% test size, so we can remove the test_size parameter for today.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=314)
print(f'Test data split proportion: {len(X_test) / len(X)}')
print(f'Train data split proportion: {len(X_train) / len(X)}')

Test data split proportion: 0.25
Train data split proportion: 0.75

A Note on Dataframes versus numPy Arrays

You might notice that we’re using dataframes directly (player_data[features]) in the train_test_split, but some code samples will use numPy arrays instead (player_data[features].to_numpy()). Both approaches can work, but they have different implications depending on how you intend to use the data. Let’s quickly review the pros and cons of each approach.

Using pandas DataFrames or Series

This entails passing X and y as DataFrames or Series directly to the train_test_split function.

print(f'dtype of X: {type(X)}')
print(f'dtype of y: {type(y)}')

dtype of X: <class 'pandas.core.frame.DataFrame'>
dtype of y: <class 'pandas.core.series.Series'>

Some advantages of this approach are:

Retains column names, which can be useful for model interpretation and debugging.
Works seamlessly with scikit-learn, which can handle DataFrames directly.
Makes it easier to perform certain operations, like feature selection and transformation. Some disadvantages of this approach are:
Slightly more overhead than working with numpy arrays, but generally negligible.

Using numPy arrays

This entails passing X and y as numPy arrays to the train_test_split function.

print(f'dtype of X: {type(X.to_numpy())}')
print(f'dtype of y: {type(y.to_numpy())}')

dtype of X: <class 'numpy.ndarray'>
dtype of y: <class 'numpy.ndarray'>

Some advantages of this approach are:

Can be slightly faster for certain operations because numpy arrays are lower-level structures.
If you need to work with libraries that require numpy arrays (although most scikit-learn functions accept DataFrames as well). Some disadvantages of this approach are:
You lose the column names and index information, which can make it harder to trace errors or interpret results later.
Not necessary for most scikit-learn functions, which work fine with DataFrames.

It’s generally more convenient to use DataFrames directly unless you have a specific reason to convert to numpy arrays. This way, you retain all the metadata that can be useful during data analysis and model interpretation.

In summary, you can choose either method based on your preference, but it’s generally more convenient to use DataFrames directly unless you have a specific reason to convert to numpy arrays. If you have a specific scenario where a numpy array is required, then use .to_numpy(). For today, we’ll move on with the DataFrames approach.

Train the Model

Now that we have our data split into training and test sets, we’re ready to train our linear regression model. We’ll use scikit-learn’s LinearRegression class (which uses ordinary least squares) for this purpose.

We can start by initializing the model using an instance of the LinearRegression class with the default parameters:

linear_reg_model = LinearRegression()

We can then use the fit() method to actually train our model on our data. This method takes two arguments: X_train (the training split of our features) and y_train (the training split of our target variable).

linear_reg_model.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

During the training process, the model uses the training set to learn the optimal parameters for each feature that minimize the difference between the predicted values of the target variable and the actual values of the target variable. Once the training is complete, our linear_reg_model object will contain the learned parameters (coefficients and intercept) that can be used to make predictions on new data. In the future, we can use X_test to predict what the model thinks y is, and then compare that output y to the actual y values stored in y_test to evaluate the model performance. For now, let’s take a look at the model parameters directly.

Print the Model Equation

While not strictly necessary, it can be helpful to circle back to the linear regression equation mentioned in an earlier section by printing the final equation of our trained model. For a linear regression model, the learned parameters are the coefficients and intercept, which can be used to assemble the model equation.

The model coefficients can be printed with the coef_ property:

linear_reg_model.coef_

array([ 1.37791128e-14,  1.19348975e-15,  1.66666667e+00,  1.33333333e+00,
        3.33333333e-01,  1.00000000e+00,  1.20000000e+00,  1.50000000e+00,
       -1.00000000e+00,  2.00000000e+00,  2.00000000e+00, -1.20129601e-15,
       -7.23379689e-16])

The model intercept can be printed with the intercept_ property:

linear_reg_model.intercept_

-6.821210263296962e-13

We can print the feature names with the feature_names_in_ property:

linear_reg_model.feature_names_in_

array(['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE',
       'THREE_POINTS_MADE', 'TWO_POINTS_MADE', 'FREE_THROWS_MADE',
       'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 'STEALS', 'BLOCKS',
       'FOULS', 'POINTS'], dtype=object)

This isn’t very readable though, so with a bit of effort we can map each coefficient to the corresponding feature name. We can start by creating a dictionary of the coefficients:

coef_series = pd.Series(data=linear_reg_model.coef_, index=linear_reg_model.feature_names_in_)
coef_series

Height               1.377911e-14
MINUTES_PLAYED       1.193490e-15
FIELD_GOALS_MADE     1.666667e+00
THREE_POINTS_MADE    1.333333e+00
TWO_POINTS_MADE      3.333333e-01
FREE_THROWS_MADE     1.000000e+00
TOTAL_REBOUNDS       1.200000e+00
ASSISTS              1.500000e+00
TURNOVERS           -1.000000e+00
STEALS               2.000000e+00
BLOCKS               2.000000e+00
FOULS               -1.201296e-15
POINTS              -7.233797e-16
dtype: float64

This makes the coefficients far more readable! We can print this data as a string in the format of coefficient*feature_name (with newlines for formatting) to then use in the model equation:

coef_string = "\n + ".join(f"{coef}*{feat}" for feat, coef in coef_series.items())
print(coef_string)

1.3779112764770156e-14*Height
 + 1.1934897514720433e-15*MINUTES_PLAYED
 + 1.6666666666666634*FIELD_GOALS_MADE
 + 1.333333333333333*THREE_POINTS_MADE
 + 0.33333333333333315*TWO_POINTS_MADE
 + 1.0000000000000009*FREE_THROWS_MADE
 + 1.1999999999999982*TOTAL_REBOUNDS
 + 1.499999999999999*ASSISTS
 + -0.9999999999999992*TURNOVERS
 + 1.999999999999999*STEALS
 + 2.000000000000001*BLOCKS
 + -1.201296007113939e-15*FOULS
 + -7.233796894823286e-16*POINTS

Let’s use this coefficient string to assemble the final model equation:

print(f'{target} = {coef_string} + {linear_reg_model.intercept_} + error')

FANTASY_POINTS = 1.3779112764770156e-14*Height
 + 1.1934897514720433e-15*MINUTES_PLAYED
 + 1.6666666666666634*FIELD_GOALS_MADE
 + 1.333333333333333*THREE_POINTS_MADE
 + 0.33333333333333315*TWO_POINTS_MADE
 + 1.0000000000000009*FREE_THROWS_MADE
 + 1.1999999999999982*TOTAL_REBOUNDS
 + 1.499999999999999*ASSISTS
 + -0.9999999999999992*TURNOVERS
 + 1.999999999999999*STEALS
 + 2.000000000000001*BLOCKS
 + -1.201296007113939e-15*FOULS
 + -7.233796894823286e-16*POINTS + -6.821210263296962e-13 + error

Analyze the Model Equation

Now that we have the final model equation, we can see that multiple variables have a coefficient that is very close to zero (e-15 or smaller values). If we remove those values using abs(coef) > 0.0001, this equation simplifies to:

coef_series_simple = coef_series[coef_series > 0.0001]
coef_string_simple = "\n\t\t + ".join(f"{coef:.4f} * {feat}" for feat, coef in coef_series_simple.items())
print(f'{target} = {coef_string_simple} + {linear_reg_model.intercept_} + error')

FANTASY_POINTS = 1.6667 * FIELD_GOALS_MADE
		 + 1.3333 * THREE_POINTS_MADE
		 + 0.3333 * TWO_POINTS_MADE
		 + 1.0000 * FREE_THROWS_MADE
		 + 1.2000 * TOTAL_REBOUNDS
		 + 1.5000 * ASSISTS
		 + 2.0000 * STEALS
		 + 2.0000 * BLOCKS + -6.821210263296962e-13 + error

Excellent! We can compare this to the original equation for Fantasy Points:

FANTASY_POINTS =  3   * THREE_POINTS_MADE + 
                  2   * TWO_POINTS_MADE + 
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

The model estimated some of the coefficients to be the same values as the known equation. We included more parameters in the machine learning model than were in the original equation and most of those extra parameters were estimated to have a coefficient around zero. However, we can see that the coefficient was not zero for one of those extra parameters, FIELD_GOALS_MADE, and that the coefficients for THREE_POINTS_MADE and TWO_POINTS_MADE do not match between the estimated and original equations. Here’s a quick table to summarize the differences:

-	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE
Estimated Coefficient	1.6	1.3	0.3
Original Coefficient	Does Not Exist	3	2

Note: the fractions were rounded to one-tenth for readability but the values are actually \(1 \frac{2}{3}\) instead of 1.6, \(1 \frac{1}{3}\) instead of 1.3, \(\frac{1}{3}\) instead of 0.3, etc.

This already seems pretty close, but we can take it a step further by revisiting the definition of field goals. The FIELD_GOALS_MADE variable is actually the sum of TWO_POINTS_MADE and THREE_POINTS_MADE (but not FREE_THROWS_MADE). We can sanity check this in our dataset as well:

check = player_data['FIELD_GOALS_MADE'] == player_data['TWO_POINTS_MADE'] + player_data['THREE_POINTS_MADE']
print(f'True count: {check.sum()} rows')
print(f'False count: {(~check).sum()} rows')

True count: 900 rows
False count: 0 rows

So, we can write this relationship as the equation:

FIELD_GOALS_MADE = TWO_POINTS_MADE + THREE_POINTS_MADE

We can then substitute this equation into the model’s equation:

FANTASY_POINTS =  1.6 * (TWO_POINTS_MADE + THREE_POINTS_MADE) + 
                  1.3 * THREE_POINTS_MADE + 
                  0.3 * TWO_POINTS_MADE + 
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

A quick distribution of the coefficient turns this into:

FANTASY_POINTS =  1.6 * THREE_POINTS_MADE + 
                  1.3 * THREE_POINTS_MADE + 
                  1.6 * TWO_POINTS_MADE +
                  0.3 * TWO_POINTS_MADE + 
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

Using the actual fractional notation of these coefficients (instead of the rounded values shown above), this simplifies down into:

FANTASY_POINTS =  3   * THREE_POINTS_MADE + 
                  2   * TWO_POINTS_MADE +
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

This means that the model estimated approximately the same equation as the original fantasy points calculation, with the addition of a few dependent variables with coefficients close to zero and an intercept value close to zero. As a reminder, some models will be sufficiently complex that it might be difficult to output and effectively analyze the estimated equation, but it provides a lot of value in this case.

Alternate Training

Since we have ended up with essentially the same equation as the original fantasy points calculation, we can logically expect our model to perform pretty well in the next phase of model evaluation. So, we can also train an alternate model with a few of the features removed for comparison. First, let’s remove all three of the features with high correlation coefficients:

X_alt = player_data[features].drop(columns=['FIELD_GOALS_MADE', 'TWO_POINTS_MADE', 'POINTS'])

Note: this is more features than you would likely want to remove in a real-world scenario, but removing too many features will give us an opportunity to compare a less-than-perfect model to a perfect model in the model evaluation phase.

Our target variable is unchanged, so we can create alternate training and test splits:

X_train_alt, X_test_alt, y_train_alt, y_test_alt = train_test_split(X_alt, y, random_state=314)

We can now train an alternate model using these new test splits:

ols_alt = LinearRegression()
ols_alt.fit(X_train_alt, y_train_alt)

LinearRegression()

We can print the model equation for this alternate model as well:

coef_series_alt = pd.Series(data=ols_alt.coef_, index=ols_alt.feature_names_in_)
coef_series_alt = coef_series_alt[coef_series_alt > 0.0001]
coef_string_alt = "\n\t\t + ".join(f"{coef:.4f} * {feat}" for feat, coef in coef_series_alt.items())
print(f'{target} = {coef_string_alt} + {ols_alt.intercept_} + error')

FANTASY_POINTS = 1.4457 * Height
		 + 0.1001 * MINUTES_PLAYED
		 + 2.2441 * THREE_POINTS_MADE
		 + 2.3925 * FREE_THROWS_MADE
		 + 1.5098 * TOTAL_REBOUNDS
		 + 1.2482 * ASSISTS
		 + 2.1495 * STEALS
		 + 2.5698 * BLOCKS + -131.85858178609237 + error

We can see that the model coefficients and the y-intercept are substantially different from the model we originally trained. We won’t know if this alternate model performs as well as the original one until we evaluate each model in the next article.

Wrap Up

In this series, we’ve built a new dataset by acquiring and then combining the NCAA women’s basketball player information dataset with the Yahoo Sports player statistics dataset. We laid the groundwork for data analysis by cleaning and preprocessing the combined player data, and then expanded upon it by engineering a few new features. In the previous part, we took a closer look at the underlying data in each column and created visualizations to identify the relationship between various parameters. In today’s article, we learned how to select an appropriate machine learning model, properly split our data set into train and test subsets, and trained the model. In the next section, we’ll move on to evaluating the model’s performance.

Training a Linear Regression Model

Outlier or Caitlin Clark? [Part 5]

Getting Started

Import Packages

Import Data

Basics of Machine Learning

Select a Model

Define the Objective

Define the Target Variable

Define the Features

Final Objective

A Solved Problem

Identify Appropriate Models

Model Characteristics

Verify Assumptions of the Model

Linearity

No Multicollinearity

Weak Exogeneity

Train the Model

Define the Variables

Create Training and Testing Splits

Reproducibility

Dataset Proportions

A Note on Dataframes versus numPy Arrays

Using pandas DataFrames or Series

Using numPy arrays

Train the Model

Print the Model Equation

Analyze the Model Equation

Alternate Training

Wrap Up

About the Author

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
200	68	669	53	10	43	75	73	111	89	25	3	50	191
300	73	577	103	0	103	46	170	29	56	30	37	81	252
336	74	791	108	6	102	38	213	19	56	33	40	97	260
727	70	591	84	0	84	39	210	12	60	20	21	93	207
403	74	303	49	2	47	16	125	8	23	10	26	52	116

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
200	68	669	53	10	43	75	73	111	89	25	3	50	191
300	73	577	103	0	103	46	170	29	56	30	37	81	252
336	74	791	108	6	102	38	213	19	56	33	40	97	260
727	70	591	84	0	84	39	210	12	60	20	21	93	207
403	74	303	49	2	47	16	125	8	23	10	26	52	116

	Height	MINUTES_PLAYED	FIELD_GOALS_MADE	THREE_POINTS_MADE	TWO_POINTS_MADE	FREE_THROWS_MADE	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
200	68	669	53	10	43	75	73	111	89	25	3	50	191
300	73	577	103	0	103	46	170	29	56	30	37	81	252
336	74	791	108	6	102	38	213	19	56	33	40	97	260
727	70	591	84	0	84	39	210	12	60	20	21	93	207
403	74	303	49	2	47	16	125	8	23	10	26	52	116