Today we’ll cover the basics of machine learning and examine how to select an appropriate machine learning model. This is the sixth part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning.

Getting Started

First, let’s take a look at an overview of this data science project. If you’re already familiar with it, feel free to skip to the next section.

Project Overview

As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step of this project:

the steps for this data science project

Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
Data Exploration - This step focuses on analyzing and visualizing the dataset to uncover patterns, relationships, and general trends and is a helpful preliminary step before deeper analysis.
Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
Machine Learning - This step focuses on selecting, training, and evaluating a machine learning model. For this project, the model will identify the combination of individual player statistics that correlates with optimal performance.

We’ll use Python along with popular libraries like pandas, numpy, and scikit-learn to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, identify relationships between variables, create meaningful visualizations, and train a basic machine learning model. Due to the size of this project, today we’ll cover part of the seventh step: selecting a machine learning model.

Dependencies

Since this is the sixth installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the first article of the series or at least review the Getting Started section of that post before continuing.

You’ll want to have the latest version of Python installed with the following packages:

For today’s machine learning segment specifically, we’ll want to import a few of these libraries:

import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Import Data

In Part 3 of this series, we engineered new features for our dataset, which is stored in a dataframe named player_data. No changes have been made to the underlying dataset in the intermediary articles. If you want to follow along with the code examples in this article, it’s recommended to import the player_data dataframe before proceeding.

player_data = pd.read_excel('player_data_engineered.xlsx')
player_data.head()

	PLAYER_NAME	Team	Class	Height	Position	PLAYER_ID	TEAM_NAME	GAMES	MINUTES_PLAYED	FIELD_GOALS_MADE	...	Conference	MINUTES_PER_GAME	FOULS_PER_GAME	POINTS_PER_GAME	ASSISTS_PER_GAME	STEALS_PER_GAME	BLOCKS_PER_GAME	REBOUNDS_PER_GAME	ASSIST_TO_TURNOVER	FANTASY_POINTS
0	Kiara Jackson	UNLV (Mountain West)	Junior	67	Guard	ncaaw.p.67149	UNLV	29	895	128	...	Mountain West	30.862069	1.620690	11.137931	4.655172	1.068966	0.172414	4.448276	3.214286	710.3
1	Raven Johnson	South Carolina (SEC)	Sophomore	68	Guard	ncaaw.p.67515	South Carolina	30	823	98	...	SEC	27.433333	1.133333	8.100000	4.933333	2.000000	0.166667	5.366667	2.792453	735.2
2	Gina Marxen	Montana (Big Sky)	Senior	68	Guard	ncaaw.p.57909	Montana	29	778	88	...	Big Sky	26.827586	0.896552	10.241379	3.827586	0.551724	0.068966	2.068966	2.921053	533.5
3	McKenna Hofschild	Colorado St. (Mountain West)	Senior	62	Guard	ncaaw.p.60402	Colorado St.	29	1046	231	...	Mountain West	36.068966	1.172414	22.551724	7.275862	1.241379	0.137931	3.965517	2.971831	1117.5
4	Kaylah Ivey	Boston College (ACC)	Junior	68	Guard	ncaaw.p.64531	Boston Coll.	33	995	47	...	ACC	30.151515	1.454545	4.333333	5.636364	1.090909	0.030303	1.727273	2.906250	500.4

5 rows × 40 columns

Basics of Machine Learning

Before we get into selecting a machine learning model, let’s briefly cover a few basics of machine learning. Machine learning is a branch of artificial intelligence that focuses on creating algorithms and statistical models that allow computer systems to “learn” how to improve their performance on a specific task through experience. In the context of our basketball statistics project, machine learning can be particularly useful for predicting player performance, classifying player position, and identifying similar players.

Key concepts in machine learning that we’ll encounter include:

Model - The system that learns patterns from data and can be used to make predictions on previously unseen data. Machine learning models are often of a specific type (Linear or Logistic Regression, Random Forests, Support Vector Machines, Neural Networks, etc.).
Training Data - The subset of our data used to train the model.
Testing Data - A separate subset of data used to evaluate the model’s performance.
Features - The input variables used to make predictions. This is sometimes referred to as the independent variable(s). In our case, these could be various player statistics like three points made or assists.
Target Variable - The variable we’re trying to predict or optimize, such as points scored or fantasy points. This is sometimes referred to as the dependent variable(s), as it depends on the independent variable(s).
Parameters - The values that the model learns during training, such as coefficients in linear regression. These parameters define how the model transforms input features into predictions.
Hyperparameters - The configuration settings for the model that are set before training begins. These are not learned from the data but are specified by the data scientist. Examples include learning rate, number of iterations, or regularization strength. Hyperparameters can significantly affect model performance and are often tuned to optimize the model.
- Note: The model we’ll be using in this project is straightforward and doesn’t typically have hyperparameters in the traditional sense. However, it’s still important to know the difference between parameters and hyperparameters since many models will have hyperparameters.
Residuals - The differences between the observed values and the predicted values from the model. Residuals help assess how well the model fits the data and can reveal patterns or issues in the model’s predictions.
Model Evaluation - Metrics used to assess how well our model is performing. For a Linear Regression model, this will include metrics like Mean Squared Error (MSE) and the R-squared value.

We’ll use primarily the first six terms throughout this article, so it’s best to familiarize yourself with them now. The other concepts will be explored in more detail in future articles (please let me know if that is something you are interested in!).

It’s important to note that our focus in this article is on classic machine learning models designed for tabular data. We won’t be covering models built specifically for natural language processing, image recognition, or video analysis. However, it’s worth mentioning that many problems in these domains often get transformed into tabular data problems, so some of the principles we discuss here may still apply in those contexts. With all of that out of the way, let’s move on to defining the problem and selecting an appropriate machine learning model.

Model Selection

Before we choose a model, it’s a good idea to clearly define our objective to help us ensure we’re using an appropriate model for our task. This step sets the foundation for our entire machine learning process and helps guide our decision-making throughout the project.

Define the Objective

The goal of a machine learning project in commercial settings will often be determined by a desired business outcome. However, for a hobby project like this, we have the freedom to pick the objective. So, for today’s machine learning model, we’ll focus on training the model to predict a target variable based on one or more input features (such as field goals, blocks, assists, etc.). Let’s choose the target variable and set of features as well.

Define the Target Variable

The target variable has a massive impact on the machine learning model, including what type (regression, classification, clustering, etc.) of machine learning model is appropriate. For today, let’s choose one of the numerical columns from Part 4:

numerical_columns = ['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 
                     'THREE_POINTS_MADE', 'TWO_POINTS_MADE', 'FREE_THROWS_MADE',
                     'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 
                     'STEALS', 'BLOCKS', 'FOULS', 'POINTS', 'FANTASY_POINTS']

We could choose the POINTS variable, but that would end up focusing on primarily offensive players in the model. Defensive players could be prioritized by choosing the BLOCKS variable. To include all types of players, let’s instead use FANTASY_POINTS as the target variable.

target = 'FANTASY_POINTS'

Define the Features

Next up, let’s define the features that the model should use. Typically, this will involve performing feature selection and feature engineering on the dataset, but we’ve already essentially done that in Part 3 and Part 4. So, we’ll use the list of numerical columns with the target variable (FANTASY_POINTS) removed.

features = ['Height', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'THREE_POINTS_MADE',
            'TWO_POINTS_MADE', 'FREE_THROWS_MADE', 'TOTAL_REBOUNDS', 'ASSISTS',
            'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS']

Ultimate Objective

In summary, our ultimate objective is to train a machine learning model to predict FANTASY_POINTS (the target variable) based on Height, MINUTES_PLAYED, FIELD_GOALS_MADE, THREE_POINTS_MADE, TWO_POINTS_MADE, FREE_THROWS_MADE, TOTAL_REBOUNDS, ASSISTS, TURNOVERS, STEALS, BLOCKS, FOULS, and POINTS (the features).

A Solved Problem

If you’ve been following this series from the beginning, you might remember that we actually calculated FANTASY_POINTS from some of these variables at the end of Part 3. The equation we used was:

FANTASY_POINTS =  3   * THREE_POINTS_MADE + 
                  2   * TWO_POINTS_MADE + 
                  1   * FREE_THROWS_MADE + 
                  1.2 * TOTAL_REBOUNDS + 
                  1.5 * ASSISTS + 
                  2   * BLOCKS + 
                  2   * STEALS + 
                 -1   * TURNOVERS

So this is essentially a solved problem and training a machine learning model is technically unnecessary. However, there are some benefits of using a known relationship as a starting point for learning about machine learning and training your first model:

Verification of results - Since we know the exact formula used to calculate the target variable, we can easily verify if our model is working correctly. This provides a clear benchmark for our model’s performance.
Understanding feature importance - By comparing the coefficients yur model learns to the actual coefficients used in the fantasy points calculation, we can gain insights into how well the model identifies important features.
Concept reinforcement - Working with a known relationship helps reinforce key machine learning concepts like feature selection, model training, and evaluation in a controlled environment.
Confidence building - Successfully predicting a known relationship can also boost confidence in applying machine learning techniques to more complex, unknown relationships in the future.

While using a known relationship is a great learning tool, it’s important to remember that in real-world scenarios, the relationships between variables are often unknown or more complex. This exercise serves as a stepping stone to tackling more challenging predictive tasks. To simulate a bit of this complexity and to make the future article on model evaluation more valuable, we’ll also train an alternate version of the model after removing some of the features.

Identify Appropriate Models

The first part of training any machine learning model is selecting the model to use. This might sound obvious, but selecting the “best” model for your problem depends on a variety of factors. We’ll likely explore this step in more detail in a future article (send me a note if you would be interested in that), but for today we’ll go through scikit-learn’s model flowchart to find an appropriate model.

scikit-learn's algorithm flowchat

Let’s start with the first question in the flowchart: “Are there more than 50 samples?” We can answer this by checking the number of rows in our dataframe.

len(player_data)

This confirms that there are more than 50 rows in our dataframe, so we can follow the “yes” path to the next question: “Are we predicting a category”? We’ll be predicting FANTASY_POINTS, which we can check the datatype of using .dtypes:

player_data['FANTASY_POINTS'].dtypes

dtype('float64')

So in this case our target variable contains float64 values. Note that in many cases, the data type might be listed as object, such as if numeric data is stored as strings or if there are multiple datatypes in the column. In those cases, it can also be a good idea to look at a sample of the target variable data:

player_data['FANTASY_POINTS'].sample(10)

  776.6
  901.9
  624.2
  875.9
  611.0
  563.9
   474.0
  668.6
  410.0
  534.0
Name: FANTASY_POINTS, dtype: float64

Since our target variable contains entirely numeric data, we can answer “no” to “predicting a category?”. Note that in some cases you might treat numeric data as categorical data or you might want to separate the numeric data into bins, but for today we are not predicing a category.

Next up: “Are we predicing a quantity?” In this case, the answer is “yes”, since FANTASY_POINTS is indeed a quantity. Answering “yes” to this guides us to the regression category of models. Regression analysis is a statistical method used to model the relationship between a continuous numeric dependent variable and one or more independent variables. It aims to predict the value of the dependent variable based on the independent variables. Regression analysis models are a staple in the industry and are often a good starting point due to their simplicity and explainability of results.

The next question is: “Are there fewer than 100,000 samples?” Looking back to the output of len(player_data) above, we know that we do have fewer than 100k samples, so we can go down the “yes” path.

At this point, we have ended up at the question: “Should only a few features be important?” Answering “yes” will take us to Lasso and answering “no” will take us to RidgeRegression. However, both Lasso and Ridge regression in this context are meant to be improvements upon Ordinary Least Squares (OLS) linear regression. To clarify, linear regression is a specific type of regression analysis where the relationship between the dependent variable and the independent variables is assumed to be linear. It finds the best-fitting straight line (called the regression line) through the data points to make predictions. So, instead of using either lasso or ridge regression today, we’ll explore the classic OLS linear regression model.

Model Characteristics

Ordinary Least Squares (OLS) is the most basic form of linear regression. It aims to minimize the sum of the squared differences between observed and predicted values (the residuals). As a type of linear regression, it assumes a linear relationship between the independent and dependent variables. It can be sensitive to outliers, which can skew the results significantly. It also does not offer regularization, which means it can overfit when there are many predictors or when predictors are highly correlated. However, it is simple and explainable, so it offers a good starting point.

As mentioned earlier, linear regression assumes that the relationship between the independent variables (features) and the dependent variable (target) can be described by a straight line. This line (also known as the regression line) is represented by an equation of the form:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n\]

Where:

\(y\) is the predicted value of the target variable
\(x_1, x_2, …, x_n\) are the input features
\(b_0\) is the y-intercept (the value of \(y\) when all \(x\) are zero)
\(b_1, b_2, …, b_n\) are the coefficients that represent the change in \(y\) for a one-unit change in the corresponding \(x\), holding all other \(x\) constant

We can also rewrite this equation for today’s project as:

\[FantasyPoints = \beta_0 + \beta_1Height + \beta_2MinutesPlayed + ... + \beta_nPoints\]

The goal of linear regression is to find the values of \(b_0\) and \(b_1, b_2, …, b_n\) that minimize the difference between the predicted \(FantasyPoints\) values and the actual \(FantasyPoints\) values in our dataset.

Note: This is a slightly simplified explanation of the underlying mathematics. I highly recommend reading the Wikipedia page, or other suitable sources, for a more nuanced understanding of the process.

Now that we conceptually understand a bit more about how this model works, we can take a quick look at the common assumptions of the model and whether or not those assumptions are satisfied.

Verify Assumptions of the Model

Next up, we should verify the underlying assumptions of the machine learning model are satisfied by our particular problem and situation. This step might be tempting to skip, but it can save hours of time in the future and can help ensure your model is generalized. The basic assumptions of linear regression models generally are:

Linearity - The relationship between the target variable and the features is linear.
No Multicollinearity - The features are not too highly correlated with each other.
Weak Exogeneity - The features are treated as fixed values, not random variables, and are not correlated with any error in \(FantasyPoints\).
Independence of Errors - Residuals are independent of and unrelated to one another.
Zero Mean of Residuals - The mean of the residuals is zero or close to zero.
Constant Variance (Homoskedasticity) - Residuals have constant variance across all levels of the independent variables.

Note: These are simplified summaries of each assumption. We’ll go through each one in a bit more detail later in this article and in future articles, but once again I highly recommend reading a supplemental source for a deeper understanding. Some suitable sources include Wikipedia, Stanford, and Statistics by Jim.

You might notice that the first three assumptions pertain to the features and the last three assumptions pertain to the residuals. Residuals are the difference between the predicted target variable and the actual target variable. This means that we must be able to generate predictions and therefore have trained the model before evaluating those assumptions. So we can go through each of the first three assumptions (linearity, no multicollinearity, and weak exogeneity) today and will leave the other three for a future article.

Linearity

The first assumption we’ll look at is whether the relationship between the independent and dependent variables is linear. Linear regression models can model curvature to some extent (in the variables, not the parameters), but for today we’ll focus on a strictly linear relationship.

There are a few common methods to detect linearity:

Scatterplot - We can use a simple scatterplot to evaluate the relationship between the target variable and the features. For projects with only one feature, we would only need one scatterplot. Since today’s project has multiple features, we’ll need one scatterplot for each feature.
LOWESS Curve - This is the same idea as the scatterplot, but adds in a LOWESS curve to more robustly evaluate the linearity assumption.
Residuals Plot - If the residuals plot shows curves or trends, it could mean that the linearity assumption is not satisfied.

Since we’re looking for a strictly linear relationship today, we can start with the same pairplot() function that we did in Part 4. Since we’re only looking at the relationship between the target variable (FANTASY_POINTS) and each feature, we can make a slight modification to show only the target variable plots.

pairplot = sns.pairplot(data=player_data, x_vars=features, y_vars=target)

png

A LOWESS (Locally Weighted Scatterplot Smoothing) curve is a type of local regression that fits a smooth curve through a scatter plot to visualize relationships between variables. It’s particularly useful for identifying non-linear patterns in data by fitting simple models to localized subsets. LOWESS curves are flexible and can capture complex relationships, making them a helpful visual aid for determining the linearity of each pairing.

We can use Seaborn’s regplot() function with the lowess=True parameter to add a LOWESS curve to each scatterplot:

def loess_reg(x, y, **kwargs):
    sns.regplot(x=x, y=y, lowess=True, line_kws={'color': 'red'}, **kwargs)


pairplot = sns.pairplot(data=player_data, x_vars=features, y_vars=target)
pairplot.map(loess_reg)

<seaborn.axisgrid.PairGrid at 0x74cd86773280>

png

No Multicollinearity

The next assumption that we’ll examine is the lack of multicollinearity. Multicollinearity occurs when the features in a regression model are highly correlated with each other. This can make it challenging to determine individual variable effects, lead to unstable coefficient estimates, and increase standard errors. As a result, some variables may appear statistically insignificant when they should be significant.

There are a few common ways to detect multicollinearity; we’ll look at the top two:

Correlation Coefficient - We can check the correlation coefficient between each feature.
Variance Inflation Factor - We can check the variance inflation factor after training the model.

To check the correlation coefficient, we can use the same method as in Part 3. We can check the correlation coefficients before training the model using a correlation matrix. As a reminder, a correlation matrix displays the correlation coefficients between all pairs of variables, with values ranging from -1 to 1. Strong correlations (when the absolute value of the correlation coefficient is typically above 0.8 or 0.9) suggest potential multicollinearity.

plt.figure(figsize=(12, 10))
correlation_matrix = player_data[features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)

png

From this chart, we can see that the variables with the strongest correlations are:

FIELD_GOALS_MADE and TWO_POINTS_MADE with a correlation coefficient of 0.86
FIELD_GOALS_MADE and POINTS with a correlation coefficient of 0.97

When you encounter variable pairings with strong correlations in your correlation matrix, there are several ways to deal with them:

Remove one of the correlated variables - If two variables are highly correlated, they might provide similar information. Keeping both may lead to redundancy and potential issues with multicollinearity. We could choose the variable that we believe is more important or relevant to your analysis.
Combine correlated variables - We might create a new feature that combines the information from both correlated variables. This could be done through a variety of methods such as by creating a composite score.
Use regularization techniques - Methods like Ridge Regression (L2 regularization) or Lasso Regression (L1 regularization) can help mitigate the effects of multicollinearity by adding a penalty term to the model that discourages large coefficients. We could also apply regularization to the features separately, before training the OLS model.
Collect more data - Sometimes, multicollinearity can be reduced by increasing the sample size, which may help differentiate the effects of correlated variables. This is more situational and will not apply in all scenarios.

Since we’re interested in seeing how the model behaves with a simple application of OLS linear regression, we’ll proceed without removing, combining, or regularizing any of the features. However, we should keep this information in mind in the future since we might want to compare the results to a separate model that was trained without one or more of the highly correlated features.

Weak Exogeneity

Weak exogeneity is another crucial assumption in linear regression that we need to verify. In the context of linear regression, weak exogeneity means that the predictor variables (independent variables) are not correlated with the error term in the model. More formally, weak exogeneity assumes that the independent variables can be treated as fixed in value and are not influenced by the dependent variable or by factors that affect both the dependent variable and the predictors.

This assumption does not require the predictor variables to be free from all external influences, but rather that the predictors are not subject to endogeneity. Endogeneity could arise if, for example, the predictors are correlated with unobserved factors that also affect the dependent variable. Under weak exogeneity, we assume that such correlations do not exist.

In practical terms, weak exogeneity also implies that the predictors are not contaminated by measurement errors that are correlated with the error term in the regression. While weak exogeneity is often an idealization that may not always hold in real-world data, it is necessary for the standard ordinary least squares (OLS) estimation procedure to produce unbiased and consistent estimates.

It’s worth noting that dropping this assumption would require significantly more complex models (such as errors-in-variables models). These models account for measurement errors in the predictor variables but are generally more challenging to implement and interpret.

There is no direct statistical test for weak exogeneity, so we’ll treat this as more of a logical check than a mathematical one. For our basketball player statistics model, weak exogeneity would mean that the statistics we’re using as predictors (such as minutes played, field goals attempted, etc.) are not themselves influenced by unmeasured factors that also affect fantasy points. Based on our understanding of the domain, this makes rough logical sense and we’ll consider this assumption satisfied.

Wrap Up

In today’s article, we covered the basics of machine learning and learned how to select an appropriate machine learning model. In the next post, we’ll cover how to actually train the model.

Also, all of the code snippets in today’s guide are available in a Jupyter Notebook in the ncaa-basketball-stats repository on GitHub.

Selecting a Machine Learning Model

Outlier or Caitlin Clark? [Part 6]