Today we’ll explore how to derive new features from existing columns by calculating additional metrics and extracting textual data. This is the third part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning.

Getting Started

First, let’s take a look at an overview of this data science project. If you’re already familiar with it, feel free to skip to the next section.

Project Overview

As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step that we’ll go through for this project:

the steps for this data science project

Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
Data Exploration - This step focuses on analyzing and visualizing the dataset to uncover patterns, relationships, and general trends and is a helpful preliminary step before deeper analysis.
Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
Machine Learning - This step focuses on selecting, training, and evaluating a machine learning model. For this project, the model will identify the combination of individual player statistics that correlates with optimal performance.

We’ll use Python along with popular libraries like pandas, numpy, and scikit-learn to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, identify relationships between variables, create meaningful visualizations, and train a basic machine learning model. Due to the size of this project, today we’ll cover the fourth step: feature engineering.

Dependencies

Since this is the third installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the first post in this series or at least review the Getting Started section of that post before continuing. In summary, you’ll want to have Python installed with the following packages:

For today’s guide specifically, we’ll want to import the following packages:

import pandas as pd
import numpy as np

Import Data

In Part 2 of this series, we cleaned and preprocessed the values in our dataset, which is stored in a dataframe named player_data. If you want to follow along with the code examples in this article, it’s recommended to import the player_data dataframe before proceeding.

player_data = pd.read_excel('player_data_clean.xlsx')

Feature Engineering

In this section, we’ll create new features (columns) by extracting data and calculating additional metrics (derived column creation) from columns that already exist in our dataset. For example, we can calculate the per-game averages by using the number of total games played and another statistic (such as total three-point baskets made). By the end of this process, we should have all the columns we want when we start creating visualizations and training machine learning models.

Calculate Two-Point Basket Metrics

This dataset contains the statistics on both field goals (the combination of two-point and three-point baskets) and three-point baskets specifically, so the first set of columns we can derive are two-point basket statistics.

# Calculate two-pointers made
player_data['TWO_POINTS_MADE'] = player_data['FIELD_GOALS_MADE'] - player_data['THREE_POINTS_MADE']

# Calculate two-point attempts
player_data['TWO_POINT_ATTEMPTS'] = player_data['FIELD_GOAL_ATTEMPTS'] - player_data['THREE_POINT_ATTEMPTS']

# Calculate two-point percentage
player_data['TWO_POINT_PERCENTAGE'] = (player_data['TWO_POINTS_MADE'] / player_data['TWO_POINT_ATTEMPTS']) * 100

player_data.dtypes

Unnamed: 0                  int64
PLAYER_NAME                object
Team                       object
Class                      object
Height                      int64
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
TWO_POINTS_MADE             int64
TWO_POINT_ATTEMPTS          int64
TWO_POINT_PERCENTAGE      float64
dtype: object

Just like the three-point basket statistics, we can see that the two-pointers made and attempted are integers whereas the two-point percentage is stored as a float.

Extract Conference from Team Name

This dataset includes two fields for the team name: one that came from the player information dataset and the other that came from the player statistics dataset. Let’s look at a sample of each.

player_data[['TEAM_NAME', 'Team']].sample(10)

	TEAM_NAME	Team
727	Old Dominion	Old Dominion (Sun Belt)
856	Missouri	Missouri (SEC)
867	LIU Brooklyn	LIU (NEC)
94	Arkansas	Arkansas St. (Sun Belt)
715	Rutgers	Rutgers (Big Ten)
500	S.F. Austin	SFA (WAC)
676	Gonzaga	Gonzaga (WCC)
150	Wisconsin	Wisconsin (Big Ten)
666	Holy Cross	Holy Cross (Patriot)
427	Miami (FL)	Notre Dame (ACC)

So the TEAM_NAME column from the player statistics dataset contains strictly the name of the team, but the Team column from the player information dataset contains both the team name and the basketball conference in parenthesis. If we only had the TEAM_NAME column and wanted to know the conference, then we could pull in another data source to map team names to conference names. However, since the conference is listed in the Team column, we can split this information into a separate column as a feature.

There are multiple ways to do this, so let’s start with the most obvious one. It looks like the conference name is enclosed in parentheses after the team name. We could extract the conference name by splitting the Team column on the opening and closing parentheses and selecting the last element of the resulting list.

player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0]

    Mountain West
              SEC
          Big Sky
    Mountain West
              ACC
           ...      
           MAAC
            AAC
          SoCon
           SWAC
           ASUN
Name: 0, Length: 900, dtype: object

However, depending on the results of your previous sample, you might notice an issue with this. Certain teams have two sets of parenthesis instead of just one. Here are two examples.

player_data.loc[[125, 824], 'Team']

125       LMU (CA) (WCC)
824     Miami (FL) (ACC)
Name: Team, dtype: object

For these two examples, the extracted conference name would be set to “CA” and “FL” instead of the proper “WCC” and “ACC”.

player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0].iloc[[125, 824]]

125    CA
824    FL
Name: 0, dtype: object

We could modify the previous code to select only the values in the right-most set of parenthesis, but switching from the .split() method to using a regular expression (a.k.a. “regex”) might be a more robust solution. Regular expressions provide more flexibility in pattern matching and are quite efficient, so let’s build a pattern to select the proper value.

Our regular expression will be composed of the following parts:

\( matches on an opening parenthesis. Parenthesis in regular expressions are used for capturing groups, so the \ escapes the parenthesis and matches the literal value.
([^)]+) matches matches one or more characters that are not a closing parentheses (any text inside the parentheses). The ^ symbol negates the set, so this pattern matches anything except a right parenthesis. The + quantifier means “one or more” of these characters. This ensures we capture the entire conference name instead of just the first letter.
\) matches on a closing parenthesis. Just like the opening parenthesis, this needs to be escaped.
$ searches for matches from the end of the string first instead of from the start of the string.

This pattern should capture the text within the last pair of parentheses at the end of the string. Let’s preview what this logic would extract as the conference name.

player_data['Team'].str.extract(r'\(([^)]+)\)$')

	0
0	Mountain West
1	SEC
2	Big Sky
3	Mountain West
4	ACC
...	...
895	MAAC
896	AAC
897	SoCon
898	SWAC
899	ASUN

900 rows × 1 columns

We can double-check that this regex pattern will pull from the proper set of parentheses for the values with multiple parentheses.

player_data['Team'].str.extract(r'\(([^)]+)\)$').iloc[[125, 824]]

	0
125	WCC
824	ACC

This is perfect, so let’s go ahead and create a new column for this data.

player_data['Conference'] = player_data['Team'].str.extract(r'\(([^)]+)\)$')

We can look at the unique values in this new column to verify that we have not extracted any incorrect data.

sorted(player_data['Conference'].unique())

['AAC',
 'ACC',
 'ASUN',
 'America East',
 'Atlantic 10',
 'Big 12',
 'Big East',
 'Big Sky',
 'Big South',
 'Big Ten',
 'Big West',
 'CAA',
 'CUSA',
 'DI Independent',
 'Horizon',
 'Ivy League',
 'MAAC',
 'MAC',
 'MEAC',
 'MVC',
 'Mountain West',
 'NEC',
 'OVC',
 'Pac-12',
 'Patriot',
 'SEC',
 'SWAC',
 'SoCon',
 'Southland',
 'Summit League',
 'Sun Belt',
 'WAC',
 'WCC']

These values closely match the list of conferences (with a few minor syntax differences such as “DI Independent” for independent schools) so this feature is complete. The rest of the features we’ll be generating today will be based on straightforward calculations using existing columns.

Calculate per-Game Metrics

The next set of columns we can calculate are per-game metrics. Each metric can be divided by the total number of games in the season to get the per-game average of that metric. For example, to determine the average points-per-game, we can divide the total points by the total games played. We could do this for almost every one of the numeric columns in this dataset, but let’s focus on the number of minutes played, fouls, and the big five player statistics.

player_data['MINUTES_PER_GAME'] = player_data['MINUTES_PLAYED'] / player_data['GAMES']
player_data['FOULS_PER_GAME'] = player_data['FOULS'] / player_data['GAMES']
player_data['POINTS_PER_GAME'] = player_data['POINTS'] / player_data['GAMES']
player_data['ASSISTS_PER_GAME'] = player_data['ASSISTS'] / player_data['GAMES']
player_data['STEALS_PER_GAME'] = player_data['STEALS'] / player_data['GAMES']
player_data['BLOCKS_PER_GAME'] = player_data['BLOCKS'] / player_data['GAMES']
player_data['REBOUNDS_PER_GAME'] = player_data['TOTAL_REBOUNDS'] / player_data['GAMES']

player_data[['PLAYER_NAME', 'MINUTES_PER_GAME', 'FOULS_PER_GAME', 'POINTS_PER_GAME', 'ASSISTS_PER_GAME', 'STEALS_PER_GAME', 'BLOCKS_PER_GAME', 'REBOUNDS_PER_GAME']].sample(5)

	PLAYER_NAME	MINUTES_PER_GAME	FOULS_PER_GAME	POINTS_PER_GAME	ASSISTS_PER_GAME	STEALS_PER_GAME	BLOCKS_PER_GAME	REBOUNDS_PER_GAME
865	Dena Jarrells	27.625000	1.437500	10.625000	3.500000	0.687500	0.062500	2.562500
566	Jasmine Gayles	34.566667	1.800000	16.800000	2.466667	0.933333	0.033333	4.033333
172	Elena Rodriguez	27.173913	2.739130	10.652174	3.347826	1.173913	0.695652	6.695652
326	Caitlin Staley	13.689655	2.482759	3.310345	0.310345	0.310345	1.379310	2.517241
76	Bria Sanders-Woods	25.068966	2.482759	7.758621	4.206897	1.241379	0.206897	2.103448

These look great, so we can move on to the next feature.

Calculate Assist-to-Turnover Ratio

Another metric often used in basketball performance analysis is the assist-to-turnover ratio, so let’s add a column for that. This can be calculated by dividing the number of assists by the number of turnovers for each player.

player_data['ASSIST_TO_TURNOVER'] = player_data['ASSISTS'] / player_data['TURNOVERS']

player_data[['PLAYER_NAME', 'ASSISTS', 'TURNOVERS', 'ASSIST_TO_TURNOVER']].sample(5)

	PLAYER_NAME	ASSISTS	TURNOVERS	ASSIST_TO_TURNOVER
766	Kylee Mabry	35	61	0.573770
678	Quay Miller	42	72	0.583333
825	MiLaysia Fulwiley	70	54	1.296296
294	JuJu Watkins	96	120	0.800000
588	Jasmine Shavers	50	80	0.625000

That’s all we need to do for the assist-to-turnover ratio, so let’s create the final feature.

Calculate Fantasy Points

The last column we’ll calculate today is a metric called Fantasy Points. There are multiple ways to calculate this metric, but today we’ll go with a weighted sum of the major metrics. Notice that, in this method, nearly all of the metrics contribute positively towards the Fantasy Points, but turnovers contributes negatively. This means players with higher points, rebounds, assists, blocks, and steals are rewarded, whereas players with high turnovers are slightly penalized.

player_data['FANTASY_POINTS'] = (player_data['THREE_POINTS_MADE'] * 3) + \
                                (player_data['TWO_POINTS_MADE'] * 2) + \
                                (player_data['FREE_THROWS_MADE'] * 1) + \
                                (player_data['TOTAL_REBOUNDS'] * 1.2) + \
                                (player_data['ASSISTS'] * 1.5) + \
                                (player_data['BLOCKS'] * 2) + \
                                (player_data['STEALS'] * 2) + \
                                (player_data['TURNOVERS'] * -1)

player_data[['PLAYER_NAME', 'FANTASY_POINTS']].sample(5)

	PLAYER_NAME	FANTASY_POINTS
152	Benthe Versteeg	751.1
266	Rayah Marshall	803.0
501	Lex Therien	932.7
411	Ashlee Locke	395.0
145	Last-Tear Poa	393.3

This completes the feature engineering section of this project! Let’s take a look at our final set of columns and data types before we move on to the next steps of the project.

player_data.dtypes

Unnamed: 0                  int64
PLAYER_NAME                object
Team                       object
Class                      object
Height                      int64
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
TWO_POINTS_MADE             int64
TWO_POINT_ATTEMPTS          int64
TWO_POINT_PERCENTAGE      float64
Conference                 object
MINUTES_PER_GAME          float64
FOULS_PER_GAME            float64
POINTS_PER_GAME           float64
ASSISTS_PER_GAME          float64
STEALS_PER_GAME           float64
BLOCKS_PER_GAME           float64
REBOUNDS_PER_GAME         float64
ASSIST_TO_TURNOVER        float64
FANTASY_POINTS            float64
dtype: object

Export Data

If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export this final dataset. As a reminder, you can use the to_csv() method instead of the .to_excel() method if you prefer.

player_data.to_excel('player_data_engineered.xlsx', index=False)

Wrap up

In today’s guide, we expanded upon our dataset by engineering a few new features. In the next part, we’ll take a closer look at the underlying data in each column and create visualizations to identify the relationship between various parameters.

Also, all of the code snippets in today’s guide are available in a Jupyter Notebook in the ncaa-basketball-stats repository on GitHub.

Feature Engineering

Outlier or Caitlin Clark? [Part 3]

Getting Started

Project Overview

Dependencies

Import Data

Feature Engineering

Calculate Two-Point Basket Metrics

Extract Conference from Team Name

Calculate per-Game Metrics

Calculate Assist-to-Turnover Ratio

Calculate Fantasy Points

Export Data

Wrap up

Articles in this Series

About the Author