Today we’ll explore how to derive new features from existing columns by calculating additional metrics and extracting textual data. This is the third part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, machine learning, and creating visualizations.

As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step that we’ll go through for this project:

the steps for this data science project

  1. Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
  2. Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
  3. Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
  4. Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
  5. Machine Learning - This step focuses on training a machine learning model to identify the combination of individual player statistics that correlates with optimal performance.
  6. Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).

We’ll use Python along with the popular pandas and requests libraries to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, create meaningful visualizations, and train a basic machine learning model. Since we already gathered the raw data from online sources in Part 1 and cleaned that data in Part 2, we’re ready to move on to feature engineering.

Getting Started

Since this is the third installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the previous post or at least review the Getting Started section of that post before continuing. In summary, you’ll want to have Python installed with the following packages:

In Part 2 of this series, we cleaned and preprocessed the values in our dataset, which is stored in a dataframe named player_data. If you want to follow along with the code examples in this article, it’s recommended to import the player_data dataframe before proceeding.

import pandas as pd
import requests
import json
import os
import numpy as np


player_data = pd.read_excel('player_data_clean.xlsx')

Feature Engineering

In this section, we’ll create new features (columns) by extracting data and calculating additional metrics (derived column creation) from columns that already exist in our dataset. For example, we can calculate the per-game averages by using the number of total games played and another statistic (such as total three-point baskets made). By the end of this process, we should have all the columns we want when we start creating visualizations and training machine learning models.

Calculate Two-Point Basket Metrics

This dataset contains the statistics on both field goals (the combination of two-point and three-point baskets) and three-point baskets specifically, so the first set of columns we can derive are two-point basket statistics.

# Calculate two-pointers made
player_data['TWO_POINTS_MADE'] = player_data['FIELD_GOALS_MADE'] - player_data['THREE_POINTS_MADE']

# Calculate two-point attempts
player_data['TWO_POINT_ATTEMPTS'] = player_data['FIELD_GOAL_ATTEMPTS'] - player_data['THREE_POINT_ATTEMPTS']

# Calculate two-point percentage
player_data['TWO_POINT_PERCENTAGE'] = (player_data['TWO_POINTS_MADE'] / player_data['TWO_POINT_ATTEMPTS']) * 100

player_data.dtypes
Unnamed: 0                  int64
PLAYER_NAME                object
Team                       object
Class                      object
Height                      int64
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
TWO_POINTS_MADE             int64
TWO_POINT_ATTEMPTS          int64
TWO_POINT_PERCENTAGE      float64
dtype: object

Just like the three-point basket statistics, we can see that the two-pointers made and attempted are integers whereas the two-point percentage is stored as a float.

Extract Conference from Team Name

This dataset includes two fields for the team name: one that came from the player information dataset and the other that came from the player statistics dataset. Let’s look at a sample of each.

player_data[['TEAM_NAME', 'Team']].sample(10)
TEAM_NAME Team
727 Old Dominion Old Dominion (Sun Belt)
856 Missouri Missouri (SEC)
867 LIU Brooklyn LIU (NEC)
94 Arkansas Arkansas St. (Sun Belt)
715 Rutgers Rutgers (Big Ten)
500 S.F. Austin SFA (WAC)
676 Gonzaga Gonzaga (WCC)
150 Wisconsin Wisconsin (Big Ten)
666 Holy Cross Holy Cross (Patriot)
427 Miami (FL) Notre Dame (ACC)

So the TEAM_NAME column from the player statistics dataset contains strictly the name of the team, but the Team column from the player information dataset contains both the team name and the basketball conference in parenthesis. If we only had the TEAM_NAME column and wanted to know the conference, then we could pull in another data source to map team names to conference names. However, since the conference is listed in the TEAM column, we can split this information into a separate column as a feature.

There are multiple ways to do this, so let’s start with the most obvious one. It looks like the conference name is enclosed in parentheses after the team name. We could extract the conference name by splitting the ‘Team’ column on the opening and closing parentheses and selecting the last element of the resulting list.

player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0]
0      Mountain West
1                SEC
2            Big Sky
3      Mountain West
4                ACC
           ...      
895             MAAC
896              AAC
897            SoCon
898             SWAC
899             ASUN
Name: 0, Length: 900, dtype: object

However, depending on the results of your previous sample, you might notice an issue with this. Certain teams have two sets of parenthesis instead of just one. Here are two examples.

player_data.loc[[125, 824], 'Team']
125       LMU (CA) (WCC)
824     Miami (FL) (ACC)
Name: Team, dtype: object

For these two examples, the extracted Conference name would be “CA” and “FL” instead of the proper “WCC” and “ACC”.

player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0].iloc[[125, 824]]
125    CA
824    FL
Name: 0, dtype: object

We could modify the previous code to select only the values in the right-most set of parenthesis, but switching from the .split() method to using a regular expression (a.k.a. “regex”) might be a more robust solution. Regular expressions provide more flexibility in pattern matching and are quite efficient, so let’s build a pattern to select the proper value.

In this regular expression, we can use the following parts:

  1. \( matches on an opening parenthesis. Parenthesis in regular expressions are used for capturing groups, so the \ escapes the parenthesis and matches the literal value.
  2. ([^)]+) matches matches one or more characters that are not a closing parentheses (any text inside the parentheses). The ^ symbol negates the set, so this pattern matches anything except a right parenthesis. The + quantifier means “one or more” of these characters. This ensures we capture the entire conference name instead of just the first letter.
  3. \) matches on a closing parenthesis. Just like the opening parenthesis, this needs to be escaped.
  4. $ searches for matches from the end of the string first instead of from the start of the string.

This pattern should capture the text within the last pair of parentheses at the end of the string. Let’s preview what this logic would extract as the conference name.

player_data['Team'].str.extract(r'\(([^)]+)\)$')
0
0 Mountain West
1 SEC
2 Big Sky
3 Mountain West
4 ACC
... ...
895 MAAC
896 AAC
897 SoCon
898 SWAC
899 ASUN

900 rows × 1 columns

We can double-check that this regex pattern will pull from the proper set of parentheses for the values with multiple parentheses.

player_data['Team'].str.extract(r'\(([^)]+)\)$').iloc[[125, 824]]
0
125 WCC
824 ACC

This is perfect, so let’s go ahead and create a new column for this data.

player_data['Conference'] = player_data['Team'].str.extract(r'\(([^)]+)\)$')

We can look at the unique values in this new column to verify that we have not extracted any incorrect data.

sorted(player_data['Conference'].unique())
['AAC',
 'ACC',
 'ASUN',
 'America East',
 'Atlantic 10',
 'Big 12',
 'Big East',
 'Big Sky',
 'Big South',
 'Big Ten',
 'Big West',
 'CAA',
 'CUSA',
 'DI Independent',
 'Horizon',
 'Ivy League',
 'MAAC',
 'MAC',
 'MEAC',
 'MVC',
 'Mountain West',
 'NEC',
 'OVC',
 'Pac-12',
 'Patriot',
 'SEC',
 'SWAC',
 'SoCon',
 'Southland',
 'Summit League',
 'Sun Belt',
 'WAC',
 'WCC']

These values closely match the list of conferences (with a few minor syntax differences such as “DI Independent” for independent schools) so this feature is complete. The rest of the features we’ll be generating today will be based on straightforward calculations using existing columns.

Calculate per-Game Metrics

The next set of columns we can calculate are per-game metrics. Each metric can be divided by the total number of games in the season to get the per-game average of that metric. For example, to determine the average points-per-game, we can divide the total points by the total games played. We could do this for almost every one of the numeric columns in this dataset, but let’s focus on the number of minutes played, fouls, and the big five player statistics.

player_data['MINUTES_PER_GAME'] = player_data['MINUTES_PLAYED'] / player_data['GAMES']
player_data['FOULS_PER_GAME'] = player_data['FOULS'] / player_data['GAMES']
player_data['POINTS_PER_GAME'] = player_data['POINTS'] / player_data['GAMES']
player_data['ASSISTS_PER_GAME'] = player_data['ASSISTS'] / player_data['GAMES']
player_data['STEALS_PER_GAME'] = player_data['STEALS'] / player_data['GAMES']
player_data['BLOCKS_PER_GAME'] = player_data['BLOCKS'] / player_data['GAMES']
player_data['REBOUNDS_PER_GAME'] = player_data['TOTAL_REBOUNDS'] / player_data['GAMES']

player_data[['PLAYER_NAME', 'MINUTES_PER_GAME', 'FOULS_PER_GAME', 'POINTS_PER_GAME', 'ASSISTS_PER_GAME', 'STEALS_PER_GAME', 'BLOCKS_PER_GAME', 'REBOUNDS_PER_GAME']].sample(5)
PLAYER_NAME MINUTES_PER_GAME FOULS_PER_GAME POINTS_PER_GAME ASSISTS_PER_GAME STEALS_PER_GAME BLOCKS_PER_GAME REBOUNDS_PER_GAME
865 Dena Jarrells 27.625000 1.437500 10.625000 3.500000 0.687500 0.062500 2.562500
566 Jasmine Gayles 34.566667 1.800000 16.800000 2.466667 0.933333 0.033333 4.033333
172 Elena Rodriguez 27.173913 2.739130 10.652174 3.347826 1.173913 0.695652 6.695652
326 Caitlin Staley 13.689655 2.482759 3.310345 0.310345 0.310345 1.379310 2.517241
76 Bria Sanders-Woods 25.068966 2.482759 7.758621 4.206897 1.241379 0.206897 2.103448

These look great, so we can move on to the next feature.

Calculate Assist-to-Turnover Ratio

Another metric often used in basketball performance analysis is the assist-to-turnover ratio, so let’s add a column for that. This can be calculated by dividing the number of assists by the number of turnovers for each player.

player_data['ASSIST_TO_TURNOVER'] = player_data['ASSISTS'] / player_data['TURNOVERS']

player_data[['PLAYER_NAME', 'ASSISTS', 'TURNOVERS', 'ASSIST_TO_TURNOVER']].sample(5)
PLAYER_NAME ASSISTS TURNOVERS ASSIST_TO_TURNOVER
766 Kylee Mabry 35 61 0.573770
678 Quay Miller 42 72 0.583333
825 MiLaysia Fulwiley 70 54 1.296296
294 JuJu Watkins 96 120 0.800000
588 Jasmine Shavers 50 80 0.625000

That’s all we need to do for the assist-to-turnover ratio, so let’s create the final feature.

Calculate Fantasy Points

The last column we’ll calculate today is a metric called Fantasy Points. There are multiple ways to calculate this metric, but today we’ll go with a weighted sum of the major metrics. Notice that, in this method, nearly all of the metrics contribute positively towards the Fantasy Points, but turnovers contributes negatively. This means players with higher points, rebounds, assists, blocks, and steals are rewarded, whereas players with high turnovers are slightly penalized.

player_data['FANTASY_POINTS'] = (player_data['THREE_POINTS_MADE'] * 3) + \
                                (player_data['TWO_POINTS_MADE'] * 2) + \
                                (player_data['FREE_THROWS_MADE'] * 1) + \
                                (player_data['TOTAL_REBOUNDS'] * 1.2) + \
                                (player_data['ASSISTS'] * 1.5) + \
                                (player_data['BLOCKS'] * 2) + \
                                (player_data['STEALS'] * 2) + \
                                (player_data['TURNOVERS'] * -1)

player_data[['PLAYER_NAME', 'FANTASY_POINTS']].sample(5)
PLAYER_NAME FANTASY_POINTS
152 Benthe Versteeg 751.1
266 Rayah Marshall 803.0
501 Lex Therien 932.7
411 Ashlee Locke 395.0
145 Last-Tear Poa 393.3

This completes the feature engineering section of this project! Let’s take a look at our final set of columns and data types before we move on to data visualization and machine learning.

player_data.dtypes
Unnamed: 0                  int64
PLAYER_NAME                object
Team                       object
Class                      object
Height                      int64
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
TWO_POINTS_MADE             int64
TWO_POINT_ATTEMPTS          int64
TWO_POINT_PERCENTAGE      float64
Conference                 object
MINUTES_PER_GAME          float64
FOULS_PER_GAME            float64
POINTS_PER_GAME           float64
ASSISTS_PER_GAME          float64
STEALS_PER_GAME           float64
BLOCKS_PER_GAME           float64
REBOUNDS_PER_GAME         float64
ASSIST_TO_TURNOVER        float64
FANTASY_POINTS            float64
dtype: object

If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export this final dataset. As a reminder, you can use the to_csv() method instead of .to_excel() if you prefer.

player_data.to_excel('player_data_engineered.xlsx', index=False)

Wrap up

In this series, we’ve built a new dataset by acquiring and then combining the NCAA women’s basketball player information dataset with the Yahoo Sports player statistics dataset. We laid the groundwork for data analysis by cleaning and preprocessing the combined player data, and then expanded upon it by engineering a few new features. In the next part, we’ll take a closer look at the underlying data in each column and create visualizations to identify the relationship between various parameters. After that, we’ll move on to training a machine learning model.