Today we’ll explore how to derive new features from existing columns by calculating additional metrics and extracting textual data. This is the third part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, machine learning, and creating visualizations.
As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step that we’ll go through for this project:
- Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
- Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
- Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
- Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
- Machine Learning - This step focuses on training a machine learning model to identify the combination of individual player statistics that correlates with optimal performance.
- Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
We’ll use Python along with the popular pandas and requests libraries to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, create meaningful visualizations, and train a basic machine learning model. Since we already gathered the raw data from online sources in Part 1 and cleaned that data in Part 2, we’re ready to move on to feature engineering.
Getting Started
Since this is the third installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the previous post or at least review the Getting Started section of that post before continuing. In summary, you’ll want to have Python installed with the following packages:
In Part 2 of this series, we cleaned and preprocessed the values in our dataset, which is stored in a dataframe named player_data
. If you want to follow along with the code examples in this article, it’s recommended to import the player_data
dataframe before proceeding.
import pandas as pd
import requests
import json
import os
import numpy as np
player_data = pd.read_excel('player_data_clean.xlsx')
Feature Engineering
In this section, we’ll create new features (columns) by extracting data and calculating additional metrics (derived column creation) from columns that already exist in our dataset. For example, we can calculate the per-game averages by using the number of total games played and another statistic (such as total three-point baskets made). By the end of this process, we should have all the columns we want when we start creating visualizations and training machine learning models.
Calculate Two-Point Basket Metrics
This dataset contains the statistics on both field goals (the combination of two-point and three-point baskets) and three-point baskets specifically, so the first set of columns we can derive are two-point basket statistics.
# Calculate two-pointers made
player_data['TWO_POINTS_MADE'] = player_data['FIELD_GOALS_MADE'] - player_data['THREE_POINTS_MADE']
# Calculate two-point attempts
player_data['TWO_POINT_ATTEMPTS'] = player_data['FIELD_GOAL_ATTEMPTS'] - player_data['THREE_POINT_ATTEMPTS']
# Calculate two-point percentage
player_data['TWO_POINT_PERCENTAGE'] = (player_data['TWO_POINTS_MADE'] / player_data['TWO_POINT_ATTEMPTS']) * 100
player_data.dtypes
Unnamed: 0 int64
PLAYER_NAME object
Team object
Class object
Height int64
Position object
PLAYER_ID object
TEAM_NAME object
GAMES int64
MINUTES_PLAYED int64
FIELD_GOALS_MADE int64
FIELD_GOAL_ATTEMPTS int64
FIELD_GOAL_PERCENTAGE float64
THREE_POINTS_MADE int64
THREE_POINT_ATTEMPTS int64
THREE_POINT_PERCENTAGE float64
FREE_THROWS_MADE int64
FREE_THROW_ATTEMPTS int64
FREE_THROW_PERCENTAGE float64
OFFENSIVE_REBOUNDS int64
DEFENSIVE_REBOUNDS int64
TOTAL_REBOUNDS int64
ASSISTS int64
TURNOVERS int64
STEALS int64
BLOCKS int64
FOULS int64
POINTS int64
TWO_POINTS_MADE int64
TWO_POINT_ATTEMPTS int64
TWO_POINT_PERCENTAGE float64
dtype: object
Just like the three-point basket statistics, we can see that the two-pointers made and attempted are integers whereas the two-point percentage is stored as a float.
Extract Conference from Team Name
This dataset includes two fields for the team name: one that came from the player information dataset and the other that came from the player statistics dataset. Let’s look at a sample of each.
player_data[['TEAM_NAME', 'Team']].sample(10)
TEAM_NAME | Team | |
---|---|---|
727 | Old Dominion | Old Dominion (Sun Belt) |
856 | Missouri | Missouri (SEC) |
867 | LIU Brooklyn | LIU (NEC) |
94 | Arkansas | Arkansas St. (Sun Belt) |
715 | Rutgers | Rutgers (Big Ten) |
500 | S.F. Austin | SFA (WAC) |
676 | Gonzaga | Gonzaga (WCC) |
150 | Wisconsin | Wisconsin (Big Ten) |
666 | Holy Cross | Holy Cross (Patriot) |
427 | Miami (FL) | Notre Dame (ACC) |
So the TEAM_NAME column from the player statistics dataset contains strictly the name of the team, but the Team column from the player information dataset contains both the team name and the basketball conference in parenthesis. If we only had the TEAM_NAME column and wanted to know the conference, then we could pull in another data source to map team names to conference names. However, since the conference is listed in the TEAM column, we can split this information into a separate column as a feature.
There are multiple ways to do this, so let’s start with the most obvious one. It looks like the conference name is enclosed in parentheses after the team name. We could extract the conference name by splitting the ‘Team’ column on the opening and closing parentheses and selecting the last element of the resulting list.
player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0]
0 Mountain West
1 SEC
2 Big Sky
3 Mountain West
4 ACC
...
895 MAAC
896 AAC
897 SoCon
898 SWAC
899 ASUN
Name: 0, Length: 900, dtype: object
However, depending on the results of your previous sample, you might notice an issue with this. Certain teams have two sets of parenthesis instead of just one. Here are two examples.
player_data.loc[[125, 824], 'Team']
125 LMU (CA) (WCC)
824 Miami (FL) (ACC)
Name: Team, dtype: object
For these two examples, the extracted Conference name would be “CA” and “FL” instead of the proper “WCC” and “ACC”.
player_data['Team'].str.split('\(', expand=True)[1].str.split('\)', expand=True)[0].iloc[[125, 824]]
125 CA
824 FL
Name: 0, dtype: object
We could modify the previous code to select only the values in the right-most set of parenthesis, but switching from the .split()
method to using a regular expression (a.k.a. “regex”) might be a more robust solution. Regular expressions provide more flexibility in pattern matching and are quite efficient, so let’s build a pattern to select the proper value.
In this regular expression, we can use the following parts:
\(
matches on an opening parenthesis. Parenthesis in regular expressions are used for capturing groups, so the\
escapes the parenthesis and matches the literal value.([^)]+)
matches matches one or more characters that are not a closing parentheses (any text inside the parentheses). The^
symbol negates the set, so this pattern matches anything except a right parenthesis. The+
quantifier means “one or more” of these characters. This ensures we capture the entire conference name instead of just the first letter.\)
matches on a closing parenthesis. Just like the opening parenthesis, this needs to be escaped.$
searches for matches from the end of the string first instead of from the start of the string.
This pattern should capture the text within the last pair of parentheses at the end of the string. Let’s preview what this logic would extract as the conference name.
player_data['Team'].str.extract(r'\(([^)]+)\)$')
0 | |
---|---|
0 | Mountain West |
1 | SEC |
2 | Big Sky |
3 | Mountain West |
4 | ACC |
... | ... |
895 | MAAC |
896 | AAC |
897 | SoCon |
898 | SWAC |
899 | ASUN |
900 rows × 1 columns
We can double-check that this regex pattern will pull from the proper set of parentheses for the values with multiple parentheses.
player_data['Team'].str.extract(r'\(([^)]+)\)$').iloc[[125, 824]]
0 | |
---|---|
125 | WCC |
824 | ACC |
This is perfect, so let’s go ahead and create a new column for this data.
player_data['Conference'] = player_data['Team'].str.extract(r'\(([^)]+)\)$')
We can look at the unique values in this new column to verify that we have not extracted any incorrect data.
sorted(player_data['Conference'].unique())
['AAC',
'ACC',
'ASUN',
'America East',
'Atlantic 10',
'Big 12',
'Big East',
'Big Sky',
'Big South',
'Big Ten',
'Big West',
'CAA',
'CUSA',
'DI Independent',
'Horizon',
'Ivy League',
'MAAC',
'MAC',
'MEAC',
'MVC',
'Mountain West',
'NEC',
'OVC',
'Pac-12',
'Patriot',
'SEC',
'SWAC',
'SoCon',
'Southland',
'Summit League',
'Sun Belt',
'WAC',
'WCC']
These values closely match the list of conferences (with a few minor syntax differences such as “DI Independent” for independent schools) so this feature is complete. The rest of the features we’ll be generating today will be based on straightforward calculations using existing columns.
Calculate per-Game Metrics
The next set of columns we can calculate are per-game metrics. Each metric can be divided by the total number of games in the season to get the per-game average of that metric. For example, to determine the average points-per-game, we can divide the total points by the total games played. We could do this for almost every one of the numeric columns in this dataset, but let’s focus on the number of minutes played, fouls, and the big five player statistics.
player_data['MINUTES_PER_GAME'] = player_data['MINUTES_PLAYED'] / player_data['GAMES']
player_data['FOULS_PER_GAME'] = player_data['FOULS'] / player_data['GAMES']
player_data['POINTS_PER_GAME'] = player_data['POINTS'] / player_data['GAMES']
player_data['ASSISTS_PER_GAME'] = player_data['ASSISTS'] / player_data['GAMES']
player_data['STEALS_PER_GAME'] = player_data['STEALS'] / player_data['GAMES']
player_data['BLOCKS_PER_GAME'] = player_data['BLOCKS'] / player_data['GAMES']
player_data['REBOUNDS_PER_GAME'] = player_data['TOTAL_REBOUNDS'] / player_data['GAMES']
player_data[['PLAYER_NAME', 'MINUTES_PER_GAME', 'FOULS_PER_GAME', 'POINTS_PER_GAME', 'ASSISTS_PER_GAME', 'STEALS_PER_GAME', 'BLOCKS_PER_GAME', 'REBOUNDS_PER_GAME']].sample(5)
PLAYER_NAME | MINUTES_PER_GAME | FOULS_PER_GAME | POINTS_PER_GAME | ASSISTS_PER_GAME | STEALS_PER_GAME | BLOCKS_PER_GAME | REBOUNDS_PER_GAME | |
---|---|---|---|---|---|---|---|---|
865 | Dena Jarrells | 27.625000 | 1.437500 | 10.625000 | 3.500000 | 0.687500 | 0.062500 | 2.562500 |
566 | Jasmine Gayles | 34.566667 | 1.800000 | 16.800000 | 2.466667 | 0.933333 | 0.033333 | 4.033333 |
172 | Elena Rodriguez | 27.173913 | 2.739130 | 10.652174 | 3.347826 | 1.173913 | 0.695652 | 6.695652 |
326 | Caitlin Staley | 13.689655 | 2.482759 | 3.310345 | 0.310345 | 0.310345 | 1.379310 | 2.517241 |
76 | Bria Sanders-Woods | 25.068966 | 2.482759 | 7.758621 | 4.206897 | 1.241379 | 0.206897 | 2.103448 |
These look great, so we can move on to the next feature.
Calculate Assist-to-Turnover Ratio
Another metric often used in basketball performance analysis is the assist-to-turnover ratio, so let’s add a column for that. This can be calculated by dividing the number of assists by the number of turnovers for each player.
player_data['ASSIST_TO_TURNOVER'] = player_data['ASSISTS'] / player_data['TURNOVERS']
player_data[['PLAYER_NAME', 'ASSISTS', 'TURNOVERS', 'ASSIST_TO_TURNOVER']].sample(5)
PLAYER_NAME | ASSISTS | TURNOVERS | ASSIST_TO_TURNOVER | |
---|---|---|---|---|
766 | Kylee Mabry | 35 | 61 | 0.573770 |
678 | Quay Miller | 42 | 72 | 0.583333 |
825 | MiLaysia Fulwiley | 70 | 54 | 1.296296 |
294 | JuJu Watkins | 96 | 120 | 0.800000 |
588 | Jasmine Shavers | 50 | 80 | 0.625000 |
That’s all we need to do for the assist-to-turnover ratio, so let’s create the final feature.
Calculate Fantasy Points
The last column we’ll calculate today is a metric called Fantasy Points. There are multiple ways to calculate this metric, but today we’ll go with a weighted sum of the major metrics. Notice that, in this method, nearly all of the metrics contribute positively towards the Fantasy Points, but turnovers contributes negatively. This means players with higher points, rebounds, assists, blocks, and steals are rewarded, whereas players with high turnovers are slightly penalized.
player_data['FANTASY_POINTS'] = (player_data['THREE_POINTS_MADE'] * 3) + \
(player_data['TWO_POINTS_MADE'] * 2) + \
(player_data['FREE_THROWS_MADE'] * 1) + \
(player_data['TOTAL_REBOUNDS'] * 1.2) + \
(player_data['ASSISTS'] * 1.5) + \
(player_data['BLOCKS'] * 2) + \
(player_data['STEALS'] * 2) + \
(player_data['TURNOVERS'] * -1)
player_data[['PLAYER_NAME', 'FANTASY_POINTS']].sample(5)
PLAYER_NAME | FANTASY_POINTS | |
---|---|---|
152 | Benthe Versteeg | 751.1 |
266 | Rayah Marshall | 803.0 |
501 | Lex Therien | 932.7 |
411 | Ashlee Locke | 395.0 |
145 | Last-Tear Poa | 393.3 |
This completes the feature engineering section of this project! Let’s take a look at our final set of columns and data types before we move on to data visualization and machine learning.
player_data.dtypes
Unnamed: 0 int64
PLAYER_NAME object
Team object
Class object
Height int64
Position object
PLAYER_ID object
TEAM_NAME object
GAMES int64
MINUTES_PLAYED int64
FIELD_GOALS_MADE int64
FIELD_GOAL_ATTEMPTS int64
FIELD_GOAL_PERCENTAGE float64
THREE_POINTS_MADE int64
THREE_POINT_ATTEMPTS int64
THREE_POINT_PERCENTAGE float64
FREE_THROWS_MADE int64
FREE_THROW_ATTEMPTS int64
FREE_THROW_PERCENTAGE float64
OFFENSIVE_REBOUNDS int64
DEFENSIVE_REBOUNDS int64
TOTAL_REBOUNDS int64
ASSISTS int64
TURNOVERS int64
STEALS int64
BLOCKS int64
FOULS int64
POINTS int64
TWO_POINTS_MADE int64
TWO_POINT_ATTEMPTS int64
TWO_POINT_PERCENTAGE float64
Conference object
MINUTES_PER_GAME float64
FOULS_PER_GAME float64
POINTS_PER_GAME float64
ASSISTS_PER_GAME float64
STEALS_PER_GAME float64
BLOCKS_PER_GAME float64
REBOUNDS_PER_GAME float64
ASSIST_TO_TURNOVER float64
FANTASY_POINTS float64
dtype: object
If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export this final dataset. As a reminder, you can use the to_csv()
method instead of .to_excel()
if you prefer.
player_data.to_excel('player_data_engineered.xlsx', index=False)
Wrap up
In this series, we’ve built a new dataset by acquiring and then combining the NCAA women’s basketball player information dataset with the Yahoo Sports player statistics dataset. We laid the groundwork for data analysis by cleaning and preprocessing the combined player data, and then expanded upon it by engineering a few new features. In the next part, we’ll take a closer look at the underlying data in each column and create visualizations to identify the relationship between various parameters. After that, we’ll move on to training a machine learning model.