Today we’ll walk through how to clean and preprocess a dataset to ensure it is ready for analysis. This is the second part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning.

Getting Started

First, let’s take a look at an overview of this data science project. If you’re already familiar with it, feel free to skip to the next section.

Project Overview

As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step that we’ll go through for this project:

the steps for this data science project

Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
Data Exploration - This step focuses on analyzing and visualizing the dataset to uncover patterns, relationships, and general trends and is a helpful preliminary step before deeper analysis.
Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
Machine Learning - This step focuses on selecting, training, and evaluating a machine learning model. For this project, the model will identify the combination of individual player statistics that correlates with optimal performance.

We’ll use Python along with popular libraries like pandas, numpy, and scikit-learn to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, identify relationships between variables, create meaningful visualizations, and train a basic machine learning model. Due to the size of this project, today we’ll cover the second and third steps: data cleaning and data preprocessing.

Dependencies

Since this is the second installment in the series, you might already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the previous post or at least review the Getting Started section of that post before continuing. In summary, you’ll want to have Python installed with the following packages:

For today’s guide specifically, we’ll want to import the following packages:

import pandas as pd
import numpy as np

Import Data

In Part 1 of this series, we acquired two datasets and combined them into one final dataset, stored in a dataframe named player_data. If you want to follow along with the code examples in this article, it’s recommended to import the player_data dataframe before proceeding.

player_data = pd.read_excel('player_data_raw.xlsx')

Data Cleaning

Before we can analyze the dataset, we need to ensure it is clean and reliable. This prevents headaches like training a model with unintended values or creating graphs without the full dataset. In this guide, we’ll address issues like missing values, incorrect entries, and inconsistencies.

Handle Missing Values

Missing values (such as None and NaN) can significantly impact the accuracy and validity of statistical analyses and visualizations, so we want to identify and handle (remove, impute, accept) each instance of missing and empty values in the dataset. There are several ways to handle missing values, but let’s take a look at the most common methods:

Correction - In limited circumstances, the missing values may be due to an import error or available in an alternate data source that can supplement your original dataset. This allows you to make corrections to the original dataset and eliminate missing values.
Imputation - Imputation involves replacing missing values with estimated or predicted values based on other available information in the dataset. Common imputation techniques include mean, median, mode imputation, or more advanced methods such as regression imputation or k-nearest neighbors (KNN) imputation.
Deletion - Deleting rows or columns with missing values is a straightforward approach, but it should be used carefully as it can lead to the loss of valuable information. Row deletion (also known as listwise deletion) removes entire observations with missing values, while column deletion (variable-wise deletion) removes entire variables with missing values.
Advanced Techniques - Advanced techniques such as multiple imputation, which generates multiple imputed datasets and combines the results, or sophisticated machine learning algorithms designed to handle missing data directly, offer more robust solutions for handling missing values in complex datasets.

How you handle missing values will depend on a variety of factors (the nature of the missing data, the requirements and objectives of your project, etc.) and should be evaluated on a case-by-case basis. For today, we’ll go through each row with missing values and determine the best way to handle them one at a time.

Identify Missing Values

Let’s begin by looking at any rows that contain at least one missing value (represented as None and NaN in Python). We’ll use the pandas isna() method for this.

player_data[player_data.isna().any(axis=1)]

	PLAYER_NAME	Team	Class	Height	Position	PLAYER_ID	TEAM_NAME	GAMES	MINUTES_PLAYED	FIELD_GOALS_MADE	...	FREE_THROW_PERCENTAGE	OFFENSIVE_REBOUNDS	DEFENSIVE_REBOUNDS	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
148	Ally Becki	Ball St. (MAC)	Jr.	5-8	NaN	ncaaw.p.66590	Ball St.	31	972	143	...	75.4	21	122	143	148	108	60	11	69	398
245	Caitlin Weimar	Boston U. (Patriot)	Sr.	6-4	F	ncaaw.p.64464	N.C. State	28	987	199	...	68.4	76	219	295	54	69	37	80	74	519
250	Abby Muse	Boise St. (Mountain West)	Sr.	6-2	F	ncaaw.p.64516	Boise St.	31	785	90	...	54.3	74	191	265	34	66	36	87	72	230
254	Emily Bowman	Samford (SoCon)	So.	6-5	C	ncaaw.p.64719	Samford	30	703	89	...	53.1	83	172	255	17	55	5	74	88	238
257	Christina Dalce	Villanova (Big East)	Jr.	6-2	F	ncaaw.p.67708	Villanova	30	833	108	...	54.9	145	146	291	30	52	25	70	89	255
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
726	Kseniia Kozlova	James Madison (Sun Belt)	Sr.	6-3	C	ncaaw.p.64131	James Madison	24	526	118	...	57.1	74	97	171	19	51	8	8	55	284
727	Simone Cunningham	Old Dominion (Sun Belt)	Jr.	5-10	G	ncaaw.p.113163	Old Dominion	30	591	84	...	56.5	87	123	210	12	60	20	21	93	207
729	Otaifo Esenabhalu	Longwood (Big South)	Fr.	6-2	F	ncaaw.p.113170	Longwood	30	547	60	...	60.0	79	128	207	10	59	23	13	78	147
755	Sedayjha Payne	Morgan St. (MEAC)	Sr.	5-8	G	ncaaw.p.113276	Morgan St.	29	634	80	...	52.0	66	73	139	24	53	67	7	58	186
843	Madelyn Bischoff	Ball St. (MAC)	Jr.	5-9	NaN	ncaaw.p.66600	Ball St.	30	881	112	...	86.6	11	60	71	37	36	20	6	38	359

63 rows × 27 columns

There are quite a few rows and columns in this output, so let’s find out which columns contain missing values.

player_data.columns[player_data.isna().any()].tolist()

['Position', 'THREE_POINT_PERCENTAGE']

Now that we know which columns to focus on, let’s look at just those two columns with the player name and team name.

player_data[player_data.isna().any(axis=1)][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINT_PERCENTAGE']]

	PLAYER_NAME	Team	Position	THREE_POINT_PERCENTAGE
148	Ally Becki	Ball St. (MAC)	NaN	34.8
245	Caitlin Weimar	Boston U. (Patriot)	F	None
250	Abby Muse	Boise St. (Mountain West)	F	None
254	Emily Bowman	Samford (SoCon)	C	None
257	Christina Dalce	Villanova (Big East)	F	None
...	...	...	...	...
726	Kseniia Kozlova	James Madison (Sun Belt)	C	None
727	Simone Cunningham	Old Dominion (Sun Belt)	G	None
729	Otaifo Esenabhalu	Longwood (Big South)	F	None
755	Sedayjha Payne	Morgan St. (MEAC)	G	None
843	Madelyn Bischoff	Ball St. (MAC)	NaN	40.3

63 rows × 4 columns

The NaN values in each column should be dealt with separately, meaning the NaNs in the Position column will be handled separately from the NaNs in the THREE_POINT_PERCENTAGE column. Since the THREE_POINT_PERCENTAGE is a calculated field, let’s take a look at that first.

Handle Missing Three-Point Percentages

Three-point percentages in basketball are calculated by dividing the number of three-point baskets made by the number of three-point baskets attempted. So, let’s add those two columns in to our output.

player_data[player_data['THREE_POINT_PERCENTAGE'].isna()][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE']]

	PLAYER_NAME	Team	Position	THREE_POINTS_MADE	THREE_POINT_ATTEMPTS	THREE_POINT_PERCENTAGE
245	Caitlin Weimar	Boston U. (Patriot)	F	0	0	None
250	Abby Muse	Boise St. (Mountain West)	F	0	0	None
254	Emily Bowman	Samford (SoCon)	C	0	0	None
257	Christina Dalce	Villanova (Big East)	F	0	0	None
...	...	...	...	...	...	...
726	Kseniia Kozlova	James Madison (Sun Belt)	C	0	0	None
727	Simone Cunningham	Old Dominion (Sun Belt)	G	0	0	None
729	Otaifo Esenabhalu	Longwood (Big South)	F	0	0	None
755	Sedayjha Payne	Morgan St. (MEAC)	G	0	0	None

We can see that for all of the rows where THREE_POINT_PERCENTAGE is None, the three-point baskets made and attempted values are zero. If we manually calculated the three-point percentage for these rows, we would get NaN results instead of None due to the division by zero. NaN and None are different values, but we’ve confirmed that there is no issue with the underlying data. The None values will automatically be replaced with NaNs later in this process, so we’ll leave these values at None for now and move on to the Position field.

Handle Missing Positions

Let’s see how many rows are missing a Position value.

player_data[player_data['Position'].isna()][['PLAYER_NAME', 'Team', 'Position']]

	PLAYER_NAME	Team	Position
148	Ally Becki	Ball St. (MAC)	NaN
359	Marie Kiefer	Ball St. (MAC)	NaN
709	Ava Uhrich	Southern Utah (WAC)	NaN
843	Madelyn Bischoff	Ball St. (MAC)	NaN

It can be helpful to evaluate whether or not each column with missing values should have a missing or null value. In this case, it doesn’t make much sense for a basketball player to be missing a position (guard, forward, center), since every player is assigned a standard position on their team. Knowing that, we can assume that our current dataset is missing these values in error, and the entries should be updated.

We’ll be using an external data source to look up the correct positions for each of these players. If there were more rows with a missing Position value, then we might acquire another dataset and combine it with the current one. However, there are only four rows here, so we’ll look up the values manually.

You can use several external data sources, but here’s an example using ESPN’s website. Search for the player’s name and school to pull up the player’s individual page (for example, Ally Becki at Ball St.). On this page, you can see the player’s team, class, height, jersey number, as well as a F, C, or G for the position (in this example, the position is listed as G for Guard).

Manually look up a player's position

With that, we can locate and set the correct value of the Position field for each of the four players.

player_data.loc[player_data['PLAYER_NAME'] == 'Ally Becki', 'Position'] = 'G'
player_data.loc[player_data['PLAYER_NAME'] == 'Marie Kiefer', 'Position'] = 'F'
player_data.loc[player_data['PLAYER_NAME'] == 'Ava Uhrich', 'Position'] = 'F'
player_data.loc[player_data['PLAYER_NAME'] == 'Madelyn Bischoff', 'Position'] = 'G'

We can confirm that these changes were applied correctly by re-pulling the rows with a missing Position value.

player_data[player_data['Position'].isna()][['PLAYER_NAME', 'Team', 'Position']]

	PLAYER_NAME	Team	Position

There are no results, so we’ve now handled all of the values caught by the isna() method. This is a great start, but you’ll often find datasets with missing or incorrect values that are not caught by the isna() method.

Handle Incorrect Values

Depending on the size of your dataset, the unique() method can be a good way to find additional missing values. Let’s try that out on the Height column.

Handle Incorrect Heights

Using the unique() method, we can see one height value that is not valid: 0-0.

player_data['Height'].unique()

array(['5-7', '5-8', '5-2', '5-9', '5-6', '6-0', '5-10', '5-3', '5-11',
       '5-5', '5-4', '6-2', '6-1', '6-3', '6-4', '6-6', '6-5', '6-7',
       '6-8', '0-0'], dtype=object)

Let’s pull all of the rows where the Height is set to 0-0.

player_data[player_data['Height'].eq('0-0')]

	PLAYER_NAME	Team	Class	Height	Position	PLAYER_ID	TEAM_NAME	GAMES	MINUTES_PLAYED	FIELD_GOALS_MADE	...	FREE_THROW_PERCENTAGE	OFFENSIVE_REBOUNDS	DEFENSIVE_REBOUNDS	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
709	Ava Uhrich	Southern Utah (WAC)	Fr.	0-0	F	ncaaw.p.115529	Southern Utah	29	877	151	...	68.4	65	149	214	47	55	22	19	76	383
823	Payton Hull	Abilene Christian (WAC)	Fr.	0-0	G	ncaaw.p.112709	Abilene Christian	29	837	155	...	75.3	33	67	100	59	87	47	8	71	444

2 rows × 27 columns

There are only two rows with missing heights, so we can manually look up and update these values using the same method as the missing Position values.

player_data.loc[player_data['PLAYER_NAME'] == 'Ava Uhrich', 'Height'] = '6-0'
player_data.loc[player_data['PLAYER_NAME'] == 'Payton Hull', 'Height'] = '5-11'

We can double-check that this worked properly for the Height column, just like the Position field.

player_data[player_data.eq('0-0').any(axis=1)]

	PLAYER_NAME	Team	Class	Height	Position	PLAYER_ID	TEAM_NAME	GAMES	MINUTES_PLAYED	FIELD_GOALS_MADE	...	FREE_THROW_PERCENTAGE	OFFENSIVE_REBOUNDS	DEFENSIVE_REBOUNDS	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS

0 rows × 27 columns

Handle Incorrect Classes

We can apply the same process used for Height to the Class field. Since there’s only one player with an incorrect class, this is a quick update.

player_data['Class'].unique()
player_data[player_data['Class'] == '---']
player_data.loc[player_data['PLAYER_NAME'] == 'Ayanna Khalfani', 'Class'] = 'Sr.'
player_data['Class'].unique()

array(['Jr.', 'So.', 'Sr.', 'Fr.'], dtype=object)

Now that we’ve handled all of the empty and incorrect values, we’re ready to move on to other data preprocessing steps.

Data Preprocessing

The goal of this step is to make sure our dataset is consistent and suitable for analysis. This step will change quite a bit depending on the exact project. For this project, we’ll be setting column datatypes, converting units, and substituting abbreviated values. For advanced machine learning projects, data preprocessing often includes additional steps like feature scaling, normalizing certain features, and encoding categorical values.

Data Type Conversion

Data type conversion is a fundamental step in preparing a dataset for analysis. Specifically, we often encounter situations where numbers are stored as strings or objects instead of their native numerical formats (e.g., integers or floats). In this step, let’s take a closer look at the initial data type of each column and identify opportunities to convert these values into more suitable datatypes.

player_data.dtypes

PLAYER_NAME               object
Team                      object
Class                     object
Height                    object
Position                  object
PLAYER_ID                 object
TEAM_NAME                 object
GAMES                     object
MINUTES_PLAYED            object
FIELD_GOALS_MADE          object
FIELD_GOAL_ATTEMPTS       object
FIELD_GOAL_PERCENTAGE     object
THREE_POINTS_MADE         object
THREE_POINT_ATTEMPTS      object
THREE_POINT_PERCENTAGE    object
FREE_THROWS_MADE          object
FREE_THROW_ATTEMPTS       object
FREE_THROW_PERCENTAGE     object
OFFENSIVE_REBOUNDS        object
DEFENSIVE_REBOUNDS        object
TOTAL_REBOUNDS            object
ASSISTS                   object
TURNOVERS                 object
STEALS                    object
BLOCKS                    object
FOULS                     object
POINTS                    object
dtype: object

We can see that all of the columns currently have the object data type, which is not ideal for numeric columns like number of minutes played or field goals made.

Convert Numeric-only Columns

There are a few reasons to convert columns from the object to numeric data type. The main reason is that most mathematical operations (including numpy methods and sum/averages/etc. taken during aggregation steps) only work on numeric data types. Plotting and visualization libraries also expect numeric data types, so you may not be able to create any charts or graphs (besides categorical charts like histograms) unless specific columns use the numeric data type. Storing numbers as strings can give unexpected results, such as sorting the values 1, 2, and 11 in the order 1, 11, 2. For large datasets, it’s worth noting that this can also take up more memory and be slower than storing numbers in numeric data types.

Now that we know why we would want to convert numeric columns to numeric data types, let’s start by defining which columns we expect to store only numeric data.

numeric_columns = ['GAMES', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOAL_ATTEMPTS', 
                  'FIELD_GOAL_PERCENTAGE', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 
                  'THREE_POINT_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROW_ATTEMPTS', 
                  'FREE_THROW_PERCENTAGE', 'OFFENSIVE_REBOUNDS', 'DEFENSIVE_REBOUNDS', 
                  'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS']

Now that we have this list of columns, we can use the pandas to_numeric() method to convert these columns to the appropriate numeric data type. We can confirm that worked by re-checking the data type of each column.

player_data[numeric_columns] = player_data[numeric_columns].apply(pd.to_numeric)
player_data.dtypes

PLAYER_NAME                object
Team                       object
Class                      object
Height                     object
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
dtype: object

All of the defined numeric columns were properly converted to either integers or floats. The percentage fields are now stored as floats, while the rest of them are integers.

We can also revisit the None values in the THREE_POINT_PERCENTAGE column to see if they were impacted by the datatype conversion.

player_data[player_data['THREE_POINT_PERCENTAGE'].isna()][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE']].head()

	PLAYER_NAME	Team	Position	THREE_POINT_PERCENTAGE
245	Caitlin Weimar	Boston U. (Patriot)	F	NaN
250	Abby Muse	Boise St. (Mountain West)	F	NaN
254	Emily Bowman	Samford (SoCon)	C	NaN
257	Christina Dalce	Villanova (Big East)	F	NaN
258	Emily Saunders	Youngstown St. (Horizon)	F	NaN

So the to_numeric() method automatically handled the None values in the three-point percentage column and converted them to NaNs. Neat!

With the numeric columns converted, we can move on to the columns that should contain only text.

Convert Text-only Columns

In pandas, both the object and string data types can be used to represent text. However, there is a slight difference between them:

object dtype - This is the default dtype and is a catch-all for any non-numeric data. It can hold any Python object, including strings, lists, dictionaries, etc. When a column contains multiple data types or when pandas cannot guess the data type, it defaults to the object dtype. While this is more flexible, operations on columns with object dtype may be slower compared to columns with the string data type. The pandas documentation outlines reasons why it’s better to use the string dtype for storing text-only data.
string dtype - This dtype specifically represents strings. Columns with this dtype contain only string data (not a mixture of string and other dtypes), which improves readability and allows for better dtype-specific operations like the select_dtypes() method. Additionally, with the str dtype, you can use some string-specific methods and functions directly on the column without the need for explicit type conversions.

Let’s start by identifying which columns should contain only string data.

string_columns = ['PLAYER_NAME', 'Team', 'Class', 'Height', 'Position', 'PLAYER_ID', 'TEAM_NAME']

We can use the pandas astype() method to convert object columns to string columns. However, you should watch out for values with other data types when performing this conversion. For example, any missing values (such as NaN) will be converted to the string "NaN". We’ve already replaced all of the missing values in these columns, so this is no longer a concern. However, we do not want other non-string values (numbers, arrays, dictionaries, etc.) to be converted either, so it’s best to check for those before performing the conversion. For this dataset, we do not expect these columns to have any non-string values, but it’s a good idea to look at a sample of the data in each column just to be safe.

player_data[string_columns].sample(10)

	PLAYER_NAME	Team	Class	Height	Position	PLAYER_ID	TEAM_NAME
400	Precious Johnson	Duquesne (Atlantic 10)	Sr.	6-4	C	ncaaw.p.61337	Duquesne
385	Meghan O'Brien	Lehigh (Patriot)	Jr.	6-1	F	ncaaw.p.66792	Lehigh
770	Carmen Villalobos	Cleveland St. (Horizon)	Sr.	6-1	G	ncaaw.p.61574	Cleveland St.
180	Ashley Hawkins	Gardner-Webb (Big South)	Jr.	5-6	G	ncaaw.p.100658	Gardner-Webb
821	Natalie Picton	Montana St. (Big Sky)	Fr.	5-5	G	ncaaw.p.112319	Montana St.
273	Jada Tiggett	N.C. Central (MEAC)	Fr.	6-2	G	ncaaw.p.113273	N.C. Central
710	Nyla McGill	Yale (Ivy League)	Jr.	5-8	G	ncaaw.p.67510	Yale
200	Keshuna Luckett	Jackson St. (SWAC)	Sr.	5-8	G	ncaaw.p.61415	Jackson St.
577	Bri McDaniel	Maryland (Big Ten)	So.	5-10	G	ncaaw.p.70818	Maryland
649	Halli Poock	Bradley (MVC)	Fr.	5-4	G	ncaaw.p.112767	Murray St.

This looks good, so we’ll go ahead and convert the data type for these columns and confirm the change was made.

player_data[string_columns] = player_data[string_columns].astype('string')
player_data[string_columns].dtypes

PLAYER_NAME    string[python]
Team           string[python]
Class          string[python]
Height         string[python]
Position       string[python]
PLAYER_ID      string[python]
TEAM_NAME      string[python]
dtype: object

Now that we’re finished converting data types, it’s time to move on to value substitution.

Value Substitution

Value substitution (or value mapping) refers to replacing one value with another. One of the most common uses for this is to replace abbreviations with their full values (such as replacing “Fr.” with “Freshman”). For machine learning datasets, value substitution can also include mapping categorical values to number or array formats.

The first value substitution we’ll make is swapping the abbreviated position names with the full values. The Positionfield in this dataset refers to the position where the player typically plays on the court. Here’s a diagram of the positions:

Diagram of basketball positions

Let’s see what Position values are currently in our dataset.

player_data['Position'].unique()

<StringArray>
['G', 'F', 'C']
Length: 3, dtype: string

So this data set uses the three types of positions instead of the five specific positions. This means that shooting guard and point guard are both stored as G for guard, and power forward and small forward are both stored as F for forward. To make these F, G, C values more readable, let’s create a map between the current single-letter values and the full-length position name.

position_names = {
    'F': 'Forward',
    'G': 'Guard',
    'C': 'Center'
}

Now all we have to do is use the pandas replace() method to apply this mapping to the entire series.

player_data['Position'] = player_data['Position'].replace(position_names)
player_data['Position'].unique()

<StringArray>
['Guard', 'Forward', 'Center']
Length: 3, dtype: string

We can use the same process to substitute the values in the Class column as well.

class_names = {
    'Fr.': 'Freshman',
    'So.': 'Sophomore',
    'Jr.': 'Junior',
    'Sr.': 'Senior'
}
player_data['Class'] = player_data['Class'].replace(class_names)
player_data['Class'].unique()

<StringArray>
['Junior', 'Sophomore', 'Senior', 'Freshman']
Length: 4, dtype: string

There are more substitutions that we could make (such as replacing “St.” with “State” in the team name), but those are the only ones we’ll make today. Next, let’s look at unit conversions.

Unit Conversion

The only unit conversion we’ll be doing today is to convert the feet-inches notation in the Height column to the total number of inches. This means that a height of five feet, three inches tall is currently stored as 5-3, but will be converted to the total number of inches, 63. We can define a function to convert the individual Height values:

def height_to_inches(height):
    feet, inches = map(int, height.split('-'))
    return feet * 12 + inches

This function breaks the values in the Height column on the hyphen and store the number of feet and inches in separate variables. It then multiplies the number of feet by 12 (the number of inches per foot) and adds the number of inches to get the total height of the player in inches.

Now that we have this function, we can apply it to the Height column and verify it worked by checking the unique values.

player_data['Height'] = player_data['Height'].apply(height_to_inches)
player_data['Height'].unique()

array([67, 68, 62, 69, 66, 72, 70, 63, 71, 65, 64, 74, 73, 75, 76, 78, 77,
       79, 80])

That wraps up everything needed for data preprocessing and our dataset is ready for the next steps of the project like feature engineering and machine learning!

Export Data

If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export this dataset. As a reminder, you can use the to_csv() method instead of the .to_excel() method if you prefer.

player_data.to_excel('player_data_clean.xlsx', index=False)

Wrap up

In today’s guide, we laid the groundwork for data analysis by cleaning and preprocessing the combined player data. In the next article, we’ll expand upon this dataset by engineering new features.

Also, all of the code snippets in today’s guide are available in a Jupyter Notebook in the ncaa-basketball-stats repository on GitHub.

Data Cleaning and Preprocessing

Outlier or Caitlin Clark? [Part 2]