Today we’ll walk through how to clean and preprocess a dataset to ensure it is ready for analysis. This is the second part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning.

Getting Started

First, let’s take a look at an overview of this data science project. If you’re already familiar with it, feel free to skip to the next section.

Project Overview

As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step that we’ll go through for this project:

the steps for this data science project

  1. Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
  2. Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
  3. Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
  4. Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
  5. Data Exploration - This step focuses on analyzing and visualizing the dataset to uncover patterns, relationships, and general trends and is a helpful preliminary step before deeper analysis.
  6. Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
  7. Machine Learning - This step focuses on selecting, training, and evaluating a machine learning model. For this project, the model will identify the combination of individual player statistics that correlates with optimal performance.

We’ll use Python along with popular libraries like pandas, numpy, and scikit-learn to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, identify relationships between variables, create meaningful visualizations, and train a basic machine learning model. Due to the size of this project, today we’ll cover the second and third steps: data cleaning and data preprocessing.

Dependencies

Since this is the second installment in the series, you might already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the previous post or at least review the Getting Started section of that post before continuing. In summary, you’ll want to have Python installed with the following packages:

For today’s guide specifically, we’ll want to import the following packages:

import pandas as pd
import numpy as np

Import Data

In Part 1 of this series, we acquired two datasets and combined them into one final dataset, stored in a dataframe named player_data. If you want to follow along with the code examples in this article, it’s recommended to import the player_data dataframe before proceeding.

player_data = pd.read_excel('player_data_raw.xlsx')

Data Cleaning

Before we can analyze the dataset, we need to ensure it is clean and reliable. This prevents headaches like training a model with unintended values or creating graphs without the full dataset. In this guide, we’ll address issues like missing values, incorrect entries, and inconsistencies.

Handle Missing Values

Missing values (such as None and NaN) can significantly impact the accuracy and validity of statistical analyses and visualizations, so we want to identify and handle (remove, impute, accept) each instance of missing and empty values in the dataset. There are several ways to handle missing values, but let’s take a look at the most common methods:

  • Correction - In limited circumstances, the missing values may be due to an import error or available in an alternate data source that can supplement your original dataset. This allows you to make corrections to the original dataset and eliminate missing values.
  • Imputation - Imputation involves replacing missing values with estimated or predicted values based on other available information in the dataset. Common imputation techniques include mean, median, mode imputation, or more advanced methods such as regression imputation or k-nearest neighbors (KNN) imputation.
  • Deletion - Deleting rows or columns with missing values is a straightforward approach, but it should be used carefully as it can lead to the loss of valuable information. Row deletion (also known as listwise deletion) removes entire observations with missing values, while column deletion (variable-wise deletion) removes entire variables with missing values.
  • Advanced Techniques - Advanced techniques such as multiple imputation, which generates multiple imputed datasets and combines the results, or sophisticated machine learning algorithms designed to handle missing data directly, offer more robust solutions for handling missing values in complex datasets.

How you handle missing values will depend on a variety of factors (the nature of the missing data, the requirements and objectives of your project, etc.) and should be evaluated on a case-by-case basis. For today, we’ll go through each row with missing values and determine the best way to handle them one at a time.

Identify Missing Values

Let’s begin by looking at any rows that contain at least one missing value (represented as None and NaN in Python). We’ll use the pandas isna() method for this.

player_data[player_data.isna().any(axis=1)]
PLAYER_NAME Team Class Height Position PLAYER_ID TEAM_NAME GAMES MINUTES_PLAYED FIELD_GOALS_MADE ... FREE_THROW_PERCENTAGE OFFENSIVE_REBOUNDS DEFENSIVE_REBOUNDS TOTAL_REBOUNDS ASSISTS TURNOVERS STEALS BLOCKS FOULS POINTS
148 Ally Becki Ball St. (MAC) Jr. 5-8 NaN ncaaw.p.66590 Ball St. 31 972 143 ... 75.4 21 122 143 148 108 60 11 69 398
245 Caitlin Weimar Boston U. (Patriot) Sr. 6-4 F ncaaw.p.64464 N.C. State 28 987 199 ... 68.4 76 219 295 54 69 37 80 74 519
250 Abby Muse Boise St. (Mountain West) Sr. 6-2 F ncaaw.p.64516 Boise St. 31 785 90 ... 54.3 74 191 265 34 66 36 87 72 230
254 Emily Bowman Samford (SoCon) So. 6-5 C ncaaw.p.64719 Samford 30 703 89 ... 53.1 83 172 255 17 55 5 74 88 238
257 Christina Dalce Villanova (Big East) Jr. 6-2 F ncaaw.p.67708 Villanova 30 833 108 ... 54.9 145 146 291 30 52 25 70 89 255
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
726 Kseniia Kozlova James Madison (Sun Belt) Sr. 6-3 C ncaaw.p.64131 James Madison 24 526 118 ... 57.1 74 97 171 19 51 8 8 55 284
727 Simone Cunningham Old Dominion (Sun Belt) Jr. 5-10 G ncaaw.p.113163 Old Dominion 30 591 84 ... 56.5 87 123 210 12 60 20 21 93 207
729 Otaifo Esenabhalu Longwood (Big South) Fr. 6-2 F ncaaw.p.113170 Longwood 30 547 60 ... 60.0 79 128 207 10 59 23 13 78 147
755 Sedayjha Payne Morgan St. (MEAC) Sr. 5-8 G ncaaw.p.113276 Morgan St. 29 634 80 ... 52.0 66 73 139 24 53 67 7 58 186
843 Madelyn Bischoff Ball St. (MAC) Jr. 5-9 NaN ncaaw.p.66600 Ball St. 30 881 112 ... 86.6 11 60 71 37 36 20 6 38 359

63 rows × 27 columns

There are quite a few rows and columns in this output, so let’s find out which columns contain missing values.

player_data.columns[player_data.isna().any()].tolist()
['Position', 'THREE_POINT_PERCENTAGE']

Now that we know which columns to focus on, let’s look at just those two columns with the player name and team name.

player_data[player_data.isna().any(axis=1)][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINT_PERCENTAGE']]
PLAYER_NAME Team Position THREE_POINT_PERCENTAGE
148 Ally Becki Ball St. (MAC) NaN 34.8
245 Caitlin Weimar Boston U. (Patriot) F None
250 Abby Muse Boise St. (Mountain West) F None
254 Emily Bowman Samford (SoCon) C None
257 Christina Dalce Villanova (Big East) F None
... ... ... ... ...
726 Kseniia Kozlova James Madison (Sun Belt) C None
727 Simone Cunningham Old Dominion (Sun Belt) G None
729 Otaifo Esenabhalu Longwood (Big South) F None
755 Sedayjha Payne Morgan St. (MEAC) G None
843 Madelyn Bischoff Ball St. (MAC) NaN 40.3

63 rows × 4 columns

The NaN values in each column should be dealt with separately, meaning the NaNs in the Position column will be handled separately from the NaNs in the THREE_POINT_PERCENTAGE column. Since the THREE_POINT_PERCENTAGE is a calculated field, let’s take a look at that first.

Handle Missing Three-Point Percentages

Three-point percentages in basketball are calculated by dividing the number of three-point baskets made by the number of three-point baskets attempted. So, let’s add those two columns in to our output.

player_data[player_data['THREE_POINT_PERCENTAGE'].isna()][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE']]
PLAYER_NAME Team Position THREE_POINTS_MADE THREE_POINT_ATTEMPTS THREE_POINT_PERCENTAGE
245 Caitlin Weimar Boston U. (Patriot) F 0 0 None
250 Abby Muse Boise St. (Mountain West) F 0 0 None
254 Emily Bowman Samford (SoCon) C 0 0 None
257 Christina Dalce Villanova (Big East) F 0 0 None
... ... ... ... ... ... ...
726 Kseniia Kozlova James Madison (Sun Belt) C 0 0 None
727 Simone Cunningham Old Dominion (Sun Belt) G 0 0 None
729 Otaifo Esenabhalu Longwood (Big South) F 0 0 None
755 Sedayjha Payne Morgan St. (MEAC) G 0 0 None

We can see that for all of the rows where THREE_POINT_PERCENTAGE is None, the three-point baskets made and attempted values are zero. If we manually calculated the three-point percentage for these rows, we would get NaN results instead of None due to the division by zero. NaN and None are different values, but we’ve confirmed that there is no issue with the underlying data. The None values will automatically be replaced with NaNs later in this process, so we’ll leave these values at None for now and move on to the Position field.

Handle Missing Positions

Let’s see how many rows are missing a Position value.

player_data[player_data['Position'].isna()][['PLAYER_NAME', 'Team', 'Position']]
PLAYER_NAME Team Position
148 Ally Becki Ball St. (MAC) NaN
359 Marie Kiefer Ball St. (MAC) NaN
709 Ava Uhrich Southern Utah (WAC) NaN
843 Madelyn Bischoff Ball St. (MAC) NaN

It can be helpful to evaluate whether or not each column with missing values should have a missing or null value. In this case, it doesn’t make much sense for a basketball player to be missing a position (guard, forward, center), since every player is assigned a standard position on their team. Knowing that, we can assume that our current dataset is missing these values in error, and the entries should be updated.

We’ll be using an external data source to look up the correct positions for each of these players. If there were more rows with a missing Position value, then we might acquire another dataset and combine it with the current one. However, there are only four rows here, so we’ll look up the values manually.

You can use several external data sources, but here’s an example using ESPN’s website. Search for the player’s name and school to pull up the player’s individual page (for example, Ally Becki at Ball St.). On this page, you can see the player’s team, class, height, jersey number, as well as a F, C, or G for the position (in this example, the position is listed as G for Guard).

Manually look up a player's position

With that, we can locate and set the correct value of the Position field for each of the four players.

player_data.loc[player_data['PLAYER_NAME'] == 'Ally Becki', 'Position'] = 'G'
player_data.loc[player_data['PLAYER_NAME'] == 'Marie Kiefer', 'Position'] = 'F'
player_data.loc[player_data['PLAYER_NAME'] == 'Ava Uhrich', 'Position'] = 'F'
player_data.loc[player_data['PLAYER_NAME'] == 'Madelyn Bischoff', 'Position'] = 'G'

We can confirm that these changes were applied correctly by re-pulling the rows with a missing Position value.

player_data[player_data['Position'].isna()][['PLAYER_NAME', 'Team', 'Position']]
PLAYER_NAME Team Position

There are no results, so we’ve now handled all of the values caught by the isna() method. This is a great start, but you’ll often find datasets with missing or incorrect values that are not caught by the isna() method.

Handle Incorrect Values

Depending on the size of your dataset, the unique() method can be a good way to find additional missing values. Let’s try that out on the Height column.

Handle Incorrect Heights

Using the unique() method, we can see one height value that is not valid: 0-0.

player_data['Height'].unique()
array(['5-7', '5-8', '5-2', '5-9', '5-6', '6-0', '5-10', '5-3', '5-11',
       '5-5', '5-4', '6-2', '6-1', '6-3', '6-4', '6-6', '6-5', '6-7',
       '6-8', '0-0'], dtype=object)

Let’s pull all of the rows where the Height is set to 0-0.

player_data[player_data['Height'].eq('0-0')]
PLAYER_NAME Team Class Height Position PLAYER_ID TEAM_NAME GAMES MINUTES_PLAYED FIELD_GOALS_MADE ... FREE_THROW_PERCENTAGE OFFENSIVE_REBOUNDS DEFENSIVE_REBOUNDS TOTAL_REBOUNDS ASSISTS TURNOVERS STEALS BLOCKS FOULS POINTS
709 Ava Uhrich Southern Utah (WAC) Fr. 0-0 F ncaaw.p.115529 Southern Utah 29 877 151 ... 68.4 65 149 214 47 55 22 19 76 383
823 Payton Hull Abilene Christian (WAC) Fr. 0-0 G ncaaw.p.112709 Abilene Christian 29 837 155 ... 75.3 33 67 100 59 87 47 8 71 444

2 rows × 27 columns

There are only two rows with missing heights, so we can manually look up and update these values using the same method as the missing Position values.

player_data.loc[player_data['PLAYER_NAME'] == 'Ava Uhrich', 'Height'] = '6-0'
player_data.loc[player_data['PLAYER_NAME'] == 'Payton Hull', 'Height'] = '5-11'

We can double-check that this worked properly for the Height column, just like the Position field.

player_data[player_data.eq('0-0').any(axis=1)]
PLAYER_NAME Team Class Height Position PLAYER_ID TEAM_NAME GAMES MINUTES_PLAYED FIELD_GOALS_MADE ... FREE_THROW_PERCENTAGE OFFENSIVE_REBOUNDS DEFENSIVE_REBOUNDS TOTAL_REBOUNDS ASSISTS TURNOVERS STEALS BLOCKS FOULS POINTS

0 rows × 27 columns

Handle Incorrect Classes

We can apply the same process used for Height to the Class field. Since there’s only one player with an incorrect class, this is a quick update.

player_data['Class'].unique()
player_data[player_data['Class'] == '---']
player_data.loc[player_data['PLAYER_NAME'] == 'Ayanna Khalfani', 'Class'] = 'Sr.'
player_data['Class'].unique()
array(['Jr.', 'So.', 'Sr.', 'Fr.'], dtype=object)

Now that we’ve handled all of the empty and incorrect values, we’re ready to move on to other data preprocessing steps.

Data Preprocessing

The goal of this step is to make sure our dataset is consistent and suitable for analysis. This step will change quite a bit depending on the exact project. For this project, we’ll be setting column datatypes, converting units, and substituting abbreviated values. For advanced machine learning projects, data preprocessing often includes additional steps like feature scaling, normalizing certain features, and encoding categorical values.

Data Type Conversion

Data type conversion is a fundamental step in preparing a dataset for analysis. Specifically, we often encounter situations where numbers are stored as strings or objects instead of their native numerical formats (e.g., integers or floats). In this step, let’s take a closer look at the initial data type of each column and identify opportunities to convert these values into more suitable datatypes.

player_data.dtypes
PLAYER_NAME               object
Team                      object
Class                     object
Height                    object
Position                  object
PLAYER_ID                 object
TEAM_NAME                 object
GAMES                     object
MINUTES_PLAYED            object
FIELD_GOALS_MADE          object
FIELD_GOAL_ATTEMPTS       object
FIELD_GOAL_PERCENTAGE     object
THREE_POINTS_MADE         object
THREE_POINT_ATTEMPTS      object
THREE_POINT_PERCENTAGE    object
FREE_THROWS_MADE          object
FREE_THROW_ATTEMPTS       object
FREE_THROW_PERCENTAGE     object
OFFENSIVE_REBOUNDS        object
DEFENSIVE_REBOUNDS        object
TOTAL_REBOUNDS            object
ASSISTS                   object
TURNOVERS                 object
STEALS                    object
BLOCKS                    object
FOULS                     object
POINTS                    object
dtype: object

We can see that all of the columns currently have the object data type, which is not ideal for numeric columns like number of minutes played or field goals made.

Convert Numeric-only Columns

There are a few reasons to convert columns from the object to numeric data type. The main reason is that most mathematical operations (including numpy methods and sum/averages/etc. taken during aggregation steps) only work on numeric data types. Plotting and visualization libraries also expect numeric data types, so you may not be able to create any charts or graphs (besides categorical charts like histograms) unless specific columns use the numeric data type. Storing numbers as strings can give unexpected results, such as sorting the values 1, 2, and 11 in the order 1, 11, 2. For large datasets, it’s worth noting that this can also take up more memory and be slower than storing numbers in numeric data types.

Now that we know why we would want to convert numeric columns to numeric data types, let’s start by defining which columns we expect to store only numeric data.

numeric_columns = ['GAMES', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOAL_ATTEMPTS', 
                  'FIELD_GOAL_PERCENTAGE', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 
                  'THREE_POINT_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROW_ATTEMPTS', 
                  'FREE_THROW_PERCENTAGE', 'OFFENSIVE_REBOUNDS', 'DEFENSIVE_REBOUNDS', 
                  'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS']

Now that we have this list of columns, we can use the pandas to_numeric() method to convert these columns to the appropriate numeric data type. We can confirm that worked by re-checking the data type of each column.

player_data[numeric_columns] = player_data[numeric_columns].apply(pd.to_numeric)
player_data.dtypes
PLAYER_NAME                object
Team                       object
Class                      object
Height                     object
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
dtype: object

All of the defined numeric columns were properly converted to either integers or floats. The percentage fields are now stored as floats, while the rest of them are integers.

We can also revisit the None values in the THREE_POINT_PERCENTAGE column to see if they were impacted by the datatype conversion.

player_data[player_data['THREE_POINT_PERCENTAGE'].isna()][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE']].head()
PLAYER_NAME Team Position THREE_POINTS_MADE THREE_POINT_ATTEMPTS THREE_POINT_PERCENTAGE
245 Caitlin Weimar Boston U. (Patriot) F 0 0 NaN
250 Abby Muse Boise St. (Mountain West) F 0 0 NaN
254 Emily Bowman Samford (SoCon) C 0 0 NaN
257 Christina Dalce Villanova (Big East) F 0 0 NaN
258 Emily Saunders Youngstown St. (Horizon) F 0 0 NaN

So the to_numeric() method automatically handled the None values in the three-point percentage column and converted them to NaNs. Neat!

With the numeric columns converted, we can move on to the columns that should contain only text.

Convert Text-only Columns

In pandas, both the object and string data types can be used to represent text. However, there is a slight difference between them:

  • object dtype - This is the default dtype and is a catch-all for any non-numeric data. It can hold any Python object, including strings, lists, dictionaries, etc. When a column contains multiple data types or when pandas cannot guess the data type, it defaults to the object dtype. While this is more flexible, operations on columns with object dtype may be slower compared to columns with the string data type. The pandas documentation outlines reasons why it’s better to use the string dtype for storing text-only data.
  • string dtype - This dtype specifically represents strings. Columns with this dtype contain only string data (not a mixture of string and other dtypes), which improves readability and allows for better dtype-specific operations like the select_dtypes() method. Additionally, with the str dtype, you can use some string-specific methods and functions directly on the column without the need for explicit type conversions.

Let’s start by identifying which columns should contain only string data.

string_columns = ['PLAYER_NAME', 'Team', 'Class', 'Height', 'Position', 'PLAYER_ID', 'TEAM_NAME']

We can use the pandas astype() method to convert object columns to string columns. However, you should watch out for values with other data types when performing this conversion. For example, any missing values (such as NaN) will be converted to the string "NaN". We’ve already replaced all of the missing values in these columns, so this is no longer a concern. However, we do not want other non-string values (numbers, arrays, dictionaries, etc.) to be converted either, so it’s best to check for those before performing the conversion. For this dataset, we do not expect these columns to have any non-string values, but it’s a good idea to look at a sample of the data in each column just to be safe.

player_data[string_columns].sample(10)
PLAYER_NAME Team Class Height Position PLAYER_ID TEAM_NAME
400 Precious Johnson Duquesne (Atlantic 10) Sr. 6-4 C ncaaw.p.61337 Duquesne
385 Meghan O'Brien Lehigh (Patriot) Jr. 6-1 F ncaaw.p.66792 Lehigh
770 Carmen Villalobos Cleveland St. (Horizon) Sr. 6-1 G ncaaw.p.61574 Cleveland St.
180 Ashley Hawkins Gardner-Webb (Big South) Jr. 5-6 G ncaaw.p.100658 Gardner-Webb
821 Natalie Picton Montana St. (Big Sky) Fr. 5-5 G ncaaw.p.112319 Montana St.
273 Jada Tiggett N.C. Central (MEAC) Fr. 6-2 G ncaaw.p.113273 N.C. Central
710 Nyla McGill Yale (Ivy League) Jr. 5-8 G ncaaw.p.67510 Yale
200 Keshuna Luckett Jackson St. (SWAC) Sr. 5-8 G ncaaw.p.61415 Jackson St.
577 Bri McDaniel Maryland (Big Ten) So. 5-10 G ncaaw.p.70818 Maryland
649 Halli Poock Bradley (MVC) Fr. 5-4 G ncaaw.p.112767 Murray St.

This looks good, so we’ll go ahead and convert the data type for these columns and confirm the change was made.

player_data[string_columns] = player_data[string_columns].astype('string')
player_data[string_columns].dtypes
PLAYER_NAME    string[python]
Team           string[python]
Class          string[python]
Height         string[python]
Position       string[python]
PLAYER_ID      string[python]
TEAM_NAME      string[python]
dtype: object

Now that we’re finished converting data types, it’s time to move on to value substitution.

Value Substitution

Value substitution (or value mapping) refers to replacing one value with another. One of the most common uses for this is to replace abbreviations with their full values (such as replacing “Fr.” with “Freshman”). For machine learning datasets, value substitution can also include mapping categorical values to number or array formats.

The first value substitution we’ll make is swapping the abbreviated position names with the full values. The Positionfield in this dataset refers to the position where the player typically plays on the court. Here’s a diagram of the positions:

Diagram of basketball positions

Let’s see what Position values are currently in our dataset.

player_data['Position'].unique()
<StringArray>
['G', 'F', 'C']
Length: 3, dtype: string

So this data set uses the three types of positions instead of the five specific positions. This means that shooting guard and point guard are both stored as G for guard, and power forward and small forward are both stored as F for forward. To make these F, G, C values more readable, let’s create a map between the current single-letter values and the full-length position name.

position_names = {
    'F': 'Forward',
    'G': 'Guard',
    'C': 'Center'
}

Now all we have to do is use the pandas replace() method to apply this mapping to the entire series.

player_data['Position'] = player_data['Position'].replace(position_names)
player_data['Position'].unique()
<StringArray>
['Guard', 'Forward', 'Center']
Length: 3, dtype: string

We can use the same process to substitute the values in the Class column as well.

class_names = {
    'Fr.': 'Freshman',
    'So.': 'Sophomore',
    'Jr.': 'Junior',
    'Sr.': 'Senior'
}
player_data['Class'] = player_data['Class'].replace(class_names)
player_data['Class'].unique()
<StringArray>
['Junior', 'Sophomore', 'Senior', 'Freshman']
Length: 4, dtype: string

There are more substitutions that we could make (such as replacing “St.” with “State” in the team name), but those are the only ones we’ll make today. Next, let’s look at unit conversions.

Unit Conversion

The only unit conversion we’ll be doing today is to convert the feet-inches notation in the Height column to the total number of inches. This means that a height of five feet, three inches tall is currently stored as 5-3, but will be converted to the total number of inches, 63. We can define a function to convert the individual Height values:

def height_to_inches(height):
    feet, inches = map(int, height.split('-'))
    return feet * 12 + inches

This function breaks the values in the Height column on the hyphen and store the number of feet and inches in separate variables. It then multiplies the number of feet by 12 (the number of inches per foot) and adds the number of inches to get the total height of the player in inches.

Now that we have this function, we can apply it to the Height column and verify it worked by checking the unique values.

player_data['Height'] = player_data['Height'].apply(height_to_inches)
player_data['Height'].unique()
array([67, 68, 62, 69, 66, 72, 70, 63, 71, 65, 64, 74, 73, 75, 76, 78, 77,
       79, 80])

That wraps up everything needed for data preprocessing and our dataset is ready for the next steps of the project like feature engineering and machine learning!

Export Data

If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export this dataset. As a reminder, you can use the to_csv() method instead of the .to_excel() method if you prefer.

player_data.to_excel('player_data_clean.xlsx', index=False)

Wrap up

In today’s guide, we laid the groundwork for data analysis by cleaning and preprocessing the combined player data. In the next article, we’ll expand upon this dataset by engineering new features.

Also, all of the code snippets in today’s guide are available in a Jupyter Notebook in the ncaa-basketball-stats repository on GitHub.

Articles in this Series

  1. Acquiring and Combining the Datasets
  2. Cleaning and Preprocessing the Data (Today’s Guide)
  3. Engineering New Features
  4. Exploratory Data Analysis
  5. Visualizations, Charts, and Graphs
  6. Selecting a Machine Learning Model
  7. Training the Machine Learning Model
  8. Evaluating the Machine Learning Model