Today we’ll walk through how to clean and preprocess a dataset to ensure it is ready for analysis. This is the second part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, machine learning, and creating visualizations.

As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step that we’ll go through for this project:

the steps for this data science project

  1. Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
  2. Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
  3. Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
  4. Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
  5. Machine Learning - This step focuses on training a machine learning model to identify the combination of individual player statistics that correlates with optimal performance.
  6. Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).

We’ll use Python along with the popular pandas and requests libraries to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, create meaningful visualizations, and train a basic machine learning model. Since we already gathered the raw data from online sources in the last part, let’s move on to the data cleaning and preprocessing steps.

Getting Started

Since this is the second installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the previous post or at least review the Getting Started section of that post before continuing. In summary, you’ll want to have Python installed with the following packages:

In Part 1 of this series, we acquired two datasets and combined them into one final dataset, stored in a dataframe named player_data. If you want to follow along with the code examples in this article, it’s recommended to import the player_data dataframe before proceeding.

player_data = pd.read_excel('player_data_raw.xlsx')

Data Cleaning

Before we can analyze the dataset, we need to ensure it is clean and reliable. In this section, we’ll address issues like missing values, incorrect entries, and inconsistencies. This saves you the headache of training a model with unintended values or creating graphs without the full dataset.

Handle Missing Values

Missing values (such as None and NaN) can significantly impact the accuracy and validity of statistical analyses and visualizations, so we want to identify and handle (remove, impute, accept) each instance of missing and empty values in the dataset. There are several ways to handle missing values, but let’s take a look at the most common methods:

  • Correction - In limited circumstances, the missing values may be due to an import error or available in an alternate data source that can supplement your original dataset. This allows you to make corrections to the original dataset and eliminate missing values.
  • Imputation - Imputation involves replacing missing values with estimated or predicted values based on other available information in the dataset. Common imputation techniques include mean, median, mode imputation, or more advanced methods such as regression imputation or k-nearest neighbors (KNN) imputation.
  • Deletion - Deleting rows or columns with missing values is a straightforward approach, but it should be used carefully as it can lead to the loss of valuable information. Row deletion (also known as listwise deletion) removes entire observations with missing values, while column deletion (variable-wise deletion) removes entire variables with missing values.
  • Advanced Techniques - Advanced techniques such as multiple imputation, which generates multiple imputed datasets and combines the results, or sophisticated machine learning algorithms designed to handle missing data directly, offer more robust solutions for handling missing values in complex datasets.

How you handle missing values will depend on a variety of factors (the nature of the missing data, the requirements and objectives of your project, etc.) and should be evaluated on a case-by-case basis. For today, we’ll go through each row with missing values and determine the best way to handle them one at a time.

Identify Missing Values

Let’s begin by looking at any rows that contain at least one missing value (represented as None and NaN in Python). We’ll use the pandas isna() method for this.

player_data[player_data.isna().any(axis=1)]
PLAYER_NAME Team Class Height Position PLAYER_ID TEAM_NAME GAMES MINUTES_PLAYED FIELD_GOALS_MADE ... FREE_THROW_PERCENTAGE OFFENSIVE_REBOUNDS DEFENSIVE_REBOUNDS TOTAL_REBOUNDS ASSISTS TURNOVERS STEALS BLOCKS FOULS POINTS
148 Ally Becki Ball St. (MAC) Jr. 5-8 NaN ncaaw.p.66590 Ball St. 31 972 143 ... 75.4 21 122 143 148 108 60 11 69 398
245 Caitlin Weimar Boston U. (Patriot) Sr. 6-4 F ncaaw.p.64464 N.C. State 28 987 199 ... 68.4 76 219 295 54 69 37 80 74 519
250 Abby Muse Boise St. (Mountain West) Sr. 6-2 F ncaaw.p.64516 Boise St. 31 785 90 ... 54.3 74 191 265 34 66 36 87 72 230
254 Emily Bowman Samford (SoCon) So. 6-5 C ncaaw.p.64719 Samford 30 703 89 ... 53.1 83 172 255 17 55 5 74 88 238
257 Christina Dalce Villanova (Big East) Jr. 6-2 F ncaaw.p.67708 Villanova 30 833 108 ... 54.9 145 146 291 30 52 25 70 89 255
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
726 Kseniia Kozlova James Madison (Sun Belt) Sr. 6-3 C ncaaw.p.64131 James Madison 24 526 118 ... 57.1 74 97 171 19 51 8 8 55 284
727 Simone Cunningham Old Dominion (Sun Belt) Jr. 5-10 G ncaaw.p.113163 Old Dominion 30 591 84 ... 56.5 87 123 210 12 60 20 21 93 207
729 Otaifo Esenabhalu Longwood (Big South) Fr. 6-2 F ncaaw.p.113170 Longwood 30 547 60 ... 60.0 79 128 207 10 59 23 13 78 147
755 Sedayjha Payne Morgan St. (MEAC) Sr. 5-8 G ncaaw.p.113276 Morgan St. 29 634 80 ... 52.0 66 73 139 24 53 67 7 58 186
843 Madelyn Bischoff Ball St. (MAC) Jr. 5-9 NaN ncaaw.p.66600 Ball St. 30 881 112 ... 86.6 11 60 71 37 36 20 6 38 359

63 rows × 27 columns

There are quite a few rows and columns in this output, so let’s find out which columns contain missing values.

player_data.columns[player_data.isna().any()].tolist()
['Position', 'THREE_POINT_PERCENTAGE']

Now that we know which columns to focus on, let’s look at just those two columns with the player name and team name.

player_data[player_data.isna().any(axis=1)][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINT_PERCENTAGE']]
PLAYER_NAME Team Position THREE_POINT_PERCENTAGE
148 Ally Becki Ball St. (MAC) NaN 34.8
245 Caitlin Weimar Boston U. (Patriot) F None
250 Abby Muse Boise St. (Mountain West) F None
254 Emily Bowman Samford (SoCon) C None
257 Christina Dalce Villanova (Big East) F None
... ... ... ... ...
726 Kseniia Kozlova James Madison (Sun Belt) C None
727 Simone Cunningham Old Dominion (Sun Belt) G None
729 Otaifo Esenabhalu Longwood (Big South) F None
755 Sedayjha Payne Morgan St. (MEAC) G None
843 Madelyn Bischoff Ball St. (MAC) NaN 40.3

63 rows × 4 columns

The NaN values in each column should be dealt with separately, meaning the NaNs in the Position column will be handled separately from the NaNs in the THREE_POINT_PERCENTAGE column. Since the THREE_POINT_PERCENTAGE is a calculated field, let’s take a look at that first.

Handle Missing Three-Point Percentages

Three-point percentages in basketball are calculated by dividing the number of three-point baskets made by the number of three-point baskets attempted. So, let’s add those two columns in to our output.

player_data[player_data['THREE_POINT_PERCENTAGE'].isna()][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE']]
PLAYER_NAME Team Position THREE_POINTS_MADE THREE_POINT_ATTEMPTS THREE_POINT_PERCENTAGE
245 Caitlin Weimar Boston U. (Patriot) F 0 0 None
250 Abby Muse Boise St. (Mountain West) F 0 0 None
254 Emily Bowman Samford (SoCon) C 0 0 None
257 Christina Dalce Villanova (Big East) F 0 0 None
258 Emily Saunders Youngstown St. (Horizon) F 0 0 None
264 Lauren Betts UCLA (Pac-12) C 0 0 None
265 Tenin Magassa Rhode Island (Atlantic 10) C 0 0 None
275 Brooklyn Meyer South Dakota St. (Summit League) F 0 0 None
282 Breya Cunningham Arizona (Pac-12) F 0 0 None
290 Rochelle Norris Central Mich. (MAC) C 0 0 None
296 Rita Igbokwe Ole Miss (SEC) F 0 0 None
297 Kyndall Golden Kennesaw St. (ASUN) F 0 0 None
306 Kennedy Basham Oregon (Pac-12) F 0 0 None
307 Maria Gakdeng North Carolina (ACC) C 0 0 None
311 Kate Samson Navy (Patriot) C 0 0 None
316 Liatu King Pittsburgh (ACC) F 0 0 None
318 Ashlee Lewis Cal St. Fullerton (Big West) C 0 0 None
319 Jillian Archer St. John's (NY) (Big East) F 0 0 None
322 Kylee Watson Notre Dame (ACC) F 0 0 None
326 Caitlin Staley Western Ky. (CUSA) F 0 0 None
332 Sacha Washington Vanderbilt (SEC) F 0 0 None
340 Adreanna Waddle Prairie View (SWAC) F 0 0 None
348 River Baldwin NC State (ACC) C 0 0 None
353 Zyheima Swint Hofstra (CAA) C 0 0 None
355 Phillipina Kyei Oregon (Pac-12) C 0 0 None
358 Jasmin Dixon Northwestern St. (Southland) F 0 0 None
361 Shamarre Hale Austin Peay (ASUN) C 0 0 None
366 Jadyn Donovan Duke (ACC) G 0 0 None
367 Melyia Grayson Southern Miss. (Sun Belt) C 0 0 None
379 ZiKeyah Carter Chicago St. (DI Independent) C 0 0 None
380 Meghan Downing ETSU (SoCon) F 0 0 None
394 Makayla Minett Denver (Summit League) F 0 0 None
401 Ugonne Onyiah California (Pac-12) F 0 0 None
402 Bella Murekatete Washington St. (Pac-12) C 0 0 None
409 Zoe McCrary Col. of Charleston (CAA) F 0 0 None
411 Ashlee Locke Mercer (SoCon) F 0 0 None
414 Kyra Wood Syracuse (ACC) F 0 0 None
420 Laura Bello Idaho St. (Big Sky) F 0 0 None
424 Khalis Cain UNC Greensboro (SoCon) F 0 0 None
428 Hannah Noveroske Toledo (MAC) C 0 0 None
431 Yaubryon Chambers Tennessee Tech (OVC) F 0 0 None
433 Sierra McCullough Eastern Ky. (ASUN) F 0 0 None
437 India Howard North Ala. (ASUN) F 0 0 None
440 Kayla Clark Bethune-Cookman (SWAC) C 0 0 None
443 Katlyn Manuel ULM (Sun Belt) F 0 0 None
445 Jordana Reisma Cleveland St. (Horizon) F 0 0 None
462 Carys Roy Saint Peter's (MAAC) C 0 0 None
468 Clarice Akunwafo Southern California (Pac-12) C 0 0 None
526 Carter McCray Northern Ky. (Horizon) C 0 0 None
548 Ajae Petty Kentucky (SEC) F 0 0 None
664 Ellie Mitchell Princeton (Ivy League) F 0 0 None
682 Teneisia Brown FDU (NEC) F 0 0 None
686 Alancia Ramsey Coastal Carolina (Sun Belt) F 0 0 None
715 Chyna Cornwell Rutgers (Big Ten) F 0 0 None
722 Kennedy Taylor Missouri St. (MVC) F 0 0 None
726 Kseniia Kozlova James Madison (Sun Belt) C 0 0 None
727 Simone Cunningham Old Dominion (Sun Belt) G 0 0 None
729 Otaifo Esenabhalu Longwood (Big South) F 0 0 None
755 Sedayjha Payne Morgan St. (MEAC) G 0 0 None

We can see that for all of the rows where THREE_POINT_PERCENTAGE is None, the three-point baskets made and attempted values are zero. If we manually calculated the three-point percentage for these rows, we would get NaN results instead of None due to the division by zero. NaN and None are different values, but we’ve confirmed that there is no issue with the underlying data. The None values will automatically be replaced with NaNs later in this process, so we’ll leave these values at None for now and move on to the Position field.

Handle Missing Positions

Let’s see how many rows are missing a Position value.

player_data[player_data['Position'].isna()][['PLAYER_NAME', 'Team', 'Position']]
PLAYER_NAME Team Position
148 Ally Becki Ball St. (MAC) NaN
359 Marie Kiefer Ball St. (MAC) NaN
709 Ava Uhrich Southern Utah (WAC) NaN
843 Madelyn Bischoff Ball St. (MAC) NaN

It can be helpful to evaluate whether or not each column with missing values should have a missing or null value. In this case, it doesn’t make much sense for a basketball player to be missing a position (guard, forward, center), since every player is assigned a standard position on their team. Knowing that, we can assume that our current dataset is missing these values in error, and the entries should be updated.

We’ll be using an external data source to look up the correct positions for each of these players. If there were more rows with a missing Position value, then we might acquire another dataset and combine it with the current one. However, there are only four rows here, so we’ll look up the values manually.

You can use several external data sources, but here’s an example using ESPN’s website. Search for the player’s name and school to pull up the player’s individual page (for example, Ally Becki at Ball St.). On this page, you can see the player’s team, class, height, jersey number, as well as a F, C, or G for the position (in this example, the position is listed as G for Guard).

Manually look up a player's position

With that, we can locate and set the correct value of the Position field for each of the four players.

player_data.loc[player_data['PLAYER_NAME'] == 'Ally Becki', 'Position'] = 'G'
player_data.loc[player_data['PLAYER_NAME'] == 'Marie Kiefer', 'Position'] = 'F'
player_data.loc[player_data['PLAYER_NAME'] == 'Ava Uhrich', 'Position'] = 'F'
player_data.loc[player_data['PLAYER_NAME'] == 'Madelyn Bischoff', 'Position'] = 'G'

We can confirm that these changes were applied correctly by re-pulling the rows with a missing Position value.

player_data[player_data['Position'].isna()][['PLAYER_NAME', 'Team', 'Position']]
PLAYER_NAME Team Position

There are no results, so we’ve now handled all of the values caught by the isna() method. This is a great start, but you’ll often find datasets with missing or incorrect values that are not caught by the isna() method.

Handle Incorrect Values

Depending on the size of your dataset, the unique() method can be a good way to find additional missing values. Let’s try that out on the Height column.

Handle Incorrect Heights

Using the unique() method, we can see one height value that is not valid: 0-0.

player_data['Height'].unique()
array(['5-7', '5-8', '5-2', '5-9', '5-6', '6-0', '5-10', '5-3', '5-11',
       '5-5', '5-4', '6-2', '6-1', '6-3', '6-4', '6-6', '6-5', '6-7',
       '6-8', '0-0'], dtype=object)

Let’s pull all of the rows where the Height is set to 0-0.

player_data[player_data['Height'].eq('0-0')]
PLAYER_NAME Team Class Height Position PLAYER_ID TEAM_NAME GAMES MINUTES_PLAYED FIELD_GOALS_MADE ... FREE_THROW_PERCENTAGE OFFENSIVE_REBOUNDS DEFENSIVE_REBOUNDS TOTAL_REBOUNDS ASSISTS TURNOVERS STEALS BLOCKS FOULS POINTS
709 Ava Uhrich Southern Utah (WAC) Fr. 0-0 F ncaaw.p.115529 Southern Utah 29 877 151 ... 68.4 65 149 214 47 55 22 19 76 383
823 Payton Hull Abilene Christian (WAC) Fr. 0-0 G ncaaw.p.112709 Abilene Christian 29 837 155 ... 75.3 33 67 100 59 87 47 8 71 444

2 rows × 27 columns

There are only two rows with missing heights, so we can manually look up and update these values using the same method as the missing Position values.

player_data.loc[player_data['PLAYER_NAME'] == 'Ava Uhrich', 'Height'] = '6-0'
player_data.loc[player_data['PLAYER_NAME'] == 'Payton Hull', 'Height'] = '5-11'

We can double-check that this worked properly, just like the Position field.

player_data[player_data.eq('0-0').any(axis=1)]
PLAYER_NAME Team Class Height Position PLAYER_ID TEAM_NAME GAMES MINUTES_PLAYED FIELD_GOALS_MADE ... FREE_THROW_PERCENTAGE OFFENSIVE_REBOUNDS DEFENSIVE_REBOUNDS TOTAL_REBOUNDS ASSISTS TURNOVERS STEALS BLOCKS FOULS POINTS

0 rows × 27 columns

Handle Incorrect Classes

We can apply the same process used for Height to the Class field. Since there’s only one player with an incorrect class, this is a quick update.

player_data['Class'].unique()
player_data[player_data['Class'] == '---']
player_data.loc[player_data['PLAYER_NAME'] == 'Ayanna Khalfani', 'Class'] = 'Sr.'
player_data['Class'].unique()
array(['Jr.', 'So.', 'Sr.', 'Fr.'], dtype=object)

Now that we’ve handled all of the empty and incorrect values, we’re ready to move on to other data preprocessing steps.

Data Preprocessing

The goal of this step is to make sure our dataset is consistent and suitable for analysis. This stepp will change quite a bit depending on the exact project. For this project, we’ll be setting column datatypes, converting units, and substituting abbreviated values. For advanced machine learning projects, data preprocessing often includes additional steps like feature scaling, normalizing certain features, and encoding categorical values.

Data Type Conversion

Data type conversion is a fundamental step in preparing a dataset for analysis. Specifically, we often encounter situations where numbers are stored as strings or objects instead of their native numerical formats (e.g., integers or floats). In this step, let’s take a closer look at the initial data type of each column and identify opportunities to convert these values into more suitable datatypes.

player_data.dtypes
PLAYER_NAME               object
Team                      object
Class                     object
Height                    object
Position                  object
PLAYER_ID                 object
TEAM_NAME                 object
GAMES                     object
MINUTES_PLAYED            object
FIELD_GOALS_MADE          object
FIELD_GOAL_ATTEMPTS       object
FIELD_GOAL_PERCENTAGE     object
THREE_POINTS_MADE         object
THREE_POINT_ATTEMPTS      object
THREE_POINT_PERCENTAGE    object
FREE_THROWS_MADE          object
FREE_THROW_ATTEMPTS       object
FREE_THROW_PERCENTAGE     object
OFFENSIVE_REBOUNDS        object
DEFENSIVE_REBOUNDS        object
TOTAL_REBOUNDS            object
ASSISTS                   object
TURNOVERS                 object
STEALS                    object
BLOCKS                    object
FOULS                     object
POINTS                    object
dtype: object

We can see that all of the columns currently have the object data type, which is not ideal for numeric columns like number of minutes played or field goals made.

Convert Numeric-only Columns

There are a few reasons to convert columns from the object to numeric data type. The main reason is that most mathematical operations (including numpy methods and sum/averages/etc. taken during aggregation steps) only work on numeric data types. Plotting and visualization libraries also expect numeric data types, so you may not be able to create any charts or graphs (besides categorical charts like histograms) unless specific columns use the numeric data type. Storing numbers as strings can give unexpected results, such as sorting the values 1, 2, and 11 in the order 1, 11, 2. For large datasets, it’s worth noting that this can also take up more memory and be slower than storing numbers in numeric data types.

Now that we know why we would want to convert numeric columns to numeric data types, let’s start by defining which columns we expect to store only numeric data.

numeric_columns = ['GAMES', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOAL_ATTEMPTS', 
                  'FIELD_GOAL_PERCENTAGE', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 
                  'THREE_POINT_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROW_ATTEMPTS', 
                  'FREE_THROW_PERCENTAGE', 'OFFENSIVE_REBOUNDS', 'DEFENSIVE_REBOUNDS', 
                  'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS']

Now that we have this list of columns, we can use the pandas to_numeric() method to convert these columns to the appropriate numeric data type. We can confirm that worked by re-checking the data type of each column.

player_data[numeric_columns] = player_data[numeric_columns].apply(pd.to_numeric)
player_data.dtypes
PLAYER_NAME                object
Team                       object
Class                      object
Height                     object
Position                   object
PLAYER_ID                  object
TEAM_NAME                  object
GAMES                       int64
MINUTES_PLAYED              int64
FIELD_GOALS_MADE            int64
FIELD_GOAL_ATTEMPTS         int64
FIELD_GOAL_PERCENTAGE     float64
THREE_POINTS_MADE           int64
THREE_POINT_ATTEMPTS        int64
THREE_POINT_PERCENTAGE    float64
FREE_THROWS_MADE            int64
FREE_THROW_ATTEMPTS         int64
FREE_THROW_PERCENTAGE     float64
OFFENSIVE_REBOUNDS          int64
DEFENSIVE_REBOUNDS          int64
TOTAL_REBOUNDS              int64
ASSISTS                     int64
TURNOVERS                   int64
STEALS                      int64
BLOCKS                      int64
FOULS                       int64
POINTS                      int64
dtype: object

All of the defined numeric columns were properly converted to either integers or floats. The percentage fields are now stored as floats, while the rest of them are integers.

We can also revisit the None values in the THREE_POINT_PERCENTAGE column to see if they were impacted by the datatype conversion.

player_data[player_data['THREE_POINT_PERCENTAGE'].isna()][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE']].head()
PLAYER_NAME Team Position THREE_POINTS_MADE THREE_POINT_ATTEMPTS THREE_POINT_PERCENTAGE
245 Caitlin Weimar Boston U. (Patriot) F 0 0 NaN
250 Abby Muse Boise St. (Mountain West) F 0 0 NaN
254 Emily Bowman Samford (SoCon) C 0 0 NaN
257 Christina Dalce Villanova (Big East) F 0 0 NaN
258 Emily Saunders Youngstown St. (Horizon) F 0 0 NaN

So the to_numeric() method automatically handled the None values in the three-point percentage column and converted them to NaNs. With the numeric columns converted, we can move on to the columns that should contain only text.

Convert Text-only Columns

In pandas, both the object and string data types can be used to represent text. However, there is a slight difference between them:

  • object dtype - This is the default dtype and is a catch-all for any non-numeric data. It can hold any Python object, including strings, lists, dictionaries, etc. When a column contains multiple data types or when pandas cannot guess the data type, it defaults to the object dtype. While this is more flexible, operations on columns with object dtype may be slower compared to columns with the string data type. The pandas documentation outlines reasons why it’s better to use the string dtype for storing text-only data.
  • string dtype - This dtype specifically represents strings. Columns with this dtype contain only string data (not a mixture of string and other dtypes), which improves readability and allows for better dtype-specific operations like the select_dtypes() method. Additionally, with the str dtype, you can use some string-specific methods and functions directly on the column without the need for explicit type conversions.

Let’s start by identifying which columns should contain only string data.

string_columns = ['PLAYER_NAME', 'Team', 'Class', 'Height', 'Position', 'PLAYER_ID', 'TEAM_NAME']

We can use the pandas astype() method to convert object columns to string columns. However, you should watch out for values with other data types when performing this conversion. For example, any missing values (such as NaN) will be converted to the string "NaN". We’ve already replaced all of the missing values in these columns, so this is no longer a concern. However, we do not want other non-string values (numbers, arrays, dictionaries, etc.) to be converted either, so it’s best to check for those before performing the conversion. For this dataset, we do not expect these columns to have any non-string values, but it’s a good idea to look at a sample of the data in each column just to be safe.

player_data[string_columns].sample(10)
PLAYER_NAME Team Class Height Position PLAYER_ID TEAM_NAME
400 Precious Johnson Duquesne (Atlantic 10) Sr. 6-4 C ncaaw.p.61337 Duquesne
385 Meghan O'Brien Lehigh (Patriot) Jr. 6-1 F ncaaw.p.66792 Lehigh
770 Carmen Villalobos Cleveland St. (Horizon) Sr. 6-1 G ncaaw.p.61574 Cleveland St.
180 Ashley Hawkins Gardner-Webb (Big South) Jr. 5-6 G ncaaw.p.100658 Gardner-Webb
821 Natalie Picton Montana St. (Big Sky) Fr. 5-5 G ncaaw.p.112319 Montana St.
273 Jada Tiggett N.C. Central (MEAC) Fr. 6-2 G ncaaw.p.113273 N.C. Central
710 Nyla McGill Yale (Ivy League) Jr. 5-8 G ncaaw.p.67510 Yale
200 Keshuna Luckett Jackson St. (SWAC) Sr. 5-8 G ncaaw.p.61415 Jackson St.
577 Bri McDaniel Maryland (Big Ten) So. 5-10 G ncaaw.p.70818 Maryland
649 Halli Poock Bradley (MVC) Fr. 5-4 G ncaaw.p.112767 Murray St.

This looks good, so we’ll go ahead and convert the data type for these columns and confirm the change was made.

player_data[string_columns] = player_data[string_columns].astype('string')
player_data[string_columns].dtypes
PLAYER_NAME    string[python]
Team           string[python]
Class          string[python]
Height         string[python]
Position       string[python]
PLAYER_ID      string[python]
TEAM_NAME      string[python]
dtype: object

We’re finished converting data types, so it’s time to move on to value substitution.

Value Substitution

Value substitution (or value mapping) refers to replacing one value with another. One of the most common uses for this is to replace abbreviations with their full values (such as replacing “Fr.” with “Freshman”). For machine learning datasets, value substitution can also include mapping categorical values to number or array formats.

The first value substitution we’ll make is swapping the abbreviated position names with the full values. The Position field in this dataset refers to where the player typically plays on the court. Here’s a diagram of the positions:

Diagram of basketball positions

Let’s see what Position values are currently in our dataset.

player_data['Position'].unique()
<StringArray>
['G', 'F', 'C']
Length: 3, dtype: string

So this data set uses the three types of positions instead of the five specific positions. To make the values more readable, let’s create a map between the current single-letter values and the full-length position name.

position_names = {
    'F': 'Forward',
    'G': 'Guard',
    'C': 'Center'
}

Now all we have to do is use the pandas replace() method to apply this mapping to the entire series.

player_data['Position'] = player_data['Position'].replace(position_names)
player_data['Position'].unique()
<StringArray>
['Guard', 'Forward', 'Center']
Length: 3, dtype: string

We can use the same process to substitute the values in the Class column as well.

class_names = {
    'Fr.': 'Freshman',
    'So.': 'Sophomore',
    'Jr.': 'Junior',
    'Sr.': 'Senior'
}
player_data['Class'] = player_data['Class'].replace(class_names)
player_data['Class'].unique()
<StringArray>
['Junior', 'Sophomore', 'Senior', 'Freshman']
Length: 4, dtype: string

There are more substitutions that we could make (such as replacing “St.” with “State” in the team name), but those are the only ones we’ll make today. Next, let’s look at unit conversions.

Unit Conversion

The only unit conversion we’ll be doing today is to convert the feet-inches notation in the Height column to the total number of inches. This means that a height of five feet, three inches tall is currently stored as 5-3, but will be converted to the total number of inches, 63. We can define a function to convert the individual Height values:

def height_to_inches(height):
    feet, inches = map(int, height.split('-'))
    return feet * 12 + inches

This function breaks the values in the Height column on the hyphen and store the number of feet and inches in separate variables. It then multiplies the number of feet by 12 (the number of inches per foot) and adds the number of inches to get the total height of the player in inches.

Now that we have this function, we can apply it to the Height column and verify it worked by checking the unique values.

player_data['Height'] = player_data['Height'].apply(height_to_inches)
player_data['Height'].unique()
array([67, 68, 62, 69, 66, 72, 70, 63, 71, 65, 64, 74, 73, 75, 76, 78, 77,
       79, 80])

That wraps up everything needed for data preprocessing and our dataset is ready for feature engineering and machine learning.

If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export this dataset. As a reminder, you can use the to_csv() method instead of .to_excel() if you prefer.

player_data.to_excel('player_data_clean.xlsx', index=False)

Wrap up

In today’s guide, we laid the groundwork for data analysis by cleaning and preprocessing the combined player data. In the next article, we’ll expand upon this dataset by engineering a few new features and training a machine learning model. In the final installment of this series, we’ll identify relationships between various parameters and create meaningful visualizations.