Today we’ll walk through how to clean and preprocess a dataset to ensure it is ready for analysis. This is the second part of a series that walks through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, machine learning, and creating visualizations.
As a reminder, the dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step that we’ll go through for this project:
- Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
- Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
- Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
- Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
- Machine Learning - This step focuses on training a machine learning model to identify the combination of individual player statistics that correlates with optimal performance.
- Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
We’ll use Python along with the popular pandas and requests libraries to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, create meaningful visualizations, and train a basic machine learning model. Since we already gathered the raw data from online sources in the last part, let’s move on to the data cleaning and preprocessing steps.
Getting Started
Since this is the second installment in the series, you likely already have your environment setup and can skip to the next section. If you’re not already set up and you want to follow along on your own machine, it’s recommended to read the previous post or at least review the Getting Started section of that post before continuing. In summary, you’ll want to have Python installed with the following packages:
In Part 1 of this series, we acquired two datasets and combined them into one final dataset, stored in a dataframe named player_data
. If you want to follow along with the code examples in this article, it’s recommended to import the player_data
dataframe before proceeding.
player_data = pd.read_excel('player_data_raw.xlsx')
Data Cleaning
Before we can analyze the dataset, we need to ensure it is clean and reliable. In this section, we’ll address issues like missing values, incorrect entries, and inconsistencies. This saves you the headache of training a model with unintended values or creating graphs without the full dataset.
Handle Missing Values
Missing values (such as None
and NaN
) can significantly impact the accuracy and validity of statistical analyses and visualizations, so we want to identify and handle (remove, impute, accept) each instance of missing and empty values in the dataset. There are several ways to handle missing values, but let’s take a look at the most common methods:
- Correction - In limited circumstances, the missing values may be due to an import error or available in an alternate data source that can supplement your original dataset. This allows you to make corrections to the original dataset and eliminate missing values.
- Imputation - Imputation involves replacing missing values with estimated or predicted values based on other available information in the dataset. Common imputation techniques include mean, median, mode imputation, or more advanced methods such as regression imputation or k-nearest neighbors (KNN) imputation.
- Deletion - Deleting rows or columns with missing values is a straightforward approach, but it should be used carefully as it can lead to the loss of valuable information. Row deletion (also known as listwise deletion) removes entire observations with missing values, while column deletion (variable-wise deletion) removes entire variables with missing values.
- Advanced Techniques - Advanced techniques such as multiple imputation, which generates multiple imputed datasets and combines the results, or sophisticated machine learning algorithms designed to handle missing data directly, offer more robust solutions for handling missing values in complex datasets.
How you handle missing values will depend on a variety of factors (the nature of the missing data, the requirements and objectives of your project, etc.) and should be evaluated on a case-by-case basis. For today, we’ll go through each row with missing values and determine the best way to handle them one at a time.
Identify Missing Values
Let’s begin by looking at any rows that contain at least one missing value (represented as None
and NaN
in Python). We’ll use the pandas isna()
method for this.
player_data[player_data.isna().any(axis=1)]
PLAYER_NAME | Team | Class | Height | Position | PLAYER_ID | TEAM_NAME | GAMES | MINUTES_PLAYED | FIELD_GOALS_MADE | ... | FREE_THROW_PERCENTAGE | OFFENSIVE_REBOUNDS | DEFENSIVE_REBOUNDS | TOTAL_REBOUNDS | ASSISTS | TURNOVERS | STEALS | BLOCKS | FOULS | POINTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
148 | Ally Becki | Ball St. (MAC) | Jr. | 5-8 | NaN | ncaaw.p.66590 | Ball St. | 31 | 972 | 143 | ... | 75.4 | 21 | 122 | 143 | 148 | 108 | 60 | 11 | 69 | 398 |
245 | Caitlin Weimar | Boston U. (Patriot) | Sr. | 6-4 | F | ncaaw.p.64464 | N.C. State | 28 | 987 | 199 | ... | 68.4 | 76 | 219 | 295 | 54 | 69 | 37 | 80 | 74 | 519 |
250 | Abby Muse | Boise St. (Mountain West) | Sr. | 6-2 | F | ncaaw.p.64516 | Boise St. | 31 | 785 | 90 | ... | 54.3 | 74 | 191 | 265 | 34 | 66 | 36 | 87 | 72 | 230 |
254 | Emily Bowman | Samford (SoCon) | So. | 6-5 | C | ncaaw.p.64719 | Samford | 30 | 703 | 89 | ... | 53.1 | 83 | 172 | 255 | 17 | 55 | 5 | 74 | 88 | 238 |
257 | Christina Dalce | Villanova (Big East) | Jr. | 6-2 | F | ncaaw.p.67708 | Villanova | 30 | 833 | 108 | ... | 54.9 | 145 | 146 | 291 | 30 | 52 | 25 | 70 | 89 | 255 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
726 | Kseniia Kozlova | James Madison (Sun Belt) | Sr. | 6-3 | C | ncaaw.p.64131 | James Madison | 24 | 526 | 118 | ... | 57.1 | 74 | 97 | 171 | 19 | 51 | 8 | 8 | 55 | 284 |
727 | Simone Cunningham | Old Dominion (Sun Belt) | Jr. | 5-10 | G | ncaaw.p.113163 | Old Dominion | 30 | 591 | 84 | ... | 56.5 | 87 | 123 | 210 | 12 | 60 | 20 | 21 | 93 | 207 |
729 | Otaifo Esenabhalu | Longwood (Big South) | Fr. | 6-2 | F | ncaaw.p.113170 | Longwood | 30 | 547 | 60 | ... | 60.0 | 79 | 128 | 207 | 10 | 59 | 23 | 13 | 78 | 147 |
755 | Sedayjha Payne | Morgan St. (MEAC) | Sr. | 5-8 | G | ncaaw.p.113276 | Morgan St. | 29 | 634 | 80 | ... | 52.0 | 66 | 73 | 139 | 24 | 53 | 67 | 7 | 58 | 186 |
843 | Madelyn Bischoff | Ball St. (MAC) | Jr. | 5-9 | NaN | ncaaw.p.66600 | Ball St. | 30 | 881 | 112 | ... | 86.6 | 11 | 60 | 71 | 37 | 36 | 20 | 6 | 38 | 359 |
63 rows × 27 columns
There are quite a few rows and columns in this output, so let’s find out which columns contain missing values.
player_data.columns[player_data.isna().any()].tolist()
['Position', 'THREE_POINT_PERCENTAGE']
Now that we know which columns to focus on, let’s look at just those two columns with the player name and team name.
player_data[player_data.isna().any(axis=1)][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINT_PERCENTAGE']]
PLAYER_NAME | Team | Position | THREE_POINT_PERCENTAGE | |
---|---|---|---|---|
148 | Ally Becki | Ball St. (MAC) | NaN | 34.8 |
245 | Caitlin Weimar | Boston U. (Patriot) | F | None |
250 | Abby Muse | Boise St. (Mountain West) | F | None |
254 | Emily Bowman | Samford (SoCon) | C | None |
257 | Christina Dalce | Villanova (Big East) | F | None |
... | ... | ... | ... | ... |
726 | Kseniia Kozlova | James Madison (Sun Belt) | C | None |
727 | Simone Cunningham | Old Dominion (Sun Belt) | G | None |
729 | Otaifo Esenabhalu | Longwood (Big South) | F | None |
755 | Sedayjha Payne | Morgan St. (MEAC) | G | None |
843 | Madelyn Bischoff | Ball St. (MAC) | NaN | 40.3 |
63 rows × 4 columns
The NaN
values in each column should be dealt with separately, meaning the NaN
s in the Position
column will be handled separately from the NaN
s in the THREE_POINT_PERCENTAGE
column. Since the THREE_POINT_PERCENTAGE is a calculated field, let’s take a look at that first.
Handle Missing Three-Point Percentages
Three-point percentages in basketball are calculated by dividing the number of three-point baskets made by the number of three-point baskets attempted. So, let’s add those two columns in to our output.
player_data[player_data['THREE_POINT_PERCENTAGE'].isna()][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE']]
PLAYER_NAME | Team | Position | THREE_POINTS_MADE | THREE_POINT_ATTEMPTS | THREE_POINT_PERCENTAGE | |
---|---|---|---|---|---|---|
245 | Caitlin Weimar | Boston U. (Patriot) | F | 0 | 0 | None |
250 | Abby Muse | Boise St. (Mountain West) | F | 0 | 0 | None |
254 | Emily Bowman | Samford (SoCon) | C | 0 | 0 | None |
257 | Christina Dalce | Villanova (Big East) | F | 0 | 0 | None |
258 | Emily Saunders | Youngstown St. (Horizon) | F | 0 | 0 | None |
264 | Lauren Betts | UCLA (Pac-12) | C | 0 | 0 | None |
265 | Tenin Magassa | Rhode Island (Atlantic 10) | C | 0 | 0 | None |
275 | Brooklyn Meyer | South Dakota St. (Summit League) | F | 0 | 0 | None |
282 | Breya Cunningham | Arizona (Pac-12) | F | 0 | 0 | None |
290 | Rochelle Norris | Central Mich. (MAC) | C | 0 | 0 | None |
296 | Rita Igbokwe | Ole Miss (SEC) | F | 0 | 0 | None |
297 | Kyndall Golden | Kennesaw St. (ASUN) | F | 0 | 0 | None |
306 | Kennedy Basham | Oregon (Pac-12) | F | 0 | 0 | None |
307 | Maria Gakdeng | North Carolina (ACC) | C | 0 | 0 | None |
311 | Kate Samson | Navy (Patriot) | C | 0 | 0 | None |
316 | Liatu King | Pittsburgh (ACC) | F | 0 | 0 | None |
318 | Ashlee Lewis | Cal St. Fullerton (Big West) | C | 0 | 0 | None |
319 | Jillian Archer | St. John's (NY) (Big East) | F | 0 | 0 | None |
322 | Kylee Watson | Notre Dame (ACC) | F | 0 | 0 | None |
326 | Caitlin Staley | Western Ky. (CUSA) | F | 0 | 0 | None |
332 | Sacha Washington | Vanderbilt (SEC) | F | 0 | 0 | None |
340 | Adreanna Waddle | Prairie View (SWAC) | F | 0 | 0 | None |
348 | River Baldwin | NC State (ACC) | C | 0 | 0 | None |
353 | Zyheima Swint | Hofstra (CAA) | C | 0 | 0 | None |
355 | Phillipina Kyei | Oregon (Pac-12) | C | 0 | 0 | None |
358 | Jasmin Dixon | Northwestern St. (Southland) | F | 0 | 0 | None |
361 | Shamarre Hale | Austin Peay (ASUN) | C | 0 | 0 | None |
366 | Jadyn Donovan | Duke (ACC) | G | 0 | 0 | None |
367 | Melyia Grayson | Southern Miss. (Sun Belt) | C | 0 | 0 | None |
379 | ZiKeyah Carter | Chicago St. (DI Independent) | C | 0 | 0 | None |
380 | Meghan Downing | ETSU (SoCon) | F | 0 | 0 | None |
394 | Makayla Minett | Denver (Summit League) | F | 0 | 0 | None |
401 | Ugonne Onyiah | California (Pac-12) | F | 0 | 0 | None |
402 | Bella Murekatete | Washington St. (Pac-12) | C | 0 | 0 | None |
409 | Zoe McCrary | Col. of Charleston (CAA) | F | 0 | 0 | None |
411 | Ashlee Locke | Mercer (SoCon) | F | 0 | 0 | None |
414 | Kyra Wood | Syracuse (ACC) | F | 0 | 0 | None |
420 | Laura Bello | Idaho St. (Big Sky) | F | 0 | 0 | None |
424 | Khalis Cain | UNC Greensboro (SoCon) | F | 0 | 0 | None |
428 | Hannah Noveroske | Toledo (MAC) | C | 0 | 0 | None |
431 | Yaubryon Chambers | Tennessee Tech (OVC) | F | 0 | 0 | None |
433 | Sierra McCullough | Eastern Ky. (ASUN) | F | 0 | 0 | None |
437 | India Howard | North Ala. (ASUN) | F | 0 | 0 | None |
440 | Kayla Clark | Bethune-Cookman (SWAC) | C | 0 | 0 | None |
443 | Katlyn Manuel | ULM (Sun Belt) | F | 0 | 0 | None |
445 | Jordana Reisma | Cleveland St. (Horizon) | F | 0 | 0 | None |
462 | Carys Roy | Saint Peter's (MAAC) | C | 0 | 0 | None |
468 | Clarice Akunwafo | Southern California (Pac-12) | C | 0 | 0 | None |
526 | Carter McCray | Northern Ky. (Horizon) | C | 0 | 0 | None |
548 | Ajae Petty | Kentucky (SEC) | F | 0 | 0 | None |
664 | Ellie Mitchell | Princeton (Ivy League) | F | 0 | 0 | None |
682 | Teneisia Brown | FDU (NEC) | F | 0 | 0 | None |
686 | Alancia Ramsey | Coastal Carolina (Sun Belt) | F | 0 | 0 | None |
715 | Chyna Cornwell | Rutgers (Big Ten) | F | 0 | 0 | None |
722 | Kennedy Taylor | Missouri St. (MVC) | F | 0 | 0 | None |
726 | Kseniia Kozlova | James Madison (Sun Belt) | C | 0 | 0 | None |
727 | Simone Cunningham | Old Dominion (Sun Belt) | G | 0 | 0 | None |
729 | Otaifo Esenabhalu | Longwood (Big South) | F | 0 | 0 | None |
755 | Sedayjha Payne | Morgan St. (MEAC) | G | 0 | 0 | None |
We can see that for all of the rows where THREE_POINT_PERCENTAGE is None
, the three-point baskets made and attempted values are zero. If we manually calculated the three-point percentage for these rows, we would get NaN
results instead of None
due to the division by zero. NaN
and None
are different values, but we’ve confirmed that there is no issue with the underlying data. The None
values will automatically be replaced with NaN
s later in this process, so we’ll leave these values at None
for now and move on to the Position
field.
Handle Missing Positions
Let’s see how many rows are missing a Position value.
player_data[player_data['Position'].isna()][['PLAYER_NAME', 'Team', 'Position']]
PLAYER_NAME | Team | Position | |
---|---|---|---|
148 | Ally Becki | Ball St. (MAC) | NaN |
359 | Marie Kiefer | Ball St. (MAC) | NaN |
709 | Ava Uhrich | Southern Utah (WAC) | NaN |
843 | Madelyn Bischoff | Ball St. (MAC) | NaN |
It can be helpful to evaluate whether or not each column with missing values should have a missing or null value. In this case, it doesn’t make much sense for a basketball player to be missing a position (guard, forward, center), since every player is assigned a standard position on their team. Knowing that, we can assume that our current dataset is missing these values in error, and the entries should be updated.
We’ll be using an external data source to look up the correct positions for each of these players. If there were more rows with a missing Position value, then we might acquire another dataset and combine it with the current one. However, there are only four rows here, so we’ll look up the values manually.
You can use several external data sources, but here’s an example using ESPN’s website. Search for the player’s name and school to pull up the player’s individual page (for example, Ally Becki at Ball St.). On this page, you can see the player’s team, class, height, jersey number, as well as a F
, C
, or G
for the position (in this example, the position is listed as G
for Guard).
With that, we can locate and set the correct value of the Position field for each of the four players.
player_data.loc[player_data['PLAYER_NAME'] == 'Ally Becki', 'Position'] = 'G'
player_data.loc[player_data['PLAYER_NAME'] == 'Marie Kiefer', 'Position'] = 'F'
player_data.loc[player_data['PLAYER_NAME'] == 'Ava Uhrich', 'Position'] = 'F'
player_data.loc[player_data['PLAYER_NAME'] == 'Madelyn Bischoff', 'Position'] = 'G'
We can confirm that these changes were applied correctly by re-pulling the rows with a missing Position value.
player_data[player_data['Position'].isna()][['PLAYER_NAME', 'Team', 'Position']]
PLAYER_NAME | Team | Position |
---|
There are no results, so we’ve now handled all of the values caught by the isna()
method. This is a great start, but you’ll often find datasets with missing or incorrect values that are not caught by the isna()
method.
Handle Incorrect Values
Depending on the size of your dataset, the unique()
method can be a good way to find additional missing values. Let’s try that out on the Height column.
Handle Incorrect Heights
Using the unique()
method, we can see one height value that is not valid: 0-0
.
player_data['Height'].unique()
array(['5-7', '5-8', '5-2', '5-9', '5-6', '6-0', '5-10', '5-3', '5-11',
'5-5', '5-4', '6-2', '6-1', '6-3', '6-4', '6-6', '6-5', '6-7',
'6-8', '0-0'], dtype=object)
Let’s pull all of the rows where the Height is set to 0-0
.
player_data[player_data['Height'].eq('0-0')]
PLAYER_NAME | Team | Class | Height | Position | PLAYER_ID | TEAM_NAME | GAMES | MINUTES_PLAYED | FIELD_GOALS_MADE | ... | FREE_THROW_PERCENTAGE | OFFENSIVE_REBOUNDS | DEFENSIVE_REBOUNDS | TOTAL_REBOUNDS | ASSISTS | TURNOVERS | STEALS | BLOCKS | FOULS | POINTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
709 | Ava Uhrich | Southern Utah (WAC) | Fr. | 0-0 | F | ncaaw.p.115529 | Southern Utah | 29 | 877 | 151 | ... | 68.4 | 65 | 149 | 214 | 47 | 55 | 22 | 19 | 76 | 383 |
823 | Payton Hull | Abilene Christian (WAC) | Fr. | 0-0 | G | ncaaw.p.112709 | Abilene Christian | 29 | 837 | 155 | ... | 75.3 | 33 | 67 | 100 | 59 | 87 | 47 | 8 | 71 | 444 |
2 rows × 27 columns
There are only two rows with missing heights, so we can manually look up and update these values using the same method as the missing Position values.
player_data.loc[player_data['PLAYER_NAME'] == 'Ava Uhrich', 'Height'] = '6-0'
player_data.loc[player_data['PLAYER_NAME'] == 'Payton Hull', 'Height'] = '5-11'
We can double-check that this worked properly, just like the Position field.
player_data[player_data.eq('0-0').any(axis=1)]
PLAYER_NAME | Team | Class | Height | Position | PLAYER_ID | TEAM_NAME | GAMES | MINUTES_PLAYED | FIELD_GOALS_MADE | ... | FREE_THROW_PERCENTAGE | OFFENSIVE_REBOUNDS | DEFENSIVE_REBOUNDS | TOTAL_REBOUNDS | ASSISTS | TURNOVERS | STEALS | BLOCKS | FOULS | POINTS |
---|
0 rows × 27 columns
Handle Incorrect Classes
We can apply the same process used for Height to the Class field. Since there’s only one player with an incorrect class, this is a quick update.
player_data['Class'].unique()
player_data[player_data['Class'] == '---']
player_data.loc[player_data['PLAYER_NAME'] == 'Ayanna Khalfani', 'Class'] = 'Sr.'
player_data['Class'].unique()
array(['Jr.', 'So.', 'Sr.', 'Fr.'], dtype=object)
Now that we’ve handled all of the empty and incorrect values, we’re ready to move on to other data preprocessing steps.
Data Preprocessing
The goal of this step is to make sure our dataset is consistent and suitable for analysis. This stepp will change quite a bit depending on the exact project. For this project, we’ll be setting column datatypes, converting units, and substituting abbreviated values. For advanced machine learning projects, data preprocessing often includes additional steps like feature scaling, normalizing certain features, and encoding categorical values.
Data Type Conversion
Data type conversion is a fundamental step in preparing a dataset for analysis. Specifically, we often encounter situations where numbers are stored as strings or objects instead of their native numerical formats (e.g., integers or floats). In this step, let’s take a closer look at the initial data type of each column and identify opportunities to convert these values into more suitable datatypes.
player_data.dtypes
PLAYER_NAME object
Team object
Class object
Height object
Position object
PLAYER_ID object
TEAM_NAME object
GAMES object
MINUTES_PLAYED object
FIELD_GOALS_MADE object
FIELD_GOAL_ATTEMPTS object
FIELD_GOAL_PERCENTAGE object
THREE_POINTS_MADE object
THREE_POINT_ATTEMPTS object
THREE_POINT_PERCENTAGE object
FREE_THROWS_MADE object
FREE_THROW_ATTEMPTS object
FREE_THROW_PERCENTAGE object
OFFENSIVE_REBOUNDS object
DEFENSIVE_REBOUNDS object
TOTAL_REBOUNDS object
ASSISTS object
TURNOVERS object
STEALS object
BLOCKS object
FOULS object
POINTS object
dtype: object
We can see that all of the columns currently have the object
data type, which is not ideal for numeric columns like number of minutes played or field goals made.
Convert Numeric-only Columns
There are a few reasons to convert columns from the object to numeric data type. The main reason is that most mathematical operations (including numpy methods and sum/averages/etc. taken during aggregation steps) only work on numeric data types. Plotting and visualization libraries also expect numeric data types, so you may not be able to create any charts or graphs (besides categorical charts like histograms) unless specific columns use the numeric data type. Storing numbers as strings can give unexpected results, such as sorting the values 1
, 2
, and 11
in the order 1
, 11
, 2
. For large datasets, it’s worth noting that this can also take up more memory and be slower than storing numbers in numeric data types.
Now that we know why we would want to convert numeric columns to numeric data types, let’s start by defining which columns we expect to store only numeric data.
numeric_columns = ['GAMES', 'MINUTES_PLAYED', 'FIELD_GOALS_MADE', 'FIELD_GOAL_ATTEMPTS',
'FIELD_GOAL_PERCENTAGE', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS',
'THREE_POINT_PERCENTAGE', 'FREE_THROWS_MADE', 'FREE_THROW_ATTEMPTS',
'FREE_THROW_PERCENTAGE', 'OFFENSIVE_REBOUNDS', 'DEFENSIVE_REBOUNDS',
'TOTAL_REBOUNDS', 'ASSISTS', 'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS']
Now that we have this list of columns, we can use the pandas to_numeric()
method to convert these columns to the appropriate numeric data type. We can confirm that worked by re-checking the data type of each column.
player_data[numeric_columns] = player_data[numeric_columns].apply(pd.to_numeric)
player_data.dtypes
PLAYER_NAME object
Team object
Class object
Height object
Position object
PLAYER_ID object
TEAM_NAME object
GAMES int64
MINUTES_PLAYED int64
FIELD_GOALS_MADE int64
FIELD_GOAL_ATTEMPTS int64
FIELD_GOAL_PERCENTAGE float64
THREE_POINTS_MADE int64
THREE_POINT_ATTEMPTS int64
THREE_POINT_PERCENTAGE float64
FREE_THROWS_MADE int64
FREE_THROW_ATTEMPTS int64
FREE_THROW_PERCENTAGE float64
OFFENSIVE_REBOUNDS int64
DEFENSIVE_REBOUNDS int64
TOTAL_REBOUNDS int64
ASSISTS int64
TURNOVERS int64
STEALS int64
BLOCKS int64
FOULS int64
POINTS int64
dtype: object
All of the defined numeric columns were properly converted to either integers or floats. The percentage fields are now stored as floats, while the rest of them are integers.
We can also revisit the None
values in the THREE_POINT_PERCENTAGE column to see if they were impacted by the datatype conversion.
player_data[player_data['THREE_POINT_PERCENTAGE'].isna()][['PLAYER_NAME', 'Team', 'Position', 'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE']].head()
PLAYER_NAME | Team | Position | THREE_POINTS_MADE | THREE_POINT_ATTEMPTS | THREE_POINT_PERCENTAGE | |
---|---|---|---|---|---|---|
245 | Caitlin Weimar | Boston U. (Patriot) | F | 0 | 0 | NaN |
250 | Abby Muse | Boise St. (Mountain West) | F | 0 | 0 | NaN |
254 | Emily Bowman | Samford (SoCon) | C | 0 | 0 | NaN |
257 | Christina Dalce | Villanova (Big East) | F | 0 | 0 | NaN |
258 | Emily Saunders | Youngstown St. (Horizon) | F | 0 | 0 | NaN |
So the to_numeric()
method automatically handled the None
values in the three-point percentage column and converted them to NaN
s. With the numeric columns converted, we can move on to the columns that should contain only text.
Convert Text-only Columns
In pandas, both the object and string data types can be used to represent text. However, there is a slight difference between them:
- object dtype - This is the default dtype and is a catch-all for any non-numeric data. It can hold any Python object, including strings, lists, dictionaries, etc. When a column contains multiple data types or when pandas cannot guess the data type, it defaults to the object dtype. While this is more flexible, operations on columns with object dtype may be slower compared to columns with the string data type. The pandas documentation outlines reasons why it’s better to use the string dtype for storing text-only data.
- string dtype - This dtype specifically represents strings. Columns with this dtype contain only string data (not a mixture of string and other dtypes), which improves readability and allows for better dtype-specific operations like the
select_dtypes()
method. Additionally, with the str dtype, you can use some string-specific methods and functions directly on the column without the need for explicit type conversions.
Let’s start by identifying which columns should contain only string data.
string_columns = ['PLAYER_NAME', 'Team', 'Class', 'Height', 'Position', 'PLAYER_ID', 'TEAM_NAME']
We can use the pandas astype()
method to convert object columns to string columns. However, you should watch out for values with other data types when performing this conversion. For example, any missing values (such as NaN
) will be converted to the string "NaN"
. We’ve already replaced all of the missing values in these columns, so this is no longer a concern. However, we do not want other non-string values (numbers, arrays, dictionaries, etc.) to be converted either, so it’s best to check for those before performing the conversion. For this dataset, we do not expect these columns to have any non-string values, but it’s a good idea to look at a sample of the data in each column just to be safe.
player_data[string_columns].sample(10)
PLAYER_NAME | Team | Class | Height | Position | PLAYER_ID | TEAM_NAME | |
---|---|---|---|---|---|---|---|
400 | Precious Johnson | Duquesne (Atlantic 10) | Sr. | 6-4 | C | ncaaw.p.61337 | Duquesne |
385 | Meghan O'Brien | Lehigh (Patriot) | Jr. | 6-1 | F | ncaaw.p.66792 | Lehigh |
770 | Carmen Villalobos | Cleveland St. (Horizon) | Sr. | 6-1 | G | ncaaw.p.61574 | Cleveland St. |
180 | Ashley Hawkins | Gardner-Webb (Big South) | Jr. | 5-6 | G | ncaaw.p.100658 | Gardner-Webb |
821 | Natalie Picton | Montana St. (Big Sky) | Fr. | 5-5 | G | ncaaw.p.112319 | Montana St. |
273 | Jada Tiggett | N.C. Central (MEAC) | Fr. | 6-2 | G | ncaaw.p.113273 | N.C. Central |
710 | Nyla McGill | Yale (Ivy League) | Jr. | 5-8 | G | ncaaw.p.67510 | Yale |
200 | Keshuna Luckett | Jackson St. (SWAC) | Sr. | 5-8 | G | ncaaw.p.61415 | Jackson St. |
577 | Bri McDaniel | Maryland (Big Ten) | So. | 5-10 | G | ncaaw.p.70818 | Maryland |
649 | Halli Poock | Bradley (MVC) | Fr. | 5-4 | G | ncaaw.p.112767 | Murray St. |
This looks good, so we’ll go ahead and convert the data type for these columns and confirm the change was made.
player_data[string_columns] = player_data[string_columns].astype('string')
player_data[string_columns].dtypes
PLAYER_NAME string[python]
Team string[python]
Class string[python]
Height string[python]
Position string[python]
PLAYER_ID string[python]
TEAM_NAME string[python]
dtype: object
We’re finished converting data types, so it’s time to move on to value substitution.
Value Substitution
Value substitution (or value mapping) refers to replacing one value with another. One of the most common uses for this is to replace abbreviations with their full values (such as replacing “Fr.” with “Freshman”). For machine learning datasets, value substitution can also include mapping categorical values to number or array formats.
The first value substitution we’ll make is swapping the abbreviated position names with the full values. The Position field in this dataset refers to where the player typically plays on the court. Here’s a diagram of the positions:
Let’s see what Position values are currently in our dataset.
player_data['Position'].unique()
<StringArray>
['G', 'F', 'C']
Length: 3, dtype: string
So this data set uses the three types of positions instead of the five specific positions. To make the values more readable, let’s create a map between the current single-letter values and the full-length position name.
position_names = {
'F': 'Forward',
'G': 'Guard',
'C': 'Center'
}
Now all we have to do is use the pandas replace()
method to apply this mapping to the entire series.
player_data['Position'] = player_data['Position'].replace(position_names)
player_data['Position'].unique()
<StringArray>
['Guard', 'Forward', 'Center']
Length: 3, dtype: string
We can use the same process to substitute the values in the Class
column as well.
class_names = {
'Fr.': 'Freshman',
'So.': 'Sophomore',
'Jr.': 'Junior',
'Sr.': 'Senior'
}
player_data['Class'] = player_data['Class'].replace(class_names)
player_data['Class'].unique()
<StringArray>
['Junior', 'Sophomore', 'Senior', 'Freshman']
Length: 4, dtype: string
There are more substitutions that we could make (such as replacing “St.” with “State” in the team name), but those are the only ones we’ll make today. Next, let’s look at unit conversions.
Unit Conversion
The only unit conversion we’ll be doing today is to convert the feet-inches notation in the Height
column to the total number of inches. This means that a height of five feet, three inches tall is currently stored as 5-3
, but will be converted to the total number of inches, 63
. We can define a function to convert the individual Height values:
def height_to_inches(height):
feet, inches = map(int, height.split('-'))
return feet * 12 + inches
This function breaks the values in the Height
column on the hyphen and store the number of feet and inches in separate variables. It then multiplies the number of feet by 12 (the number of inches per foot) and adds the number of inches to get the total height of the player in inches.
Now that we have this function, we can apply it to the Height
column and verify it worked by checking the unique values.
player_data['Height'] = player_data['Height'].apply(height_to_inches)
player_data['Height'].unique()
array([67, 68, 62, 69, 66, 72, 70, 63, 71, 65, 64, 74, 73, 75, 76, 78, 77,
79, 80])
That wraps up everything needed for data preprocessing and our dataset is ready for feature engineering and machine learning.
If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export this dataset. As a reminder, you can use the to_csv()
method instead of .to_excel()
if you prefer.
player_data.to_excel('player_data_clean.xlsx', index=False)
Wrap up
In today’s guide, we laid the groundwork for data analysis by cleaning and preprocessing the combined player data. In the next article, we’ll expand upon this dataset by engineering a few new features and training a machine learning model. In the final installment of this series, we’ll identify relationships between various parameters and create meaningful visualizations.