Today we’ll go over how to start a new data science project by acquiring the data (using APIs and a CSV export). This is the first part of a series that will walk through the entire process of a data science project - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, creating visualizations, and machine learning.

Getting Started

First, let’s take a look at an overview of this data science project. If you’re already familiar with it, feel free to skip to the next section.

Project Overview

The dataset we’ll be using in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season. Here’s a brief description of each major step that we’ll go through for this project:

the steps for this data science project

Data Acquisition - This initial step involves obtaining data from two sources: (1) exporting the NCAA’s online individual player statistics report and (2) making API requests to the Yahoo Sports endpoint.
Data Cleaning - This step focuses on identifying and correcting any errors within the dataset. This includes removing duplicates, correcting inaccuracies, and handling missing data.
Data Preprocessing - This step ensures the data is suitable for analysis by converting datatypes, standardizing units, and replacing abbreviations.
Feature Engineering - This step involves selecting and expanding upon the dataset’s features (or columns). This includes calculating additional metrics from existing columns.
Data Exploration - This step focuses on analyzing and visualizing the dataset to uncover patterns, relationships, and general trends and is a helpful preliminary step before deeper analysis.
Creating Visualizations - This step involves identifying the relationships between various parameters (such as height and blocked shots) and generating meaningful visualizations (such as bar charts, scatterplots, and candlestick charts).
Machine Learning - This step focuses on selecting, training, and evaluating a machine learning model. For this project, the model will identify the combination of individual player statistics that correlates with optimal performance.

We’ll use Python along with popular libraries like pandas, numpy, and scikit-learn to accomplish these tasks efficiently. By the end of this series, you’ll be equipped with the skills needed to gather raw data from online sources, structure it into a usable format, eliminate any inconsistencies and errors, identify relationships between variables, create meaningful visualizations, and train a basic machine learning model. Due to the size of this project, we’ll start today with just the first step: data acqusition.

Requirements

Next, let’s cover what you’ll need if you want to follow along with this guide. If you already have a Python environment up and running and are familiar with how to install packages, then feel free to skip to the next section.

Before starting, you should have:

A computer with the appropriate access level to install and remove programs.
- This guide uses a Linux distribution (specifically Ubuntu), but this code can work on any major OS with a few minor changes.
A reliable internet connection to download the necessary software and make API requests.
A text editor or IDE to create and edit program files.
Basic programming knowledge is a plus. If you’ve never used Python before, then going through the beginner’s guide first might be helpful.

Dependencies

This project depends on having Python, a package manager (such as pip), and the relevant packages (listed below) installed before moving forward.

Python

This project depends on Python, which is probably already installed on your computer if you’re using a common OS. You can verify if Python is installed and which version is currently being used by running:

$ python3 --version
Python 3.10.12

If you get a response that says command not found instead of a version number, then Python is not installed. In that case, you can follow Python’s official installation instructions. You can also try running python --version to see if you have Python 2 installed instead of Python 3, but ultimately you should install Python 3 for this project.

Package Manager (pip)

You’ll also need a package manager to install dependencies. I’ve used pip, but this works with any package manager. You can check if pip is installed by running:

$ pip --version
pip 24.0 from /home/scoops/.local/lib/python3.10/site-packages/pip (python 3.10)

Similar to the command for Python, if you get a response that says command not found instead of a version number, then pip might not be installed on your device. You should follow the official instructions to install pip (or your package manager of choice) before proceeding.

Specific Packages

You should install the following required packages:

pandas - a powerful data manipulation library, essential for handling structured data. If you’re using pip, then you can install pandas by running pip install pandas.
requests - a library used for making HTTP requests, which will be useful for fetching data from online sources.
json - a module for encoding and decoding JSON data, facilitating the interchange of data between a Python program and external systems or files.
os - a module providing a portable way of interacting with the operating system, enabling functionalities such as file management, directory manipulation, and environment variables handling.
numpy - a fundamental package for scientific computing in Python, providing support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.
openpyxl - a useful library for reading from, writing to, and modifying Excel files. This library is optional for today’s project - you can write the dataframes to .csv files (using .to_csv()) instead of .xlsx (using .to_excel()) files to skip this requirement.

Once those required packages have been installed, we can start writing the program by importing the packages.

import pandas as pd
import requests

Now that we’ve ensured the necessary dependencies are installed, it’s time to acquire the data.

Identifying Datasets

Before diving into the process of gathering basketball player statistics for the 2023-2024 NCAA women’s basketball season, let’s briefly look at the two data sets we’ll be acquiring and a rough overview of the process to get each one:

Player Information Dataset
- This dataset will include player information such as name, height, position, team name, and class.
- To obtain this data, we’ll be navigating to the NCAA website’s basketball statistics section. From there, we selected the desired season, division, and reporting week to access the statistics. After selecting “All Statistics for Individual,” we clicked on “Show report CSV” to generate and download the dataset in CSV format.
Player Statistics Dataset
- This dataset will include player statistics and individual results for the latest season, including points scored, field goals made, blocks, steals, assists, etc.
- To obtain this dataset, we’ll be making multiple API requests to the Yahoo Sports API. Each request will pull the top players for a given statistic (such as points, blocks, assists, etc.) and then those results will be combined into one dataset.

Once we have those two datasets, the next step will be to merge them to have one comprehensive dataset that includes player information (such as height, position, class, etc.) as well as the player’s total statistics (such as total points scored, blocks made, steals made, etc.). The final dataset can then be used for analysis and to generate visualizations.

Acquiring Player Information Data

As described above, the player information data can be obtained from the NCAA website’s basketball statistics section. You can write a small web scraper in Python to scrape this data automatically, but today we’ll walk through how to manually download the data. Starting on the NCAA’s website, the page should look like this:

NCAA basketball statistics website

From this page, we’ll select the desired season, which is the most recent 2023-2024 season:

select the proper season

Once you click “view”, it should open a new tab with a page that looks like this:

select division and statistics

We can now select the division (which will be Division 1), the reporting week (which will be the final week of the season), and the statistics (which will be all individual player statistics). With those selections made, we can click “Show Report (CSV - Spreadsheet)” to get the data in CSV format.

download csv results

With the CSV downloaded, the only step left is to import the data (after deleting all blank and duplicate rows).

player_info = pd.read_csv("2024-04-07 NCAA WBB-Div1 Player Info.csv", encoding='latin1')
player_info.head()

	Player	Team	Class	Height	Position
0	Kiara Jackson	UNLV (Mountain West)	Jr.	5-7	G
1	Raven Johnson	South Carolina (SEC)	So.	5-8	G
2	Gina Marxen	Montana (Big Sky)	Sr.	5-8	G
3	McKenna Hofschild	Colorado St. (Mountain West)	Sr.	5-2	G
4	Kaylah Ivey	Boston College (ACC)	Jr.	5-8	G

We can see how many rows and columns are in this dataset using the .shape() method.

player_info.shape

(1009, 5)

This dataset contains 1,009 rows and 5 columns. Here’s a quick description of each column:

Player - The name of the basketball player.
Team - The name of the basketball team the player is associated with, including the conference they belong to.
Class - The academic year of the player (freshman, sophomore, junior, senior).
Height - The height of the player in feet-inches.
Position - The playing position of the player on the basketball court (guard, forward, center).

Now that we’re done with this dataset, let’s save a copy of it. As a reminder, you can use the to_csv() method instead of .to_excel() if you prefer.

player_info.to_excel('player_info.xlsx', index=False)

Acquiring Player Statistics Data

As mentioned before, the player statistics data can be obtained by making API requests to the Yahoo Sports API. There are a variety of API endpoints available for sports data, but today we’ll be using the Yahoo Sports API since it is free and contains all of the player statistics in one convenient result.

Explore the API Endpoint

You can view this data in a table format on Yahoo’s website, which can help identify particular columns (such as “G” stands for “Games Played”). Yahoo Sports basketball statistics webpage

Tools such as Postman can be extremely helpful for fine-tuning the API request and viewing the response data. The “View Code” button allows you to view the API request as Python code directly (in this case, by using the requests library specifically) in the right sidebar.

Postman API request with code sidebar

Write a Function to Request Data

We can use the Python code copied from Postman as a starting point. At a high level, this code will make a GET request to the Yahoo Sports API endpoint with the parameters specified in the query string, using the given payload and headers. The response data is then printed.

import requests

url = "https://graphite-secure.sports.yahoo.com/v1/query/shangrila/seasonStatsBasketballTotal?lang=en-US&region=US&tz=America/New_York&ysp_redesign=1&ysp_platform=desktop&season=2023&league=ncaaw&leagueStructure=ncaaw.struct.div.1&count=200&sortStatId=FREE_THROWS_MADE&positionIds=&qualified=TRUE"

payload = {}
headers = {}

response = requests.request("GET", url, headers=headers, data=payload)

print(response.text)

Depending on your use case, you might not need to make any modifications to this code. However, for this project, we’ll want to request data for multiple statistics (rather than just FREE_THROWS_MADE), so we will make a few changes to improve usability:

Separate all of the query string parameters into a dictionary that will be passed into the requests.get() function
Add function parameters for the statistic name, season, league, and count
Add basic error handling for request exceptions
Add a docstring

def get_data_for_stat(stat_name, season='2023', league='ncaaw', count='500'):
    """
    Retrieve basketball player statistics for a specified statistical category (stat_name) for a given season and league.

    Parameters:
    - stat_name (str): Specifies the name of the statistic to pull (points scored, field goals made, etc.).
    - season (str): Specifies the season for which data is requested (default is '2023').
    - league (str): Specifies the league for which data is requested (default is 'ncaaw' for NCAA Women's Basketball).
    - count (str): Specifies the maximum number of data entries to retrieve (default is '500').

    Returns:
    - dict: JSON response containing player statistics for the specified statistical category.
    """
    url = "https://graphite-secure.sports.yahoo.com/v1/query/shangrila/seasonStatsBasketballTotal"
    params = {
        'lang': 'en-US',
        'region': 'US',
        'tz': 'America/New_York',
        'ysp_redesign': '1',
        'ysp_platform': 'desktop',
        'season': season,
        'league': league,
        'leagueStructure': f'{league}.struct.div.1',
        'count': count,
        'sortStatId': stat_name,
        'positionIds': '',
        'qualified': 'FALSE'
    }
    try:
        # Send GET request to the API endpoint with specified parameters
        response = requests.get(url, params)
        response.raise_for_status()  # Raise exception for 4xx and 5xx status codes
        return response.json()
    except requests.exceptions.RequestException as e:
        # Handle any errors encountered during the API request
        print(f'Error requesting data for {stat_name}: {e}')
        return None

This get_data_for_stat() function retrieves basketball player statistics for the top players in a specified statistical category (stat_name) for a given season (2023) and league (ncaaw). At a high level, the function begins by setting the base URL for the Yahoo Sports API endpoint for basketball season statistics in the url variable. Next, it uses the params variable to store various query string parameters for the API request (such as language, region, time zone, etc.).

The function then uses the requests.get() method to send a GET request to the API endpoint with the specified parameters. It checks the response status code for errors using response.raise_for_status(). If the response indicates an error (4xx or 5xx status code), it raises an exception. If the request is successful, the function returns the JSON data obtained from the API response.

You could further modify this code to fit your needs and add additional parameters to the function arguments (such as language or region), but we’ll use this function to retrieve the top players for a given statistical category. Here’s an example of retrieving the top 500 players (the default count is set to 500 in the function) by total points scored during the season:

example_response = get_data_for_stat('POINTS')

Since the function returns data in json format, we can take a look at the structure of the response data (using example_response['data']) and identify a few values of interest.

example_response['data']['statTypes'] stores the syntax and sort order for all of the available statistics
example_response['data']['leagues'][0]['leaders'] stores the result set for our request (where ‘leaders’ refers to the top players for the given request)

Let’s print the available statistic types as well as the first result for the top player by points scored.

example_response.keys()
example_response['data']['statTypes']
example_response['data']['leagues'][0]['leaders'][0]

    {'player': {'displayName': 'Caitlin Clark',
      'playerId': 'ncaaw.p.64550',
      'team': {'displayName': 'Iowa',
       'abbreviation': 'IOWA',
       'teamLogo': {'url': 'https://s.yimg.com/iu/api/res/1.2/c1eT0fjpIOp9jIlg5xiq0w--~B/YXBwaWQ9c2hhcmVkO2ZpPWZpbGw7cT0xMDA7aD0xMjg7dz0xMjg-/https://s.yimg.com/cv/apiv2/default/ncaab/20181211/500x500/Iowa.png',
        'height': 128,
        'width': 128}},
      'alias': {'url': 'https://sports.yahoo.com/ncaaw/players/64550/'},
      'playerCutout': None},
     'stats': [{'statId': 'GAMES', 'value': '32'},
      {'statId': 'MINUTES_PLAYED', 'value': '1088'},
      {'statId': 'FIELD_GOALS_MADE', 'value': '332'},
      {'statId': 'FIELD_GOAL_ATTEMPTS', 'value': '719'},
      {'statId': 'FIELD_GOAL_PERCENTAGE', 'value': '46.2'},
      {'statId': 'THREE_POINTS_MADE', 'value': '168'},
      {'statId': 'THREE_POINT_ATTEMPTS', 'value': '437'},
      {'statId': 'THREE_POINT_PERCENTAGE', 'value': '38.4'},
      {'statId': 'FREE_THROWS_MADE', 'value': '188'},
      {'statId': 'FREE_THROW_ATTEMPTS', 'value': '219'},
      {'statId': 'FREE_THROW_PERCENTAGE', 'value': '85.8'},
      {'statId': 'OFFENSIVE_REBOUNDS', 'value': '10'},
      {'statId': 'DEFENSIVE_REBOUNDS', 'value': '224'},
      {'statId': 'TOTAL_REBOUNDS', 'value': '234'},
      {'statId': 'ASSISTS', 'value': '282'},
      {'statId': 'TURNOVERS', 'value': '151'},
      {'statId': 'STEALS', 'value': '55'},
      {'statId': 'BLOCKS', 'value': '17'},
      {'statId': 'FOULS', 'value': '61'},
      {'statId': 'POINTS', 'value': '1020'}]}

This is correctly showing Caitlin Clark as the top-scoring player of the 2023-2024 season, so we’re ready to move on to the next step.

Write a Function to Format Data

Now that the API request function is working, let’s define another function. This one will extract the relevant data and load it into a dataframe.

def format_response_data(response_data):
    """
    Process and format the JSON response data obtained from the Yahoo Sports API into a pandas DataFrame.

    Parameters:
    - response_data (dict): JSON response data obtained from the Yahoo Sports API, containing player statistics.

    Returns:
    - DataFrame: Pandas DataFrame containing formatted player statistics.
    """
    if not response_data:
        return None
    try:
        # Extract relevant data from the JSON response
        response_data = response_data['data']['leagues'][0]['leaders']
        data = []
        for item in response_data:
            # Extract player details
            player_details = {
                'PLAYER_NAME': item['player']['displayName'],
                'PLAYER_ID': item['player']['playerId'],
                'TEAM_NAME': item['player']['team']['displayName']
            }
            # Extract player statistics
            player_stats = {stat['statId']: stat['value'] for stat in item['stats']}
            # Combine player details and statistics into a single dictionary
            player_row = {**player_details, **player_stats}
            data.append(player_row)
        # Convert the list of dictionaries into a pandas DataFrame
        return pd.DataFrame(data)
    except KeyError as e:
        # Handle any errors encountered during data formatting
        print(f'Error formatting response data: {e}')
        return None

This format_response_data() function will take the given response_data (from the Yahoo Sports API) and formats it into a pandas DataFrame. At a high level, the function begins by checking if the response_data is empty. If it is, the function returns None. It then attempts to access specific nested fields within the JSON data to extract relevant information. Inside the try block, the function extracts the player details (name, ID, team name) and player statistics from the JSON data. It iterates through each player in the response data, creating a dictionary (player_row) containing player details and statistics.

The player details include PLAYER_NAME, PLAYER_ID, and TEAM_NAME, while the player statistics are dynamically extracted from the JSON data. These player details and statistics are appended to a list (data). Finally, the function constructs a pandas DataFrame from that list of dictionaries and returns it.

By using this function in conjunction with the get_data_for_stat() function, we can efficiently retrieve, process, and format player statistics from the Yahoo Sports API into a structured DataFrame.

We can test out this function using the example_response from the previous step.

example_dataframe = format_response_data(example_response)
example_dataframe.head()

	PLAYER_NAME	PLAYER_ID	TEAM_NAME	GAMES	MINUTES_PLAYED	FIELD_GOALS_MADE	FIELD_GOAL_ATTEMPTS	FIELD_GOAL_PERCENTAGE	THREE_POINTS_MADE	THREE_POINT_ATTEMPTS	…	FREE_THROW_PERCENTAGE	OFFENSIVE_REBOUNDS	DEFENSIVE_REBOUNDS	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
0	Caitlin Clark	ncaaw.p.64550	Iowa	32	1088	332	719	46.2	168	437	…	85.8	10	224	234	282	151	55	17	61	1020
1	JuJu Watkins	ncaaw.p.112021	USC	29	989	270	656	41.2	58	176	…	84.6	52	161	213	96	120	72	45	78	801
2	Hannah Hidalgo	ncaaw.p.112250	Notre Dame	31	1104	255	560	45.5	45	132	…	78.3	25	175	200	170	109	145	3	86	725
3	Lucy Olsen	ncaaw.p.67706	Iowa	30	1087	268	612	43.8	47	158	…	80.9	30	114	144	115	73	57	18	72	697
4	Ta’Niya Latson	ncaaw.p.70600	Florida St.	32	991	249	566	44.0	27	98	…	85.2	17	118	135	128	98	50	13	53	680

5 rows × 23 columns

The middle of the dataframe might not be fully displayed, so let’s print the full list of columns in this dataframe as well:

example_dataframe.columns

Index(['PLAYER_NAME', 'PLAYER_ID', 'TEAM_NAME', 'GAMES', 'MINUTES_PLAYED',
       'FIELD_GOALS_MADE', 'FIELD_GOAL_ATTEMPTS', 'FIELD_GOAL_PERCENTAGE',
       'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE',
       'FREE_THROWS_MADE', 'FREE_THROW_ATTEMPTS', 'FREE_THROW_PERCENTAGE',
       'OFFENSIVE_REBOUNDS', 'DEFENSIVE_REBOUNDS', 'TOTAL_REBOUNDS', 'ASSISTS',
       'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS'],
      dtype='object')

Here’s a short description of each column:

PLAYER_NAME The name of the basketball player.
PLAYER_ID The unique identifier for each basketball player.
TEAM_NAME The name of the player’s basketball team.
GAMES The number of games the player has played in. This does not include attending a game but not playing (such as when a player is injured).
MINUTES_PLAYED The total number of minutes the player has been on the court. This is measured in the time on the game clock, not in real-time. There are typically 40 minutes of game-time, comprised of four 10-minute quarters. Overtime would be considered extra time.
FIELD_GOALS_MADE The number of successful field goals made by the player. This includes both two-point and three-point field goals and does not include free throws.
FIELD_GOAL_ATTEMPTS The total number of field goal attempts by the player.
FIELD_GOAL_PERCENTAGE The percentage of successful field goals made by the player.
THREE_POINTS_MADE The number of successful three-point baskets made by the player.
THREE_POINT_ATTEMPTS The total number of three-point basket attempts by the player.
THREE_POINT_PERCENTAGE The percentage of successful three-point baskets made by the player.
FREE_THROWS_MADE The number of successful free throws made by the player.
FREE_THROW_ATTEMPTS The total number of free throw attempts by the player.
FREE_THROW_PERCENTAGE The percentage of successful free throws made by the player.
OFFENSIVE_REBOUNDS The number of offensive rebounds grabbed by the player.
DEFENSIVE_REBOUNDS The number of defensive rebounds grabbed by the player.
TOTAL_REBOUNDS The total number of rebounds grabbed by the player.
ASSISTS The number of assists made by the player.
TURNOVERS The number of turnovers committed by the player.
STEALS The number of steals made by the player.
BLOCKS The number of baskets blocked by the player.
FOULS The number of fouls committed by the player.
POINTS The total number of points scored by the player.

Let’s quickly verify that the dataframe has the proper number of rows using the shape() method. Since we set the count to 500 in the API request, there should be 500 rows in this dataframe.

example_dataframe.shape

(500, 23)

This is correct, so we can move on to the next step.

Write a Function to Format and Request Data

We could stop here and use those two functions to make our API requests, but let’s define one more function that will request and format the data in one step.

def get_and_format_data_for_stat(stat_name, season='2023', league='ncaaw'):
    """
    Retrieve basketball player statistics for a specified statistical category for a given season and league and format the data into a pandas DataFrame.

    Parameters:
    - stat_name (str): Specifies the name of the statistic to pull (points scored, field goals made, etc.).
    - season (str): Specifies the season for which data is requested (default is '2023').
    - league (str): Specifies the league for which data is requested (default is 'ncaaw' for NCAA Women's Basketball).

    Returns:
    - DataFrame: Pandas DataFrame containing formatted player statistics for the specified statistical category.
    """
    # Retrieve player statistics for the specified statistical category, season, and league
    response_data = get_data_for_stat(stat_name, season, league)
    # Format the retrieved data into a pandas DataFrame
    return format_response_data(response_data)

This function combines the other two function calls and passes through the same parameters. In summary, it retrieves data for the top players for a specified statistic (stat_name), season (season), and league using the get_data_for_stat() function and then formats the data into a pandas DataFrame using the format_response_data() function. Since we’ll be sending multiple requests to the API, this function streamlines the entire API request and formatting into one command and simplifies the process of generating the player statistics dataset.

Request and Format Statistics Data

Now we’re ready to request data for each statistic. For this project, we’ll pull the list of top players by five statistics: points, assists, rebounds, blocks, and steals. Here’s what that looks like:

# Get and format data for each stat
points_top_players = get_and_format_data_for_stat('POINTS')
assists_top_players = get_and_format_data_for_stat('ASSISTS')
rebounds_top_players = get_and_format_data_for_stat('TOTAL_REBOUNDS')
blocks_top_players = get_and_format_data_for_stat('BLOCKS')
steals_top_players = get_and_format_data_for_stat('STEALS')

We now have five dataframes and each one contains the 500 players who scored the best in the given statistic.

Combine Statistics Data

Since these dataframes contain the same columns with different data rows, we’ll combine all five into one single dataframe. The same player could show up in each dataframe (such as Caitlin Clark being a top player in both points and assists), so we’ll drop any duplicate rows and reset the index in this step as well.

# Combine the leaders for each stat into one df
player_stats = pd.concat([points_top_players, assists_top_players, rebounds_top_players,
                         blocks_top_players, steals_top_players], ignore_index=True).drop_duplicates()
player_stats.head()

	PLAYER_NAME	PLAYER_ID	TEAM_NAME	GAMES	MINUTES_PLAYED	FIELD_GOALS_MADE	FIELD_GOAL_ATTEMPTS	FIELD_GOAL_PERCENTAGE	THREE_POINTS_MADE	THREE_POINT_ATTEMPTS	...	FREE_THROW_PERCENTAGE	OFFENSIVE_REBOUNDS	DEFENSIVE_REBOUNDS	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
0	Caitlin Clark	ncaaw.p.64550	Iowa	32	1088	332	719	46.2	168	437	...	85.8	10	224	234	282	151	55	17	61	1020
1	JuJu Watkins	ncaaw.p.112021	USC	29	989	270	656	41.2	58	176	...	84.6	52	161	213	96	120	72	45	78	801
2	Hannah Hidalgo	ncaaw.p.112250	Notre Dame	31	1104	255	560	45.5	45	132	...	78.3	25	175	200	170	109	145	3	86	725
3	Lucy Olsen	ncaaw.p.67706	Iowa	30	1087	268	612	43.8	47	158	...	80.9	30	114	144	115	73	57	18	72	697
4	Ta'Niya Latson	ncaaw.p.70600	Florida St.	32	991	249	566	44.0	27	98	...	85.2	17	118	135	128	98	50	13	53	680

5 rows × 23 columns

We can see that this combined dataframe has the same number of columns as the individual dataframes, but will likely have fewer rows.

player_stats.shape

(1392, 23)

The final player statistics dataset does have fewer rows, since we combined any duplicate rows. Now that we’re done with this dataset, let’s save it. As a reminder, you can use the to_csv() method instead of .to_excel() if you prefer.

player_stats.to_excel('player_stats.xlsx', index=False)

With that, we’ve finished acquiring both of the datasets needed for this project and can combine them into one final dataset that includes both the player information and player statistics.

Combine Player Information and Statistics Datasets

Combining datasets can be easy or difficult, depending on what columns are shared by the two datasets. In this case, the Player column in the player_info dataframe closely matches the PLAYER_NAME column in the player_stats dataframe, so we’ll start by renaming one of those columns to match the other.

player_info.rename(columns={"Player": "PLAYER_NAME"}, inplace=True)
player_info.head()

	PLAYER_NAME	Team	Class	Height	Position
0	Kiara Jackson	UNLV (Mountain West)	Jr.	5-7	G
1	Raven Johnson	South Carolina (SEC)	So.	5-8	G
2	Gina Marxen	Montana (Big Sky)	Sr.	5-8	G
3	McKenna Hofschild	Colorado St. (Mountain West)	Sr.	5-2	G
4	Kaylah Ivey	Boston College (ACC)	Jr.	5-8	G

Next, we’ll use the pandas merge() method to merge the two dataframes on the player’s name. This final player_data dataframe includes columns for the player’s information (height, position, class, etc.) as well as their statistics for the season (points, rebounds, blocks, assists, etc.)

player_data = pd.merge(player_info, player_stats, on=['PLAYER_NAME'], how='inner')
player_data.head()

	PLAYER_NAME	Team	Class	Height	Position	PLAYER_ID	TEAM_NAME	GAMES	MINUTES_PLAYED	FIELD_GOALS_MADE	...	FREE_THROW_PERCENTAGE	OFFENSIVE_REBOUNDS	DEFENSIVE_REBOUNDS	TOTAL_REBOUNDS	ASSISTS	TURNOVERS	STEALS	BLOCKS	FOULS	POINTS
0	Kiara Jackson	UNLV (Mountain West)	Jr.	5-7	G	ncaaw.p.67149	UNLV	29	895	128	...	75.0	27	102	129	135	42	31	5	47	323
1	Raven Johnson	South Carolina (SEC)	So.	5-8	G	ncaaw.p.67515	South Carolina	30	823	98	...	64.3	33	128	161	148	53	60	5	34	243
2	Gina Marxen	Montana (Big Sky)	Sr.	5-8	G	ncaaw.p.57909	Montana	29	778	88	...	72.4	6	54	60	111	38	16	2	26	297
3	McKenna Hofschild	Colorado St. (Mountain West)	Sr.	5-2	G	ncaaw.p.60402	Colorado St.	29	1046	231	...	83.5	6	109	115	211	71	36	4	34	654
4	Kaylah Ivey	Boston College (ACC)	Jr.	5-8	G	ncaaw.p.64531	Boston Coll.	33	995	47	...	60.7	12	45	57	186	64	36	1	48	143

5 rows × 27 columns

We’ve done an inner join to keep rows that exist in both columns. For some projects, you might want to use an outer join or even a right or left join, but an inner join works well for this project. If you’re familiar with SQL, the how parameter is similar to the different types of SQL JOINs, but you can read about (and see examples of) the different how parameters in the pandas merge documentation. In this case, using an inner join means that we have fewer rows in the final dataset than we had in either of the other two dataframes.

player_data.shape

(900, 27)

So our final dataset has 900 rows and 27 columns. This is enough data for our project, so we’re ready to move on to the cleaning and preprocessing steps.

Export Data

If you’re going to use a new Jupyter notebook / Python script for the next part of this series, then it’s a good idea to export this dataset. As a reminder, you can use the to_csv() method instead of the .to_excel() method if you prefer.

player_data.to_excel('player_data_raw.xlsx', index=False)

Wrap up

In this guide, we built a new dataset by acquiring and then combining the NCAA women’s basketball player information dataset with the Yahoo Sports player statistics dataset. In the next part, we’ll lay the groundwork for data analysis by cleaning and preprocessing the combined player data.

Also, all of the code snippets in today’s guide are available in a Jupyter Notebook in the ncaa-basketball-stats repository on GitHub.

Project Setup and Data Acquisition

Outlier or Caitlin Clark? [Part 1]