6. Grouping data and creating pivot tables with Pandas

In this lesson you will learn how to group data together in a DataFrame and create summary statistics of the grouped data. You will also learn how to reshape a DataFrame, and create a pivot table.

For this lesson we will use the following data of (fake) exam scores:

import pandas as pd
exams_data = 'https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/exam-scores/exam-scores.csv'
exams = pd.read_csv(exams_data)
exams.sample(10)

	Class_Section	Exam_Version	Completion_Time	Made_Own_Study_Guide	Did_Exam_Prep Assignment	Studied_In_Groups	Student_Score	Percentage	Letter_Grade
62	M02	D	45	?	?	?	24	80.00%	B
64	M02	D	60	N	N	Y	24	80.00%	B
0	M01	A	20	N	N	Y	24	80.00%	B
7	M01	B	15	Y	Y	Y	26	86.70%	B+
44	M02	B	45	Y	Y	Y	25	83.30%	B
26	M01	D	60	Y	Y	N	19	63.30%	C-
39	M02	B	15	N	N	Y	19	63.30%	C-
46	M02	B	55	Y	N	N	17	56.70%	D
23	M01	D	35	?	?	?	13	43.30%	F
52	M02	C	40	Y	Y	N	16	53.30%	D

To follow along with the commands below, download the following scratch notebook to your ist356 directory, then open it with VS Code:

Download scratch-pandas-5.ipynb

This has the exams_data URL in it.

Group By

With large data sets we frequently want to group common data together and summarize those groups in some way. For example, you might want to know:

Average exam score by section
Number of students who took each exam
Average grade based on whether students studied in groups
Total completion time by letter grade

The Pandas groupby method allows us to quickly group data like this. Using it involves two steps:

Run the groupby method on a dataframe, giving it a column or list of columns to group the data by.
The groupby method returns an DataFrameGroupBy object. On its own, this object isn’t terrible informative. To get something useful, we then use the aggregate method (agg) to summarize the grouped data and display it as a DataFrame. When aggregating the data, we need to provide the name of an operation (sum, min, max, mean, std, quartile, count) by which to summarizie the data.

Here’s an example using our exams data:

# Example: Total number of exams take by section and the average score in each section:
exams_by_section = exams.groupby(by=['Class_Section']).agg({ 'Class_Section': 'count', 'Student_Score': 'mean' })
exams_by_section

	Class_Section	Student_Score
Class_Section
M01	29	23.000000
M02	36	22.527778

Here, we’ve chosen to group the exams by the class section. We’ve then produced a DataFrame in which the class sections are summarized by the number of students and their average score. The column names are a little confusing, however. Let’s rename them to make the output more readable:

exams_by_section = exams.groupby(by=['Class_Section']).agg({ 'Class_Section': 'count', 'Student_Score': 'mean' })
exams_by_section = exams_by_section.rename(columns={'Class_Section': 'Exam_Count', 'Student_Score': 'Average_Score'})
exams_by_section

	Exam_Count	Average_Score
Class_Section
M01	29	23.000000
M02	36	22.527778

Note that the grouped columns end up in the index method to add the grouped columns back as a column:

exams_by_section['Class_Section'] = exams_by_section.index
exams_by_section

	Exam_Count	Average_Score	Class_Section
Class_Section
M01	29	23.000000	M01
M02	36	22.527778	M02

Code Challenge 6.1

Create a Streamlit app that will load the exams csv file above and will allow the user to select one of the following: Made_Own_Study_Guide, Did_Exam_Prep Assignment, Studied_In_Groups. After the selection is made, display a dataframe that summarizes the count of students and the average student score for the selection.

Hint: For offering the selection, use Streamlit’s selectbox method.

Solution

import streamlit as st
import pandas as pd

st.title("Exam Scores")


options = ['Made_Own_Study_Guide', 'Did_Exam_Prep Assignment', 'Studied_In_Groups']

exams = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/exam-scores/exam-scores.csv')

option = st.selectbox('Select Exam:', options)

summary_df = exams.groupby(by=option).agg({'Class_Section': 'count', 'Student_Score' :'mean'})
summary_df = summary_df.rename(columns={'Class_Section': 'Student Count', 'Student_Score': 'Mean Score'})

st.dataframe(summary_df)

Pivot and Melt

Pivot and melt are inverse operations:

pivot makes “long” data “wide” moving rows into columns.
melt makes “wide” data “long” moving columns into rows. Basically, the opposite of pivot.

Note

These functions only move data, they are unable to summarize it.
The intersection of row/column must contain a single value. Multiple values under the same row/column will fail.

To set this up this example from exams let’s create a dataframe that summarizes the data. We will add the index columns back to the dataframe for clarity. Please note this is not something that needs to be done typically. We are just re-using the dataset for this example. (In fact, the following block of code is basically creating a pivot table from the exams data. We’ll see how to create pivot tables using a much simpler way below.)

# Get average scores by section and exam version:
avg_scores_by_section_and_version = exams.groupby(
    by=['Class_Section', 'Exam_Version']).agg({'Student_Score': 'mean'})

# add section and exam version back to dataframe
avg_scores_by_section_and_version['Class_Section'] = avg_scores_by_section_and_version.index.get_level_values('Class_Section')
avg_scores_by_section_and_version['Exam_Version'] = avg_scores_by_section_and_version.index.get_level_values('Exam_Version')
# reset the index
avg_scores_by_section_and_version = avg_scores_by_section_and_version.reset_index(drop=True)
#rename the Student_score to average score
avg_scores_by_section_and_version =  avg_scores_by_section_and_version.rename(columns={'Student_Score': 'Average_Score'})
#reorder the columns
avg_scores_by_section_and_version = avg_scores_by_section_and_version[['Class_Section', 'Exam_Version', 'Average_Score']]

#show
avg_scores_by_section_and_version

	Class_Section	Exam_Version	Average_Score
0	M01	A	25.428571
1	M01	B	23.571429
2	M01	C	23.714286
3	M01	D	19.750000
4	M02	A	22.400000
5	M02	B	23.222222
6	M02	C	21.777778
7	M02	D	22.750000

Pivot()

Let’s pivot this data two different ways:

exam_version_in_col - a pivot where the exam version is in the column
class_section_in_col - a pivot where the class section is in the column

exam_version_in_col = avg_scores_by_section_and_version.pivot(
    index='Class_Section', columns='Exam_Version', values='Average_Score')
exam_version_in_col

Exam_Version	A	B	C	D
Class_Section
M01	25.428571	23.571429	23.714286	19.75
M02	22.400000	23.222222	21.777778	22.75

class_section_in_col = avg_scores_by_section_and_version.pivot(
    index='Exam_Version', columns='Class_Section', values='Average_Score')
class_section_in_col

Class_Section	M01	M02
Exam_Version
A	25.428571	22.400000
B	23.571429	23.222222
C	23.714286	21.777778
D	19.750000	22.750000

Melt()

We will now melt the data back into its original shape. Melt requires:

id_vars=list list of columns which remain in the melt
var_name=str column name of the columns to unpivot
value_name column name of the values to unpivot

First, to get this this example to work, we need to add the index values as a column (here called Class_Section):

exam_version_in_col['Class_Section'] = exam_version_in_col.index
melted1 = exam_version_in_col.melt(id_vars=["Class_Section"],
                                   var_name="Exam_Version",
                                   value_name='Average_Score')
melted1

	Class_Section	Exam_Version	Average_Score
0	M01	A	25.428571
1	M02	A	22.400000
2	M01	B	23.571429
3	M02	B	23.222222
4	M01	C	23.714286
5	M02	C	21.777778
6	M01	D	19.750000
7	M02	D	22.750000

Doing the same with the class_section_in_col:

class_section_in_col['Exam_Version'] = class_section_in_col.index
melted2 = class_section_in_col.melt(id_vars=["Exam_Version"],
                                    var_name="Class_Section",
                                    value_name='Average_Score')
melted2

	Exam_Version	Class_Section	Average_Score
0	A	M01	25.428571
1	B	M01	23.571429
2	C	M01	23.714286
3	D	M01	19.750000
4	A	M02	22.400000
5	B	M02	23.222222
6	C	M02	21.777778
7	D	M02	22.750000

Pivot_table()

The pd.pivot_table() function combines a groupby() with a pivot(). Its intended for when you need to pivot and aggregate in the pivot, avoiding a lot of extra code such as adding indexes as columns (as we had to above).

Here’s the examples above, but with a pivot_table on the original examsdata. We can skip the processing building avg_scores_by_section_and_version because pivot_table() allows us to summarize data.

exam_version_in_col = exams.pivot_table(index='Class_Section',
                                        columns='Exam_Version',
                                        values='Student_Score',
                                        aggfunc='mean')
exam_version_in_col

Exam_Version	A	B	C	D
Class_Section
M01	25.428571	23.571429	23.714286	19.75
M02	22.400000	23.222222	21.777778	22.75

class_section_in_col = exams.pivot_table(index='Exam_Version',
                                         columns='Class_Section',
                                         values='Student_Score',
                                         aggfunc='mean')
class_section_in_col

Class_Section	M01	M02
Exam_Version
A	25.428571	22.400000
B	23.571429	23.222222
C	23.714286	21.777778
D	19.750000	22.750000

Code Challenge 6.2

Let’s build an interactive pivot table in streamlit! Using the exams csv file from above:

Add a selection widget (use Steamlit’s selectbox method) that lets the user select one of the following fields to put in the row of the pivot table:

::: {#c33fc35d .cell execution_count=13} {.python .cell-code} fields = ['Class_Section', 'Exam_Version', 'Made_Own_Study_Guide', 'Did_Exam_Prep Assignment', 'Studied_In_Groups','Letter_Grade'] :::
Add another selection widget that allows the user to select which field to display in the columns. Note: you will need to remove the field the user selected for the row in the list of options, else you’ll get an error if they select the same field twice.
Create another selection widget that allows the user to select what data to populate the pivot table with. The pivot table should display the average of the selected value. The options for the displayed data should be:
```
measures = ['Completion_Time','Student_Score']
```
Build the pivot table dataframe from the inputs. Use the average for the aggfunc
Display the pivot table.

Here’s a screen shot of what your app should like:

Bonus: Cache the exams data so that it is not reloaded every time you interact with the app! Refer back to the Session State section of the Streamlit tutorial for help.

Solution

import streamlit as st
import pandas as pd

st.title("Exam Scores Pivot Table")

# load the data: we'll only do this once, then cache
if 'exams' not in st.session_state:
    st.session_state.exams = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/exam-scores/exam-scores.csv')

# set up the options
fields = [ 'Class_Section', 'Exam_Version', 'Made_Own_Study_Guide', 'Did_Exam_Prep Assignment', 'Studied_In_Groups','Letter_Grade']
measures = ['Student_Score', 'Completion_Time']
row = st.selectbox('Display in Row:', fields)
fields.remove(row)
col = st.selectbox('Display in Column:', fields)
value = st.selectbox('Display Average of:', measures)

pivot_df = st.session_state.exams.pivot_table(index=row, columns=col, values=value, aggfunc='mean')
st.dataframe(pivot_df)