import pandas as pd
import numpy as np2. Introduction to Pandas: Series and DataFrames
What is Pandas?
Pandas is a Python library for working with tabular data. Pandas is short for PANeled DAta.
Pandas is like a programmable spreadheet. It is used by programmers to wrangle data (sort, filter, clean, enhance, etc).
Pandas Series and DataFrame
The two fundamental compoents of Pandas are the Series and DataFrame
- a
Seriesis a list of values with labels. This creates a column of data - a
DataFrameis a collection of series. This creates a table of data
Null / No Value
The constant np.nan is used to represent “no value”
Series
A Series is a named list of values.
The series has an index, too to reference each value. The default index is a zero based, similar to a python list.
grades = pd.Series(data=[100,80,90,np.nan,100], name="Midterm Grades")
grades0 100.0
1 80.0
2 90.0
3 NaN
4 100.0
Name: Midterm Grades, dtype: float64
# The the value at index 2
grades[2]np.float64(90.0)
The index can be anyting . Here’s the same grades with student names as the index.
grades2 = pd.Series( data=[100,80,90,np.nan,100],
name="Midterm Grades",
index=["Alice", "Bob", "Charlie", "David", "Eve"])
grades2Alice 100.0
Bob 80.0
Charlie 90.0
David NaN
Eve 100.0
Name: Midterm Grades, dtype: float64
# Get Charlie's grade
grades2["Charlie"]np.float64(90.0)
Series Vectorized Functions
Like NumPy arrays, you can perform element-wise mathematical operations on Pandas series without needing for loops (i.e., vectorization). For example:
# add 5 points to all the grades
grades3 = grades2 + 5
print(grades3)Alice 105.0
Bob 85.0
Charlie 95.0
David NaN
Eve 105.0
Name: Midterm Grades, dtype: float64
# square the grades
gradesq = grades2**2
print(gradesq)Alice 10000.0
Bob 6400.0
Charlie 8100.0
David NaN
Eve 10000.0
Name: Midterm Grades, dtype: float64
# add two series together
grades4 = grades2 + grades3
print(grades4)Alice 205.0
Bob 165.0
Charlie 185.0
David NaN
Eve 205.0
Name: Midterm Grades, dtype: float64
Pandas series also have a number of vectorized methods that you can call on the series themselves, again like NumPy arrays. Some examples:
print("Highest grade:", grades.max())
print("Average grade:", grades.mean())
print("lowest grade:", grades.min())
print("Sum of grades:", grades.sum())
print("Count of grades", grades.count())Highest grade: 100.0
Average grade: 92.5
lowest grade: 80.0
Sum of grades: 370.0
Count of grades 4
Other Series Functions
We can use the unique() method function to return only the non-duplicate values from the series.
The value_counts() method function adds up values, creating a new series where the index is the value and the value is the count.
For example consider the following series:
votes = pd.Series(data=[ 'y','y','y','n','y',np.nan,'n','n','y'], name="Vote")
print("deduplicate the votes:", votes.unique())
print("counts by value:", votes.value_counts())deduplicate the votes: ['y' 'n' nan]
counts by value: Vote
y 5
n 3
Name: count, dtype: int64
Comparison to NumPy
In many ways, you can think of a Pandas series as being like a NumPy array (in fact, series are built on top of NumPy arrays). It even has similar performance. For example:
a = np.arange(1000000)
aseries = pd.Series(a)%timeit a.mean()537 μs ± 807 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit aseries.mean()549 μs ± 1.05 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
However, unlike NumPy arrays, Pandas series can only be one dimensional. Example:
# 2D NumPy array? No problem!
a = np.ones((1000, 1000))
print(a.shape)(1000, 1000)
# 2D Pandas series? Nope!
aseries = pd.Series(a)--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[15], line 2 1 # 2D Pandas series? Nope! ----> 2 aseries = pd.Series(a) File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/series.py:584, in Series.__init__(self, data, index, dtype, name, copy, fastpath) 582 data = data.copy() 583 else: --> 584 data = sanitize_array(data, index, dtype, copy) 586 manager = _get_option("mode.data_manager", silent=True) 587 if manager == "block": File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/construction.py:656, in sanitize_array(data, index, dtype, copy, allow_2d) 653 subarr = cast(np.ndarray, subarr) 654 subarr = maybe_infer_to_datetimelike(subarr) --> 656 subarr = _sanitize_ndim(subarr, data, dtype, index, allow_2d=allow_2d) 658 if isinstance(subarr, np.ndarray): 659 # at this point we should have dtype be None or subarr.dtype == dtype 660 dtype = cast(np.dtype, dtype) File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/construction.py:715, in _sanitize_ndim(result, data, dtype, index, allow_2d) 713 if allow_2d: 714 return result --> 715 raise ValueError( 716 f"Data must be 1-dimensional, got ndarray of shape {data.shape} instead" 717 ) 718 if is_object_dtype(dtype) and isinstance(dtype, ExtensionDtype): 719 # i.e. NumpyEADtype("O") 721 result = com.asarray_tuplesafe(data, dtype=np.dtype("object")) ValueError: Data must be 1-dimensional, got ndarray of shape (1000, 1000) instead
DataFrame
For 2D data, you use a Pandas DataFrame. A DataFrame is a table representation of data. It is the primary use case for pandas itself. A DataFrame is simply a collection of Series that share a common index. It’s like a programmable spreadsheet: it has rows and columns which can be accessed and manipulated with Python.
An example:
names = pd.Series( data = ['Allen','Bob','Chris','Dave','Ed','Frank','Gus'])
gpas = pd.Series( data = [4.0, np.nan, 3.4, 2.8, 2.5, 3.8, 3.0])
years = pd.Series( data = ['So', 'Fr', 'Fr', 'Jr', 'Sr', 'Sr', 'Fr'])
series_dict = { 'Name': names, 'GPA': gpas, 'Year' : years } # dict of Series, keys are the series names
students = pd.DataFrame( series_dict )
students| Name | GPA | Year | |
|---|---|---|---|
| 0 | Allen | 4.0 | So |
| 1 | Bob | NaN | Fr |
| 2 | Chris | 3.4 | Fr |
| 3 | Dave | 2.8 | Jr |
| 4 | Ed | 2.5 | Sr |
| 5 | Frank | 3.8 | Sr |
| 6 | Gus | 3.0 | Fr |
Other Ways to create dataframes:
# Lists of lists
pd.DataFrame([['Tom', 7], ['Mike', 15], ['Tiffany', 3]])| 0 | 1 | |
|---|---|---|
| 0 | Tom | 7 |
| 1 | Mike | 15 |
| 2 | Tiffany | 3 |
# Dictionary
pd.DataFrame({"Name": ['Tom', 'Mike', 'Tiffany'], "Number": [7, 15, 3]})| Name | Number | |
|---|---|---|
| 0 | Tom | 7 |
| 1 | Mike | 15 |
| 2 | Tiffany | 3 |
For more, see the Pandas documentation on DataFrames.
Accessing elements with indexing
You can access columns in the DataFrame using the names of the series, much in the same way you would a dictionary. For example:
students['GPA'] # slicing by row labelAllen 4.0
Bob NaN
Chris 3.4
Dave NaN
Ed 2.8
Frank 2.5
Name: GPA, dtype: float64
Since the values in a DataFrame are Series, you can then access a particular value using the Series index. For example, since the Series data in studentsn were indexed by name, we can get Chris’s grade by doing:
students['GPA']['Chris']np.float64(3.4)
Much like a Series is like a special NumPy array with fancy indexing (and other useful features), a DataFrame is like a special type of dictionary, with some extra features that make handling datasets much easier. In fact, as we’ll see below, a DataFrame is more like a cross between a dictionary and a NumPy array that make it excel at data wrangling. (pun intended)
Accessing elements with loc and iloc
The loc[index, col] and iloc[row_pos, col_pos] properties allow you to slice the dataframe. loc uses the index and column names, while iloc uses ordinal positions starting at zero.
Here are some examples, using studentsn
# Examples using loc
print("loc: Get the Chris' GPA: ", students.loc['Chris', 'GPA'])
print("loc: Get the Year of the last student (Frank): ", students.loc['Frank', 'Year'])
# Same examples using iloc
print("iloc: Get the GPA of the student at row 2 (Chris): ", students.iloc[2, 0])
print("iloc: Get the Year of the last student (Frank): ", students.iloc[-1, 1])loc: Get the Chris' GPA: 3.4
loc: Get the Year of the last student (Frank): Sr
iloc: Get the GPA of the student at row 2 (Chris): 3.4
iloc: Get the Year of the last student (Frank): Sr
# You can also slice using loc and iloc
print("loc: last two rows:\n", students.loc['Ed':, 'GPA':'Year'])
print()
print("iloc: last two rows:\n", students.iloc[-2:, 0:2])loc: last two rows:
GPA Year
Ed 2.8 NaN
Frank 2.5 Sr
iloc: last two rows:
GPA Year
Ed 2.8 NaN
Frank 2.5 Sr
Null Checks
use isna() to check for np.nan.
students[students.GPA.isna()]| GPA | Year | |
|---|---|---|
| Bob | NaN | Fr |
| Dave | NaN | Jr |
Basic Dataframe operations
info()provide names of columns, counts of non-null values in each columns, and data types.describe()for each numerical column provide some basic statistics (min, max, mean, and quartiles).head(n=5)view the FIRSTnrows in the dataframe (defaults to 5)tail(n=5)view the LASTnrows in the dataframe (defaults to 5)sample(n=1)view a randomnrows from the dataframe (defautls to 1).columnsretrieve a list of columns in the dataframe
To illustrate this, we’ll load a comma-separated-value (CSV) file customers.csv that containing some customer data. We can load the file directly as a DataFrame using Panda’s read_csv function. Notice that we can pass a URL to the function. We don’t need to first download, Pandas will take care of that for us all under the hood!
customers = pd.read_csv('https://su-ist356-m003-fall-2025.github.io/course-home/04_data_wrangling/customers.csv')
print(customers) First Last Email Gender Last IP Address \
0 Al Fresco afresco@dayrep.com M 74.111.18.161
1 Abby Kuss akuss@rhyta.com F 23.80.125.101
2 Arial Photo aphoto@dayrep.com F 24.0.14.56
3 Bette Alott balott@rhyta.com F 56.216.127.219
4 Barb Barion bbarion@superrito.com F 38.68.15.223
5 Barry DeHatchett bdehatchett@dayrep.com M 23.192.215.78
6 Bill Melator bmelator@einrot.com M 24.11.125.10
7 Candi Cayne ccayne@rhyta.com F 24.39.14.15
8 Carol Ling cling@superrito.com F 23.180.242.66
9 Cam Rha crha@einrot.com M 24.1.25.140
10 Dan Delyons ddelyons@dayrep.com M 24.38.224.161
11 Erin Detyers edetyers@dayrep.com F 70.209.14.54
12 Euron Tasomthin etasomthin@superrito.com M 68.199.40.156
13 Justin Case jcase@dayrep.com M 23.192.215.44
14 Jean Poole jpoole@dayrep.com F 23.182.25.40
15 Lee Hvmeehom lhvmeehom@einrot.com F 215.82.23.2
16 Lisa Karfurless lkarfurless@dayrep.com F 172.189.252.8
17 Mary Melator mmelator@rhyta.com F 23.88.15.5
18 Mike Rofone mrofone@dayrep.com M 23.224.160.4
19 Oren Jouglad ojouglad@einrot.com M 128.122.140.238
20 Phil Meaup pmeaup@dayrep.com M 23.83.132.200
21 Rowan Deboat rdeboat@dayrep.com M 23.84.32.22
22 Ray Ovlight rovlight@dayrep.com M 74.111.18.59
23 Sara Bellum sbellum@superrito.com F 74.111.6.173
24 Sal Ladd sladd@superrito.com M 23.112.202.16
25 Seymour Ofewe sofewe@dayrep.com M 98.29.25.44
26 Ty Anott tanott@rhyta.com M 23.230.12.5
27 Tally Itupp titupp@superrito.com F 24.38.114.105
28 Tim Pani tpani@superrito.com M 23.84.132.226
29 Victor Rhee vrhee@einrot.com M 23.112.232.160
City State Total Orders Total Purchased Months Customer
0 Syracuse NY 1 45 1
1 Phoenix AZ 1 25 2
2 Newark NJ 1 680 1
3 Raleigh NC 6 560 18
4 Dallas TX 4 1590 1
5 Boston MA 1 15 6
6 Orem UT 9 6090 35
7 Portland ME 1 620 2
8 Syracuse NY 2 440 6
9 Chicago IL 0 0 1
10 Greenwich CT 2 2570 10
11 Tampa FL 5 1105 38
12 Hempstead NY 13 4630 28
13 Boston MA 3 1050 1
14 Kingston NY 7 185 12
15 Columbus OH 9 207 18
16 Fairfax VA 6 250 27
17 Los Angeles CA 8 4275 40
18 Cheyenne WY 0 0 0
19 New York NY 12 4500 36
20 Phoenix AZ 4 930 24
21 Topeka KS 1 3500 42
22 Syracuse NY 6 125 42
23 Alexandria VA 2 189 2
24 Rochester NY 14 594 10
25 Cleveland OH 9 1190 3
26 San Jose CA 1 50 3
27 Sea Cliff NY 11 380 42
28 Buffalo NY 0 0 1
29 Green Bay WI 0 0 2
Display the dataframe in Streamlit
You can use the st.dataframe() function to display a DataFrame in Streamlit.
Here is an example:
import streamlit as st
import pandas as pd
st.title("Dataframe Example")
customers = pd.read_csv('https://su-ist356-m003-fall-2025.github.io/course-home/04_data_wrangling/customers.csv')
st.dataframe(customers.head(20))
st.dataframe(customers.describe())Selecting Rows and Columns
We can pair down the output of a dataframe by using:
- a
listof column names to select columns. - a
boolean indexto select matching rows.
data_dict = {
'Name': ['Allen','Bob','Chris','Dave','Ed','Frank','Gus'],
'GPA': [4.0, np.nan, 3.4, 2.8, 2.5, 3.8, 3.0],
'Year' : ['So', 'Fr', 'Fr', 'Jr', 'Sr', 'Sr', 'Fr'] }
students = pd.DataFrame( data_dict )
students| Name | GPA | Year | |
|---|---|---|---|
| 0 | Allen | 4.0 | So |
| 1 | Bob | NaN | Fr |
| 2 | Chris | 3.4 | Fr |
| 3 | Dave | 2.8 | Jr |
| 4 | Ed | 2.5 | Sr |
| 5 | Frank | 3.8 | Sr |
| 6 | Gus | 3.0 | Fr |
Selecting Columns
This example just gets the name and GPA columns
columns_to_show = ['Name', 'GPA']
students[columns_to_show]| Name | GPA | |
|---|---|---|
| 0 | Allen | 4.0 |
| 1 | Bob | NaN |
| 2 | Chris | 3.4 |
| 3 | Dave | 2.8 |
| 4 | Ed | 2.5 |
| 5 | Frank | 3.8 |
| 6 | Gus | 3.0 |
Getting the freshmen using a boolean index
consider the following:
students['Year'] == 'Fr'0 False
1 True
2 True
3 False
4 False
5 False
6 True
Name: Year, dtype: bool
This it called a boolean index. The boolean expression is evaluted for each index in the DataFrame. It’s similar to the boolean “mask” array we used for extracting values from an array in the NumPy unit.
When we apply the boolean index to the dataframe, only the rows where the index evaluates to True are returned.
students[students['Year'] == 'Fr'] | Name | GPA | Year | |
|---|---|---|---|
| 1 | Bob | NaN | Fr |
| 2 | Chris | 3.4 | Fr |
| 6 | Gus | 3.0 | Fr |
Likewise we can assign these variables for clarity
only_freshmen_index = students['Year'] == 'Fr'
only_freshmen = students[only_freshmen_index]
only_freshmen| Name | GPA | Year | |
|---|---|---|---|
| 1 | Bob | NaN | Fr |
| 2 | Chris | 3.4 | Fr |
| 6 | Gus | 3.0 | Fr |
And Or and Not with Boolean indexes
What if we want freshmen or seniors? We cannot use or in this case, instead we must use the python bitwise or operator. This is because the series contains multiple values.
Bitwise Operators
- and
& - or
| - not
~
Note: () are required between each bitwise operator.
# freshmen and seniors
only_freshmen_seniors = (students['Year'] == 'Fr') | (students['Year'] == 'Sr')
students[only_freshmen_seniors]| Name | GPA | Year | |
|---|---|---|---|
| 1 | Bob | NaN | Fr |
| 2 | Chris | 3.4 | Fr |
| 4 | Ed | 2.5 | Sr |
| 5 | Frank | 3.8 | Sr |
| 6 | Gus | 3.0 | Fr |
Putting it Together
Get the name and GPA of only the freshmen that have a GPA stored (i.e., for which the GPA is not a NaN):
cols = ['Name', 'GPA']
fr_with_gpa = (students['Year'] == 'Fr') & (students['GPA'].notna())
students[fr_with_gpa][cols]| Name | GPA | |
|---|---|---|
| 2 | Chris | 3.4 |
| 6 | Gus | 3.0 |