2. Introduction to Pandas: Series and DataFrames

What is Pandas?

Pandas is a Python library for working with tabular data. Pandas is short for PANeled DAta.

Pandas is like a programmable spreadheet. It is used by programmers to wrangle data (sort, filter, clean, enhance, etc).

Pandas Series and DataFrame

The two fundamental compoents of Pandas are the Series and DataFrame

  • a Series is a list of values with labels. This creates a column of data
  • a DataFrame is a collection of series. This creates a table of data

Null / No Value

The constant np.nan is used to represent “no value”

import pandas as pd
import numpy as np

Series

A Series is a named list of values.

The series has an index, too to reference each value. The default index is a zero based, similar to a python list.

grades = pd.Series(data=[100,80,90,np.nan,100], name="Midterm Grades")
grades
0    100.0
1     80.0
2     90.0
3      NaN
4    100.0
Name: Midterm Grades, dtype: float64
# The the value at index 2
grades[2]
np.float64(90.0)

The index can be anyting . Here’s the same grades with student names as the index.

grades2 = pd.Series( data=[100,80,90,np.nan,100], 
                    name="Midterm Grades",
                    index=["Alice", "Bob", "Charlie", "David", "Eve"])
grades2
Alice      100.0
Bob         80.0
Charlie     90.0
David        NaN
Eve        100.0
Name: Midterm Grades, dtype: float64
# Get Charlie's grade
grades2["Charlie"]
np.float64(90.0)

Series Vectorized Functions

Like NumPy arrays, you can perform element-wise mathematical operations on Pandas series without needing for loops (i.e., vectorization). For example:

# add 5 points to all the grades
grades3 = grades2 + 5
print(grades3)
Alice      105.0
Bob         85.0
Charlie     95.0
David        NaN
Eve        105.0
Name: Midterm Grades, dtype: float64
# square the grades
gradesq = grades2**2
print(gradesq)
Alice      10000.0
Bob         6400.0
Charlie     8100.0
David          NaN
Eve        10000.0
Name: Midterm Grades, dtype: float64
# add two series together
grades4 = grades2 + grades3
print(grades4)
Alice      205.0
Bob        165.0
Charlie    185.0
David        NaN
Eve        205.0
Name: Midterm Grades, dtype: float64

Pandas series also have a number of vectorized methods that you can call on the series themselves, again like NumPy arrays. Some examples:

print("Highest grade:", grades.max())
print("Average grade:", grades.mean())
print("lowest grade:", grades.min())
print("Sum of grades:", grades.sum())
print("Count of grades", grades.count())
Highest grade: 100.0
Average grade: 92.5
lowest grade: 80.0
Sum of grades: 370.0
Count of grades 4

Other Series Functions

We can use the unique() method function to return only the non-duplicate values from the series.

The value_counts() method function adds up values, creating a new series where the index is the value and the value is the count.

For example consider the following series:

votes = pd.Series(data=[ 'y','y','y','n','y',np.nan,'n','n','y'], name="Vote")
print("deduplicate the votes:", votes.unique())
print("counts by value:", votes.value_counts())
deduplicate the votes: ['y' 'n' nan]
counts by value: Vote
y    5
n    3
Name: count, dtype: int64

Comparison to NumPy

In many ways, you can think of a Pandas series as being like a NumPy array (in fact, series are built on top of NumPy arrays). It even has similar performance. For example:

a = np.arange(1000000)
aseries = pd.Series(a)
%timeit a.mean()
537 μs ± 807 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit aseries.mean()
549 μs ± 1.05 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

However, unlike NumPy arrays, Pandas series can only be one dimensional. Example:

# 2D NumPy array? No problem!
a = np.ones((1000, 1000))
print(a.shape)
(1000, 1000)
# 2D Pandas series? Nope!
aseries = pd.Series(a)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 2
      1 # 2D Pandas series? Nope!
----> 2 aseries = pd.Series(a)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/series.py:584, in Series.__init__(self, data, index, dtype, name, copy, fastpath)
    582         data = data.copy()
    583 else:
--> 584     data = sanitize_array(data, index, dtype, copy)
    586     manager = _get_option("mode.data_manager", silent=True)
    587     if manager == "block":

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/construction.py:656, in sanitize_array(data, index, dtype, copy, allow_2d)
    653             subarr = cast(np.ndarray, subarr)
    654             subarr = maybe_infer_to_datetimelike(subarr)
--> 656 subarr = _sanitize_ndim(subarr, data, dtype, index, allow_2d=allow_2d)
    658 if isinstance(subarr, np.ndarray):
    659     # at this point we should have dtype be None or subarr.dtype == dtype
    660     dtype = cast(np.dtype, dtype)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/construction.py:715, in _sanitize_ndim(result, data, dtype, index, allow_2d)
    713     if allow_2d:
    714         return result
--> 715     raise ValueError(
    716         f"Data must be 1-dimensional, got ndarray of shape {data.shape} instead"
    717     )
    718 if is_object_dtype(dtype) and isinstance(dtype, ExtensionDtype):
    719     # i.e. NumpyEADtype("O")
    721     result = com.asarray_tuplesafe(data, dtype=np.dtype("object"))

ValueError: Data must be 1-dimensional, got ndarray of shape (1000, 1000) instead

DataFrame

For 2D data, you use a Pandas DataFrame. A DataFrame is a table representation of data. It is the primary use case for pandas itself. A DataFrame is simply a collection of Series that share a common index. It’s like a programmable spreadsheet: it has rows and columns which can be accessed and manipulated with Python.

An example:

names = pd.Series( data = ['Allen','Bob','Chris','Dave','Ed','Frank','Gus'])
gpas = pd.Series( data = [4.0, np.nan, 3.4, 2.8, 2.5, 3.8, 3.0])
years = pd.Series( data = ['So', 'Fr', 'Fr', 'Jr', 'Sr', 'Sr', 'Fr'])
series_dict = { 'Name':  names, 'GPA': gpas, 'Year' : years }  # dict of Series, keys are the series names
students = pd.DataFrame( series_dict )
students
Name GPA Year
0 Allen 4.0 So
1 Bob NaN Fr
2 Chris 3.4 Fr
3 Dave 2.8 Jr
4 Ed 2.5 Sr
5 Frank 3.8 Sr
6 Gus 3.0 Fr

Other Ways to create dataframes:

# Lists of lists
pd.DataFrame([['Tom', 7], ['Mike', 15], ['Tiffany', 3]])
0 1
0 Tom 7
1 Mike 15
2 Tiffany 3
# Dictionary
pd.DataFrame({"Name": ['Tom', 'Mike', 'Tiffany'], "Number": [7, 15, 3]})
Name Number
0 Tom 7
1 Mike 15
2 Tiffany 3

For more, see the Pandas documentation on DataFrames.

DataFrames share the index

The dataframe is stitched together from values macthing on their index. For example:

gpas = pd.Series(data=[4.0, np.nan, 3.4, 2.8, 2.5 ], index=['Allen','Bob','Chris','Ed', 'Frank'])
yrs = pd.Series(data=['So', 'Fr', 'Jr', 'Sr'], index=['Allen','Bob','Dave', 'Frank'])
students = pd.DataFrame( {'GPA': gpas, 'Year': yrs})
students
GPA Year
Allen 4.0 So
Bob NaN Fr
Chris 3.4 NaN
Dave NaN Jr
Ed 2.8 NaN
Frank 2.5 Sr

Accessing elements with indexing

You can access columns in the DataFrame using the names of the series, much in the same way you would a dictionary. For example:

students['GPA'] # slicing by row label
Allen    4.0
Bob      NaN
Chris    3.4
Dave     NaN
Ed       2.8
Frank    2.5
Name: GPA, dtype: float64

Since the values in a DataFrame are Series, you can then access a particular value using the Series index. For example, since the Series data in studentsn were indexed by name, we can get Chris’s grade by doing:

students['GPA']['Chris']
np.float64(3.4)

Much like a Series is like a special NumPy array with fancy indexing (and other useful features), a DataFrame is like a special type of dictionary, with some extra features that make handling datasets much easier. In fact, as we’ll see below, a DataFrame is more like a cross between a dictionary and a NumPy array that make it excel at data wrangling. (pun intended)

Accessing elements with loc and iloc

The loc[index, col] and iloc[row_pos, col_pos] properties allow you to slice the dataframe. loc uses the index and column names, while iloc uses ordinal positions starting at zero.

Here are some examples, using studentsn

# Examples using loc
print("loc: Get the Chris' GPA: ", students.loc['Chris', 'GPA'])
print("loc: Get the Year of the last student (Frank): ", students.loc['Frank', 'Year'])

# Same examples using iloc
print("iloc: Get the GPA of the student at row 2 (Chris): ", students.iloc[2, 0])
print("iloc: Get the Year of the last student (Frank): ", students.iloc[-1, 1])
loc: Get the Chris' GPA:  3.4
loc: Get the Year of the last student (Frank):  Sr
iloc: Get the GPA of the student at row 2 (Chris):  3.4
iloc: Get the Year of the last student (Frank):  Sr
# You can also slice using loc and iloc
print("loc: last two rows:\n", students.loc['Ed':, 'GPA':'Year'])
print()
print("iloc: last two rows:\n", students.iloc[-2:, 0:2])
loc: last two rows:
        GPA Year
Ed     2.8  NaN
Frank  2.5   Sr

iloc: last two rows:
        GPA Year
Ed     2.8  NaN
Frank  2.5   Sr

Null Checks

use isna() to check for np.nan.

students[students.GPA.isna()]
GPA Year
Bob NaN Fr
Dave NaN Jr
CautionCode Challenge 2.1

Create this DataFrame:

   s1   s2 s3
a   1  2.2  q
b   2  NaN  q
c   3  3.0  z
d   4  1.5  z

In other words, the frame should have 3 columns named s1, s2, and s3, and the rows should be indexed with the strings a, b, c, and d. Use Series to create it to make sure the index is correct. Print the full the DataFrame (so that you can get back something like the above), then print the the first 2 rows and columns using loc or iloc.

import pandas as pd
import numpy as np

s1 = pd.Series(data = [1, 2, 3, 4], index=['a', 'b', 'c', 'd'], name='s1')
s2 = pd.Series(data = [2.2, np.nan, 3.0, 1.5], index=['a', 'b', 'c', 'd'], name='s2')
s3 = pd.Series(data = ['q', 'q', 'z', 'z'], index=['a', 'b', 'c', 'd'], name='s3')

df = pd.DataFrame({'s1':s1,'s2':s2,'s3':s3})
print(df)

print(df.loc['a':'b', 's1':'s2'])
   s1   s2 s3
a   1  2.2  q
b   2  NaN  q
c   3  3.0  z
d   4  1.5  z
   s1   s2
a   1  2.2
b   2  NaN

Basic Dataframe operations

  • info() provide names of columns, counts of non-null values in each columns, and data types.
  • describe() for each numerical column provide some basic statistics (min, max, mean, and quartiles).
  • head(n=5) view the FIRST n rows in the dataframe (defaults to 5)
  • tail(n=5) view the LAST n rows in the dataframe (defaults to 5)
  • sample(n=1) view a random n rows from the dataframe (defautls to 1)
  • .columns retrieve a list of columns in the dataframe

To illustrate this, we’ll load a comma-separated-value (CSV) file customers.csv that containing some customer data. We can load the file directly as a DataFrame using Panda’s read_csv function. Notice that we can pass a URL to the function. We don’t need to first download, Pandas will take care of that for us all under the hood!

customers = pd.read_csv('https://su-ist356-m003-fall-2025.github.io/course-home/04_data_wrangling/customers.csv')
print(customers)
      First        Last                     Email Gender  Last IP Address  \
0        Al      Fresco        afresco@dayrep.com      M    74.111.18.161   
1      Abby        Kuss           akuss@rhyta.com      F    23.80.125.101   
2     Arial       Photo         aphoto@dayrep.com      F       24.0.14.56   
3     Bette       Alott          balott@rhyta.com      F   56.216.127.219   
4     Barb       Barion     bbarion@superrito.com      F     38.68.15.223   
5     Barry  DeHatchett    bdehatchett@dayrep.com      M    23.192.215.78   
6      Bill     Melator       bmelator@einrot.com      M     24.11.125.10   
7     Candi       Cayne          ccayne@rhyta.com      F      24.39.14.15   
8     Carol        Ling       cling@superrito.com      F    23.180.242.66   
9       Cam         Rha           crha@einrot.com      M      24.1.25.140   
10      Dan     Delyons       ddelyons@dayrep.com      M    24.38.224.161   
11     Erin     Detyers       edetyers@dayrep.com      F     70.209.14.54   
12    Euron   Tasomthin  etasomthin@superrito.com      M    68.199.40.156   
13   Justin        Case          jcase@dayrep.com      M    23.192.215.44   
14     Jean       Poole         jpoole@dayrep.com      F     23.182.25.40   
15      Lee    Hvmeehom      lhvmeehom@einrot.com      F      215.82.23.2   
16     Lisa  Karfurless    lkarfurless@dayrep.com      F    172.189.252.8   
17     Mary     Melator        mmelator@rhyta.com      F       23.88.15.5   
18     Mike      Rofone        mrofone@dayrep.com      M     23.224.160.4   
19     Oren     Jouglad       ojouglad@einrot.com      M  128.122.140.238   
20     Phil       Meaup         pmeaup@dayrep.com      M    23.83.132.200   
21    Rowan      Deboat        rdeboat@dayrep.com      M      23.84.32.22   
22      Ray     Ovlight       rovlight@dayrep.com      M     74.111.18.59   
23     Sara      Bellum     sbellum@superrito.com      F     74.111.6.173   
24      Sal        Ladd       sladd@superrito.com      M    23.112.202.16   
25  Seymour       Ofewe         sofewe@dayrep.com      M      98.29.25.44   
26       Ty       Anott          tanott@rhyta.com      M      23.230.12.5   
27    Tally       Itupp      titupp@superrito.com      F    24.38.114.105   
28      Tim        Pani       tpani@superrito.com      M    23.84.132.226   
29   Victor        Rhee          vrhee@einrot.com      M   23.112.232.160   

           City State  Total Orders  Total Purchased  Months Customer  
0      Syracuse    NY             1               45                1  
1       Phoenix    AZ             1               25                2  
2        Newark    NJ             1              680                1  
3       Raleigh    NC             6              560               18  
4        Dallas    TX             4             1590                1  
5        Boston    MA             1               15                6  
6          Orem    UT             9             6090               35  
7      Portland    ME             1              620                2  
8      Syracuse    NY             2              440                6  
9       Chicago    IL             0                0                1  
10    Greenwich    CT             2             2570               10  
11        Tampa    FL             5             1105               38  
12    Hempstead    NY            13             4630               28  
13       Boston    MA             3             1050                1  
14     Kingston    NY             7              185               12  
15     Columbus    OH             9              207               18  
16      Fairfax    VA             6              250               27  
17  Los Angeles    CA             8             4275               40  
18     Cheyenne    WY             0                0                0  
19     New York    NY            12             4500               36  
20      Phoenix    AZ             4              930               24  
21       Topeka    KS             1             3500               42  
22     Syracuse    NY             6              125               42  
23   Alexandria    VA             2              189                2  
24    Rochester    NY            14              594               10  
25    Cleveland    OH             9             1190                3  
26     San Jose    CA             1               50                3  
27    Sea Cliff    NY            11              380               42  
28      Buffalo    NY             0                0                1  
29    Green Bay    WI             0                0                2  

Display the dataframe in Streamlit

You can use the st.dataframe() function to display a DataFrame in Streamlit.

Here is an example:

import streamlit as st
import pandas as pd

st.title("Dataframe Example")

customers = pd.read_csv('https://su-ist356-m003-fall-2025.github.io/course-home/04_data_wrangling/customers.csv')

st.dataframe(customers.head(20))
st.dataframe(customers.describe())
CautionCode Challenge 2.2

Similar to the previous example, load this file into a customers dataframe:

https://su-ist356-m003-fall-2025.github.io/course-home/04_data_wrangling/customers.csv

Then create a radio widget to allow the user to select Head or Tail and a number input widget to enter a number of lines.

Output the head or tail of the dataframe and only show the number of lines input.

Hint: Use Streamlit’s radio and number_input functions.

import streamlit as st
import pandas as pd

st.title('My first dataframe')

customers = pd.read_csv('https://su-ist356-m003-fall-2025.github.io/course-home/04_data_wrangling/customers.csv')

radio = st.radio('Show:', options=[ 'Head', 'Tail'], index=0)
rows = st.number_input('Rows:', min_value=1, max_value=len(customers), value=5)
if radio == 'Head':
    st.dataframe(customers.head(rows))
else:
    st.dataframe(customers.tail(rows))

Selecting Rows and Columns

We can pair down the output of a dataframe by using:

  • a list of column names to select columns.
  • a boolean index to select matching rows.
data_dict = { 
    'Name':  ['Allen','Bob','Chris','Dave','Ed','Frank','Gus'], 
    'GPA': [4.0, np.nan, 3.4, 2.8, 2.5, 3.8, 3.0], 
    'Year' : ['So', 'Fr', 'Fr', 'Jr', 'Sr', 'Sr', 'Fr'] } 
students = pd.DataFrame( data_dict )
students
Name GPA Year
0 Allen 4.0 So
1 Bob NaN Fr
2 Chris 3.4 Fr
3 Dave 2.8 Jr
4 Ed 2.5 Sr
5 Frank 3.8 Sr
6 Gus 3.0 Fr

Selecting Columns

This example just gets the name and GPA columns

columns_to_show = ['Name', 'GPA']
students[columns_to_show]
Name GPA
0 Allen 4.0
1 Bob NaN
2 Chris 3.4
3 Dave 2.8
4 Ed 2.5
5 Frank 3.8
6 Gus 3.0

Getting the freshmen using a boolean index

consider the following:

students['Year'] == 'Fr'
0    False
1     True
2     True
3    False
4    False
5    False
6     True
Name: Year, dtype: bool

This it called a boolean index. The boolean expression is evaluted for each index in the DataFrame. It’s similar to the boolean “mask” array we used for extracting values from an array in the NumPy unit.

When we apply the boolean index to the dataframe, only the rows where the index evaluates to True are returned.

students[students['Year'] == 'Fr'] 
Name GPA Year
1 Bob NaN Fr
2 Chris 3.4 Fr
6 Gus 3.0 Fr

Likewise we can assign these variables for clarity

only_freshmen_index = students['Year'] == 'Fr'
only_freshmen = students[only_freshmen_index]
only_freshmen
Name GPA Year
1 Bob NaN Fr
2 Chris 3.4 Fr
6 Gus 3.0 Fr

And Or and Not with Boolean indexes

What if we want freshmen or seniors? We cannot use or in this case, instead we must use the python bitwise or operator. This is because the series contains multiple values.

Bitwise Operators

  • and &
  • or |
  • not ~

Note: () are required between each bitwise operator.

# freshmen and seniors
only_freshmen_seniors = (students['Year'] == 'Fr') | (students['Year'] == 'Sr')
students[only_freshmen_seniors]
Name GPA Year
1 Bob NaN Fr
2 Chris 3.4 Fr
4 Ed 2.5 Sr
5 Frank 3.8 Sr
6 Gus 3.0 Fr

Putting it Together

Get the name and GPA of only the freshmen that have a GPA stored (i.e., for which the GPA is not a NaN):

cols = ['Name', 'GPA']
fr_with_gpa = (students['Year'] == 'Fr') & (students['GPA'].notna())
students[fr_with_gpa][cols]
Name GPA
2 Chris 3.4
6 Gus 3.0
CautionCode Challenge 2.3

Similar to the previous example, load this file into a customers dataframe:

https://su-ist356-m003-fall-2025.github.io/course-home/04_data_wrangling/customers.csv

Then:

  1. Create a radio widget to allow the user to select “M” or “F” for gender,

  2. a multi-select widget to pick which columns to display (Hint: use Streamlit’s multiselect method),

  3. and filter the rows to match the gender and selected columns.

Display the dataframe in the Streamlit app.

import streamlit as st
import pandas as pd

st.title('Customers')
customers = pd.read_csv('https://su-ist356-m003-fall-2025.github.io/course-home/04_data_wrangling/customers.csv')
radio = st.radio('Gender:', options=[ 'M', 'F'], index=0)
cols = st.multiselect('Columns:', options=customers.columns)
gender_index = customers['Gender'] == radio
st.dataframe(customers[gender_index][cols])