3. Data I/O with Pandas

In this tutorial we’ll learn about how to load and save files using Pandas, including how to handle different file formats (csv, json, Excel, etc.).

Pandas reads data in a variety of formats

Examples:

Text: CSV / Delimited
pd.read_csv("file.csv", sep=",", header=0)
Semi- Structured: JSON, HTML, XML
pd.read_json("file.json", orient="records")
Microsoft Excel
pd.read_excel("file.xlsx", sheet_name="Sheet 1")
Big Data formats (ORC, Parquet, HDF5)
pd.read_parquet("file.parquet")
SQL Databases

For more details, see the Pandas IO documentation.

Pandas can read from almost anywhere

Local files
pd_read_csv("./folder/file.csv")
Files over the network using http / https
pd.read_csv("https://website/folder/file.csv")
File-like: binary / text streams

with open('file.csv', 'r') as file:
    data = file.read()
    df = pd.read_csv(pd.compat.StringIO(data))  # text stream

Reading CSV / Delimited Text

For reading CSV files (or text files with other delimiters, such as tab), we use the read_csv() function.

This function is for processing text files one record per line with values separated by a delimiter (typically a ,, but can be any string).
Common named arguments:
- sep= the delimiter, default is a comma.
- header= Which row, amongst those not skipped is the header
- names= list of column names to use in the DataFrame
- skiprows= how many lines to skip before the data begins?

Some examples of reading in the same data in different ways are below. In every case, the output is the same DataFrame:

import pandas as pd
# To view the following files, see:
# https://github.com/mafudge/datasets/tree/master/delimited
location = "https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited"

# Header is first row, Comma-delimited
students = pd.read_csv(f'{location}/students-header.csv') 

# No header in first row, Comma-delimited
students = pd.read_csv(f'{location}/students-no-header.csv', header=None, names =['Name','Grade','Year'])

# No header in first row, Pipe-delimited  "|"
students = pd.read_csv(f'{location}/students-header.psv', sep="|")

# Header not in first row, header in 6th row, Comma-delimited"
students = pd.read_csv(f'{location}/students-header-blanks.csv', skiprows=5)

# no header, data starts in 6th row, semicolon-delimited"
students = pd.read_csv(f'{location}/students-no-header-blanks.ssv', skiprows=5, header=None, sep=";", names =['Name','Grade','Year'])

students

	Name	Grade	Year
0	Abby	7.0	Freshman
1	Bob	9.0	Sophomore
2	Chris	10.0	Senior
3	Dave	8.0	Freshman
4	Ellen	7.0	Sophomore
5	Fran	10.0	Senior
6	Greg	8.0	Freshman
7	Helen	NaN	Sophomore
8	Iris	10.0	Senior
9	Jimmy	8.0	Freshman
10	Karen	7.5	Freshman
11	Lynne	10.0	Sophomore
12	Mike	10.0	Sophomore
13	Nico	NaN	Junior
14	Pete	8.0	Freshman

Code Challenge 3.1

Read this file into a Pandas DataFrame:

https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/webtraffic.log

What is the delimiter?
Is there a header? Which row?
Do you need to skip lines?

Display only data where the time taken > 500 (msec) and the sc-status is equal to 200.

Bonus: display the data in a Streamlit app.

Solution

import pandas as pd

wt = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/webtraffic.log", skiprows=3, header=0, sep="\s+")
wt.info() # colunmn info (to console only)

wt_filter = (wt['sc-status'] == 200) & ( wt['time-taken'] > 500)
wt_slow_but_successful = wt[wt_filter]
wt_slow_but_successful

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   date             489 non-null    object
 1   time             489 non-null    object
 2   s-ip             489 non-null    object
 3   cs-method        489 non-null    object
 4   cs-uri-stem      489 non-null    object
 5   cs-uri-query     489 non-null    object
 6   s-port           489 non-null    int64 
 7   cs-username      489 non-null    object
 8   c-ip             489 non-null    object
 9   cs(User-Agent)   489 non-null    object
 10  cs(Referer)      489 non-null    object
 11  sc-status        489 non-null    int64 
 12  sc-substatus     489 non-null    int64 
 13  sc-win32-status  489 non-null    int64 
 14  time-taken       489 non-null    int64 
dtypes: int64(5), object(10)
memory usage: 57.4+ KB

<>:3: SyntaxWarning: invalid escape sequence '\s'
<>:3: SyntaxWarning: invalid escape sequence '\s'
/tmp/ipykernel_11215/568368627.py:3: SyntaxWarning: invalid escape sequence '\s'
  wt = pd.read_csv("https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited/webtraffic.log", skiprows=3, header=0, sep="\s+")

	date	time	s-ip	cs-method	cs-uri-stem	cs-uri-query	s-port	cs-username	c-ip	cs(User-Agent)	cs(Referer)	sc-status	time-taken
32	2016-02-11	17:16:17	128.230.247.37	GET	/desktops	-	80	-	215.82.23.2	Mozilla/5.0+(Windows+NT+10.0;+WOW64;+rv:43.0)+...	http://group0.ist722.ischool.syr.edu/	200	1144
93	2016-02-11	17:16:59	128.230.247.37	GET	/adidas-consortium-campus-80s-running-shoes	-	80	-	128.122.140.238	Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKi...	http://group0.ist722.ischool.syr.edu/shoes	200	701
105	2016-02-11	17:17:07	128.230.247.37	GET	/htc-one-m8-android-l-50-lollipop	-	80	-	128.122.140.238	Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKi...	http://group0.ist722.ischool.syr.edu/	200	613
158	2016-02-11	18:03:08	128.230.247.37	GET	/digital-downloads	-	80	-	74.111.6.173	Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKi...	http://group0.ist722.ischool.syr.edu/register	200	572
184	2016-02-11	18:03:58	128.230.247.37	GET	/jewelry	-	80	-	74.111.6.173	Mozilla/5.0+(Windows+NT+6.3;+WOW64)+AppleWebKi...	http://group0.ist722.ischool.syr.edu/cart	200	516
281	2016-02-11	18:07:48	128.230.247.37	GET	/electronics	-	80	-	172.189.252.8	Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_11_3...	http://group0.ist722.ischool.syr.edu/	200	704
293	2016-02-11	18:07:50	128.230.247.37	GET	/apparel	-	80	-	172.189.252.8	Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_11_3...	http://group0.ist722.ischool.syr.edu/electronics	200	645
303	2016-02-11	18:07:53	128.230.247.37	GET	/computers	-	80	-	172.189.252.8	Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_11_3...	http://group0.ist722.ischool.syr.edu/apparel	200	703
399	2016-02-11	19:55:38	128.230.247.37	GET	/if-you-wait	-	80	-	8.37.70.99	AddThis.com+(http://support.addthis.com/)	-	200	539

To display in a Streamlit app, put the above in a script file (say, webtraffic.py) and add the following:

import streamlit as st

st.title("Webtraffic Data")
st.dataframe(wt_slow_but_successful) # first 20 rows

Then in a terminal run:

python -m streamlit run webtraffic.py

Reading JSON Text

To load JSON files as a Pandas DataFrame we use the read_json() function. Examples:

pd.read_json("file.json", orient="columns")
pd.read_json("file.json", orient="records", lines=True) <== Line-oriented json

Orientations: - split: dict like {index -> [index]; columns -> [columns]; data -> [values]}

records: list like [{column -> value} …]
index: dict like {index -> {column -> value}}
columns: dict like {column -> {index -> value}}
values: just the values array
table: dict adhering to the JSON Table Schema https://specs.frictionlessdata.io/table-schema/#descriptor

For more on reading JSON files, see the Pandas Reading JSON guide.

Some examples of reading in the same JSON data in different ways follows. In every case, the output is the same DataFrame:

# https://github.com/mafudge/datasets/tree/master/json-formats to view the files
location = "https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-formats"

# Row-oriented JSON [ { "Name": "Alice", "Grade": 12, "Year": 2021 }, { "Name": "Bob", "Grade": 11, "Year": 2022 } ]
students = pd.read_json(f'{location}/students-records.json', orient='records')

# line-oriented JSON { "Name": "Alice", "Grade": 12, "Year": 2021 }\n { "Name": "Bob", "Grade": 11, "Year": 2022 }\n
students = pd.read_json(f'{location}/students-lines.json', orient='records', lines=True)

# column-oriented JSON { "Name": ["Alice", "Bob"], "Grade": [12, 11], "Year": [2021, 2022] }
students = pd.read_json(f'{location}/students-columns.json', orient='columns')

students

	Name	Grade	Year
0	Abby	7.0	Freshman
1	Bob	9.0	Sophomore
2	Chris	10.0	Senior
3	Dave	8.0	Freshman
4	Ellen	7.0	Sophomore
5	Fran	10.0	Senior
6	Greg	8.0	Freshman
7	Helen	NaN	Sophomore
8	Iris	10.0	Senior
9	Jimmy	8.0	Freshman
10	Karen	7.5	Freshman
11	Lynne	10.0	Sophomore
12	Mike	10.0	Sophomore
13	Nico	NaN	Junior
14	Pete	8.0	Freshman

Handling Nested JSON

The read_json() method does not perform well on nested JSON structures. For example consider the following JSON file of customer orders:

The file orders.json:

[
    {
        "Customer" : { "FirstName" : "Abby", "LastName" : "Kuss"}, 
        "Items" : [
            { "Name" : "T-Shirt", "Price" : 10.0, "Quantity" : 3},
            { "Name" : "Jacket", "Price" : 20.0, "Quantity" : 1}
        ]
    },
    {
        "Customer" : { "FirstName" : "Bette", "LastName" : "Alott"}, 
        "Items" : [
            { "Name" : "Shoes", "Price" : 25.0, "Quantity" : 1}, 
            { "Name" : "Jacket", "Price" : 20.0, "Quantity" : 1}
        ]
    },
    {
        "Customer" : { "FirstName" : "Chris", "LastName" : "Peanugget"}, 
        "Items" : [
            { "Name" : "T-Shirt", "Price" : 10.0, "Quantity" : 1}
        ]
    }
]

When we read this with read_json() we get the three orders but only two columns — one for the "Customer" key, and the other for the "Items" key:

orders = pd.read_json("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/orders.json")
orders

	Customer	Items
0	{'FirstName': 'Abby', 'LastName': 'Kuss'}	[{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'...
1	{'FirstName': 'Bette', 'LastName': 'Alott'}	[{'Name': 'Shoes', 'Price': 25.0, 'Quantity': ...
2	{'FirstName': 'Chris', 'LastName': 'Peanugget'}	[{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'...

What we want is one row per item on the the order and the customer name to be in separate columns. The json_normalize() function can help here.

Note

It is important to note that json_normalize() does not take a file as input, but rather de-serialized json (i.e., a dict or list of dicts).

An example (note that we first need to load the file as JSON dict; for that, we’ll use the requests module to download the data):

# first down load the data
import requests
response = requests.get("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/orders.json")
json_data = response.json()  #de-serialize
print('Original JSON data:')
print(json_data)
# now load into a DataFrame
orders = pd.json_normalize(json_data)
print("\nLoaded DataFrame:")
orders

Original JSON data:
[{'Customer': {'FirstName': 'Abby', 'LastName': 'Kuss'}, 'Items': [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity': 3}, {'Name': 'Jacket', 'Price': 20.0, 'Quantity': 1}]}, {'Customer': {'FirstName': 'Bette', 'LastName': 'Alott'}, 'Items': [{'Name': 'Shoes', 'Price': 25.0, 'Quantity': 1}, {'Name': 'Jacket', 'Price': 20.0, 'Quantity': 1}]}, {'Customer': {'FirstName': 'Chris', 'LastName': 'Peanugget'}, 'Items': [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity': 1}]}]

Loaded DataFrame:

	Items	Customer.FirstName	Customer.LastName
0	[{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'...	Abby	Kuss
1	[{'Name': 'Shoes', 'Price': 25.0, 'Quantity': ...	Bette	Alott
2	[{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'...	Chris	Peanugget

Better but this only processed nested dict and not nested list. We still need to handle the list of Items. To accomplish this we :

Set the record_path to be the nested list 'Items'. This tells json_normalize() to use that JSON key as the row level. So now we will have 5 rows (one for each item) instead of 3.
Then we set the meta named argument to a list of each of the other values we wish to include, in this instance last name and first name.

Note: The meta syntax is a bit weird. It’s a list of JSON paths (also represented as lists) to each item in the JSON. For example:

The meta Argument        ==> Matches This in the JSON           ==> And Displays As This Pandas Column
["Customer","FirstName"] ==> { "Customer" : { "FirstName": ...} ==> Customer.Firstname

orders = pd.json_normalize(json_data, record_path="Items", meta=[["Customer","FirstName"],["Customer","LastName"]])
orders

	Name	Price	Quantity	Customer.FirstName	Customer.LastName
0	T-Shirt	10.0	3	Abby	Kuss
1	Jacket	20.0	1	Abby	Kuss
2	Shoes	25.0	1	Bette	Alott
3	Jacket	20.0	1	Bette	Alott
4	T-Shirt	10.0	1	Chris	Peanugget

Yes it seems complicated, because conceptually it is a bit complicated. Let’s try another example, with some abstract values.

In the following example we want to generate a normalized table with 3 rows and 4 columns.

The rows are based on the "A" record_path, which has two sub-sets, A1 and A2. There are three sets of A data: (101, 102); (111, 112); and (201, 202).
The meta data are based on columns "B", and "C1"

json_data = [
    {
        "A": [
            {"A1": 101, "A2": 102},
            {"A1": 111, "A2": 112}
        ],
        "B": 103,
        "C": {"C1": 104}
    },
    {
        "A": [
            {"A1": 201, "A2": 202}
        ],
        "B": 203,
        "C": {"C1": 204}
    }
]

df = pd.json_normalize(json_data, record_path="A", meta=["B", ["C", "C1"]])
df

	A1	A2	B	C.C1
0	101	102	103	104
1	111	112	103	104
2	201	202	203	204

Code Challenge 3.2

Use the json_normalize function to tabularize this JSON data:

https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-samples/employees.json

The final table should have these columns: dept, age, firstname, lastname.

Hint: read the file using the requests module, like in the above example.

Solution

import requests
import pandas as pd

response = requests.get("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/employees.json")
employees = response.json()
employees_df = pd.json_normalize(employees, record_path=["employees"], meta=["dept"])
employees_df

	firstName	lastName	age	dept
0	John	Doe	23	accounting
1	Mary	Smith	32	accounting
2	Sally	Green	27	sales
3	Jim	Galley	41	sales

Reading Excel files

Excel files can be read using the read_excel() function. For example: pd.read_excel('file.xlsx', sheet_name=None)

Note

In order to use the read_excel method, you need to additional install the optional Pandas dependency openpyxl. To do that using pip, run:

pip install openpyxl

This will read in all sheets as a dict, with the sheet names as the keys and the values as Pandas DataFrames representing the contents. An example using this with Streamlit:

import streamlit as st
import pandas as pd

st.title("Excel Example - multiple sheets")

contents = pd.read_excel("https://github.com/mafudge/datasets/raw/refs/heads/master/excel-examples/books_of_interest.xlsx", sheet_name=None)

# names of sheets in the excel file its a dictionary
sheets = list(contents.keys()) 

# make tabs for each sheet
tabs = st.tabs(sheets)

#loop through each tab and write the contents of the sheet to the tab
for i in range(len(tabs)):
    df = contents[sheets[i]]
    tabs[i].dataframe(df)

Reading HTML Tables

You can scrape an HTML table off a webpage using the read_html() function. This will return a list of all HTML tables on the page, with each table as a DataFrame.

Note

In order to use the read_html method, you need to additional install the optional Pandas dependency lxml. To do that using pip, run:

pip install lxml

For example:

contents = pd.read_html("https://su-ist356-m003-fall-2025.github.io/course-home")
for df in contents:
    print(df)

    Week         Dates                   Topic
0      1    8/25, 8/29    Intro; CLI and Conda
1      2           9/3       Python review - 1
2      3     9/8, 9/10       Python review - 2
3      4    9/15, 9/17                      UI
4      5    9/22, 9/24      Data wrangling - 1
5      6    9/29, 10/1      Data wrangling - 2
6      7    10/6, 10/8      Data wrangling - 3
7      8         10/15                  Exam 1
8      9  10/20, 10/22            Web APIs - 1
9     10  10/27, 10/29            Web APIs - 2
10    11    11/3, 11/5        Web scraping - 1
11    12  11/10, 11/12        Web scraping - 2
12    13  11/17, 11/19  Data visualization - 1
13    14    12/1, 12/3  Data visualization - 2
14    15          12/8                  Exam 2
15    16         12/15             Project due
          Date                                         Unnamed: 1
0    Mon. 8/25                                 First day of class
1     Mon. 9/1                               Labor day - No class
2    Mon. 9/15  Academic/Financial drop deadline; Religious ob...
3   Mon. 10/13                              Fall break - No class
4   Fri. 11/21                                Withdrawal deadline
5  11/23-11/30                      Thanksgiving Break - No class
6    Mon. 12/8                                  Last day of class

An example of turning this into a Streamlit app:

import streamlit as st
import pandas as pd

st.title("HTML Example - multiple tables")

url = "https://su-ist356-m003-fall-2025.github.io/course-home"
contents = pd.read_html(url)

# There are 2 tables on this page, but we don't know this
tables_count = len(contents)

st.write(f"Found {tables_count} tables on the page")

# make tabs for each HTML Table
tab_names = [ f"HTML Table {i}" for i in range(tables_count)]
tabs = st.tabs(tab_names)

# for each tab, show its table
for i in range(len(tabs)):
    df = contents[i]
    tabs[i].dataframe(df)

Writing Dataframes

Once the data is in a pd.DataFrame is can be written out with one of the to() methods such as to_csv(), to_json(), to_parquet() etc.
This makes pandas a superior data conversion tool.
If you include a file, the to() method writes to the file, otherwise the binary contents are returned.
https://pandas.pydata.org/pandas-docs/stable/reference/io.html

An example of converting the above Excel spreadsheet to CSV:

contents = pd.read_excel("https://github.com/mafudge/datasets/raw/refs/heads/master/excel-examples/books_of_interest.xlsx", sheet_name=None)
for sheetname, df in contents.items():
    # convert spaces to underscores for filenames
    sheetname = sheetname.replace(' ', '_')
    filename = f'books_of_interest-{sheetname}.csv'
    df.to_csv(filename, header=True, index=False)

Code Challenge 3.3

Write a Streamlit app that will accept an Excel file via file uploader and then write out a record-oriented JSON file from the first tab in the excel file.

The program should display the contents of the dataframe and provide a download button for the converted the csv file. Hints: - To provide the ability to upload a file, see: st.file_uploader - To provide the ability to download a file, see: st.download_button

Solution

import requests
import pandas as pd
import streamlit as st

st.title("Excel to JSON")

uploaded_file = st.file_uploader("Upload an EXCEL file", type=["xlsx"])

if uploaded_file:
    df = pd.read_excel(uploaded_file.getvalue())
    st.dataframe(df)
    json_file = df.to_json(orient="records", index=False)
    json_filename = uploaded_file.name.replace('.xlsx', '.json')
    download = st.download_button(f"Download {json_filename}", data=json_file, file_name=json_filename)

Pandas reads data in a variety of formats

Pandas can read from almost anywhere

Reading CSV / Delimited Text

Reading JSON Text

Handling Nested JSON

Reading Excel files

Reading HTML Tables

Writing Dataframes

Challenge 3-2-3

Excel to JSON