with open('file.csv', 'r') as file:
data = file.read()
df = pd.read_csv(pd.compat.StringIO(data)) # text stream3. Data I/O with Pandas
In this tutorial we’ll learn about how to load and save files using Pandas, including how to handle different file formats (csv, json, Excel, etc.).
Pandas reads data in a variety of formats
Examples:
- Text: CSV / Delimited
pd.read_csv("file.csv", sep=",", header=0) - Semi- Structured: JSON, HTML, XML
pd.read_json("file.json", orient="records") - Microsoft Excel
pd.read_excel("file.xlsx", sheet_name="Sheet 1") - Big Data formats (ORC, Parquet, HDF5)
pd.read_parquet("file.parquet") - SQL Databases
For more details, see the Pandas IO documentation.
Pandas can read from almost anywhere
- Local files
pd_read_csv("./folder/file.csv") - Files over the network using http / https
pd.read_csv("https://website/folder/file.csv") - File-like: binary / text streams
Reading CSV / Delimited Text
For reading CSV files (or text files with other delimiters, such as tab), we use the read_csv() function.
- This function is for processing text files one record per line with values separated by a delimiter (typically a
,, but can be any string). - Common named arguments:
sep=the delimiter, default is a comma.header=Which row, amongst those not skipped is the headernames=list of column names to use in the DataFrameskiprows=how many lines to skip before the data begins?
Some examples of reading in the same data in different ways are below. In every case, the output is the same DataFrame:
import pandas as pd
# To view the following files, see:
# https://github.com/mafudge/datasets/tree/master/delimited
location = "https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/delimited"
# Header is first row, Comma-delimited
students = pd.read_csv(f'{location}/students-header.csv')
# No header in first row, Comma-delimited
students = pd.read_csv(f'{location}/students-no-header.csv', header=None, names =['Name','Grade','Year'])
# No header in first row, Pipe-delimited "|"
students = pd.read_csv(f'{location}/students-header.psv', sep="|")
# Header not in first row, header in 6th row, Comma-delimited"
students = pd.read_csv(f'{location}/students-header-blanks.csv', skiprows=5)
# no header, data starts in 6th row, semicolon-delimited"
students = pd.read_csv(f'{location}/students-no-header-blanks.ssv', skiprows=5, header=None, sep=";", names =['Name','Grade','Year'])
students| Name | Grade | Year | |
|---|---|---|---|
| 0 | Abby | 7.0 | Freshman |
| 1 | Bob | 9.0 | Sophomore |
| 2 | Chris | 10.0 | Senior |
| 3 | Dave | 8.0 | Freshman |
| 4 | Ellen | 7.0 | Sophomore |
| 5 | Fran | 10.0 | Senior |
| 6 | Greg | 8.0 | Freshman |
| 7 | Helen | NaN | Sophomore |
| 8 | Iris | 10.0 | Senior |
| 9 | Jimmy | 8.0 | Freshman |
| 10 | Karen | 7.5 | Freshman |
| 11 | Lynne | 10.0 | Sophomore |
| 12 | Mike | 10.0 | Sophomore |
| 13 | Nico | NaN | Junior |
| 14 | Pete | 8.0 | Freshman |
Reading JSON Text
To load JSON files as a Pandas DataFrame we use the read_json() function. Examples:
pd.read_json("file.json", orient="columns")pd.read_json("file.json", orient="records", lines=True)<== Line-oriented json
Orientations: - split: dict like {index -> [index]; columns -> [columns]; data -> [values]}
records: list like
[{column -> value} …]index: dict like
{index -> {column -> value}}columns: dict like
{column -> {index -> value}}values: just the values array
table: dict adhering to the JSON Table Schema https://specs.frictionlessdata.io/table-schema/#descriptor
For more on reading JSON files, see the Pandas Reading JSON guide.
Some examples of reading in the same JSON data in different ways follows. In every case, the output is the same DataFrame:
# https://github.com/mafudge/datasets/tree/master/json-formats to view the files
location = "https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/json-formats"
# Row-oriented JSON [ { "Name": "Alice", "Grade": 12, "Year": 2021 }, { "Name": "Bob", "Grade": 11, "Year": 2022 } ]
students = pd.read_json(f'{location}/students-records.json', orient='records')
# line-oriented JSON { "Name": "Alice", "Grade": 12, "Year": 2021 }\n { "Name": "Bob", "Grade": 11, "Year": 2022 }\n
students = pd.read_json(f'{location}/students-lines.json', orient='records', lines=True)
# column-oriented JSON { "Name": ["Alice", "Bob"], "Grade": [12, 11], "Year": [2021, 2022] }
students = pd.read_json(f'{location}/students-columns.json', orient='columns')
students| Name | Grade | Year | |
|---|---|---|---|
| 0 | Abby | 7.0 | Freshman |
| 1 | Bob | 9.0 | Sophomore |
| 2 | Chris | 10.0 | Senior |
| 3 | Dave | 8.0 | Freshman |
| 4 | Ellen | 7.0 | Sophomore |
| 5 | Fran | 10.0 | Senior |
| 6 | Greg | 8.0 | Freshman |
| 7 | Helen | NaN | Sophomore |
| 8 | Iris | 10.0 | Senior |
| 9 | Jimmy | 8.0 | Freshman |
| 10 | Karen | 7.5 | Freshman |
| 11 | Lynne | 10.0 | Sophomore |
| 12 | Mike | 10.0 | Sophomore |
| 13 | Nico | NaN | Junior |
| 14 | Pete | 8.0 | Freshman |
Handling Nested JSON
The read_json() method does not perform well on nested JSON structures. For example consider the following JSON file of customer orders:
The file orders.json:
[
{
"Customer" : { "FirstName" : "Abby", "LastName" : "Kuss"},
"Items" : [
{ "Name" : "T-Shirt", "Price" : 10.0, "Quantity" : 3},
{ "Name" : "Jacket", "Price" : 20.0, "Quantity" : 1}
]
},
{
"Customer" : { "FirstName" : "Bette", "LastName" : "Alott"},
"Items" : [
{ "Name" : "Shoes", "Price" : 25.0, "Quantity" : 1},
{ "Name" : "Jacket", "Price" : 20.0, "Quantity" : 1}
]
},
{
"Customer" : { "FirstName" : "Chris", "LastName" : "Peanugget"},
"Items" : [
{ "Name" : "T-Shirt", "Price" : 10.0, "Quantity" : 1}
]
}
]
When we read this with read_json() we get the three orders but only two columns — one for the "Customer" key, and the other for the "Items" key:
orders = pd.read_json("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/orders.json")
orders| Customer | Items | |
|---|---|---|
| 0 | {'FirstName': 'Abby', 'LastName': 'Kuss'} | [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'... |
| 1 | {'FirstName': 'Bette', 'LastName': 'Alott'} | [{'Name': 'Shoes', 'Price': 25.0, 'Quantity': ... |
| 2 | {'FirstName': 'Chris', 'LastName': 'Peanugget'} | [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'... |
What we want is one row per item on the the order and the customer name to be in separate columns. The json_normalize() function can help here.
It is important to note that json_normalize() does not take a file as input, but rather de-serialized json (i.e., a dict or list of dicts).
An example (note that we first need to load the file as JSON dict; for that, we’ll use the requests module to download the data):
# first down load the data
import requests
response = requests.get("https://raw.githubusercontent.com/mafudge/datasets/master/json-samples/orders.json")
json_data = response.json() #de-serialize
print('Original JSON data:')
print(json_data)
# now load into a DataFrame
orders = pd.json_normalize(json_data)
print("\nLoaded DataFrame:")
ordersOriginal JSON data:
[{'Customer': {'FirstName': 'Abby', 'LastName': 'Kuss'}, 'Items': [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity': 3}, {'Name': 'Jacket', 'Price': 20.0, 'Quantity': 1}]}, {'Customer': {'FirstName': 'Bette', 'LastName': 'Alott'}, 'Items': [{'Name': 'Shoes', 'Price': 25.0, 'Quantity': 1}, {'Name': 'Jacket', 'Price': 20.0, 'Quantity': 1}]}, {'Customer': {'FirstName': 'Chris', 'LastName': 'Peanugget'}, 'Items': [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity': 1}]}]
Loaded DataFrame:
| Items | Customer.FirstName | Customer.LastName | |
|---|---|---|---|
| 0 | [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'... | Abby | Kuss |
| 1 | [{'Name': 'Shoes', 'Price': 25.0, 'Quantity': ... | Bette | Alott |
| 2 | [{'Name': 'T-Shirt', 'Price': 10.0, 'Quantity'... | Chris | Peanugget |
Better but this only processed nested dict and not nested list. We still need to handle the list of Items. To accomplish this we :
- Set the
record_pathto be the nested list'Items'. This tellsjson_normalize()to use that JSON key as the row level. So now we will have 5 rows (one for each item) instead of 3. - Then we set the
metanamed argument to alistof each of the other values we wish to include, in this instance last name and first name.
Note: The meta syntax is a bit weird. It’s a list of JSON paths (also represented as lists) to each item in the JSON. For example:
The meta Argument ==> Matches This in the JSON ==> And Displays As This Pandas Column
["Customer","FirstName"] ==> { "Customer" : { "FirstName": ...} ==> Customer.Firstname
orders = pd.json_normalize(json_data, record_path="Items", meta=[["Customer","FirstName"],["Customer","LastName"]])
orders| Name | Price | Quantity | Customer.FirstName | Customer.LastName | |
|---|---|---|---|---|---|
| 0 | T-Shirt | 10.0 | 3 | Abby | Kuss |
| 1 | Jacket | 20.0 | 1 | Abby | Kuss |
| 2 | Shoes | 25.0 | 1 | Bette | Alott |
| 3 | Jacket | 20.0 | 1 | Bette | Alott |
| 4 | T-Shirt | 10.0 | 1 | Chris | Peanugget |
Yes it seems complicated, because conceptually it is a bit complicated. Let’s try another example, with some abstract values.
In the following example we want to generate a normalized table with 3 rows and 4 columns.
- The rows are based on the
"A"record_path, which has two sub-sets,A1andA2. There are three sets ofAdata: (101, 102); (111, 112); and (201, 202). - The meta data are based on columns
"B", and"C1"
json_data = [
{
"A": [
{"A1": 101, "A2": 102},
{"A1": 111, "A2": 112}
],
"B": 103,
"C": {"C1": 104}
},
{
"A": [
{"A1": 201, "A2": 202}
],
"B": 203,
"C": {"C1": 204}
}
]
df = pd.json_normalize(json_data, record_path="A", meta=["B", ["C", "C1"]])
df| A1 | A2 | B | C.C1 | |
|---|---|---|---|---|
| 0 | 101 | 102 | 103 | 104 |
| 1 | 111 | 112 | 103 | 104 |
| 2 | 201 | 202 | 203 | 204 |
Reading Excel files
Excel files can be read using the read_excel() function. For example: pd.read_excel('file.xlsx', sheet_name=None)
In order to use the read_excel method, you need to additional install the optional Pandas dependency openpyxl. To do that using pip, run:
pip install openpyxl
This will read in all sheets as a dict, with the sheet names as the keys and the values as Pandas DataFrames representing the contents. An example using this with Streamlit:
import streamlit as st
import pandas as pd
st.title("Excel Example - multiple sheets")
contents = pd.read_excel("https://github.com/mafudge/datasets/raw/refs/heads/master/excel-examples/books_of_interest.xlsx", sheet_name=None)
# names of sheets in the excel file its a dictionary
sheets = list(contents.keys())
# make tabs for each sheet
tabs = st.tabs(sheets)
#loop through each tab and write the contents of the sheet to the tab
for i in range(len(tabs)):
df = contents[sheets[i]]
tabs[i].dataframe(df)Reading HTML Tables
You can scrape an HTML table off a webpage using the read_html() function. This will return a list of all HTML tables on the page, with each table as a DataFrame.
In order to use the read_html method, you need to additional install the optional Pandas dependency lxml. To do that using pip, run:
pip install lxml
For example:
contents = pd.read_html("https://su-ist356-m003-fall-2025.github.io/course-home")
for df in contents:
print(df) Week Dates Topic
0 1 8/25, 8/29 Intro; CLI and Conda
1 2 9/3 Python review - 1
2 3 9/8, 9/10 Python review - 2
3 4 9/15, 9/17 UI
4 5 9/22, 9/24 Data wrangling - 1
5 6 9/29, 10/1 Data wrangling - 2
6 7 10/6, 10/8 Data wrangling - 3
7 8 10/15 Exam 1
8 9 10/20, 10/22 Web APIs - 1
9 10 10/27, 10/29 Web APIs - 2
10 11 11/3, 11/5 Web scraping - 1
11 12 11/10, 11/12 Web scraping - 2
12 13 11/17, 11/19 Data visualization - 1
13 14 12/1, 12/3 Data visualization - 2
14 15 12/8 Exam 2
15 16 12/15 Project due
Date Unnamed: 1
0 Mon. 8/25 First day of class
1 Mon. 9/1 Labor day - No class
2 Mon. 9/15 Academic/Financial drop deadline; Religious ob...
3 Mon. 10/13 Fall break - No class
4 Fri. 11/21 Withdrawal deadline
5 11/23-11/30 Thanksgiving Break - No class
6 Mon. 12/8 Last day of class
An example of turning this into a Streamlit app:
import streamlit as st
import pandas as pd
st.title("HTML Example - multiple tables")
url = "https://su-ist356-m003-fall-2025.github.io/course-home"
contents = pd.read_html(url)
# There are 2 tables on this page, but we don't know this
tables_count = len(contents)
st.write(f"Found {tables_count} tables on the page")
# make tabs for each HTML Table
tab_names = [ f"HTML Table {i}" for i in range(tables_count)]
tabs = st.tabs(tab_names)
# for each tab, show its table
for i in range(len(tabs)):
df = contents[i]
tabs[i].dataframe(df)Writing Dataframes
- Once the data is in a
pd.DataFrameis can be written out with one of theto()methods such asto_csv(),to_json(),to_parquet()etc. - This makes pandas a superior data conversion tool.
- If you include a file, the
to()method writes to the file, otherwise the binary contents are returned. - https://pandas.pydata.org/pandas-docs/stable/reference/io.html
An example of converting the above Excel spreadsheet to CSV:
contents = pd.read_excel("https://github.com/mafudge/datasets/raw/refs/heads/master/excel-examples/books_of_interest.xlsx", sheet_name=None)
for sheetname, df in contents.items():
# convert spaces to underscores for filenames
sheetname = sheetname.replace(' ', '_')
filename = f'books_of_interest-{sheetname}.csv'
df.to_csv(filename, header=True, index=False)