checks_data = 'https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv'5. Basic data cleaning with Pandas
In this lesson we will start learning how to clean a dataframe data and loop over it. To follow along with the commands below, download the following scratch notebook to your ist356 directory, then open it with VS Code:
Download scratch-pandas-4.ipynb
For this tutorial, we will be using some data representing a pretend restaurant’s transactions. The file is:
(This link is provided in the scratch notebook.)
Let’s load the file with Pandas and sample a few of its rows to see what it contains:
import pandas as pd
checks = pd.read_csv(checks_data)
checks.sample(10)| check | date | party size | total items on check | total amount of check | gratuity | |
|---|---|---|---|---|---|---|
| 30 | 2705 | 2024-07-08 | 10 | 19 | $838.85 | $671.08 |
| 25 | 4031 | 2024-08-12 | 6 | 14 | $655.48 | $65.55 |
| 8 | 1066 | 2024-08-20 | 10 | 22 | $485.76 | $77.72 |
| 29 | 4590 | 2024-05-08 | 3 | 5 | $220.40 | $22.04 |
| 38 | 2341 | 2024-06-03 | 7 | 16 | $1,118.88 | $313.29 |
| 14 | 3676 | 2024-02-25 | 1 | 1 | $19.89 | $1.99 |
| 34 | 1368 | 2024-12-21 | 10 | 25 | $2,193.00 | $372.81 |
| 49 | 3404 | 2024-07-19 | 9 | 26 | $2,382.90 | $71.49 |
| 13 | 3867 | 2024-05-02 | 4 | 14 | $499.10 | $119.78 |
| 6 | 2527 | 2024-03-27 | 6 | 21 | $921.48 | $55.29 |
Let’s use the info method to get some more information about the columns:
checks.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 check 50 non-null int64
1 date 50 non-null object
2 party size 50 non-null int64
3 total items on check 50 non-null int64
4 total amount of check 50 non-null object
5 gratuity 50 non-null object
dtypes: int64(3), object(3)
memory usage: 2.5+ KB
There’s something odd here! Note that the data type of some of the columns (e.g., total amount of check) are object instead of floats, as you might expect. As we’ll see below, this is because of the $ in the values; that will cause issues when we try to work with these columns. We’ll learn how to “clean” these columns so that we can do useful things with them.
Apply
The apply method allows us to execute a function over a Series or the entire DataFrame.
The general syntax:
Series.apply(func)<== call functionfuncfor every item in the Series.DataFrame.apply(func, axis=1)<== call functionfuncfor every row in the DataFrame (axis=1=> row).DataFrame.apply(func, axis=0)<== call functionfuncfor every column in the DataFrame (axis=0=> col).
Note that the first argument apply is the function itself, not the function applied to some data. For example, suppose we define a function called sq that squares the input values:
def sq(x):
return x**2To apply this to one of the columns in our DataFrame (say, the party size column):
checks['party size'].apply(sq)0 64
1 9
2 25
3 4
4 36
5 1
6 36
7 64
8 100
9 1
10 36
11 100
12 16
13 16
14 1
15 25
16 25
17 9
18 64
19 4
20 25
21 1
22 64
23 4
24 100
25 36
26 36
27 1
28 16
29 9
30 100
31 9
32 9
33 36
34 100
35 16
36 4
37 81
38 49
39 49
40 9
41 49
42 49
43 25
44 49
45 81
46 1
47 1
48 81
49 81
Name: party size, dtype: int64
Why apply?
In the above example, you might wonder why we don’t just apply the function directly to the Series, rather than use apply. Afterall, in the first pandas tutorial we learned that Series are vectorized just like numpy arrays. In other words, why not just do:
sq(checks['party size'])0 64
1 9
2 25
3 4
4 36
5 1
6 36
7 64
8 100
9 1
10 36
11 100
12 16
13 16
14 1
15 25
16 25
17 9
18 64
19 4
20 25
21 1
22 64
23 4
24 100
25 36
26 36
27 1
28 16
29 9
30 100
31 9
32 9
33 36
34 100
35 16
36 4
37 81
38 49
39 49
40 9
41 49
42 49
43 25
44 49
45 81
46 1
47 1
48 81
49 81
Name: party size, dtype: int64
In this case, you could just run the function on the series. Where apply is useful is when you have more complicated functions, in particular, ones that need to do different things depending on what the input data is. For example, suppose we define the following function to group parties into small, medium, and large depending on how many people are in the party:
def classify_size(x):
if x < 4:
return 'small'
elif x < 8:
return 'medium'
else:
return 'large'If we try to run this on the party size Series, we get an error:
classify_size(checks['party size'])--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ipykernel_11312/1113461677.py in ?() ----> 1 classify_size(checks['party size']) /tmp/ipykernel_11312/2519828553.py in ?(x) 1 def classify_size(x): ----> 2 if x < 4: 3 return 'small' 4 elif x < 8: 5 return 'medium' /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/generic.py in ?(self) 1578 @final 1579 def __nonzero__(self) -> NoReturn: -> 1580 raise ValueError( 1581 f"The truth value of a {type(self).__name__} is ambiguous. " 1582 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 1583 ) ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This is because if statements cannot be vectorized like this. In contrast, the apply method does allow us to apply the function to the series:
checks['party size'].apply(classify_size)0 large
1 small
2 medium
3 small
4 medium
5 small
6 medium
7 large
8 large
9 small
10 medium
11 large
12 medium
13 medium
14 small
15 medium
16 medium
17 small
18 large
19 small
20 medium
21 small
22 large
23 small
24 large
25 medium
26 medium
27 small
28 medium
29 small
30 large
31 small
32 small
33 medium
34 large
35 medium
36 small
37 large
38 medium
39 medium
40 small
41 medium
42 medium
43 medium
44 medium
45 large
46 small
47 small
48 large
49 large
Name: party size, dtype: object
This is because the apply method takes care to cycle over every element in the series and apply the function.
Cleaning data with apply
The ability of apply to apply more complicated functions involving if statements makes it extremely useful for cleaning datasets. By “cleaning” we mean reformatting data and/or removing spurios values, so that we can use it without issue.
For example, say we want to add price per item to our DataFrame, defined as:
price per item = total amount of check / total items on check
The problem is total amount of check is an object, not a float. This means we cannot do math on it:
# This will raise a TypeError because of the dollar sign and commas!!!
checks['price_per_item'] = checks['total amount of check'] / checks['total items on check']--------------------------------------------------------------------------- TypeError Traceback (most recent call last) File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:218, in _na_arithmetic_op(left, right, op, is_cmp) 217 try: --> 218 result = func(left, right) 219 except TypeError: File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/computation/expressions.py:242, in evaluate(op, a, b, use_numexpr) 240 if use_numexpr: 241 # error: "None" not callable --> 242 return _evaluate(op, op_str, a, b) # type: ignore[misc] 243 return _evaluate_standard(op, op_str, a, b) File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/computation/expressions.py:73, in _evaluate_standard(op, op_str, a, b) 72 _store_test_result(False) ---> 73 return op(a, b) TypeError: unsupported operand type(s) for /: 'str' and 'int' During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) Cell In[10], line 2 1 # This will raise a TypeError because of the dollar sign and commas!!! ----> 2 checks['price_per_item'] = checks['total amount of check'] / checks['total items on check'] File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other) 72 return NotImplemented 74 other = item_from_zerodim(other) ---> 76 return method(self, other) File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/arraylike.py:210, in OpsMixin.__truediv__(self, other) 208 @unpack_zerodim_and_defer("__truediv__") 209 def __truediv__(self, other): --> 210 return self._arith_method(other, operator.truediv) File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/series.py:6154, in Series._arith_method(self, other, op) 6152 def _arith_method(self, other, op): 6153 self, other = self._align_for_op(other) -> 6154 return base.IndexOpsMixin._arith_method(self, other, op) File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/base.py:1391, in IndexOpsMixin._arith_method(self, other, op) 1388 rvalues = np.arange(rvalues.start, rvalues.stop, rvalues.step) 1390 with np.errstate(all="ignore"): -> 1391 result = ops.arithmetic_op(lvalues, rvalues, op) 1393 return self._construct_result(result, name=res_name) File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:283, in arithmetic_op(left, right, op) 279 _bool_arith_check(op, left, right) # type: ignore[arg-type] 281 # error: Argument 1 to "_na_arithmetic_op" has incompatible type 282 # "Union[ExtensionArray, ndarray[Any, Any]]"; expected "ndarray[Any, Any]" --> 283 res_values = _na_arithmetic_op(left, right, op) # type: ignore[arg-type] 285 return res_values File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:227, in _na_arithmetic_op(left, right, op, is_cmp) 219 except TypeError: 220 if not is_cmp and ( 221 left.dtype == object or getattr(right, "dtype", None) == object 222 ): (...) 225 # Don't do this for comparisons, as that will handle complex numbers 226 # incorrectly, see GH#32047 --> 227 result = _masked_arith_op(left, right, op) 228 else: 229 raise File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:163, in _masked_arith_op(x, y, op) 161 # See GH#5284, GH#5035, GH#19448 for historical reference 162 if mask.any(): --> 163 result[mask] = op(xrav[mask], yrav[mask]) 165 else: 166 if not is_scalar(y): TypeError: unsupported operand type(s) for /: 'str' and 'int'
How do we fix this? Let’s write a function to convert string values like this: $4,590.45 into floats like this: 4590.45
def clean_currency(value:str) -> float:
'''
This function will take a string value and remove the dollar sign and commas
and return a float value.
'''
return float(value.replace(',', '').replace('$', ''))
# tests
assert clean_currency('$1,000.00') == 1000.00
assert clean_currency('$1,000') == 1000.00
assert clean_currency('1,000') == 1000.00
assert clean_currency('$1000') == 1000.00With our function written we can use apply() to transform the series:
checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks.sample(10)| check | date | party size | total items on check | total amount of check | gratuity | total_amount_of_check_cleaned | price_per_item | |
|---|---|---|---|---|---|---|---|---|
| 22 | 1336 | 2024-08-30 | 8 | 28 | $1,199.80 | $275.95 | 1199.80 | 42.85 |
| 9 | 2968 | 2024-12-28 | 1 | 3 | $122.97 | $23.36 | 122.97 | 40.99 |
| 49 | 3404 | 2024-07-19 | 9 | 26 | $2,382.90 | $71.49 | 2382.90 | 91.65 |
| 7 | 1564 | 2024-09-23 | 8 | 11 | $928.40 | $204.25 | 928.40 | 84.40 |
| 21 | 4440 | 2024-06-11 | 1 | 3 | $168.96 | $10.14 | 168.96 | 56.32 |
| 0 | 2827 | 2024-05-06 | 8 | 12 | $415.08 | $107.92 | 415.08 | 34.59 |
| 44 | 2053 | 2024-12-14 | 7 | 23 | $588.11 | $164.67 | 588.11 | 25.57 |
| 17 | 3795 | 2024-02-21 | 3 | 7 | $212.38 | $46.72 | 212.38 | 30.34 |
| 2 | 3685 | 2024-12-07 | 5 | 5 | $252.95 | $50.59 | 252.95 | 50.59 |
| 23 | 1194 | 2024-07-06 | 2 | 6 | $453.06 | $72.49 | 453.06 | 75.51 |
Remember its a really good idea to track lineage when you are building a data pipeline.
NEVER replace columns, always create new ones.
Using lambdas to apply functions to multiple columns
So far we’ve used apply with functions that take in a single Series. What do we do if we need a function to operate on multiple columns in a DataFrame? For that, we can use Python lambda functions.
An example using our checks data:
Marketing wants you to build some key performance indicators (KPIs) using the checks data. A KPI is a statistic that summarizes some larger data set, so the data can be more easily tracked over time. For example a letter grade such as an A- is a KPI summary of all your graded efforts to date.
In this example, marketing wants you to build two KPIs:
KPI 1: Whales
Marketing has decided to group customers into the following categories: - big eaters: Customers who are in the top 25% (i.e., above the 0.75 quantile) for items per person. - big spenders: Customers who are in the top 25% for price per person. - whale: Customers who are in the top 25% for both items per person and price per person.
KPI 2: Tippers
Marketing has decided to further group customers into the following categories based on their tipping: - light: Customers who are in the botton 25% (i.e., below the 0.25 quantile) by tip percentage. - heavy: Customers who are in the top 25% (i.e., above the 0.75 quantile) by tip percentage.
To calculate percentiles we will use the quantile Series method in Pandas. This returns the value at which X% of the data is below the given percentile.
Before we can apply our KPI’s we must write the functions!
checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']To categorize whales:
def detect_whale(
items_per_person:float,
price_per_person:float,
items_per_person_75th_pctile:float,
price_per_person_75_pctile:float) -> str:
if items_per_person > items_per_person_75th_pctile and price_per_person > price_per_person_75_pctile:
return 'whale'
if items_per_person > items_per_person_75th_pctile:
return 'big eater'
if price_per_person > price_per_person_75_pctile:
return 'big spender'
return ''Let’s test our function using the quantile method:
# tests
ppp_75 = checks['price_per_person'].quantile(0.75)
ipp_75 = checks['items_per_person'].quantile(0.75)
print(ppp_75, ipp_75)
assert detect_whale(5, 250, 3, 175) == 'whale'
assert detect_whale(5, 100, 3, 175) == 'big eater'
assert detect_whale(1, 250, 3, 175) == 'big spender'
assert detect_whale(1, 100, 3, 175) == ''158.35666666666668 3.0
Now we want to apply the detect_whale function to the checks DataFrame. But detect_whale requires two columns as input, the items_per_person and price_per_person. How do we do that?
We can use a lambda function to quickly define a small function that will return the required columns when provided a row. In general, the syntax for a lambda function is lambda ARGS: FUNC. For example, lambda a, b: a+b will return the sum of the two arguments its given.
In our case, we can use a lambda function to pull out the needed columns and give them to apply, like so:
checks['whale'] = checks.apply(lambda row: detect_whale(row['items_per_person'], row['price_per_person'], ipp_75, ppp_75), axis=1)
checks.sample(25)| check | date | party size | total items on check | total amount of check | gratuity | total_amount_of_check_cleaned | price_per_item | gratuity_cleaned | price_per_person | items_per_person | tip_percentage | whale | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31 | 1945 | 2024-02-05 | 3 | 7 | $132.86 | $21.26 | 132.86 | 18.98 | 21.26 | 44.286667 | 2.333333 | 0.160018 | |
| 28 | 2446 | 2024-12-15 | 4 | 12 | $575.64 | $28.78 | 575.64 | 47.97 | 28.78 | 143.910000 | 3.000000 | 0.049997 | |
| 25 | 4031 | 2024-08-12 | 6 | 14 | $655.48 | $65.55 | 655.48 | 46.82 | 65.55 | 109.246667 | 2.333333 | 0.100003 | |
| 13 | 3867 | 2024-05-02 | 4 | 14 | $499.10 | $119.78 | 499.10 | 35.65 | 119.78 | 124.775000 | 3.500000 | 0.239992 | big eater |
| 9 | 2968 | 2024-12-28 | 1 | 3 | $122.97 | $23.36 | 122.97 | 40.99 | 23.36 | 122.970000 | 3.000000 | 0.189965 | |
| 42 | 1361 | 2024-11-21 | 7 | 14 | $65.80 | $16.45 | 65.80 | 4.70 | 16.45 | 9.400000 | 2.000000 | 0.250000 | |
| 3 | 1957 | 2024-02-15 | 2 | 2 | $42.44 | $8.91 | 42.44 | 21.22 | 8.91 | 21.220000 | 1.000000 | 0.209943 | |
| 38 | 2341 | 2024-06-03 | 7 | 16 | $1,118.88 | $313.29 | 1118.88 | 69.93 | 313.29 | 159.840000 | 2.285714 | 0.280003 | big spender |
| 33 | 3842 | 2024-03-31 | 6 | 6 | $147.12 | $5.88 | 147.12 | 24.52 | 5.88 | 24.520000 | 1.000000 | 0.039967 | |
| 16 | 3694 | 2024-11-03 | 5 | 17 | $1,574.37 | $173.18 | 1574.37 | 92.61 | 173.18 | 314.874000 | 3.400000 | 0.110000 | whale |
| 19 | 3718 | 2024-10-30 | 2 | 5 | $464.70 | $120.82 | 464.70 | 92.94 | 120.82 | 232.350000 | 2.500000 | 0.259996 | big spender |
| 46 | 3621 | 2024-06-23 | 1 | 2 | $138.76 | $19.43 | 138.76 | 69.38 | 19.43 | 138.760000 | 2.000000 | 0.140026 | |
| 5 | 2191 | 2024-01-06 | 1 | 3 | $17.85 | $1.96 | 17.85 | 5.95 | 1.96 | 17.850000 | 3.000000 | 0.109804 | |
| 43 | 1186 | 2024-09-21 | 5 | 16 | $298.72 | $74.68 | 298.72 | 18.67 | 74.68 | 59.744000 | 3.200000 | 0.250000 | big eater |
| 35 | 2486 | 2024-01-27 | 4 | 13 | $569.01 | $108.11 | 569.01 | 43.77 | 108.11 | 142.252500 | 3.250000 | 0.189997 | big eater |
| 15 | 2386 | 2024-03-31 | 5 | 12 | $1,147.80 | $137.74 | 1147.80 | 95.65 | 137.74 | 229.560000 | 2.400000 | 0.120003 | big spender |
| 1 | 2443 | 2024-06-09 | 3 | 10 | $286.40 | $31.50 | 286.40 | 28.64 | 31.50 | 95.466667 | 3.333333 | 0.109986 | big eater |
| 34 | 1368 | 2024-12-21 | 10 | 25 | $2,193.00 | $372.81 | 2193.00 | 87.72 | 372.81 | 219.300000 | 2.500000 | 0.170000 | big spender |
| 10 | 2809 | 2024-12-30 | 6 | 6 | $104.46 | $1.04 | 104.46 | 17.41 | 1.04 | 17.410000 | 1.000000 | 0.009956 | |
| 32 | 1440 | 2024-11-30 | 3 | 8 | $589.04 | $141.37 | 589.04 | 73.63 | 141.37 | 196.346667 | 2.666667 | 0.240001 | big spender |
| 29 | 4590 | 2024-05-08 | 3 | 5 | $220.40 | $22.04 | 220.40 | 44.08 | 22.04 | 73.466667 | 1.666667 | 0.100000 | |
| 22 | 1336 | 2024-08-30 | 8 | 28 | $1,199.80 | $275.95 | 1199.80 | 42.85 | 275.95 | 149.975000 | 3.500000 | 0.229997 | big eater |
| 48 | 4161 | 2024-06-22 | 9 | 28 | $1,385.16 | $235.48 | 1385.16 | 49.47 | 235.48 | 153.906667 | 3.111111 | 0.170002 | big eater |
| 20 | 3393 | 2024-08-26 | 5 | 6 | $302.64 | $24.21 | 302.64 | 50.44 | 24.21 | 60.528000 | 1.200000 | 0.079996 | |
| 14 | 3676 | 2024-02-25 | 1 | 1 | $19.89 | $1.99 | 19.89 | 19.89 | 1.99 | 19.890000 | 1.000000 | 0.100050 |
Looping over Dataframes
If you must run a for loop over your DataFrames, there are two choices:
df.iterrows()dict-like iterationdf.itertuples()named-tuple like iteration (faster)
Let’s do an example where we display the check number, whale and tipper for “heavy tipper” checks.
## Using the iterrows() method
print("Total Amount of Whale Checks")
for i,row in checks.iterrows():
if row['whale'] == 'whale':
print(i, row['check'], row['total_amount_of_check_cleaned'])# Same example with the itertuples() method
print("Total Amount of Whale Checks")
for row in checks.itertuples():
if row.whale == 'whale':
print(row.check, row.total_amount_of_check_cleaned)# Of course you don't need a loop to do this:
checks[checks['whale'] == 'whale'][['check', 'total_amount_of_check_cleaned']]