5. Basic data cleaning with Pandas

In this lesson we will start learning how to clean a dataframe data and loop over it. To follow along with the commands below, download the following scratch notebook to your ist356 directory, then open it with VS Code:

Download scratch-pandas-4.ipynb

For this tutorial, we will be using some data representing a pretend restaurant’s transactions. The file is:

checks_data = 'https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv'

(This link is provided in the scratch notebook.)

Let’s load the file with Pandas and sample a few of its rows to see what it contains:

import pandas as pd

checks = pd.read_csv(checks_data)
checks.sample(10)
check date party size total items on check total amount of check gratuity
30 2705 2024-07-08 10 19 $838.85 $671.08
25 4031 2024-08-12 6 14 $655.48 $65.55
8 1066 2024-08-20 10 22 $485.76 $77.72
29 4590 2024-05-08 3 5 $220.40 $22.04
38 2341 2024-06-03 7 16 $1,118.88 $313.29
14 3676 2024-02-25 1 1 $19.89 $1.99
34 1368 2024-12-21 10 25 $2,193.00 $372.81
49 3404 2024-07-19 9 26 $2,382.90 $71.49
13 3867 2024-05-02 4 14 $499.10 $119.78
6 2527 2024-03-27 6 21 $921.48 $55.29

Let’s use the info method to get some more information about the columns:

checks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   check                  50 non-null     int64 
 1   date                   50 non-null     object
 2   party size             50 non-null     int64 
 3   total items on check   50 non-null     int64 
 4   total amount of check  50 non-null     object
 5   gratuity               50 non-null     object
dtypes: int64(3), object(3)
memory usage: 2.5+ KB

There’s something odd here! Note that the data type of some of the columns (e.g., total amount of check) are object instead of floats, as you might expect. As we’ll see below, this is because of the $ in the values; that will cause issues when we try to work with these columns. We’ll learn how to “clean” these columns so that we can do useful things with them.

Apply

The apply method allows us to execute a function over a Series or the entire DataFrame.

The general syntax:

  • Series.apply(func) <== call function func for every item in the Series.

  • DataFrame.apply(func, axis=1) <== call function func for every row in the DataFrame (axis=1 => row).

  • DataFrame.apply(func, axis=0) <== call function func for every column in the DataFrame (axis=0 => col).

Note that the first argument apply is the function itself, not the function applied to some data. For example, suppose we define a function called sq that squares the input values:

def sq(x):
    return x**2

To apply this to one of the columns in our DataFrame (say, the party size column):

checks['party size'].apply(sq)
0      64
1       9
2      25
3       4
4      36
5       1
6      36
7      64
8     100
9       1
10     36
11    100
12     16
13     16
14      1
15     25
16     25
17      9
18     64
19      4
20     25
21      1
22     64
23      4
24    100
25     36
26     36
27      1
28     16
29      9
30    100
31      9
32      9
33     36
34    100
35     16
36      4
37     81
38     49
39     49
40      9
41     49
42     49
43     25
44     49
45     81
46      1
47      1
48     81
49     81
Name: party size, dtype: int64

Why apply?

In the above example, you might wonder why we don’t just apply the function directly to the Series, rather than use apply. Afterall, in the first pandas tutorial we learned that Series are vectorized just like numpy arrays. In other words, why not just do:

sq(checks['party size'])
0      64
1       9
2      25
3       4
4      36
5       1
6      36
7      64
8     100
9       1
10     36
11    100
12     16
13     16
14      1
15     25
16     25
17      9
18     64
19      4
20     25
21      1
22     64
23      4
24    100
25     36
26     36
27      1
28     16
29      9
30    100
31      9
32      9
33     36
34    100
35     16
36      4
37     81
38     49
39     49
40      9
41     49
42     49
43     25
44     49
45     81
46      1
47      1
48     81
49     81
Name: party size, dtype: int64

In this case, you could just run the function on the series. Where apply is useful is when you have more complicated functions, in particular, ones that need to do different things depending on what the input data is. For example, suppose we define the following function to group parties into small, medium, and large depending on how many people are in the party:

def classify_size(x):
    if x < 4:
        return 'small'
    elif x < 8:
        return 'medium'
    else:
        return 'large'

If we try to run this on the party size Series, we get an error:

classify_size(checks['party size'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_11312/1113461677.py in ?()
----> 1 classify_size(checks['party size'])

/tmp/ipykernel_11312/2519828553.py in ?(x)
      1 def classify_size(x):
----> 2     if x < 4:
      3         return 'small'
      4     elif x < 8:
      5         return 'medium'

/opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/generic.py in ?(self)
   1578     @final
   1579     def __nonzero__(self) -> NoReturn:
-> 1580         raise ValueError(
   1581             f"The truth value of a {type(self).__name__} is ambiguous. "
   1582             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1583         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This is because if statements cannot be vectorized like this. In contrast, the apply method does allow us to apply the function to the series:

checks['party size'].apply(classify_size)
0      large
1      small
2     medium
3      small
4     medium
5      small
6     medium
7      large
8      large
9      small
10    medium
11     large
12    medium
13    medium
14     small
15    medium
16    medium
17     small
18     large
19     small
20    medium
21     small
22     large
23     small
24     large
25    medium
26    medium
27     small
28    medium
29     small
30     large
31     small
32     small
33    medium
34     large
35    medium
36     small
37     large
38    medium
39    medium
40     small
41    medium
42    medium
43    medium
44    medium
45     large
46     small
47     small
48     large
49     large
Name: party size, dtype: object

This is because the apply method takes care to cycle over every element in the series and apply the function.

Cleaning data with apply

The ability of apply to apply more complicated functions involving if statements makes it extremely useful for cleaning datasets. By “cleaning” we mean reformatting data and/or removing spurios values, so that we can use it without issue.

For example, say we want to add price per item to our DataFrame, defined as:

price per item = total amount of check / total items on check

The problem is total amount of check is an object, not a float. This means we cannot do math on it:

# This will raise a TypeError because of the dollar sign and commas!!!
checks['price_per_item'] = checks['total amount of check'] / checks['total items on check']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:218, in _na_arithmetic_op(left, right, op, is_cmp)
    217 try:
--> 218     result = func(left, right)
    219 except TypeError:

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/computation/expressions.py:242, in evaluate(op, a, b, use_numexpr)
    240     if use_numexpr:
    241         # error: "None" not callable
--> 242         return _evaluate(op, op_str, a, b)  # type: ignore[misc]
    243 return _evaluate_standard(op, op_str, a, b)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/computation/expressions.py:73, in _evaluate_standard(op, op_str, a, b)
     72     _store_test_result(False)
---> 73 return op(a, b)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 # This will raise a TypeError because of the dollar sign and commas!!!
----> 2 checks['price_per_item'] = checks['total amount of check'] / checks['total items on check']

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/arraylike.py:210, in OpsMixin.__truediv__(self, other)
    208 @unpack_zerodim_and_defer("__truediv__")
    209 def __truediv__(self, other):
--> 210     return self._arith_method(other, operator.truediv)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/series.py:6154, in Series._arith_method(self, other, op)
   6152 def _arith_method(self, other, op):
   6153     self, other = self._align_for_op(other)
-> 6154     return base.IndexOpsMixin._arith_method(self, other, op)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/base.py:1391, in IndexOpsMixin._arith_method(self, other, op)
   1388     rvalues = np.arange(rvalues.start, rvalues.stop, rvalues.step)
   1390 with np.errstate(all="ignore"):
-> 1391     result = ops.arithmetic_op(lvalues, rvalues, op)
   1393 return self._construct_result(result, name=res_name)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:283, in arithmetic_op(left, right, op)
    279     _bool_arith_check(op, left, right)  # type: ignore[arg-type]
    281     # error: Argument 1 to "_na_arithmetic_op" has incompatible type
    282     # "Union[ExtensionArray, ndarray[Any, Any]]"; expected "ndarray[Any, Any]"
--> 283     res_values = _na_arithmetic_op(left, right, op)  # type: ignore[arg-type]
    285 return res_values

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:227, in _na_arithmetic_op(left, right, op, is_cmp)
    219 except TypeError:
    220     if not is_cmp and (
    221         left.dtype == object or getattr(right, "dtype", None) == object
    222     ):
   (...)    225         # Don't do this for comparisons, as that will handle complex numbers
    226         #  incorrectly, see GH#32047
--> 227         result = _masked_arith_op(left, right, op)
    228     else:
    229         raise

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:163, in _masked_arith_op(x, y, op)
    161     # See GH#5284, GH#5035, GH#19448 for historical reference
    162     if mask.any():
--> 163         result[mask] = op(xrav[mask], yrav[mask])
    165 else:
    166     if not is_scalar(y):

TypeError: unsupported operand type(s) for /: 'str' and 'int'

How do we fix this? Let’s write a function to convert string values like this: $4,590.45 into floats like this: 4590.45

def clean_currency(value:str) -> float:
    '''
    This function will take a string value and remove the dollar sign and commas
    and return a float value.
    '''
    return float(value.replace(',', '').replace('$', ''))


# tests
assert clean_currency('$1,000.00') == 1000.00
assert clean_currency('$1,000') == 1000.00
assert clean_currency('1,000') == 1000.00
assert clean_currency('$1000') == 1000.00

With our function written we can use apply() to transform the series:

checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks.sample(10)
check date party size total items on check total amount of check gratuity total_amount_of_check_cleaned price_per_item
22 1336 2024-08-30 8 28 $1,199.80 $275.95 1199.80 42.85
9 2968 2024-12-28 1 3 $122.97 $23.36 122.97 40.99
49 3404 2024-07-19 9 26 $2,382.90 $71.49 2382.90 91.65
7 1564 2024-09-23 8 11 $928.40 $204.25 928.40 84.40
21 4440 2024-06-11 1 3 $168.96 $10.14 168.96 56.32
0 2827 2024-05-06 8 12 $415.08 $107.92 415.08 34.59
44 2053 2024-12-14 7 23 $588.11 $164.67 588.11 25.57
17 3795 2024-02-21 3 7 $212.38 $46.72 212.38 30.34
2 3685 2024-12-07 5 5 $252.95 $50.59 252.95 50.59
23 1194 2024-07-06 2 6 $453.06 $72.49 453.06 75.51
Note

Remember its a really good idea to track lineage when you are building a data pipeline.

NEVER replace columns, always create new ones.

CautionCode Challenge 5.1

Let’s take what we did so far, and create a dataset that would be better prepared for analysis / machine learning. We’ll display it in a Streamlit app.

  1. Create a script called st-cleaned_checks.py.

  2. Add your clean_currency function to the script.

  3. Load the checks dataset into a Pandas DataFrame and:

    • Clean the total amount of check and gratuity columns.
    • Calculate the price_per_item as total amount of check / total items on check.
    • Calcualte the price_per_person as total amont of check / party size.
    • Calcualte the items_per_person as total items on check / party size.
    • Calcualte the tip_percentage as the total amount of check / gratuity.
  4. Display the cleaned DataFrame along with a description of its data (i.e., display the output of describe) in a Streamlit app.

  5. Test that your app works by running it.

  1. The st-cleaned_checks.py script:
import streamlit as st
import pandas as pd

def clean_currency(value:str) -> float:
    '''
    This function will take a string value and remove the dollar sign and commas
    and return a float value.
    '''
    return float(value.replace(',', '').replace('$', ''))

st.title("Dining Check Data")

# load
checks = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv')

# transformations
checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']

st.dataframe(checks, width=1000)

st.header("Summary:")
st.dataframe(checks.describe())
  1. Test it by running the following in the terminal (make sure to activate your conda environment first):
python -m streamlit run st-cleaned_checks.py

Using lambdas to apply functions to multiple columns

So far we’ve used apply with functions that take in a single Series. What do we do if we need a function to operate on multiple columns in a DataFrame? For that, we can use Python lambda functions.

An example using our checks data:

Marketing wants you to build some key performance indicators (KPIs) using the checks data. A KPI is a statistic that summarizes some larger data set, so the data can be more easily tracked over time. For example a letter grade such as an A- is a KPI summary of all your graded efforts to date.

In this example, marketing wants you to build two KPIs:

KPI 1: Whales

Marketing has decided to group customers into the following categories: - big eaters: Customers who are in the top 25% (i.e., above the 0.75 quantile) for items per person. - big spenders: Customers who are in the top 25% for price per person. - whale: Customers who are in the top 25% for both items per person and price per person.

KPI 2: Tippers

Marketing has decided to further group customers into the following categories based on their tipping: - light: Customers who are in the botton 25% (i.e., below the 0.25 quantile) by tip percentage. - heavy: Customers who are in the top 25% (i.e., above the 0.75 quantile) by tip percentage.

To calculate percentiles we will use the quantile Series method in Pandas. This returns the value at which X% of the data is below the given percentile.

Before we can apply our KPI’s we must write the functions!

checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']

To categorize whales:

def detect_whale(
        items_per_person:float, 
        price_per_person:float, 
        items_per_person_75th_pctile:float, 
        price_per_person_75_pctile:float) -> str:
    if items_per_person > items_per_person_75th_pctile and price_per_person > price_per_person_75_pctile:
        return 'whale'
    if items_per_person > items_per_person_75th_pctile:
        return 'big eater'
    if price_per_person > price_per_person_75_pctile:
        return 'big spender'
    return ''

Let’s test our function using the quantile method:

# tests
ppp_75 = checks['price_per_person'].quantile(0.75)
ipp_75 = checks['items_per_person'].quantile(0.75)
print(ppp_75, ipp_75)
assert detect_whale(5, 250, 3, 175) == 'whale'
assert detect_whale(5, 100, 3, 175) == 'big eater'
assert detect_whale(1, 250, 3, 175) == 'big spender'
assert detect_whale(1, 100, 3, 175) == ''
158.35666666666668 3.0

Now we want to apply the detect_whale function to the checks DataFrame. But detect_whale requires two columns as input, the items_per_person and price_per_person. How do we do that?

We can use a lambda function to quickly define a small function that will return the required columns when provided a row. In general, the syntax for a lambda function is lambda ARGS: FUNC. For example, lambda a, b: a+b will return the sum of the two arguments its given.

In our case, we can use a lambda function to pull out the needed columns and give them to apply, like so:

checks['whale'] = checks.apply(lambda row: detect_whale(row['items_per_person'], row['price_per_person'], ipp_75, ppp_75), axis=1)
checks.sample(25)
check date party size total items on check total amount of check gratuity total_amount_of_check_cleaned price_per_item gratuity_cleaned price_per_person items_per_person tip_percentage whale
31 1945 2024-02-05 3 7 $132.86 $21.26 132.86 18.98 21.26 44.286667 2.333333 0.160018
28 2446 2024-12-15 4 12 $575.64 $28.78 575.64 47.97 28.78 143.910000 3.000000 0.049997
25 4031 2024-08-12 6 14 $655.48 $65.55 655.48 46.82 65.55 109.246667 2.333333 0.100003
13 3867 2024-05-02 4 14 $499.10 $119.78 499.10 35.65 119.78 124.775000 3.500000 0.239992 big eater
9 2968 2024-12-28 1 3 $122.97 $23.36 122.97 40.99 23.36 122.970000 3.000000 0.189965
42 1361 2024-11-21 7 14 $65.80 $16.45 65.80 4.70 16.45 9.400000 2.000000 0.250000
3 1957 2024-02-15 2 2 $42.44 $8.91 42.44 21.22 8.91 21.220000 1.000000 0.209943
38 2341 2024-06-03 7 16 $1,118.88 $313.29 1118.88 69.93 313.29 159.840000 2.285714 0.280003 big spender
33 3842 2024-03-31 6 6 $147.12 $5.88 147.12 24.52 5.88 24.520000 1.000000 0.039967
16 3694 2024-11-03 5 17 $1,574.37 $173.18 1574.37 92.61 173.18 314.874000 3.400000 0.110000 whale
19 3718 2024-10-30 2 5 $464.70 $120.82 464.70 92.94 120.82 232.350000 2.500000 0.259996 big spender
46 3621 2024-06-23 1 2 $138.76 $19.43 138.76 69.38 19.43 138.760000 2.000000 0.140026
5 2191 2024-01-06 1 3 $17.85 $1.96 17.85 5.95 1.96 17.850000 3.000000 0.109804
43 1186 2024-09-21 5 16 $298.72 $74.68 298.72 18.67 74.68 59.744000 3.200000 0.250000 big eater
35 2486 2024-01-27 4 13 $569.01 $108.11 569.01 43.77 108.11 142.252500 3.250000 0.189997 big eater
15 2386 2024-03-31 5 12 $1,147.80 $137.74 1147.80 95.65 137.74 229.560000 2.400000 0.120003 big spender
1 2443 2024-06-09 3 10 $286.40 $31.50 286.40 28.64 31.50 95.466667 3.333333 0.109986 big eater
34 1368 2024-12-21 10 25 $2,193.00 $372.81 2193.00 87.72 372.81 219.300000 2.500000 0.170000 big spender
10 2809 2024-12-30 6 6 $104.46 $1.04 104.46 17.41 1.04 17.410000 1.000000 0.009956
32 1440 2024-11-30 3 8 $589.04 $141.37 589.04 73.63 141.37 196.346667 2.666667 0.240001 big spender
29 4590 2024-05-08 3 5 $220.40 $22.04 220.40 44.08 22.04 73.466667 1.666667 0.100000
22 1336 2024-08-30 8 28 $1,199.80 $275.95 1199.80 42.85 275.95 149.975000 3.500000 0.229997 big eater
48 4161 2024-06-22 9 28 $1,385.16 $235.48 1385.16 49.47 235.48 153.906667 3.111111 0.170002 big eater
20 3393 2024-08-26 5 6 $302.64 $24.21 302.64 50.44 24.21 60.528000 1.200000 0.079996
14 3676 2024-02-25 1 1 $19.89 $1.99 19.89 19.89 1.99 19.890000 1.000000 0.100050
CautionCode Challenge 5.2

Let’s build on the Streamlit app you made in Challenge 5.1, and add our KPIs to it.

  1. Copy st-cleaned_checks.py to a new file called st-checks_kpi.py.

  2. Add the detect_whale function defined above to st-checks_kpi.py.

  3. Write a new function detect_tipper(tip_pct, tip_pcy_75th_pctile, tip_pct_25_pctile) that should return either “light” or “heavy” depending on whether the tip is below the 0.25 quantile or above the 0.75 quantile, respectively. If neither, the function should just return an empty string.

  4. Apply the detect_whale and detect_tipper functions to the DataFrame, storing the results in new columns called whale and tipper, respectively.

  5. As with before, diplay the modified DataFrame and the output of describe in a Streamlit app.

  1. It’s easiest to do the copy using the command line. In the terminal, run:
cp st-cleaned_checks.py st-checks_kpi.py
  1. Now modify st-checks_kpi.py. It should look like:
import streamlit as st
import pandas as pd

def clean_currency(value:str) -> float:
    '''
    This function will take a string value and remove the dollar sign and commas
    and return a float value.
    '''
    return float(value.replace(',', '').replace('$', ''))

def detect_whale(
        items_per_person:float, 
        price_per_person:float, 
        items_per_person_75th_pctile:float, 
        price_per_person_75_pctile:float) -> str:
    if items_per_person > items_per_person_75th_pctile and price_per_person > price_per_person_75_pctile:
        return 'whale'
    if items_per_person > items_per_person_75th_pctile:
        return 'big eater'
    if price_per_person > price_per_person_75_pctile:
        return 'big spender'
    
    return ''


def detect_tipper(tip_pct:float, tip_pct_75th_pctile:float, tip_pct_25th_pctile:float) -> str:
    if tip_pct > tip_pct_75th_pctile:
        return 'heavy tipper'
    if tip_pct < tip_pct_25th_pctile:
        return 'light tipper'
    return ''

st.title("Dining Check Data")

# load
checks = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv')

# transformations
checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']

## The new stuff adding KPIs:
# get KPI boundaries
ppp_75 = checks['price_per_person'].quantile(0.75)
ipp_75 = checks['items_per_person'].quantile(0.75)
tp_75 = checks['tip_percentage'].quantile(0.75)
tp_25 = checks['tip_percentage'].quantile(0.25)

# Calcualte KPI's
checks['whale'] = checks.apply(lambda row: detect_whale(row['items_per_person'], row['price_per_person'], ipp_75, ppp_75), axis=1)
checks['tipper'] = checks.apply(lambda row: detect_tipper(row['tip_percentage'], tp_75, tp_25), axis=1)

# Now display
st.dataframe(checks, width=1000)

st.header("Summary:")
st.dataframe(checks.describe())

Looping over Dataframes

If you must run a for loop over your DataFrames, there are two choices:

  • df.iterrows() dict-like iteration
  • df.itertuples() named-tuple like iteration (faster)

Let’s do an example where we display the check number, whale and tipper for “heavy tipper” checks.

## Using the iterrows() method
print("Total Amount of Whale Checks")
for i,row in checks.iterrows():
    if row['whale'] == 'whale':
        print(i, row['check'], row['total_amount_of_check_cleaned'])
# Same example with the itertuples() method
print("Total Amount of Whale Checks")
for row in checks.itertuples():
    if row.whale == 'whale':
        print(row.check, row.total_amount_of_check_cleaned)
# Of course you don't need a loop to do this:
checks[checks['whale'] == 'whale'][['check', 'total_amount_of_check_cleaned']]