5. Basic data cleaning with Pandas

In this lesson we will start learning how to clean a dataframe data and loop over it. To follow along with the commands below, download the following scratch notebook to your ist356 directory, then open it with VS Code:

Download scratch-pandas-4.ipynb

For this tutorial, we will be using some data representing a pretend restaurant’s transactions. The file is:

checks_data = 'https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv'

(This link is provided in the scratch notebook.)

Let’s load the file with Pandas and sample a few of its rows to see what it contains:

import pandas as pd

checks = pd.read_csv(checks_data)
checks.sample(10)

	check	date	party size	total items on check	total amount of check	gratuity
30	2705	2024-07-08	10	19	$838.85	$671.08
25	4031	2024-08-12	6	14	$655.48	$65.55
8	1066	2024-08-20	10	22	$485.76	$77.72
29	4590	2024-05-08	3	5	$220.40	$22.04
38	2341	2024-06-03	7	16	$1,118.88	$313.29
14	3676	2024-02-25	1	1	$19.89	$1.99
34	1368	2024-12-21	10	25	$2,193.00	$372.81
49	3404	2024-07-19	9	26	$2,382.90	$71.49
13	3867	2024-05-02	4	14	$499.10	$119.78
6	2527	2024-03-27	6	21	$921.48	$55.29

Let’s use the info method to get some more information about the columns:

checks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   check                  50 non-null     int64 
 1   date                   50 non-null     object
 2   party size             50 non-null     int64 
 3   total items on check   50 non-null     int64 
 4   total amount of check  50 non-null     object
 5   gratuity               50 non-null     object
dtypes: int64(3), object(3)
memory usage: 2.5+ KB

There’s something odd here! Note that the data type of some of the columns (e.g., total amount of check) are object instead of floats, as you might expect. As we’ll see below, this is because of the $ in the values; that will cause issues when we try to work with these columns. We’ll learn how to “clean” these columns so that we can do useful things with them.

Apply

The apply method allows us to execute a function over a Series or the entire DataFrame.

The general syntax:

Series.apply(func) <== call function func for every item in the Series.
DataFrame.apply(func, axis=1) <== call function func for every row in the DataFrame (axis=1 => row).
DataFrame.apply(func, axis=0) <== call function func for every column in the DataFrame (axis=0 => col).

Note that the first argument apply is the function itself, not the function applied to some data. For example, suppose we define a function called sq that squares the input values:

def sq(x):
    return x**2

To apply this to one of the columns in our DataFrame (say, the party size column):

checks['party size'].apply(sq)

0      64
1       9
2      25
3       4
4      36
5       1
6      36
7      64
8     100
9       1
10     36
11    100
12     16
13     16
14      1
15     25
16     25
17      9
18     64
19      4
20     25
21      1
22     64
23      4
24    100
25     36
26     36
27      1
28     16
29      9
30    100
31      9
32      9
33     36
34    100
35     16
36      4
37     81
38     49
39     49
40      9
41     49
42     49
43     25
44     49
45     81
46      1
47      1
48     81
49     81
Name: party size, dtype: int64

Why `apply`?

In the above example, you might wonder why we don’t just apply the function directly to the Series, rather than use apply. Afterall, in the first pandas tutorial we learned that Series are vectorized just like numpy arrays. In other words, why not just do:

sq(checks['party size'])

0      64
1       9
2      25
3       4
4      36
5       1
6      36
7      64
8     100
9       1
10     36
11    100
12     16
13     16
14      1
15     25
16     25
17      9
18     64
19      4
20     25
21      1
22     64
23      4
24    100
25     36
26     36
27      1
28     16
29      9
30    100
31      9
32      9
33     36
34    100
35     16
36      4
37     81
38     49
39     49
40      9
41     49
42     49
43     25
44     49
45     81
46      1
47      1
48     81
49     81
Name: party size, dtype: int64

In this case, you could just run the function on the series. Where apply is useful is when you have more complicated functions, in particular, ones that need to do different things depending on what the input data is. For example, suppose we define the following function to group parties into small, medium, and large depending on how many people are in the party:

def classify_size(x):
    if x < 4:
        return 'small'
    elif x < 8:
        return 'medium'
    else:
        return 'large'

If we try to run this on the party size Series, we get an error:

classify_size(checks['party size'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_11312/1113461677.py in ?()
----> 1 classify_size(checks['party size'])

/tmp/ipykernel_11312/2519828553.py in ?(x)
      1 def classify_size(x):
----> 2     if x < 4:
      3         return 'small'
      4     elif x < 8:
      5         return 'medium'

/opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/generic.py in ?(self)
   1578     @final
   1579     def __nonzero__(self) -> NoReturn:
-> 1580         raise ValueError(
   1581             f"The truth value of a {type(self).__name__} is ambiguous. "
   1582             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1583         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This is because if statements cannot be vectorized like this. In contrast, the apply method does allow us to apply the function to the series:

checks['party size'].apply(classify_size)

0      large
1      small
2     medium
3      small
4     medium
5      small
6     medium
7      large
8      large
9      small
10    medium
11     large
12    medium
13    medium
14     small
15    medium
16    medium
17     small
18     large
19     small
20    medium
21     small
22     large
23     small
24     large
25    medium
26    medium
27     small
28    medium
29     small
30     large
31     small
32     small
33    medium
34     large
35    medium
36     small
37     large
38    medium
39    medium
40     small
41    medium
42    medium
43    medium
44    medium
45     large
46     small
47     small
48     large
49     large
Name: party size, dtype: object

This is because the apply method takes care to cycle over every element in the series and apply the function.

Cleaning data with `apply`

The ability of apply to apply more complicated functions involving if statements makes it extremely useful for cleaning datasets. By “cleaning” we mean reformatting data and/or removing spurios values, so that we can use it without issue.

For example, say we want to add price per item to our DataFrame, defined as:

price per item = total amount of check / total items on check

The problem is total amount of check is an object, not a float. This means we cannot do math on it:

# This will raise a TypeError because of the dollar sign and commas!!!
checks['price_per_item'] = checks['total amount of check'] / checks['total items on check']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:218, in _na_arithmetic_op(left, right, op, is_cmp)
    217 try:
--> 218     result = func(left, right)
    219 except TypeError:

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/computation/expressions.py:242, in evaluate(op, a, b, use_numexpr)
    240     if use_numexpr:
    241         # error: "None" not callable
--> 242         return _evaluate(op, op_str, a, b)  # type: ignore[misc]
    243 return _evaluate_standard(op, op_str, a, b)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/computation/expressions.py:73, in _evaluate_standard(op, op_str, a, b)
     72     _store_test_result(False)
---> 73 return op(a, b)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 # This will raise a TypeError because of the dollar sign and commas!!!
----> 2 checks['price_per_item'] = checks['total amount of check'] / checks['total items on check']

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/arraylike.py:210, in OpsMixin.__truediv__(self, other)
    208 @unpack_zerodim_and_defer("__truediv__")
    209 def __truediv__(self, other):
--> 210     return self._arith_method(other, operator.truediv)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/series.py:6154, in Series._arith_method(self, other, op)
   6152 def _arith_method(self, other, op):
   6153     self, other = self._align_for_op(other)
-> 6154     return base.IndexOpsMixin._arith_method(self, other, op)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/base.py:1391, in IndexOpsMixin._arith_method(self, other, op)
   1388     rvalues = np.arange(rvalues.start, rvalues.stop, rvalues.step)
   1390 with np.errstate(all="ignore"):
-> 1391     result = ops.arithmetic_op(lvalues, rvalues, op)
   1393 return self._construct_result(result, name=res_name)

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:283, in arithmetic_op(left, right, op)
    279     _bool_arith_check(op, left, right)  # type: ignore[arg-type]
    281     # error: Argument 1 to "_na_arithmetic_op" has incompatible type
    282     # "Union[ExtensionArray, ndarray[Any, Any]]"; expected "ndarray[Any, Any]"
--> 283     res_values = _na_arithmetic_op(left, right, op)  # type: ignore[arg-type]
    285 return res_values

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:227, in _na_arithmetic_op(left, right, op, is_cmp)
    219 except TypeError:
    220     if not is_cmp and (
    221         left.dtype == object or getattr(right, "dtype", None) == object
    222     ):
   (...)    225         # Don't do this for comparisons, as that will handle complex numbers
    226         #  incorrectly, see GH#32047
--> 227         result = _masked_arith_op(left, right, op)
    228     else:
    229         raise

File /opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/pandas/core/ops/array_ops.py:163, in _masked_arith_op(x, y, op)
    161     # See GH#5284, GH#5035, GH#19448 for historical reference
    162     if mask.any():
--> 163         result[mask] = op(xrav[mask], yrav[mask])
    165 else:
    166     if not is_scalar(y):

TypeError: unsupported operand type(s) for /: 'str' and 'int'

How do we fix this? Let’s write a function to convert string values like this: $4,590.45 into floats like this: 4590.45

def clean_currency(value:str) -> float:
    '''
    This function will take a string value and remove the dollar sign and commas
    and return a float value.
    '''
    return float(value.replace(',', '').replace('$', ''))


# tests
assert clean_currency('$1,000.00') == 1000.00
assert clean_currency('$1,000') == 1000.00
assert clean_currency('1,000') == 1000.00
assert clean_currency('$1000') == 1000.00

With our function written we can use apply() to transform the series:

checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks.sample(10)

	check	date	party size	total items on check	total amount of check	gratuity	total_amount_of_check_cleaned	price_per_item
22	1336	2024-08-30	8	28	$1,199.80	$275.95	1199.80	42.85
9	2968	2024-12-28	1	3	$122.97	$23.36	122.97	40.99
49	3404	2024-07-19	9	26	$2,382.90	$71.49	2382.90	91.65
7	1564	2024-09-23	8	11	$928.40	$204.25	928.40	84.40
21	4440	2024-06-11	1	3	$168.96	$10.14	168.96	56.32
0	2827	2024-05-06	8	12	$415.08	$107.92	415.08	34.59
44	2053	2024-12-14	7	23	$588.11	$164.67	588.11	25.57
17	3795	2024-02-21	3	7	$212.38	$46.72	212.38	30.34
2	3685	2024-12-07	5	5	$252.95	$50.59	252.95	50.59
23	1194	2024-07-06	2	6	$453.06	$72.49	453.06	75.51

Note

Remember its a really good idea to track lineage when you are building a data pipeline.

NEVER replace columns, always create new ones.

Code Challenge 5.1

Let’s take what we did so far, and create a dataset that would be better prepared for analysis / machine learning. We’ll display it in a Streamlit app.

Create a script called st-cleaned_checks.py.
Add your clean_currency function to the script.
Load the checks dataset into a Pandas DataFrame and:
- Clean the total amount of check and gratuity columns.
- Calculate the price_per_item as total amount of check / total items on check.
- Calcualte the price_per_person as total amont of check / party size.
- Calcualte the items_per_person as total items on check / party size.
- Calcualte the tip_percentage as the total amount of check / gratuity.
Display the cleaned DataFrame along with a description of its data (i.e., display the output of describe) in a Streamlit app.
Test that your app works by running it.

Solution

The st-cleaned_checks.py script:

import streamlit as st
import pandas as pd

def clean_currency(value:str) -> float:
    '''
    This function will take a string value and remove the dollar sign and commas
    and return a float value.
    '''
    return float(value.replace(',', '').replace('$', ''))

st.title("Dining Check Data")

# load
checks = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv')

# transformations
checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']

st.dataframe(checks, width=1000)

st.header("Summary:")
st.dataframe(checks.describe())

Test it by running the following in the terminal (make sure to activate your conda environment first):

python -m streamlit run st-cleaned_checks.py

Using `lambdas` to apply functions to multiple columns

So far we’ve used apply with functions that take in a single Series. What do we do if we need a function to operate on multiple columns in a DataFrame? For that, we can use Python lambda functions.

An example using our checks data:

Marketing wants you to build some key performance indicators (KPIs) using the checks data. A KPI is a statistic that summarizes some larger data set, so the data can be more easily tracked over time. For example a letter grade such as an A- is a KPI summary of all your graded efforts to date.

In this example, marketing wants you to build two KPIs:

KPI 1: Whales

Marketing has decided to group customers into the following categories: - big eaters: Customers who are in the top 25% (i.e., above the 0.75 quantile) for items per person. - big spenders: Customers who are in the top 25% for price per person. - whale: Customers who are in the top 25% for both items per person and price per person.

KPI 2: Tippers

Marketing has decided to further group customers into the following categories based on their tipping: - light: Customers who are in the botton 25% (i.e., below the 0.25 quantile) by tip percentage. - heavy: Customers who are in the top 25% (i.e., above the 0.75 quantile) by tip percentage.

To calculate percentiles we will use the quantile Series method in Pandas. This returns the value at which X% of the data is below the given percentile.

Before we can apply our KPI’s we must write the functions!

checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']

To categorize whales:

def detect_whale(
        items_per_person:float, 
        price_per_person:float, 
        items_per_person_75th_pctile:float, 
        price_per_person_75_pctile:float) -> str:
    if items_per_person > items_per_person_75th_pctile and price_per_person > price_per_person_75_pctile:
        return 'whale'
    if items_per_person > items_per_person_75th_pctile:
        return 'big eater'
    if price_per_person > price_per_person_75_pctile:
        return 'big spender'
    return ''

Let’s test our function using the quantile method:

# tests
ppp_75 = checks['price_per_person'].quantile(0.75)
ipp_75 = checks['items_per_person'].quantile(0.75)
print(ppp_75, ipp_75)
assert detect_whale(5, 250, 3, 175) == 'whale'
assert detect_whale(5, 100, 3, 175) == 'big eater'
assert detect_whale(1, 250, 3, 175) == 'big spender'
assert detect_whale(1, 100, 3, 175) == ''

158.35666666666668 3.0

Now we want to apply the detect_whale function to the checks DataFrame. But detect_whale requires two columns as input, the items_per_person and price_per_person. How do we do that?

We can use a lambda function to quickly define a small function that will return the required columns when provided a row. In general, the syntax for a lambda function is lambda ARGS: FUNC. For example, lambda a, b: a+b will return the sum of the two arguments its given.

In our case, we can use a lambda function to pull out the needed columns and give them to apply, like so:

checks['whale'] = checks.apply(lambda row: detect_whale(row['items_per_person'], row['price_per_person'], ipp_75, ppp_75), axis=1)
checks.sample(25)

	check	date	party size	total items on check	total amount of check	gratuity	total_amount_of_check_cleaned	price_per_item	gratuity_cleaned	price_per_person	items_per_person	tip_percentage	whale
31	1945	2024-02-05	3	7	$132.86	$21.26	132.86	18.98	21.26	44.286667	2.333333	0.160018
28	2446	2024-12-15	4	12	$575.64	$28.78	575.64	47.97	28.78	143.910000	3.000000	0.049997
25	4031	2024-08-12	6	14	$655.48	$65.55	655.48	46.82	65.55	109.246667	2.333333	0.100003
13	3867	2024-05-02	4	14	$499.10	$119.78	499.10	35.65	119.78	124.775000	3.500000	0.239992	big eater
9	2968	2024-12-28	1	3	$122.97	$23.36	122.97	40.99	23.36	122.970000	3.000000	0.189965
42	1361	2024-11-21	7	14	$65.80	$16.45	65.80	4.70	16.45	9.400000	2.000000	0.250000
3	1957	2024-02-15	2	2	$42.44	$8.91	42.44	21.22	8.91	21.220000	1.000000	0.209943
38	2341	2024-06-03	7	16	$1,118.88	$313.29	1118.88	69.93	313.29	159.840000	2.285714	0.280003	big spender
33	3842	2024-03-31	6	6	$147.12	$5.88	147.12	24.52	5.88	24.520000	1.000000	0.039967
16	3694	2024-11-03	5	17	$1,574.37	$173.18	1574.37	92.61	173.18	314.874000	3.400000	0.110000	whale
19	3718	2024-10-30	2	5	$464.70	$120.82	464.70	92.94	120.82	232.350000	2.500000	0.259996	big spender
46	3621	2024-06-23	1	2	$138.76	$19.43	138.76	69.38	19.43	138.760000	2.000000	0.140026
5	2191	2024-01-06	1	3	$17.85	$1.96	17.85	5.95	1.96	17.850000	3.000000	0.109804
43	1186	2024-09-21	5	16	$298.72	$74.68	298.72	18.67	74.68	59.744000	3.200000	0.250000	big eater
35	2486	2024-01-27	4	13	$569.01	$108.11	569.01	43.77	108.11	142.252500	3.250000	0.189997	big eater
15	2386	2024-03-31	5	12	$1,147.80	$137.74	1147.80	95.65	137.74	229.560000	2.400000	0.120003	big spender
1	2443	2024-06-09	3	10	$286.40	$31.50	286.40	28.64	31.50	95.466667	3.333333	0.109986	big eater
34	1368	2024-12-21	10	25	$2,193.00	$372.81	2193.00	87.72	372.81	219.300000	2.500000	0.170000	big spender
10	2809	2024-12-30	6	6	$104.46	$1.04	104.46	17.41	1.04	17.410000	1.000000	0.009956
32	1440	2024-11-30	3	8	$589.04	$141.37	589.04	73.63	141.37	196.346667	2.666667	0.240001	big spender
29	4590	2024-05-08	3	5	$220.40	$22.04	220.40	44.08	22.04	73.466667	1.666667	0.100000
22	1336	2024-08-30	8	28	$1,199.80	$275.95	1199.80	42.85	275.95	149.975000	3.500000	0.229997	big eater
48	4161	2024-06-22	9	28	$1,385.16	$235.48	1385.16	49.47	235.48	153.906667	3.111111	0.170002	big eater
20	3393	2024-08-26	5	6	$302.64	$24.21	302.64	50.44	24.21	60.528000	1.200000	0.079996
14	3676	2024-02-25	1	1	$19.89	$1.99	19.89	19.89	1.99	19.890000	1.000000	0.100050

Code Challenge 5.2

Let’s build on the Streamlit app you made in Challenge 5.1, and add our KPIs to it.

Copy st-cleaned_checks.py to a new file called st-checks_kpi.py.
Add the detect_whale function defined above to st-checks_kpi.py.
Write a new function detect_tipper(tip_pct, tip_pcy_75th_pctile, tip_pct_25_pctile) that should return either “light” or “heavy” depending on whether the tip is below the 0.25 quantile or above the 0.75 quantile, respectively. If neither, the function should just return an empty string.
Apply the detect_whale and detect_tipper functions to the DataFrame, storing the results in new columns called whale and tipper, respectively.
As with before, diplay the modified DataFrame and the output of describe in a Streamlit app.

Solution

It’s easiest to do the copy using the command line. In the terminal, run:

cp st-cleaned_checks.py st-checks_kpi.py

Now modify st-checks_kpi.py. It should look like:

import streamlit as st
import pandas as pd

def clean_currency(value:str) -> float:
    '''
    This function will take a string value and remove the dollar sign and commas
    and return a float value.
    '''
    return float(value.replace(',', '').replace('$', ''))

def detect_whale(
        items_per_person:float, 
        price_per_person:float, 
        items_per_person_75th_pctile:float, 
        price_per_person_75_pctile:float) -> str:
    if items_per_person > items_per_person_75th_pctile and price_per_person > price_per_person_75_pctile:
        return 'whale'
    if items_per_person > items_per_person_75th_pctile:
        return 'big eater'
    if price_per_person > price_per_person_75_pctile:
        return 'big spender'
    
    return ''


def detect_tipper(tip_pct:float, tip_pct_75th_pctile:float, tip_pct_25th_pctile:float) -> str:
    if tip_pct > tip_pct_75th_pctile:
        return 'heavy tipper'
    if tip_pct < tip_pct_25th_pctile:
        return 'light tipper'
    return ''

st.title("Dining Check Data")

# load
checks = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/refs/heads/master/dining/check-data.csv')

# transformations
checks['total_amount_of_check_cleaned'] = checks['total amount of check'].apply(clean_currency)
checks['gratuity_cleaned'] = checks['gratuity'].apply(clean_currency)
checks['price_per_item'] = checks['total_amount_of_check_cleaned'] / checks['total items on check']
checks['price_per_person'] = checks['total_amount_of_check_cleaned'] / checks['party size']
checks['items_per_person'] = checks['total items on check'] / checks['party size']
checks['tip_percentage'] = checks['gratuity_cleaned'] / checks['total_amount_of_check_cleaned']

## The new stuff adding KPIs:
# get KPI boundaries
ppp_75 = checks['price_per_person'].quantile(0.75)
ipp_75 = checks['items_per_person'].quantile(0.75)
tp_75 = checks['tip_percentage'].quantile(0.75)
tp_25 = checks['tip_percentage'].quantile(0.25)

# Calcualte KPI's
checks['whale'] = checks.apply(lambda row: detect_whale(row['items_per_person'], row['price_per_person'], ipp_75, ppp_75), axis=1)
checks['tipper'] = checks.apply(lambda row: detect_tipper(row['tip_percentage'], tp_75, tp_25), axis=1)

# Now display
st.dataframe(checks, width=1000)

st.header("Summary:")
st.dataframe(checks.describe())

Looping over Dataframes

If you must run a for loop over your DataFrames, there are two choices:

df.iterrows() dict-like iteration
df.itertuples() named-tuple like iteration (faster)

Let’s do an example where we display the check number, whale and tipper for “heavy tipper” checks.

## Using the iterrows() method
print("Total Amount of Whale Checks")
for i,row in checks.iterrows():
    if row['whale'] == 'whale':
        print(i, row['check'], row['total_amount_of_check_cleaned'])

# Same example with the itertuples() method
print("Total Amount of Whale Checks")
for row in checks.itertuples():
    if row.whale == 'whale':
        print(row.check, row.total_amount_of_check_cleaned)

# Of course you don't need a loop to do this:
checks[checks['whale'] == 'whale'][['check', 'total_amount_of_check_cleaned']]

Apply

Why apply?

Cleaning data with apply

Using lambdas to apply functions to multiple columns

Looping over Dataframes

Why `apply`?

Cleaning data with `apply`

Using `lambdas` to apply functions to multiple columns