Session 04: Advanced Pandas

Pandas

Aggregating

Merging

Introduction

In the previous session, we learned how to import .csv files into Pandas and perform the first exploratory checks on a dataset. We worked with the structure of a single table and practiced reading rows, understanding columns, and interpreting the basic shape of a DataFrame.

In this session, we move from single-table exploration to multi-table analytical thinking.

Our goal is to understand the Instacart dataset table by table, identify what each table represents, ask meaningful analytical questions, and then connect the tables using pd.merge().

This reflects real analytical work. In practice, data is often stored in several related tables rather than in one large spreadsheet. An analyst must first understand each table independently, then combine them carefully in order to answer richer business questions.

For this session, we work with the following files:

orders.csv
order_products_train.csv
products.csv
departments_t.csv
aisles.csv

Important

Here you can find the Instacart Market Basket Analysis dataset and its related documentation.

Importing the data

Now let’s omport the needed data one by one:

Downlaod Archieve.zip file again

Remember to download the Archieve.zip file.

Importing `orders.csv` file

df_orders = pd.read_csv('../data/raw/orders.csv')

print(df_orders.columns)
print(df_orders.shape)
df_orders.head()

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='str')
(3421083, 7)

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

Importing `order_products_train .csv` file

# Reading data
df_orders_products_train = pd.read_csv('../data/raw/order_products_train.csv')
print(df_orders_products_train.columns)
print(df_orders_products_train.shape)
df_orders_products_train.head()

Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered'], dtype='str')
(1384617, 4)

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

Importing `products.csv` file

df_products = pd.read_csv("../data/raw/products.csv")
df_products.head()

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices'], dtype='str')
(49693, 5)

	product_id	product_name	aisle_id	department_id	prices
0	1	Chocolate Sandwich Cookies	61	19	5.8
1	2	All-Seasons Salt	104	13	9.3
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.5
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.5
4	5	Green Chile Anytime Sauce	5	13	4.3

Importing `departments_t.csv`

df_departments = pd.read_csv('../data/processed/departments_t.csv')
print(df_departments.columns)
print(df_departments.shape)
df_departments.head()

Index(['department_id', 'department'], dtype='str')
(21, 2)

	department_id	department
0	1	frozen
1	2	other
2	3	bakery
3	4	produce
4	5	alcohol

Importing `aisles.csv`

df_aisles = pd.read_csv('../data/raw/aisles.csv')
print(df_aisles.columns)
print(df_aisles.shape)
df_aisles.head()

Index(['aisle_id', 'aisle'], dtype='str')
(134, 2)

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars
3	4	instant foods
4	5	marinades meat preparation

Data Dictionary

Before performing any analysis, we need to understand what each table represents.

Overview of the Tables

Table	Analytical Role	Grain	Main Key
`orders`	Behavioral table	One row per order	`order_id`
`order_products_train`	Transaction bridge table	One row per product within an order	no single unique key
`products`	Product master table	One row per product	`product_id`
`departments_t`	Department lookup table	One row per department	`department_id`
`aisles`	Aisle lookup table	One row per aisle	`aisle_id`

The `orders` Table

The orders table is the behavioral table of the dataset.

Each row represents one order placed by one user.

df_orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

Important columns:

order_id: unique order identifier
user_id: identifier of the customer
eval_set: indicates whether the order belongs to prior, train, or test (we are going to use train)
order_number: sequence number of the order for that user
order_dow: day of week of the order
order_hour_of_day: hour of day of the order
days_since_prior_order: number of days since the customer’s previous order

Analytical questions that can be answered from this table alone:

How many total orders are there?
How many users are represented?
What is the distribution of eval_set?
On which days do customers place orders most frequently?
At what hours do customers place orders most frequently?
Are missing values in days_since_prior_order meaningful?

The `order_products_train` Table

This is the transaction bridge table for train orders.

Each row represents one product inside one order.

Important columns:

order_id: identifier of the order
product_id: identifier of the product
add_to_cart_order: the sequence in which the product was added to the basket
reordered: indicates whether the item had been ordered before by the same customer

Important

order_id is not unique in this table
product_id is not unique in this table
repeated values are expected because one order can contain many products, and one product can appear in many orders

Analytical questions that can be answered from this table alone:

How many line items are there?
How many unique orders are represented?
How many unique products are represented?
What proportion of items are reordered?
What does the distribution of add_to_cart_order look like?

The `products` Table

The products table is the product master table.

Each row represents one product.

Important columns:

product_id: unique identifier of the product
product_name: product name
aisle_id: identifier of the aisle
department_id: identifier of the department

Analytical questions that can be answered from this table alone:

How many products exist?
Are product IDs unique?
How many departments are represented in the product catalog?
How many aisles are represented in the product catalog?
Do some product names repeat?

The `departments_t` Table

This is a lookup table.

Each row represents one department.

Important columns:

department_id: unique department identifier
department: department name

Analytical questions that can be answered from this table alone:

How many departments exist?
Are department IDs unique?

The `aisles` Table

This is also a lookup table.

Each row represents one aisle.

Important columns:

aisle_id: unique aisle identifier
aisle: aisle name

Analytical questions that can be answered from this table alone:

How many aisles exist?
Are aisle IDs unique?

Relationship Between the Tables

Before merging, we should understand how the tables are connected.

flowchart LR
    A[orders] -->|order_id| B[order_products_train]
    B -->|product_id| C[products]
    C -->|department_id| D[departments_t]
    C -->|aisle_id| E[aisles]

This tells us the following:

the orders table contains order-level behavior
the order_products_train table connects orders to products
the products table adds product information
the departments_t and aisles tables make product categories more interpretable

Why We Explore Table by Table First?

Before using pd.merge(), it is good analytical practice to inspect each table independently.

This allows us to:

understand the grain of the data
identify key columns
find missing values
check whether IDs are unique where they should be unique
detect possible problems before combining tables

Exploring the `orders`

We begin with the orders table because it is the most intuitive table in the Instacart dataset. It contains behavioral information about when users place orders and how frequently they shop.

Each row in this table represents one order placed by one user.

Previewing the data

df_orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

The output helps us answer two important structural questions:

How many rows and columns does the table have?
What variables are available for analysis?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_orders.info()

<class 'pandas.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                str    
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), str(1)
memory usage: 198.9 MB

Looking at data types

Each column has a specific role, and its data type should match its meaning.

order_id                    int64
user_id                     int64
eval_set                      str
order_number                int64
order_dow                   int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

Descriptive statistics

For numeric columns, descriptive statistics provide a first summary of the data.

df_orders.describe()

	order_id	user_id	order_number	order_dow	order_hour_of_day	days_since_prior_order
count	3.421083e+06	3.421083e+06	3.421083e+06	3.421083e+06	3.421083e+06	3.214874e+06
mean	1.710542e+06	1.029782e+05	1.715486e+01	2.776219e+00	1.345202e+01	1.111484e+01
std	9.875817e+05	5.953372e+04	1.773316e+01	2.046829e+00	4.226088e+00	9.206737e+00
min	1.000000e+00	1.000000e+00	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
25%	8.552715e+05	5.139400e+04	5.000000e+00	1.000000e+00	1.000000e+01	4.000000e+00
50%	1.710542e+06	1.026890e+05	1.100000e+01	3.000000e+00	1.300000e+01	7.000000e+00
75%	2.565812e+06	1.543850e+05	2.300000e+01	5.000000e+00	1.600000e+01	1.500000e+01
max	3.421083e+06	2.062090e+05	1.000000e+02	6.000000e+00	2.300000e+01	3.000000e+01

This is useful for checking the ranges of variables such as:

order_number
order_dow
order_hour_of_day
days_since_prior_order

Note, that from now on we will use pd.options.display.float_format = '{:,.2f}'.format option for data reporting in order to make the value more representative.

pd.options.display.float_format = '{:,.2f}'.format
df_orders.describe()

	order_id	user_id	order_number	order_dow	order_hour_of_day	days_since_prior_order
count	3,421,083.00	3,421,083.00	3,421,083.00	3,421,083.00	3,421,083.00	3,214,874.00
mean	1,710,542.00	102,978.21	17.15	2.78	13.45	11.11
std	987,581.74	59,533.72	17.73	2.05	4.23	9.21
min	1.00	1.00	1.00	0.00	0.00	0.00
25%	855,271.50	51,394.00	5.00	1.00	10.00	4.00
50%	1,710,542.00	102,689.00	11.00	3.00	13.00	7.00
75%	2,565,812.50	154,385.00	23.00	5.00	16.00	15.00
max	3,421,083.00	206,209.00	100.00	6.00	23.00	30.00

Missing values

Missing values should always be checked before moving to deeper analysis.

df_orders.isna().sum()

order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

Tip

This is especially important for days_since_prior_order, because in this dataset null values may be meaningful rather than erroneous.

Analytical questions for `orders`

Even without merging, the orders table already allows us to answer several useful analytical questions.

How many orders are there in total?
How many unique users are represented?
What is the distribution of eval_set?
On which days do customers place orders most frequently?
At what hours do customers place orders most frequently?
Are missing values in days_since_prior_order meaningful?
Is the first order of a user treated differently from later orders?

Q1: How many unique orders exist?

df_orders["order_id"].nunique()

Q2: How many distinct customers are represented in the dataset?

df_orders["user_id"].nunique()

Q3: Exploring the `eval_set` distribution

The eval_set variable is very important because it determines which subset of orders we will use later in the analysis.

This helps us understand how the orders are split into:

df_orders["eval_set"].value_counts(dropna=False)

eval_set
prior    3214874
train     131209
test       75000
Name: count, dtype: int64

Later, we will filter this table to the train subset before merging it with order_products_train.

Q4: Orders by day of week

This tells us how orders are distributed across the week. We can now begin exploring ordering behavior.

df_orders["order_dow"].value_counts().sort_index()

order_dow
0    600905
1    587478
2    467260
3    436972
4    426339
5    453368
6    448761
Name: count, dtype: int64

A useful analytical question here is: Are some days clearly more active than others?

Q5: Orders by hour of day

Next, we examine the hourly distribution of orders.

At what time of day do customers shop most often?
Is ordering behavior concentrated in a small time window?

df_orders["order_hour_of_day"].value_counts().sort_index()

order_hour_of_day
0      22758
1      12398
2       7539
3       5474
4       5527
5       9569
6      30529
7      91868
8     178201
9     257812
10    288418
11    284728
12    272841
13    277999
14    283042
15    283639
16    272553
17    228795
18    182912
19    140569
20    104292
21     78109
22     61468
23     40043
Name: count, dtype: int64

Q5:Understanding missing values in `days_since_prior_order`

This is one of the most interesting columns in the table because its missing values may actually have a logical explanation.

df_orders["days_since_prior_order"].isna().sum()

np.int64(206209)

At first glance, missing values may seem like a data quality issue. However, here we should ask a better question:

Could these null values correspond to the first order of each user?

Logical check: first orders and null values

We can test this directly.

df_orders.loc[df_orders["order_number"] == 1, "days_since_prior_order"].isna().all()

np.True_

If the result is True, then the null values are not random. They are structurally meaningful.

Caution

This is an important analytical lesson:

not every null is a problem
some null values reflect the business logic of the data

Before using a table in deeper analysis, it is good practice to perform a few basic consistency checks.

Is `order_id` unique?

Since each row should represent one order, order_id should uniquely identify rows.

df_orders["order_id"].is_unique

True

Is `order_dow` in the correct range?

Day of week should logically be between 0 and 6.

df_orders["order_dow"].between(0, 6).all()

np.True_

Is `order_hour_of_day` in the correct range?

Hour of day should logically be between 0 and 23.

df_orders["order_hour_of_day"].between(0, 23).all()

np.True_

Warning

These checks are simple, but they are important. They help us confirm that the table behaves as expected before we move on.

Filtering the `orders` table to train orders

Since the next transactional table we will use is order_products_train.csv, we should prepare the matching subset of orders in advance.

df_orders_train = df_orders[df_orders["eval_set"] == "train"]
df_orders_train.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
10	1187899	1	train	11	4	8	14.00
25	1492625	2	train	15	1	11	30.00
49	2196797	5	train	5	0	11	6.00
74	525192	7	train	21	2	11	6.00
78	880375	8	train	4	1	14	10.00

Why do we filter first?

This is an important analytical step.

order_products_train.csv contains only products from train orders
filtering first keeps the analysis focused and consistent
this is similar to applying a WHERE condition before a JOIN in SQL

To make the next steps cleaner, we can keep only the most relevant columns from the filtered table.

df_orders_train.drop(columns=["eval_set"])
df_orders_train.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
10	1187899	1	train	11	4	8	14.00
25	1492625	2	train	15	1	11	30.00
49	2196797	5	train	5	0	11	6.00
74	525192	7	train	21	2	11	6.00
78	880375	8	train	4	1	14	10.00

Important

df_orders takes significant amount of memory, thus once finishing working with the dataframe, we can delete it:

del df_orders

Summary of the `orders` table

At this stage, we understand the orders table both structurally and analytically.

It is a behavioral table with one row per order
It helps us study timing and frequency of customer orders
It introduces the important eval_set variable
It provides a good example of meaningful null values
It allows us to prepare the correct subset before merging with the train basket table

Exploring the `order_products_train`

We now move to the order_products_train table. Unlike orders, this is not an order-level table. Instead, it is a transaction bridge table that connects orders to products.

Each row in this table represents one product inside one order.

Previewing the data

df_orders_products_train.head()

	order_id	product_id	add_to_cart_order	reordered
0	1	49302	1	1
1	1	11109	2	1
2	1	10246	3	0
3	1	49683	4	0
4	1	43633	5	1

The output helps us answer two important structural questions:

What does one row represent in this table?
Which columns help us connect orders to products?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_orders_products_train.info()

<class 'pandas.DataFrame'>
RangeIndex: 1384617 entries, 0 to 1384616
Data columns (total 4 columns):
 #   Column             Non-Null Count    Dtype
---  ------             --------------    -----
 0   order_id           1384617 non-null  int64
 1   product_id         1384617 non-null  int64
 2   add_to_cart_order  1384617 non-null  int64
 3   reordered          1384617 non-null  int64
dtypes: int64(4)
memory usage: 42.3 MB

Looking at data types

Each column has a specific role, and its data type should match its meaning.

df_orders_products_train.dtypes

order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtype: object

Descriptive statistics

For numeric columns, descriptive statistics provide a first summary of the data.

df_orders_products_train.describe()

	order_id	product_id	add_to_cart_order	reordered
count	1,384,617.00	1,384,617.00	1,384,617.00	1,384,617.00
mean	1,706,297.62	25,556.24	8.76	0.60
std	989,732.65	14,121.27	7.42	0.49
min	1.00	1.00	1.00	0.00
25%	843,370.00	13,380.00	3.00	0.00
50%	1,701,880.00	25,298.00	7.00	1.00
75%	2,568,023.00	37,940.00	12.00	1.00
max	3,421,070.00	49,688.00	80.00	1.00

This is useful for checking the ranges of variables such as:

order_id
product_id
add_to_cart_order
reordered

Missing values

Missing values should always be checked before moving to deeper analysis.

df_orders_products_train.isna().sum()

order_id             0
product_id           0
add_to_cart_order    0
reordered            0
dtype: int64

Tip

This is an especially important table because it will later be used for merging. If key columns such as order_id or product_id have missing values, the merge process may produce incomplete results.

Analytical questions for `order_products_train`

Even without merging, the order_products_train table already allows us to answer several useful analytical questions.

How many line items are there in total?
How many unique orders are represented?
How many unique products are represented?
What is the distribution of reordered?
What does add_to_cart_order tell us about basket construction?
Are there any suspicious or inconsistent values?

Q1: How many line items exist?

len(df_orders_products_train)

Q2: How many unique orders are represented in this table?

df_orders_products_train["order_id"].nunique()

Q3: How many unique products are represented in this table?

df_orders_products_train["product_id"].nunique()

Q4: Exploring reorder behavior

The reordered variable is analytically important because it helps us understand repeated purchase behavior.

This helps us understand how the product rows are split into:

df_orders_products_train["reordered"].value_counts(dropna=False)

reordered
1    828824
0    555793
Name: count, dtype: int64

Later, after merging with products, departments_t, and aisles, we will be able to see which products and categories have the highest reorder tendency.

Q5: Reorder share

Instead of only counts, we can also look at proportions.

df_orders_products_train["reordered"].value_counts(normalize=True)

reordered
1   0.60
0   0.40
Name: proportion, dtype: float64

A useful analytical question here is: What share of line items in train orders corresponds to reordered products?

Q6: Understanding `add_to_cart_order`

This variable tells us the sequence in which a product was added to the cart.

df_orders_products_train["add_to_cart_order"].describe()

count   1,384,617.00
mean            8.76
std             7.42
min             1.00
25%             3.00
50%             7.00
75%            12.00
max            80.00
Name: add_to_cart_order, dtype: float64

This allows us to ask:

Are most products added early in the basket-building process?
How large do baskets tend to become?

Q7: Which cart positions appear most often?

df_orders_products_train["add_to_cart_order"].value_counts().sort_index().head(20)

add_to_cart_order
1     131209
2     124364
3     116996
4     108963
5     100745
6      91850
7      83142
8      74601
9      66618
10     59401
11     52848
12     46814
13     41431
14     36588
15     32194
16     28363
17     24841
18     21733
19     19014
20     16541
Name: count, dtype: int64

Later, after merging, we may ask whether some products or departments tend to be added earlier than others.

Understanding repeated keys in the table

This table behaves differently from orders.

df_orders_products_train["order_id"].is_unique

False

df_orders_products_train["product_id"].is_unique

False

If these results are False, that is expected.

Caution

This is an important analytical lesson:

order_id is expected to repeat because one order can contain many products
product_id is expected to repeat because the same product can appear in many different orders
repeated keys are not always an error; sometimes they reflect the grain of the table

Consistency checks for the `order_products_train` table

Before using a table in deeper analysis, it is good practice to perform a few basic consistency checks.

Is `reordered` limited to valid values?

The reordered column should logically contain only 0 and 1.

df_orders_products_train["reordered"].isin([0, 1]).all()

np.True_

Is `add_to_cart_order` positive?

Since this variable captures position in the basket, it should be greater than 0.

(df_orders_products_train["add_to_cart_order"] > 0).all()

np.True_

Are there missing values in the key columns?

df_orders_products_train[["order_id", "product_id"]].isna().sum()

order_id      0
product_id    0
dtype: int64

Warning

These checks are simple, but they are important. They help us confirm that the table behaves as expected before we move on.

Why this table matters for merging

This table is the bridge between orders and products.

order_id will connect this table to orders_train
product_id will connect this table to products
after merging, we will be able to connect customer order behavior with product attributes

Without this table, we would know when orders happened and what products exist, but not which products were purchased in which orders.

Summary of the `order_products_train`

At this stage, we understand the order_products_train table both structurally and analytically.

It is a transaction bridge table with one row per product within an order
It connects orders to products
It helps us study basket composition and reorder behavior
It introduces repeated keys as a natural property of line-item data
It prepares the foundation for the merge process

Exploring the `products`

We now move to the products table. Unlike orders and order_products_train, this is a master data table.

Each row in this table represents one product.

Previewing the data

df_products.head()

	product_id	product_name	aisle_id	department_id	prices
0	1	Chocolate Sandwich Cookies	61	19	5.80
1	2	All-Seasons Salt	104	13	9.30
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.50
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.50
4	5	Green Chile Anytime Sauce	5	13	4.30

The output helps us answer two important structural questions:

What attributes are available for each product?
Which columns will later help us enrich the transaction data?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_products.info()

<class 'pandas.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  int64  
 1   product_name   49677 non-null  str    
 2   aisle_id       49693 non-null  int64  
 3   department_id  49693 non-null  int64  
 4   prices         49693 non-null  float64
dtypes: float64(1), int64(3), str(1)
memory usage: 3.4 MB

Looking at data types

Each column has a specific role, and its data type should match its meaning.

df_products.dtypes

product_id         int64
product_name         str
aisle_id           int64
department_id      int64
prices           float64
dtype: object

Descriptive statistics

For numeric columns, descriptive statistics provide a first summary of the data.

df_products.describe()

	product_id	aisle_id	department_id	prices
count	49,693.00	49,693.00	49,693.00	49,693.00
mean	24,844.35	67.77	11.73	9.99
std	14,343.72	38.32	5.85	453.52
min	1.00	1.00	1.00	1.00
25%	12,423.00	35.00	7.00	4.10
50%	24,845.00	69.00	13.00	7.10
75%	37,265.00	100.00	17.00	11.20
max	49,688.00	134.00	21.00	99,999.00

This is useful for checking the ranges of variables such as:

product_id
aisle_id
department_id

Missing values

Missing values should always be checked before moving to deeper analysis.

df_products.isna().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

Tip

This table is especially important because it provides product attributes that will later be attached to the transaction data. Missing values in aisle_id or department_id could weaken the interpretation of merged results.

Analytical questions for `products`

Even without merging, the products table already allows us to answer several useful analytical questions.

How many products are there in total?
Are product IDs unique?
How many aisles and departments are represented?
Do product names repeat?
Is the product catalog evenly distributed across categories?

Q1: How many unique products exist?

df_products["product_id"].nunique()

Q2: Are product IDs unique?

df_products["product_id"].is_unique

False

Q3: How many unique departments are referenced in the product table?

df_products["department_id"].nunique()

Q4: How many unique aisles are referenced in the product table?

df_products["aisle_id"].nunique()

Q5: Do some product names repeat?

df_products["product_name"].duplicated().sum()

np.int64(20)

Later, after merging with lookup tables, we will be able to interpret department and aisle information in business language instead of numeric IDs.

Understanding the role of `department_id` and `aisle_id`

The products table does not contain department names or aisle names directly. Instead, it contains numeric keys.

df_products[["product_id", "product_name", "aisle_id", "department_id"]].head()

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1
4	5	Green Chile Anytime Sauce	5	13

This means:

department_id will later connect to departments_t
aisle_id will later connect to aisles

Consistency checks for the `products` table

Before using a table in deeper analysis, it is good practice to perform a few basic consistency checks.

Are `product_id` values unique?

df_products["product_id"].is_unique

False

Are `department_id` values missing?

df_products["department_id"].isna().sum()

np.int64(0)

Are `aisle_id` values missing?

df_products["aisle_id"].isna().sum()

np.int64(0)

Warning

These checks are simple, but they are important. They help us confirm that the table behaves as expected before we move on.

Why this table matters for merging

This table gives business meaning to the product_id values in the transaction table.

it introduces product names
it connects products to aisles
it connects products to departments
it will later help us move from product IDs to interpretable categories

Summary of the `products` table

At this stage, we understand the products table both structurally and analytically.

It is a master data table with one row per product
It provides product names and categorical keys
It prepares the foundation for enrichment through lookup tables
It will later help us interpret transaction data more meaningfully

Exploring the `departments_t`

We now move to the departments_t table. This is a lookup table.

Each row in this table represents one department.

Previewing the data

df_departments.head()

	department_id	department
0	1	frozen
1	2	other
2	3	bakery
3	4	produce
4	5	alcohol

The output helps us answer two important structural questions:

How many departments are there?
What labels will later be used to interpret department_id values?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_departments.info()

<class 'pandas.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   department_id  21 non-null     int64
 1   department     21 non-null     str  
dtypes: int64(1), str(1)
memory usage: 638.0 bytes

Looking at data types

Each column has a specific role, and its data type should match its meaning.

df_departments.dtypes

department_id    int64
department         str
dtype: object

Missing values

Missing values should always be checked before moving to deeper analysis.

df_departments.isna().sum()

department_id    0
department       0
dtype: int64

Analytical questions for `departments_t`

Even without merging, the departments_t table already allows us to answer several useful analytical questions.

How many departments exist?
Are department IDs unique?
Are there any missing department labels?

Q1: How many departments exist?

df_departments["department"].nunique()

Q2: Are `department_id` values unique?

df_departments["department_id"].is_unique

True

Q3: Are there missing values in the table?

df_departments.isna().sum()

department_id    0
department       0
dtype: int64

Why this table matters for merging

This is a small table, but it is analytically important.

it translates department_id into a readable department name
it makes the final merged dataset interpretable
without it, we would only see numeric department codes

Summary of the `departments_t`

At this stage, we understand the departments_t table both structurally and analytically.

It is a lookup table with one row per department
It provides readable department names
It will later enrich the products table

Exploring the `aisles`

We now move to the aisles table. Like departments_t, this is also a lookup table.

Each row in this table represents one aisle.

Previewing the data

df_aisles.head()

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars
3	4	instant foods
4	5	marinades meat preparation

The output helps us answer two important structural questions:

How many aisles are there?
What labels will later be used to interpret aisle_id values?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_aisles.info()

<class 'pandas.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   aisle_id  134 non-null    int64
 1   aisle     134 non-null    str  
dtypes: int64(1), str(1)
memory usage: 4.2 KB

Looking at data types

Each column has a specific role, and its data type should match its meaning.

df_aisles.dtypes

aisle_id    int64
aisle         str
dtype: object

Missing values

Missing values should always be checked before moving to deeper analysis.

df_aisles.isna().sum()

aisle_id    0
aisle       0
dtype: int64

Analytical questions for `aisles`

Even without merging, the aisles table already allows us to answer several useful analytical questions.

How many aisles exist?
Are aisle IDs unique?
Are there any missing aisle labels?

Q1: How many aisles exist?

df_aisles["aisle"].nunique()

Q2: Are `aisle_id` values unique?

df_aisles["aisle_id"].is_unique

True

Q3: Are there missing values in the table?

df_aisles.isna().sum()

aisle_id    0
aisle       0
dtype: int64

Why this table matters for merging

This is also a small table, but it is analytically important.

it translates aisle_id into a readable aisle name
it provides a more detailed grouping level than department
it will later enrich the products table and make category-level analysis more informative

Summary of the `aisles` table

At this stage, we understand the aisles table both structurally and analytically.

It is a lookup table with one row per aisle
It provides readable aisle names
It will later enrich the products table

Understanding `pd.merge()`

Until now, we explored each table independently. That was an important first step because good analytical work starts with understanding the structure, meaning, and limitations of each table before combining them.

Now we move to the next stage: merging tables.

In relational data, different pieces of information are stored in different tables.

orders tells us about order behavior
order_products_train tells us which products were included in each order
products gives product attributes
departments_t and aisles translate category IDs into readable labels

If we want to answer richer analytical questions, we need to connect these tables.

In Pandas, the main tool for this is pd.merge().

Before Merge

At this stage, we have explored all tables individually.

orders is a behavioral table with one row per order
order_products_train is a transaction bridge table with one row per product within an order
products is a master table with one row per product
departments_t is a lookup table for department names
aisles is a lookup table for aisle names

SQL and Pandas Connection

If you already know SQL, the logic of pd.merge() should feel familiar.

SQL Concept	Pandas Equivalent
`JOIN`	`pd.merge()`
`ON`	`on=`
`LEFT JOIN`	`how="left"`
`INNER JOIN`	`how="inner"`
`RIGHT JOIN`	`how="right"`
`FULL OUTER JOIN`	`how="outer"`

The core idea is the same:

identify a shared key
decide which rows you want to keep
combine the matching columns

General Structure of `pd.merge()`

The general syntax is:

pd.merge(left_df, right_df, on="key_column", how="left")

The most important arguments are:

left_df: the DataFrame on the left side of the merge
right_df: the DataFrame on the right side of the merge
on: the column used as the matching key
how: the type of join
indicator: adds a special column showing where each row came from

Visual Idea of a Merge

A merge can be imagined as matching rows based on shared IDs.

flowchart LR
    A[Left Table] --> C[Shared Key]
    B[Right Table] --> C
    C --> D[Merged Result]

This looks simple, but analytically it is very important because:

if the key is wrong, the result is wrong
if keys are duplicated unexpectedly, rows may multiply
if matches are missing, null values may appear

That is why merging should always be followed by validation.

Simple Merge

Before merging the large transactional tables, it is better to start with a simpler and safer example.

The cleanest first step is:

merge products with departments_t
merge that result with aisles

This lets us enrich product information first.

Merge types in brief

Although we mainly use left joins in this session, it is good to understand the common merge types.

Inner join

Keeps only rows that match in both tables.

pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="inner"
).head()

	product_id	product_name	aisle_id	department_id	prices	department
0	1	Chocolate Sandwich Cookies	61	19	5.80	snacks
1	2	All-Seasons Salt	104	13	9.30	pantry
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.50	beverages
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.50	frozen
4	5	Green Chile Anytime Sauce	5	13	4.30	pantry

Left join

Keeps all rows from the left table.

pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="left"
).head()

	product_id	product_name	aisle_id	department_id	prices	department
0	1	Chocolate Sandwich Cookies	61	19	5.80	snacks
1	2	All-Seasons Salt	104	13	9.30	pantry
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.50	beverages
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.50	frozen
4	5	Green Chile Anytime Sauce	5	13	4.30	pantry

Outer join

Keeps all rows from both tables.

pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="outer",
    indicator=True
).head()

	product_id	product_name	aisle_id	department_id	prices	department	_merge
0	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.50	frozen	both
1	8	Cut Russet Potatoes Steam N' Mash	116	1	1.10	frozen	both
2	12	Chocolate Fudge Layer Cake	119	1	9.40	frozen	both
3	18	Pizza for One Suprema Frozen Pizza	79	1	10.60	frozen	both
4	30	Three Cheese Ziti, Marinara with Meatballs	38	1	13.80	frozen	both

Why `indicator` is especially useful with outer joins

The indicator column becomes even more informative in an outer join because all three categories are easier to observe:

rows found only in the left table
rows found only in the right table
rows found in both tables

This makes outer joins useful for diagnostics, even if they are not always the final analytical choice.

Visual summary of the product enrichment process

flowchart LR
    A[products] -->|department_id| B[products + departments]
    B -->|aisle_id| C[products enriched]
    D[departments_t] --> B
    E[aisles] --> C

Merge: `products` with `departments_t`

The products table contains department_id, but it does not contain the actual department name.

If we want readable categories, we need to merge it with departments_t.

Visual view of the first merge

flowchart LR
    A[products] -->|department_id| C[merge]
    B[departments_t] -->|department_id| C
    C --> D[products + department]

Performing the merge

df_products_departments = pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="left"
)

df_products_departments.head()

	product_id	product_name	aisle_id	department_id	prices	department
0	1	Chocolate Sandwich Cookies	61	19	5.80	snacks
1	2	All-Seasons Salt	104	13	9.30	pantry
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.50	beverages
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.50	frozen
4	5	Green Chile Anytime Sauce	5	13	4.30	pantry

Why do we use `how="left"` here?

We use a left join because we want to keep all rows from df_products.

That means:

every product should remain in the result
department information will be added where a match exists
if a match is missing, Pandas will keep the product row and place NaN in the department column

This is usually the safest merge for analytical enrichment.

Understanding the `indicator` Argument

One of the most useful but underused options in pd.merge() is the indicator argument.

When we set indicator=True, Pandas adds a new column called _merge that tells us where each row came from.

df_products_departments_check = pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="left",
    indicator=True
)

df_products_departments_check.head()

	product_id	product_name	aisle_id	department_id	prices	department	_merge
0	1	Chocolate Sandwich Cookies	61	19	5.80	snacks	both
1	2	All-Seasons Salt	104	13	9.30	pantry	both
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.50	beverages	both
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.50	frozen	both
4	5	Green Chile Anytime Sauce	5	13	4.30	pantry	both

What does `_merge` mean?

The _merge column can contain three values:

`_merge` value	Meaning
`left_only`	row appears only in the left table
`right_only`	row appears only in the right table
`both`	row matched in both tables

This is extremely helpful for checking merge quality.

Visual explanation of `indicator=True`

flowchart LR
    A[Left table row] --> D[_merge = left_only]
    B[Right table row] --> E[_merge = right_only]
    C[Matched row] --> F[_merge = both]

Looking at the indicator results

df_products_departments_check["_merge"].value_counts()

_merge
both          49693
left_only         0
right_only        0
Name: count, dtype: int64

Why is this useful?

This tells us whether the merge behaved as expected.

For example:

if all rows are both, then every product found a matching department
if some rows are left_only, then some products have department IDs that do not exist in departments_t
if we ever see right_only in a left join result, that would be unusual in the displayed merged result because left joins retain left rows; however, the right-side categories may still appear through other join types or different validation setups

Important

The indicator argument is especially useful for data quality checks and referential integrity checks.

Investigating unmatched rows

If the merge indicator shows left_only, we should inspect those rows.

df_products_departments_check[
    df_products_departments_check["_merge"] == "left_only"
].head()

	product_id	product_name	aisle_id	department_id	prices	department	_merge

This helps us identify whether some department_id values in products are missing from the lookup table.

Cleaning the indicator result

Once we finish checking the merge, we usually drop the _merge column or rerun the merge without indicator=True.

df_products_departments = pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="left"
)

df_products_departments.head()

	product_id	product_name	aisle_id	department_id	prices	department
0	1	Chocolate Sandwich Cookies	61	19	5.80	snacks
1	2	All-Seasons Salt	104	13	9.30	pantry
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.50	beverages
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.50	frozen
4	5	Green Chile Anytime Sauce	5	13	4.30	pantry

Homework P1: Questions unlocked after the first merge

After this merge, we can ask more meaningful questions.

Before the merge, we only had numeric department_id values.

Now we can ask:

How many products belong to each department?
Which departments contain the most products?
Are products concentrated in a small number of departments?

Example: number of products by department

df_products_departments["department"].value_counts()

department
personal care      6565
snacks             6264
pantry             5371
beverages          4365
frozen             4007
dairy eggs         3449
household          3085
canned goods       2092
dry goods pasta    1858
produce            1684
bakery             1516
deli               1322
missing            1258
international      1139
breakfast          1116
babies             1081
alcohol            1056
pets                972
meat seafood        907
other               548
bulk                 38
Name: count, dtype: int64

Adding `aisles`

Now we enrich the product table further.

The products table also contains aisle_id, but not the aisle name itself. So we merge with aisles.

Visual view of the second merge

flowchart LR
    A[products + department] -->|aisle_id| C[merge]
    B[aisles] -->|aisle_id| C
    C --> D[products enriched]

Performing the second merge

df_products_enriched = pd.merge(
    df_products_departments,
    df_aisles,
    on="aisle_id",
    how="left"
)

df_products_enriched.head()

	product_id	product_name	aisle_id	department_id	prices	department	aisle
0	1	Chocolate Sandwich Cookies	61	19	5.80	snacks	cookies cakes
1	2	All-Seasons Salt	104	13	9.30	pantry	spices seasonings
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.50	beverages	tea
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.50	frozen	frozen meals
4	5	Green Chile Anytime Sauce	5	13	4.30	pantry	marinades meat preparation

Checking the second merge with `indicator=True`

df_products_aisles_check = pd.merge(
    df_products_departments,
    df_aisles,
    on="aisle_id",
    how="left",
    indicator=True
)

df_products_aisles_check["_merge"].value_counts()

_merge
both          49693
left_only         0
right_only        0
Name: count, dtype: int64

Investigating unmatched aisle rows

df_products_aisles_check[
    df_products_aisles_check["_merge"] == "left_only"
].head()

	product_id	product_name	aisle_id	department_id	prices	department	aisle	_merge

What do we have after these two merges?

At this stage, we have created an enriched product table.

Each product now includes:

product_name
department
aisle

This is much better for analysis than working only with IDs.

Why validate row counts after a merge?

One very important habit is checking whether the number of rows changed unexpectedly.

If departments_t has one row per department, and each product has only one department, then merging products with departments should usually keep the same number of rows.

Checking row counts

print(df_products.shape)
print(df_products_departments.shape)
print(df_products_enriched.shape)

(49693, 5)
(49693, 6)
(49693, 7)

If the number of rows increases unexpectedly, that may indicate duplicated keys in the lookup table.

Checking uniqueness in lookup tables before merging

This is why uniqueness checks matter.

df_departments["department_id"].is_unique

True

df_aisles["aisle_id"].is_unique

True

If these are False, a merge can create duplicated rows in the result.

Homework P2: Questions unlocked after enriching products

After merging products, departments_t, and aisles, we can answer questions such as:

How many products belong to each department?
How many products belong to each aisle?
Which aisles are most populated?
Which departments span the widest variety of products?
How are departments and aisles related?

Example: products by aisle

df_products_enriched["aisle"].value_counts().head(10)

aisle
missing                 1258
candy chocolate         1246
ice cream ice           1091
vitamins supplements    1038
yogurt                  1026
chips pretzels           989
tea                      894
packaged cheese          891
frozen meals             880
cookies cakes            874
Name: count, dtype: int64

Example: cross-tab of departments and aisles

pd.crosstab(
    df_products_enriched["department"],
    df_products_enriched["aisle"]
)

aisle	air fresheners candles	asian foods	baby accessories	baby bath body care	baby food formula	bakery desserts	baking ingredients	baking supplies decor	beauty	beers coolers	...	spreads	tea	tofu meat alternatives	tortillas flat bread	trail mix snack mix	trash bags liners	vitamins supplements	water seltzer sparkling water	white wines	yogurt
department
alcohol	0	0	0	0	0	0	0	0	0	387	...	0	0	0	0	0	0	0	0	147	0
babies	0	0	44	132	718	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
bakery	0	0	0	0	0	297	0	0	0	0	...	0	0	0	241	0	0	0	0	0	0
beverages	0	0	0	0	0	0	0	0	0	0	...	0	894	0	0	0	0	0	344	0	0
breakfast	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
bulk	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
canned goods	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
dairy eggs	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1026
deli	0	0	0	0	0	0	0	0	0	0	...	0	0	159	0	0	0	0	0	0	0
dry goods pasta	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
frozen	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
household	355	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	112	0	0	0	0
international	0	605	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
meat seafood	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
missing	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
other	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
pantry	0	0	0	0	0	0	623	290	0	0	...	493	0	0	0	0	0	0	0	0	0
personal care	0	0	0	0	0	0	0	0	178	0	...	0	0	0	0	0	0	1038	0	0	0
pets	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
produce	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
snacks	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	69	0	0	0	0	0

21 rows × 134 columns

Continuing the Merge Process

At this point, we already created an enriched product table by combining:

products
departments_t
aisles

Now we move to the behavioral side of the data.

The next goal is to connect:

orders_train \(\rightarrow\) order_products_train

This is the step that links order behavior to basket contents.

Why do we need this merge?

df_orders_train tells us when an order happened and which user placed it
df_orders_products_train tells us which products were included in the order

Separately, each table is useful. But once we merge them, we can start answering richer analytical questions such as:

How many items are there in each order?
Do some days tend to have larger baskets?
At what hour do larger baskets appear?
Does reorder behavior vary by order timing?

Visual view of the merge

flowchart LR
    A[orders_train] -->|order_id| C[merge]
    B[order_products_train] -->|order_id| C
    C --> D[train_transactions]

Performing the merge

We merge the filtered order table with the transaction bridge table using order_id.

df_train_transactions = pd.merge(
    df_orders_train,
    df_orders_products_train,
    on="order_id",
    how="left"
)

df_train_transactions.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	reordered
0	1187899	1	train	11	4	8	14.00	196	1	1
1	1187899	1	train	11	4	8	14.00	25133	2	1
2	1187899	1	train	11	4	8	14.00	38928	3	1
3	1187899	1	train	11	4	8	14.00	26405	4	1
4	1187899	1	train	11	4	8	14.00	39657	5	1

Why do we use `how="left"` here?

We use a left join because we want to keep all rows from df_orders_train.

That means:

every train order should remain in the result
transaction rows will be added where a match exists
if an order has no matching line items, the product-related columns will become NaN

This is analytically useful because it helps us preserve the order-level structure and inspect missing matches if they exist.

Checking the merge with `indicator=True`

As before, it is a good idea to validate the merge using the indicator argument.

df_train_transactions_check = pd.merge(
    df_orders_train,
    df_orders_products_train,
    on="order_id",
    how="left",
    indicator=True
)

df_train_transactions_check.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	reordered	_merge
0	1187899	1	train	11	4	8	14.00	196	1	1	both
1	1187899	1	train	11	4	8	14.00	25133	2	1	both
2	1187899	1	train	11	4	8	14.00	38928	3	1	both
3	1187899	1	train	11	4	8	14.00	26405	4	1	both
4	1187899	1	train	11	4	8	14.00	39657	5	1	both

Inspecting the `_merge` column

df_train_transactions_check["_merge"].value_counts()

_merge
both          1384617
left_only           0
right_only          0
Name: count, dtype: int64

How do we interpret the result?

The _merge column tells us whether each row came from:

only the left table
only the right table
both tables

In this case:

both means the order from orders_train found matching product rows in order_products_train
left_only means the order exists in orders_train but did not find matching rows in order_products_train

For a left join, this is a very important diagnostic check.

Looking at unmatched orders

If any rows are marked as left_only, we should inspect them.

df_train_transactions_check[
    df_train_transactions_check["_merge"] == "left_only"
].head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	reordered	_merge

This allows us to investigate whether some train orders are missing from the transaction table.

Why can the number of rows increase after this merge?

This is a very important point.

When we merged products with departments_t, the number of rows was expected to stay the same.

But here the logic is different.

Each order can contain many products, so when we merge orders_train with order_products_train, the number of rows can increase.

This is not an error. It is expected because:

orders_train has one row per order
order_products_train has one row per product within an order

So after the merge, the resulting table has one row per order-product combination.

Visual explanation of row expansion

flowchart TD
    A[One row in orders_train] --> B[Product 1]
    A --> C[Product 2]
    A --> D[Product 3]
    B --> E[Three rows in merged table]
    C --> E
    D --> E

This is one of the most important conceptual differences between different merge scenarios.

Checking the row counts

print(df_orders_train.shape)
print(df_orders_products_train.shape)
print(df_train_transactions.shape)

(131209, 7)
(1384617, 4)
(1384617, 10)

This helps us confirm that the merged table has the expected order-product level of detail.

What columns do we now have?

df_train_transactions.columns

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order', 'product_id',
       'add_to_cart_order', 'reordered'],
      dtype='str')

After this merge, we have both:

order-level variables such as user_id, order_dow, and order_hour_of_day
transaction-level variables such as product_id, add_to_cart_order, and reordered

This is already a much more useful analytical table.

Homework P3: Questions unlocked after this merge

After merging orders_train with order_products_train, we can now answer questions such as:

How many products are included in each order?
What is the typical basket size?
Are larger baskets more common on specific days?
Are reordered items more common at certain times of day?
Does basket construction differ across customer order sequence?

Example: basket size by order

If each row now represents one product within an order, then counting rows per order_id gives us basket size.

df_train_transactions.groupby("order_id").size().head()

order_id
1      8
36     8
38     9
96     7
98    49
dtype: int64

Example: descriptive summary of basket size

df_train_transactions.groupby("order_id").size().describe()

count   131,209.00
mean         10.55
std           7.93
min           1.00
25%           5.00
50%           9.00
75%          14.00
max          80.00
dtype: float64

This helps us understand the distribution of how many items customers place in their baskets.

Example: average basket size by day of week

df_train_transactions.groupby("order_dow").size() / df_train_transactions["order_id"].nunique()

order_dow
0   2.47
1   1.57
2   1.22
3   1.18
4   1.18
5   1.35
6   1.58
dtype: float64

The expression above is only a quick exploratory step. A more careful calculation groups at the order level first and then averages by day.

basket_size_by_order = (
    df_train_transactions
    .groupby(["order_id", "order_dow"])
    .size()
    .reset_index(name="basket_size")
)

basket_size_by_order.groupby("order_dow")["basket_size"].mean()

order_dow
0   11.80
1   10.47
2    9.96
3    9.84
4    9.74
5   10.16
6   10.97
Name: basket_size, dtype: float64

Example: reorder behavior by hour of day

df_train_transactions.groupby("order_hour_of_day")["reordered"].mean()

order_hour_of_day
0    0.57
1    0.58
2    0.58
3    0.58
4    0.60
5    0.62
6    0.65
7    0.65
8    0.64
9    0.62
10   0.61
11   0.59
12   0.59
13   0.59
14   0.59
15   0.58
16   0.59
17   0.59
18   0.58
19   0.59
20   0.60
21   0.61
22   0.60
23   0.59
Name: reordered, dtype: float64

This allows us to explore whether reordered items are more common at certain times.

Are there missing `product_id` values after the merge?

df_train_transactions["product_id"].isna().sum()

np.int64(0)

Are there missing `user_id` values after the merge?

df_train_transactions["user_id"].isna().sum()

np.int64(0)

Does every row now represent an order-product combination?

A quick way to inspect this is to look at repeated order_id values.

df_train_transactions["order_id"].value_counts().head()

order_id
2813632    80
1395075    80
949182     77
2869702    76
341238     76
Name: count, dtype: int64

This confirms that orders now appear multiple times, which is exactly what we expect in a line-item-level merged table.

Warning

After a merge, always check both the row count and the missing values in newly attached columns.

A merge may run without errors and still produce incomplete or misleading analytical results.

From transactional table to analytical table

At this stage, we have created a transactional table that combines:

order behavior
basket contents

However, product_id is still only a numeric identifier. To make the table fully interpretable, we need one final merge.

We now attach the enriched product table.

The last step is to merge df_train_transactions with df_products_enriched.

flowchart LR
    A[train_transactions] -->|product_id| C[merge]
    B[products_enriched] -->|product_id| C
    C --> D[train_analytic]

df_train_analytic = pd.merge(
    df_train_transactions,
    df_products_enriched,
    on="product_id",
    how="left"
)

df_train_analytic.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id	prices	department	aisle
0	1187899	1	train	11	4	8	14.00	196	1	1	Soda	77.00	7.00	9.00	beverages	soft drinks
1	1187899	1	train	11	4	8	14.00	25133	2	1	Organic String Cheese	21.00	16.00	8.60	dairy eggs	packaged cheese
2	1187899	1	train	11	4	8	14.00	38928	3	1	0% Greek Strained Yogurt	120.00	16.00	12.60	dairy eggs	yogurt
3	1187899	1	train	11	4	8	14.00	26405	4	1	XL Pick-A-Size Paper Towel Rolls	54.00	17.00	1.00	household	paper goods
4	1187899	1	train	11	4	8	14.00	39657	5	1	Milk Chocolate Almonds	45.00	19.00	6.80	snacks	candy chocolate

Checking the final merge with `indicator=True`

df_train_analytic_check = pd.merge(
    df_train_transactions,
    df_products_enriched,
    on="product_id",
    how="left",
    indicator=True
)

df_train_analytic_check["_merge"].value_counts()

_merge
both          1384618
left_only          88
right_only          0
Name: count, dtype: int64

Investigating unmatched product rows

df_train_analytic_check[
    df_train_analytic_check["_merge"] == "left_only"
].head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id	prices	department	aisle	_merge
10780	3009052	1629	train	37	0	13	3.00	6799	24	1	NaN	NaN	NaN	NaN	NaN	NaN	left_only
52822	195363	7955	train	12	6	14	8.00	6799	4	1	NaN	NaN	NaN	NaN	NaN	NaN	left_only
59749	831974	9000	train	11	3	13	20.00	6799	19	0	NaN	NaN	NaN	NaN	NaN	NaN	left_only
64159	206208	9640	train	16	6	11	12.00	6799	6	1	NaN	NaN	NaN	NaN	NaN	NaN	left_only
81585	472316	12168	train	10	4	12	0.00	6799	12	1	NaN	NaN	NaN	NaN	NaN	NaN	left_only

If such rows exist, then some product_id values in the transaction table do not have a match in the enriched product table.

What do we gain from the final merge?

At this stage, the table becomes fully analysis-ready.

Each row now includes:

order-level behavior
transaction-level basket information
product name
department name
aisle name

This means we can finally move from numeric IDs to interpretable business categories.

Homework P4 Questions unlocked after the final merge

After building the full analytical table, we can answer much richer questions such as:

Which products are purchased most often?
Which departments dominate train orders?
Which aisles dominate train orders?
Which departments have the highest reorder rates?
Which aisles appear most often in large baskets?
At what times are certain departments purchased more frequently?

Example: most frequently purchased products

df_train_analytic["product_name"].value_counts().head(10)

product_name
Banana                    18726
Bag of Organic Bananas    15480
Organic Strawberries      10894
Organic Baby Spinach       9784
Large Lemon                8135
Organic Avocado            7409
Organic Hass Avocado       7293
Strawberries               6494
Limes                      6033
Organic Raspberries        5546
Name: count, dtype: int64

Example: most frequently purchased departments

df_train_analytic["department"].value_counts()

department
produce            409087
dairy eggs         217051
snacks             118862
beverages          113962
frozen             100426
pantry              81242
bakery              48394
canned goods        46799
deli                44291
dry goods pasta     38713
household           35986
meat seafood        30307
breakfast           29552
personal care       21599
babies              14941
international       11902
missing              8251
alcohol              5602
pets                 4497
other                1795
bulk                 1359
Name: count, dtype: int64

Example: reorder rate by department

df_train_analytic.groupby("department")["reordered"].mean().sort_values(ascending=False)

department
dairy eggs        0.67
produce           0.66
beverages         0.66
bakery            0.63
pets              0.63
deli              0.62
alcohol           0.61
meat seafood      0.59
snacks            0.58
bulk              0.58
breakfast         0.57
frozen            0.56
babies            0.54
dry goods pasta   0.49
canned goods      0.49
household         0.43
other             0.39
missing           0.38
international     0.38
pantry            0.36
personal care     0.34
Name: reordered, dtype: float64

Example: cross-tab of department and reordered flag

pd.crosstab(
    df_train_analytic["department"],
    df_train_analytic["reordered"]
)

reordered	0	1
department
alcohol	2203	3399
babies	6857	8084
bakery	17702	30692
beverages	38960	75002
breakfast	12654	16898
bulk	573	786
canned goods	24017	22782
dairy eggs	70549	146502
deli	16924	27367
dry goods pasta	19828	18885
frozen	44258	56168
household	20614	15372
international	7380	4522
meat seafood	12400	17907
missing	5103	3148
other	1098	697
pantry	51744	29498
personal care	14309	7290
pets	1663	2834
produce	137201	271886
snacks	49760	69102

Example: cross-tab of order day and reordered flag

pd.crosstab(
    df_train_analytic["order_dow"],
    df_train_analytic["reordered"],
    normalize="index"
)

reordered	0	1
order_dow
0	0.39	0.61
1	0.40	0.60
2	0.41	0.59
3	0.41	0.59
4	0.41	0.59
5	0.39	0.61
6	0.41	0.59

Visual summary of the full merge pipeline

flowchart LR
    A[products] -->|department_id| B[products + departments]
    B -->|aisle_id| C[products_enriched]
    D[departments_t] --> B
    E[aisles] --> C
    F[orders_train] -->|order_id| G[train_transactions]
    H[order_products_train] -->|order_id| G
    G -->|product_id| I[train_analytic]
    C -->|product_id| I

Homework | Solution

Step 1 | Downlaoding the Data

In order to have the final version of the dataframe and contintue the main analytical tasks, we need to also merge customers and states dimensions.

data = ['customers.csv', 'states.csv']

URL = 'https://raw.githubusercontent.com/hovhannisyan91/aca/refs/heads/main/lab/python/data/raw/'

ldf = []
for i in data: 
    df = pd.read_csv(URL+i)
    # քանի որ ցանկանում ենք նաև միացնել տվյալները մենք կարող ենք դա անել միանգամից՝ նախ սկզբում պահելով այն լիստի մեջ
    ldf.append(df)

df_customers = ldf[0]
df_states = ldf[1]

df_customers = pd.read_csv('../data/raw/customers.csv')
df_states = pd.read_csv('../data/raw/states.csv')

Step 2 | Exploring the Dataframes

`customers`

df_customers.head()

	user_id	First Name	Surname	Gender	STATE	Age	date_joined	n_dependants	fam_status	income
0	26711	Deborah	Esquivel	Female	Missouri	48	1/1/2017	3	married	165665
1	33890	Patricia	Hart	Female	New Mexico	36	1/1/2017	0	single	59285
2	65803	Kenneth	Farley	Male	Idaho	35	1/1/2017	2	married	99568
3	125935	Michelle	Hicks	Female	Iowa	40	1/1/2017	0	single	42049
4	130797	Ann	Gilmore	Female	Maryland	26	1/1/2017	1	married	40374

`states`

df_states.head()

Step 3 | Merging `customers` with `states`

Important

Pay attention to the states name column in df_customers

Before merging first we need to rename it: from STATE to state

df_customers.rename(columns={'STATE':'state'}, inplace=True)

df_customers_states = df_customers.merge(df_states,on = 'state')

df_customers_states.head()

	user_id	First Name	Surname	Gender	state	Age	date_joined	n_dependants	fam_status	income	region	division
0	26711	Deborah	Esquivel	Female	Missouri	48	1/1/2017	3	married	165665	Midwest	West North Central
1	33890	Patricia	Hart	Female	New Mexico	36	1/1/2017	0	single	59285	West	Mountain
2	65803	Kenneth	Farley	Male	Idaho	35	1/1/2017	2	married	99568	West	Mountain
3	125935	Michelle	Hicks	Female	Iowa	40	1/1/2017	0	single	42049	Midwest	West North Central
4	130797	Ann	Gilmore	Female	Maryland	26	1/1/2017	1	married	40374	South	South Atlantic

Step 4 | merging with the main dataframe

df_train_analytics_final = df_train_analytic.merge(df_customers_states, on='user_id', how='left')

Now we can drop the all the ids

df_train_analytics_final.drop(columns=['user_id','eval_set','product_id', 'aisle_id', 'department_id'],inplace=True)

print(df_train_analytics_final.shape)
df_train_analytics_final.head()

(1384706, 22)

	order_id	order_number	order_dow	order_hour_of_day	days_since_prior_order	add_to_cart_order	reordered	product_name	prices	department	...	Surname	Gender	state	Age	date_joined	n_dependants	fam_status	income	region	division
0	1187899	11	4	8	14.00	1	1	Soda	9.00	beverages	...	Nguyen	Female	Alabama	31	2/17/2019	3	married	40423	South	East South Central
1	1187899	11	4	8	14.00	2	1	Organic String Cheese	8.60	dairy eggs	...	Nguyen	Female	Alabama	31	2/17/2019	3	married	40423	South	East South Central
2	1187899	11	4	8	14.00	3	1	0% Greek Strained Yogurt	12.60	dairy eggs	...	Nguyen	Female	Alabama	31	2/17/2019	3	married	40423	South	East South Central
3	1187899	11	4	8	14.00	4	1	XL Pick-A-Size Paper Towel Rolls	1.00	household	...	Nguyen	Female	Alabama	31	2/17/2019	3	married	40423	South	East South Central
4	1187899	11	4	8	14.00	5	1	Milk Chocolate Almonds	6.80	snacks	...	Nguyen	Female	Alabama	31	2/17/2019	3	married	40423	South	East South Central

5 rows × 22 columns

Step 5 | saving in the `data/processed` folder

We are going to have very large, thus we will save it as parquet file. parquet type compresses the data much more effectively.

# NOTE THE VIRTUAL ENVIRONMENT MUST BE ACTIVE
pip install pyarrow fastparquet

df_train_analytics_final.to_parquet("../data/processed/instacart.parquet", index = False)

df_train_analytics_final.to_csv("../data/processed/instacart.csv", index = False)

Tip

Try to save the same df_train_analytics_final dataframe both in csv and parquet format, in order to feel the difference!

Pandas Cheatsheet

Important

For mor infformation about pandas you can find here

Introduction

Importing the data

Importing orders.csv file

Importing order_products_train .csv file

Importing products.csv file

Importing departments_t.csv

Importing aisles.csv

Data Dictionary

Overview of the Tables

The orders Table

The order_products_train Table

The products Table

The departments_t Table

The aisles Table

Relationship Between the Tables

Exploring the orders

Previewing the data

Inspecting the schema

Looking at data types

Descriptive statistics

Missing values

Analytical questions for orders

Q1: How many unique orders exist?

Q2: How many distinct customers are represented in the dataset?

Q3: Exploring the eval_set distribution

Q4: Orders by day of week

Q5: Orders by hour of day

Q5:Understanding missing values in days_since_prior_order

Logical check: first orders and null values

Is order_id unique?

Is order_dow in the correct range?

Is order_hour_of_day in the correct range?

Filtering the orders table to train orders

Why do we filter first?

Summary of the orders table

Exploring the order_products_train

Previewing the data

Inspecting the schema

Looking at data types

Descriptive statistics

Missing values

Analytical questions for order_products_train

Q1: How many line items exist?

Q2: How many unique orders are represented in this table?

Q3: How many unique products are represented in this table?

Q4: Exploring reorder behavior

Q5: Reorder share

Q6: Understanding add_to_cart_order

Q7: Which cart positions appear most often?

Understanding repeated keys in the table

Consistency checks for the order_products_train table

Is reordered limited to valid values?

Is add_to_cart_order positive?

Are there missing values in the key columns?

Why this table matters for merging

Summary of the order_products_train

Exploring the products

Previewing the data

Inspecting the schema

Looking at data types

Descriptive statistics

Missing values

Analytical questions for products

Q1: How many unique products exist?

Q2: Are product IDs unique?

Q3: How many unique departments are referenced in the product table?

Q4: How many unique aisles are referenced in the product table?

Q5: Do some product names repeat?

Understanding the role of department_id and aisle_id

Consistency checks for the products table

Are product_id values unique?

Are department_id values missing?

Are aisle_id values missing?

Why this table matters for merging

Summary of the products table

Exploring the departments_t

Previewing the data

Inspecting the schema

Looking at data types

Missing values

Importing `orders.csv` file

Importing `order_products_train .csv` file

Importing `products.csv` file

Importing `departments_t.csv`

Importing `aisles.csv`

The `orders` Table

The `order_products_train` Table

The `products` Table

The `departments_t` Table

The `aisles` Table

Exploring the `orders`

Analytical questions for `orders`

Q3: Exploring the `eval_set` distribution

Q5:Understanding missing values in `days_since_prior_order`

Is `order_id` unique?

Is `order_dow` in the correct range?

Is `order_hour_of_day` in the correct range?

Filtering the `orders` table to train orders

Summary of the `orders` table

Exploring the `order_products_train`

Analytical questions for `order_products_train`

Q6: Understanding `add_to_cart_order`

Consistency checks for the `order_products_train` table

Is `reordered` limited to valid values?

Is `add_to_cart_order` positive?

Summary of the `order_products_train`

Exploring the `products`

Analytical questions for `products`

Understanding the role of `department_id` and `aisle_id`

Consistency checks for the `products` table

Are `product_id` values unique?

Are `department_id` values missing?

Are `aisle_id` values missing?

Summary of the `products` table

Exploring the `departments_t`

Analytical questions for `departments_t`

Q2: Are `department_id` values unique?

Summary of the `departments_t`

Exploring the `aisles`

Analytical questions for `aisles`

Q2: Are `aisle_id` values unique?

Summary of the `aisles` table

Understanding `pd.merge()`

General Structure of `pd.merge()`

Why `indicator` is especially useful with outer joins

Merge: `products` with `departments_t`

Why do we use `how="left"` here?

Understanding the `indicator` Argument

What does `_merge` mean?

Visual explanation of `indicator=True`

Adding `aisles`

Checking the second merge with `indicator=True`

Why do we use `how="left"` here?

Checking the merge with `indicator=True`

Inspecting the `_merge` column