Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. Python
  2. Python
  3. Session 04: Advanced Pandas
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Introduction
  • Importing the data
    • Importing orders.csv file
    • Importing order_products_train .csv file
    • Importing products.csv file
    • Importing departments_t.csv
    • Importing aisles.csv
  • Data Dictionary
    • Overview of the Tables
    • The orders Table
    • The order_products_train Table
    • The products Table
    • The departments_t Table
    • The aisles Table
  • Relationship Between the Tables
    • Why We Explore Table by Table First
  • Exploring the orders Table
    • Previewing the data
    • Inspecting the schema
    • Looking at data types
    • Descriptive statistics
    • Missing values
    • Analytical questions for orders
    • Q1: How many unique orders exist?
    • Q2: How many distinct customers are represented in the dataset?
    • Q3: Exploring the eval_set distribution
    • Q4: Orders by day of week
    • Q5: Orders by hour of day
    • Q5:Understanding missing values in days_since_prior_order
    • Logical check: first orders and null values
    • Consistency checks for the orders table
    • Filtering the orders table to train orders
    • Why do we filter first?
    • Keeping only the columns we need
    • Summary of the orders table
  • Exploring the order_products_train Table
    • Previewing the data
    • Inspecting the schema
    • Looking at data types
    • Descriptive statistics
    • Missing values
    • Analytical questions for order_products_train
    • Q1: How many line items exist?
    • Q2: How many unique orders are represented in this table?
    • Q3: How many unique products are represented in this table?
    • Q4: Exploring reorder behavior
    • Q5: Reorder share
    • Q6: Understanding add_to_cart_order
    • Q7: Which cart positions appear most often?
    • Understanding repeated keys in the table
    • Consistency checks for the order_products_train table
    • Why this table matters for merging
    • Summary of the order_products_train table
  • Exploring the products Table
    • Previewing the data
    • Inspecting the schema
    • Looking at data types
    • Descriptive statistics
    • Missing values
    • Analytical questions for products
    • Q1: How many unique products exist?
    • Q2: Are product IDs unique?
    • Q3: How many unique departments are referenced in the product table?
    • Q4: How many unique aisles are referenced in the product table?
    • Q5: Do some product names repeat?
    • Understanding the role of department_id and aisle_id
    • Consistency checks for the products table
    • Why this table matters for merging
    • Summary of the products table
  • Exploring the departments_t Table
    • Previewing the data
    • Inspecting the schema
    • Looking at data types
    • Missing values
    • Analytical questions for departments_t
    • Q1: How many departments exist?
    • Q2: Are department_id values unique?
    • Q3: Are there missing values in the table?
    • Why this table matters for merging
    • Summary of the departments_t table
  • Exploring the aisles Table
    • Previewing the data
    • Inspecting the schema
    • Looking at data types
    • Missing values
    • Analytical questions for aisles
    • Q1: How many aisles exist?
    • Q2: Are aisle_id values unique?
    • Q3: Are there missing values in the table?
    • Why this table matters for merging
    • Summary of the aisles table
  • Understanding pd.merge()
    • Before Merge
    • SQL and Pandas Connection
    • General Structure of pd.merge()
    • Visual Idea of a Merge
    • Simple Merge
    • Merge: products with departments_t
    • Visual view of the first merge
    • Performing the merge
    • Why do we use how="left" here?
  • Understanding the indicator Argument
    • What does _merge mean?
    • Visual explanation of indicator=True
    • Looking at the indicator results
    • Why is this useful?
    • Investigating unmatched rows
    • Cleaning the indicator result
  • Homework P1: Questions unlocked after the first merge
    • Example: number of products by department
  • Adding aisles
    • Visual view of the second merge
    • Performing the second merge
    • Checking the second merge with indicator=True
    • Investigating unmatched aisle rows
    • What do we have after these two merges?
    • Why validate row counts after a merge?
    • Checking row counts
  • Checking uniqueness in lookup tables before merging
  • Homework P2: Questions unlocked after enriching products
    • Example: products by aisle
    • Example: cross-tab of departments and aisles
  • Merge types in brief
    • Inner join
    • Left join
    • Outer join
    • Why indicator is especially useful with outer joins
    • Visual summary of the product enrichment process
  • pd.merge vs SQL join
  • Continuing the Merge Process
    • Why do we need this merge?
    • Visual view of the merge
    • Performing the merge
    • Why do we use how="left" here?
    • Checking the merge with indicator=True
    • Inspecting the _merge column
    • How do we interpret the result?
    • Looking at unmatched orders
    • Why can the number of rows increase after this merge?
    • Visual explanation of row expansion
    • Checking the row counts
    • What columns do we now have?
  • Homework P3: Questions unlocked after this merge
    • Example: basket size by order
    • Example: descriptive summary of basket size
    • Example: average basket size by day of week
    • Example: reorder behavior by hour of day
    • Are there missing product_id values after the merge?
    • Are there missing user_id values after the merge?
    • Does every row now represent an order-product combination?
  • From transactional table to analytical table
    • Checking the final merge with indicator=True
    • Investigating unmatched product rows
    • What do we gain from the final merge?
  • Homework P4 Questions unlocked after the final merge
    • Example: most frequently purchased products
    • Example: most frequently purchased departments
    • Example: reorder rate by department
    • Example: cross-tab of department and reordered flag
    • Example: cross-tab of order day and reordered flag
    • Visual summary of the full merge pipeline
  • Homework | Solution
    • Step 1 | Downlaoding the Data
    • Step 2 | Exploring the Dataframes
    • Step 3 | Merging customers with states
    • Step 4 | merging with the main dataframe
    • Step 5 | saving in the data/processed folder
  • Pandas Cheatsheet
  1. Python
  2. Python
  3. Session 04: Advanced Pandas

Session 04: Advanced Pandas

Pandas
Aggregating
Merging

Introduction

In the previous session, we learned how to import .csv files into Pandas and perform the first exploratory checks on a dataset. We worked with the structure of a single table and practiced reading rows, understanding columns, and interpreting the basic shape of a DataFrame.

In this session, we move from single-table exploration to multi-table analytical thinking.

Our goal is to understand the Instacart dataset table by table, identify what each table represents, ask meaningful analytical questions, and then connect the tables using pd.merge().

This reflects real analytical work. In practice, data is often stored in several related tables rather than in one large spreadsheet. An analyst must first understand each table independently, then combine them carefully in order to answer richer business questions.

For this session, we work with the following files:

  • orders.csv
  • order_products_train.csv
  • products.csv
  • departments_t.csv
  • aisles.csv
Important

Here you can find the Instacart Market Basket Analysis dataset and its related documentation.

Importing the data

Now let’s omport the needed data one by one:

ImportantDownlaod Archieve.zip file again

Remember to download the Archieve.zip file.

Importing orders.csv file

df_orders = pd.read_csv('../data/raw/orders.csv')

print(df_orders.columns)
print(df_orders.shape)
df_orders.head()
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='str')
(3421083, 7)
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0

Importing order_products_train .csv file

# Reading data
df_orders_products_train = pd.read_csv('../data/raw/order_products_train.csv')
print(df_orders_products_train.columns)
print(df_orders_products_train.shape)
df_orders_products_train.head()
Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered'], dtype='str')
(1384617, 4)
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0

Importing products.csv file

df_products = pd.read_csv("../data/raw/products.csv")
df_products.head()
Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices'], dtype='str')
(49693, 5)
product_id product_name aisle_id department_id prices
0 1 Chocolate Sandwich Cookies 61 19 5.8
1 2 All-Seasons Salt 104 13 9.3
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.5
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.5
4 5 Green Chile Anytime Sauce 5 13 4.3

Importing departments_t.csv

df_departments = pd.read_csv('../data/processed/departments_t.csv')
print(df_departments.columns)
print(df_departments.shape)
df_departments.head()
Index(['department_id', 'department'], dtype='str')
(21, 2)
department_id department
0 1 frozen
1 2 other
2 3 bakery
3 4 produce
4 5 alcohol

Importing aisles.csv

df_aisles = pd.read_csv('../data/raw/aisles.csv')
print(df_aisles.columns)
print(df_aisles.shape)
df_aisles.head()
Index(['aisle_id', 'aisle'], dtype='str')
(134, 2)
aisle_id aisle
0 1 prepared soups salads
1 2 specialty cheeses
2 3 energy granola bars
3 4 instant foods
4 5 marinades meat preparation

Data Dictionary

Before performing any analysis, we need to understand what each table represents.

Overview of the Tables

Table Analytical Role Grain Main Key
orders Behavioral table One row per order order_id
order_products_train Transaction bridge table One row per product within an order no single unique key
products Product master table One row per product product_id
departments_t Department lookup table One row per department department_id
aisles Aisle lookup table One row per aisle aisle_id

The orders Table

The orders table is the behavioral table of the dataset.

Each row represents one order placed by one user.

df_orders.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0

Important columns:

  • order_id: unique order identifier
  • user_id: identifier of the customer
  • eval_set: indicates whether the order belongs to prior, train, or test (we are going to use train)
  • order_number: sequence number of the order for that user
  • order_dow: day of week of the order
  • order_hour_of_day: hour of day of the order
  • days_since_prior_order: number of days since the customer’s previous order

Analytical questions that can be answered from this table alone:

  • How many total orders are there?
  • How many users are represented?
  • What is the distribution of eval_set?
  • On which days do customers place orders most frequently?
  • At what hours do customers place orders most frequently?
  • Are missing values in days_since_prior_order meaningful?

The order_products_train Table

This is the transaction bridge table for train orders.

Each row represents one product inside one order.

Important columns:

  • order_id: identifier of the order
  • product_id: identifier of the product
  • add_to_cart_order: the sequence in which the product was added to the basket
  • reordered: indicates whether the item had been ordered before by the same customer
Important
  • order_id is not unique in this table
  • product_id is not unique in this table
  • repeated values are expected because one order can contain many products, and one product can appear in many orders

Analytical questions that can be answered from this table alone:

  • How many line items are there?
  • How many unique orders are represented?
  • How many unique products are represented?
  • What proportion of items are reordered?
  • What does the distribution of add_to_cart_order look like?

The products Table

The products table is the product master table.

Each row represents one product.

Important columns:

  • product_id: unique identifier of the product
  • product_name: product name
  • aisle_id: identifier of the aisle
  • department_id: identifier of the department

Analytical questions that can be answered from this table alone:

  • How many products exist?
  • Are product IDs unique?
  • How many departments are represented in the product catalog?
  • How many aisles are represented in the product catalog?
  • Do some product names repeat?

The departments_t Table

This is a lookup table.

Each row represents one department.

Important columns:

  • department_id: unique department identifier
  • department: department name

Analytical questions that can be answered from this table alone:

  • How many departments exist?
  • Are department IDs unique?

The aisles Table

This is also a lookup table.

Each row represents one aisle.

Important columns:

  • aisle_id: unique aisle identifier
  • aisle: aisle name

Analytical questions that can be answered from this table alone:

  • How many aisles exist?
  • Are aisle IDs unique?

Relationship Between the Tables

Before merging, we should understand how the tables are connected.

flowchart LR
    A[orders] -->|order_id| B[order_products_train]
    B -->|product_id| C[products]
    C -->|department_id| D[departments_t]
    C -->|aisle_id| E[aisles]

This tells us the following:

  • the orders table contains order-level behavior
  • the order_products_train table connects orders to products
  • the products table adds product information
  • the departments_t and aisles tables make product categories more interpretable

Why We Explore Table by Table First

Before using pd.merge(), it is good analytical practice to inspect each table independently.

This allows us to:

  • understand the grain of the data
  • identify key columns
  • find missing values
  • check whether IDs are unique where they should be unique
  • detect possible problems before combining tables

Exploring the orders Table

We begin with the orders table because it is the most intuitive table in the Instacart dataset. It contains behavioral information about when users place orders and how frequently they shop.

Each row in this table represents one order placed by one user.

Previewing the data

df_orders.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0

The output helps us answer two important structural questions:

  • How many rows and columns does the table have?
  • What variables are available for analysis?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_orders.info()
<class 'pandas.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                str    
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), str(1)
memory usage: 198.9 MB

Looking at data types

Each column has a specific role, and its data type should match its meaning.

order_id                    int64
user_id                     int64
eval_set                      str
order_number                int64
order_dow                   int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

Descriptive statistics

For numeric columns, descriptive statistics provide a first summary of the data.

df_orders.describe()
order_id user_id order_number order_dow order_hour_of_day days_since_prior_order
count 3.421083e+06 3.421083e+06 3.421083e+06 3.421083e+06 3.421083e+06 3.214874e+06
mean 1.710542e+06 1.029782e+05 1.715486e+01 2.776219e+00 1.345202e+01 1.111484e+01
std 9.875817e+05 5.953372e+04 1.773316e+01 2.046829e+00 4.226088e+00 9.206737e+00
min 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 8.552715e+05 5.139400e+04 5.000000e+00 1.000000e+00 1.000000e+01 4.000000e+00
50% 1.710542e+06 1.026890e+05 1.100000e+01 3.000000e+00 1.300000e+01 7.000000e+00
75% 2.565812e+06 1.543850e+05 2.300000e+01 5.000000e+00 1.600000e+01 1.500000e+01
max 3.421083e+06 2.062090e+05 1.000000e+02 6.000000e+00 2.300000e+01 3.000000e+01

This is useful for checking the ranges of variables such as:

  • order_number
  • order_dow
  • order_hour_of_day
  • days_since_prior_order

Note, that from now on we will use pd.options.display.float_format = '{:,.2f}'.format option for data reporting in order to make the value more representative.

pd.options.display.float_format = '{:,.2f}'.format
df_orders.describe()
order_id user_id order_number order_dow order_hour_of_day days_since_prior_order
count 3,421,083.00 3,421,083.00 3,421,083.00 3,421,083.00 3,421,083.00 3,214,874.00
mean 1,710,542.00 102,978.21 17.15 2.78 13.45 11.11
std 987,581.74 59,533.72 17.73 2.05 4.23 9.21
min 1.00 1.00 1.00 0.00 0.00 0.00
25% 855,271.50 51,394.00 5.00 1.00 10.00 4.00
50% 1,710,542.00 102,689.00 11.00 3.00 13.00 7.00
75% 2,565,812.50 154,385.00 23.00 5.00 16.00 15.00
max 3,421,083.00 206,209.00 100.00 6.00 23.00 30.00

Missing values

Missing values should always be checked before moving to deeper analysis.

df_orders.isna().sum()
order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64
Tip

This is especially important for days_since_prior_order, because in this dataset null values may be meaningful rather than erroneous.

Analytical questions for orders

Even without merging, the orders table already allows us to answer several useful analytical questions.

  • How many orders are there in total?
  • How many unique users are represented?
  • What is the distribution of eval_set?
  • On which days do customers place orders most frequently?
  • At what hours do customers place orders most frequently?
  • Are missing values in days_since_prior_order meaningful?
  • Is the first order of a user treated differently from later orders?

Q1: How many unique orders exist?

df_orders["order_id"].nunique()
3421083

Q2: How many distinct customers are represented in the dataset?

df_orders["user_id"].nunique()
206209

Q3: Exploring the eval_set distribution

The eval_set variable is very important because it determines which subset of orders we will use later in the analysis.

This helps us understand how the orders are split into:

df_orders["eval_set"].value_counts(dropna=False)
eval_set
prior    3214874
train     131209
test       75000
Name: count, dtype: int64

Later, we will filter this table to the train subset before merging it with order_products_train.

Q4: Orders by day of week

This tells us how orders are distributed across the week. We can now begin exploring ordering behavior.

df_orders["order_dow"].value_counts().sort_index()
order_dow
0    600905
1    587478
2    467260
3    436972
4    426339
5    453368
6    448761
Name: count, dtype: int64

A useful analytical question here is: Are some days clearly more active than others?

Q5: Orders by hour of day

Next, we examine the hourly distribution of orders.

  • At what time of day do customers shop most often?
  • Is ordering behavior concentrated in a small time window?
df_orders["order_hour_of_day"].value_counts().sort_index()
order_hour_of_day
0      22758
1      12398
2       7539
3       5474
4       5527
5       9569
6      30529
7      91868
8     178201
9     257812
10    288418
11    284728
12    272841
13    277999
14    283042
15    283639
16    272553
17    228795
18    182912
19    140569
20    104292
21     78109
22     61468
23     40043
Name: count, dtype: int64

Q5:Understanding missing values in days_since_prior_order

This is one of the most interesting columns in the table because its missing values may actually have a logical explanation.

df_orders["days_since_prior_order"].isna().sum()
np.int64(206209)

At first glance, missing values may seem like a data quality issue. However, here we should ask a better question:

Could these null values correspond to the first order of each user?

Logical check: first orders and null values

We can test this directly.

df_orders.loc[df_orders["order_number"] == 1, "days_since_prior_order"].isna().all()
np.True_

If the result is True, then the null values are not random. They are structurally meaningful.

Caution

This is an important analytical lesson:

  • not every null is a problem
  • some null values reflect the business logic of the data

Consistency checks for the orders table

Before using a table in deeper analysis, it is good practice to perform a few basic consistency checks.

Is order_id unique?

Since each row should represent one order, order_id should uniquely identify rows.

df_orders["order_id"].is_unique
True

Is order_dow in the correct range?

Day of week should logically be between 0 and 6.

df_orders["order_dow"].between(0, 6).all()
np.True_

Is order_hour_of_day in the correct range?

Hour of day should logically be between 0 and 23.

df_orders["order_hour_of_day"].between(0, 23).all()
np.True_
Warning

These checks are simple, but they are important. They help us confirm that the table behaves as expected before we move on.

Filtering the orders table to train orders

Since the next transactional table we will use is order_products_train.csv, we should prepare the matching subset of orders in advance.

df_orders_train = df_orders[df_orders["eval_set"] == "train"]
df_orders_train.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
10 1187899 1 train 11 4 8 14.00
25 1492625 2 train 15 1 11 30.00
49 2196797 5 train 5 0 11 6.00
74 525192 7 train 21 2 11 6.00
78 880375 8 train 4 1 14 10.00

Why do we filter first?

This is an important analytical step.

  • order_products_train.csv contains only products from train orders
  • filtering first keeps the analysis focused and consistent
  • this is similar to applying a WHERE condition before a JOIN in SQL

Keeping only the columns we need

To make the next steps cleaner, we can keep only the most relevant columns from the filtered table.

df_orders_train.drop(columns=["eval_set"])
df_orders_train.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
10 1187899 1 train 11 4 8 14.00
25 1492625 2 train 15 1 11 30.00
49 2196797 5 train 5 0 11 6.00
74 525192 7 train 21 2 11 6.00
78 880375 8 train 4 1 14 10.00
Important

df_orders takes significant amount of memory, thus once finishing working with the dataframe, we can delete it:

del df_orders

Summary of the orders table

At this stage, we understand the orders table both structurally and analytically.

  • It is a behavioral table with one row per order
  • It helps us study timing and frequency of customer orders
  • It introduces the important eval_set variable
  • It provides a good example of meaningful null values
  • It allows us to prepare the correct subset before merging with the train basket table

Exploring the order_products_train Table

We now move to the order_products_train table. Unlike orders, this is not an order-level table. Instead, it is a transaction bridge table that connects orders to products.

Each row in this table represents one product inside one order.

Previewing the data

df_orders_products_train.head()
order_id product_id add_to_cart_order reordered
0 1 49302 1 1
1 1 11109 2 1
2 1 10246 3 0
3 1 49683 4 0
4 1 43633 5 1

The output helps us answer two important structural questions:

  • What does one row represent in this table?
  • Which columns help us connect orders to products?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_orders_products_train.info()
<class 'pandas.DataFrame'>
RangeIndex: 1384617 entries, 0 to 1384616
Data columns (total 4 columns):
 #   Column             Non-Null Count    Dtype
---  ------             --------------    -----
 0   order_id           1384617 non-null  int64
 1   product_id         1384617 non-null  int64
 2   add_to_cart_order  1384617 non-null  int64
 3   reordered          1384617 non-null  int64
dtypes: int64(4)
memory usage: 42.3 MB

Looking at data types

Each column has a specific role, and its data type should match its meaning.

df_orders_products_train.dtypes
order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtype: object

Descriptive statistics

For numeric columns, descriptive statistics provide a first summary of the data.

df_orders_products_train.describe()
order_id product_id add_to_cart_order reordered
count 1,384,617.00 1,384,617.00 1,384,617.00 1,384,617.00
mean 1,706,297.62 25,556.24 8.76 0.60
std 989,732.65 14,121.27 7.42 0.49
min 1.00 1.00 1.00 0.00
25% 843,370.00 13,380.00 3.00 0.00
50% 1,701,880.00 25,298.00 7.00 1.00
75% 2,568,023.00 37,940.00 12.00 1.00
max 3,421,070.00 49,688.00 80.00 1.00

This is useful for checking the ranges of variables such as:

  • order_id
  • product_id
  • add_to_cart_order
  • reordered

Missing values

Missing values should always be checked before moving to deeper analysis.

df_orders_products_train.isna().sum()
order_id             0
product_id           0
add_to_cart_order    0
reordered            0
dtype: int64
Tip

This is an especially important table because it will later be used for merging. If key columns such as order_id or product_id have missing values, the merge process may produce incomplete results.

Analytical questions for order_products_train

Even without merging, the order_products_train table already allows us to answer several useful analytical questions.

  • How many line items are there in total?
  • How many unique orders are represented?
  • How many unique products are represented?
  • What is the distribution of reordered?
  • What does add_to_cart_order tell us about basket construction?
  • Are there any suspicious or inconsistent values?

Q1: How many line items exist?

len(df_orders_products_train)
1384617

Q2: How many unique orders are represented in this table?

df_orders_products_train["order_id"].nunique()
131209

Q3: How many unique products are represented in this table?

df_orders_products_train["product_id"].nunique()
39123

Q4: Exploring reorder behavior

The reordered variable is analytically important because it helps us understand repeated purchase behavior.

This helps us understand how the product rows are split into:

df_orders_products_train["reordered"].value_counts(dropna=False)
reordered
1    828824
0    555793
Name: count, dtype: int64

Later, after merging with products, departments_t, and aisles, we will be able to see which products and categories have the highest reorder tendency.

Q5: Reorder share

Instead of only counts, we can also look at proportions.

df_orders_products_train["reordered"].value_counts(normalize=True)
reordered
1   0.60
0   0.40
Name: proportion, dtype: float64

A useful analytical question here is: What share of line items in train orders corresponds to reordered products?

Q6: Understanding add_to_cart_order

This variable tells us the sequence in which a product was added to the cart.

df_orders_products_train["add_to_cart_order"].describe()
count   1,384,617.00
mean            8.76
std             7.42
min             1.00
25%             3.00
50%             7.00
75%            12.00
max            80.00
Name: add_to_cart_order, dtype: float64

This allows us to ask:

  • Are most products added early in the basket-building process?
  • How large do baskets tend to become?

Q7: Which cart positions appear most often?

df_orders_products_train["add_to_cart_order"].value_counts().sort_index().head(20)
add_to_cart_order
1     131209
2     124364
3     116996
4     108963
5     100745
6      91850
7      83142
8      74601
9      66618
10     59401
11     52848
12     46814
13     41431
14     36588
15     32194
16     28363
17     24841
18     21733
19     19014
20     16541
Name: count, dtype: int64

Later, after merging, we may ask whether some products or departments tend to be added earlier than others.

Understanding repeated keys in the table

This table behaves differently from orders.

df_orders_products_train["order_id"].is_unique
False
df_orders_products_train["product_id"].is_unique
False

If these results are False, that is expected.

Caution

This is an important analytical lesson:

  • order_id is expected to repeat because one order can contain many products
  • product_id is expected to repeat because the same product can appear in many different orders
  • repeated keys are not always an error; sometimes they reflect the grain of the table

Consistency checks for the order_products_train table

Before using a table in deeper analysis, it is good practice to perform a few basic consistency checks.

Is reordered limited to valid values?

The reordered column should logically contain only 0 and 1.

df_orders_products_train["reordered"].isin([0, 1]).all()
np.True_

Is add_to_cart_order positive?

Since this variable captures position in the basket, it should be greater than 0.

(df_orders_products_train["add_to_cart_order"] > 0).all()
np.True_

Are there missing values in the key columns?

df_orders_products_train[["order_id", "product_id"]].isna().sum()
order_id      0
product_id    0
dtype: int64
Warning

These checks are simple, but they are important. They help us confirm that the table behaves as expected before we move on.

Why this table matters for merging

This table is the bridge between orders and products.

  • order_id will connect this table to orders_train
  • product_id will connect this table to products
  • after merging, we will be able to connect customer order behavior with product attributes

Without this table, we would know when orders happened and what products exist, but not which products were purchased in which orders.

Summary of the order_products_train table

At this stage, we understand the order_products_train table both structurally and analytically.

  • It is a transaction bridge table with one row per product within an order
  • It connects orders to products
  • It helps us study basket composition and reorder behavior
  • It introduces repeated keys as a natural property of line-item data
  • It prepares the foundation for the merge process

Exploring the products Table

We now move to the products table. Unlike orders and order_products_train, this is a master data table.

Each row in this table represents one product.

Previewing the data

df_products.head()
product_id product_name aisle_id department_id prices
0 1 Chocolate Sandwich Cookies 61 19 5.80
1 2 All-Seasons Salt 104 13 9.30
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.50
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.50
4 5 Green Chile Anytime Sauce 5 13 4.30

The output helps us answer two important structural questions:

  • What attributes are available for each product?
  • Which columns will later help us enrich the transaction data?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_products.info()
<class 'pandas.DataFrame'>
RangeIndex: 49693 entries, 0 to 49692
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   product_id     49693 non-null  int64  
 1   product_name   49677 non-null  str    
 2   aisle_id       49693 non-null  int64  
 3   department_id  49693 non-null  int64  
 4   prices         49693 non-null  float64
dtypes: float64(1), int64(3), str(1)
memory usage: 3.4 MB

Looking at data types

Each column has a specific role, and its data type should match its meaning.

df_products.dtypes
product_id         int64
product_name         str
aisle_id           int64
department_id      int64
prices           float64
dtype: object

Descriptive statistics

For numeric columns, descriptive statistics provide a first summary of the data.

df_products.describe()
product_id aisle_id department_id prices
count 49,693.00 49,693.00 49,693.00 49,693.00
mean 24,844.35 67.77 11.73 9.99
std 14,343.72 38.32 5.85 453.52
min 1.00 1.00 1.00 1.00
25% 12,423.00 35.00 7.00 4.10
50% 24,845.00 69.00 13.00 7.10
75% 37,265.00 100.00 17.00 11.20
max 49,688.00 134.00 21.00 99,999.00

This is useful for checking the ranges of variables such as:

  • product_id
  • aisle_id
  • department_id

Missing values

Missing values should always be checked before moving to deeper analysis.

df_products.isna().sum()
product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64
Tip

This table is especially important because it provides product attributes that will later be attached to the transaction data. Missing values in aisle_id or department_id could weaken the interpretation of merged results.

Analytical questions for products

Even without merging, the products table already allows us to answer several useful analytical questions.

  • How many products are there in total?
  • Are product IDs unique?
  • How many aisles and departments are represented?
  • Do product names repeat?
  • Is the product catalog evenly distributed across categories?

Q1: How many unique products exist?

df_products["product_id"].nunique()
49686

Q2: Are product IDs unique?

df_products["product_id"].is_unique
False

Q3: How many unique departments are referenced in the product table?

df_products["department_id"].nunique()
21

Q4: How many unique aisles are referenced in the product table?

df_products["aisle_id"].nunique()
134

Q5: Do some product names repeat?

df_products["product_name"].duplicated().sum()
np.int64(20)

Later, after merging with lookup tables, we will be able to interpret department and aisle information in business language instead of numeric IDs.

Understanding the role of department_id and aisle_id

The products table does not contain department names or aisle names directly. Instead, it contains numeric keys.

df_products[["product_id", "product_name", "aisle_id", "department_id"]].head()
product_id product_name aisle_id department_id
0 1 Chocolate Sandwich Cookies 61 19
1 2 All-Seasons Salt 104 13
2 3 Robust Golden Unsweetened Oolong Tea 94 7
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1
4 5 Green Chile Anytime Sauce 5 13

This means:

  • department_id will later connect to departments_t
  • aisle_id will later connect to aisles

Consistency checks for the products table

Before using a table in deeper analysis, it is good practice to perform a few basic consistency checks.

Are product_id values unique?

df_products["product_id"].is_unique
False

Are department_id values missing?

df_products["department_id"].isna().sum()
np.int64(0)

Are aisle_id values missing?

df_products["aisle_id"].isna().sum()
np.int64(0)
Warning

These checks are simple, but they are important. They help us confirm that the table behaves as expected before we move on.

Why this table matters for merging

This table gives business meaning to the product_id values in the transaction table.

  • it introduces product names
  • it connects products to aisles
  • it connects products to departments
  • it will later help us move from product IDs to interpretable categories

Summary of the products table

At this stage, we understand the products table both structurally and analytically.

  • It is a master data table with one row per product
  • It provides product names and categorical keys
  • It prepares the foundation for enrichment through lookup tables
  • It will later help us interpret transaction data more meaningfully

Exploring the departments_t Table

We now move to the departments_t table. This is a lookup table.

Each row in this table represents one department.

Previewing the data

df_departments.head()
department_id department
0 1 frozen
1 2 other
2 3 bakery
3 4 produce
4 5 alcohol

The output helps us answer two important structural questions:

  • How many departments are there?
  • What labels will later be used to interpret department_id values?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_departments.info()
<class 'pandas.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   department_id  21 non-null     int64
 1   department     21 non-null     str  
dtypes: int64(1), str(1)
memory usage: 638.0 bytes

Looking at data types

Each column has a specific role, and its data type should match its meaning.

df_departments.dtypes
department_id    int64
department         str
dtype: object

Missing values

Missing values should always be checked before moving to deeper analysis.

df_departments.isna().sum()
department_id    0
department       0
dtype: int64

Analytical questions for departments_t

Even without merging, the departments_t table already allows us to answer several useful analytical questions.

  • How many departments exist?
  • Are department IDs unique?
  • Are there any missing department labels?

Q1: How many departments exist?

df_departments["department"].nunique()
21

Q2: Are department_id values unique?

df_departments["department_id"].is_unique
True

Q3: Are there missing values in the table?

df_departments.isna().sum()
department_id    0
department       0
dtype: int64

Why this table matters for merging

This is a small table, but it is analytically important.

  • it translates department_id into a readable department name
  • it makes the final merged dataset interpretable
  • without it, we would only see numeric department codes

Summary of the departments_t table

At this stage, we understand the departments_t table both structurally and analytically.

  • It is a lookup table with one row per department
  • It provides readable department names
  • It will later enrich the products table

Exploring the aisles Table

We now move to the aisles table. Like departments_t, this is also a lookup table.

Each row in this table represents one aisle.

Previewing the data

df_aisles.head()
aisle_id aisle
0 1 prepared soups salads
1 2 specialty cheeses
2 3 energy granola bars
3 4 instant foods
4 5 marinades meat preparation

The output helps us answer two important structural questions:

  • How many aisles are there?
  • What labels will later be used to interpret aisle_id values?

Inspecting the schema

A very important step in Pandas is checking the schema of the table. This helps us understand the data types and whether missing values are present.

df_aisles.info()
<class 'pandas.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   aisle_id  134 non-null    int64
 1   aisle     134 non-null    str  
dtypes: int64(1), str(1)
memory usage: 4.2 KB

Looking at data types

Each column has a specific role, and its data type should match its meaning.

df_aisles.dtypes
aisle_id    int64
aisle         str
dtype: object

Missing values

Missing values should always be checked before moving to deeper analysis.

df_aisles.isna().sum()
aisle_id    0
aisle       0
dtype: int64

Analytical questions for aisles

Even without merging, the aisles table already allows us to answer several useful analytical questions.

  • How many aisles exist?
  • Are aisle IDs unique?
  • Are there any missing aisle labels?

Q1: How many aisles exist?

df_aisles["aisle"].nunique()
134

Q2: Are aisle_id values unique?

df_aisles["aisle_id"].is_unique
True

Q3: Are there missing values in the table?

df_aisles.isna().sum()
aisle_id    0
aisle       0
dtype: int64

Why this table matters for merging

This is also a small table, but it is analytically important.

  • it translates aisle_id into a readable aisle name
  • it provides a more detailed grouping level than department
  • it will later enrich the products table and make category-level analysis more informative

Summary of the aisles table

At this stage, we understand the aisles table both structurally and analytically.

  • It is a lookup table with one row per aisle
  • It provides readable aisle names
  • It will later enrich the products table

Understanding pd.merge()

Until now, we explored each table independently. That was an important first step because good analytical work starts with understanding the structure, meaning, and limitations of each table before combining them.

Now we move to the next stage: merging tables.

In relational data, different pieces of information are stored in different tables.

  • orders tells us about order behavior
  • order_products_train tells us which products were included in each order
  • products gives product attributes
  • departments_t and aisles translate category IDs into readable labels

If we want to answer richer analytical questions, we need to connect these tables.

In Pandas, the main tool for this is pd.merge().

Before Merge

At this stage, we have explored all tables individually.

  • orders is a behavioral table with one row per order
  • order_products_train is a transaction bridge table with one row per product within an order
  • products is a master table with one row per product
  • departments_t is a lookup table for department names
  • aisles is a lookup table for aisle names

SQL and Pandas Connection

If you already know SQL, the logic of pd.merge() should feel familiar.

SQL Concept Pandas Equivalent
JOIN pd.merge()
ON on=
LEFT JOIN how="left"
INNER JOIN how="inner"
RIGHT JOIN how="right"
FULL OUTER JOIN how="outer"

The core idea is the same:

  • identify a shared key
  • decide which rows you want to keep
  • combine the matching columns

General Structure of pd.merge()

The general syntax is:

pd.merge(left_df, right_df, on="key_column", how="left")

The most important arguments are:

  • left_df: the DataFrame on the left side of the merge
  • right_df: the DataFrame on the right side of the merge
  • on: the column used as the matching key
  • how: the type of join
  • indicator: adds a special column showing where each row came from

Visual Idea of a Merge

A merge can be imagined as matching rows based on shared IDs.

flowchart LR
    A[Left Table] --> C[Shared Key]
    B[Right Table] --> C
    C --> D[Merged Result]

This looks simple, but analytically it is very important because:

  • if the key is wrong, the result is wrong
  • if keys are duplicated unexpectedly, rows may multiply
  • if matches are missing, null values may appear

That is why merging should always be followed by validation.

Simple Merge

Before merging the large transactional tables, it is better to start with a simpler and safer example.

The cleanest first step is:

  • merge products with departments_t
  • merge that result with aisles

This lets us enrich product information first.

Merge: products with departments_t

The products table contains department_id, but it does not contain the actual department name.

If we want readable categories, we need to merge it with departments_t.

Visual view of the first merge

flowchart LR
    A[products] -->|department_id| C[merge]
    B[departments_t] -->|department_id| C
    C --> D[products + department]

Performing the merge

df_products_departments = pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="left"
)

df_products_departments.head()
product_id product_name aisle_id department_id prices department
0 1 Chocolate Sandwich Cookies 61 19 5.80 snacks
1 2 All-Seasons Salt 104 13 9.30 pantry
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.50 beverages
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.50 frozen
4 5 Green Chile Anytime Sauce 5 13 4.30 pantry

Why do we use how="left" here?

We use a left join because we want to keep all rows from df_products.

That means:

  • every product should remain in the result
  • department information will be added where a match exists
  • if a match is missing, Pandas will keep the product row and place NaN in the department column

This is usually the safest merge for analytical enrichment.

Understanding the indicator Argument

One of the most useful but underused options in pd.merge() is the indicator argument.

When we set indicator=True, Pandas adds a new column called _merge that tells us where each row came from.

df_products_departments_check = pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="left",
    indicator=True
)

df_products_departments_check.head()
product_id product_name aisle_id department_id prices department _merge
0 1 Chocolate Sandwich Cookies 61 19 5.80 snacks both
1 2 All-Seasons Salt 104 13 9.30 pantry both
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.50 beverages both
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.50 frozen both
4 5 Green Chile Anytime Sauce 5 13 4.30 pantry both

What does _merge mean?

The _merge column can contain three values:

_merge value Meaning
left_only row appears only in the left table
right_only row appears only in the right table
both row matched in both tables

This is extremely helpful for checking merge quality.

Visual explanation of indicator=True

flowchart LR
    A[Left table row] --> D[_merge = left_only]
    B[Right table row] --> E[_merge = right_only]
    C[Matched row] --> F[_merge = both]

Looking at the indicator results

df_products_departments_check["_merge"].value_counts()
_merge
both          49693
left_only         0
right_only        0
Name: count, dtype: int64

Why is this useful?

This tells us whether the merge behaved as expected.

For example:

  • if all rows are both, then every product found a matching department
  • if some rows are left_only, then some products have department IDs that do not exist in departments_t
  • if we ever see right_only in a left join result, that would be unusual in the displayed merged result because left joins retain left rows; however, the right-side categories may still appear through other join types or different validation setups
Important

The indicator argument is especially useful for data quality checks and referential integrity checks.

Investigating unmatched rows

If the merge indicator shows left_only, we should inspect those rows.

df_products_departments_check[
    df_products_departments_check["_merge"] == "left_only"
].head()
product_id product_name aisle_id department_id prices department _merge

This helps us identify whether some department_id values in products are missing from the lookup table.

Cleaning the indicator result

Once we finish checking the merge, we usually drop the _merge column or rerun the merge without indicator=True.

df_products_departments = pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="left"
)

df_products_departments.head()
product_id product_name aisle_id department_id prices department
0 1 Chocolate Sandwich Cookies 61 19 5.80 snacks
1 2 All-Seasons Salt 104 13 9.30 pantry
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.50 beverages
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.50 frozen
4 5 Green Chile Anytime Sauce 5 13 4.30 pantry

Homework P1: Questions unlocked after the first merge

After this merge, we can ask more meaningful questions.

Before the merge, we only had numeric department_id values.

Now we can ask:

  • How many products belong to each department?
  • Which departments contain the most products?
  • Are products concentrated in a small number of departments?

Example: number of products by department

df_products_departments["department"].value_counts()
department
personal care      6565
snacks             6264
pantry             5371
beverages          4365
frozen             4007
dairy eggs         3449
household          3085
canned goods       2092
dry goods pasta    1858
produce            1684
bakery             1516
deli               1322
missing            1258
international      1139
breakfast          1116
babies             1081
alcohol            1056
pets                972
meat seafood        907
other               548
bulk                 38
Name: count, dtype: int64

Adding aisles

Now we enrich the product table further.

The products table also contains aisle_id, but not the aisle name itself. So we merge with aisles.

Visual view of the second merge

flowchart LR
    A[products + department] -->|aisle_id| C[merge]
    B[aisles] -->|aisle_id| C
    C --> D[products enriched]

Performing the second merge

df_products_enriched = pd.merge(
    df_products_departments,
    df_aisles,
    on="aisle_id",
    how="left"
)

df_products_enriched.head()
product_id product_name aisle_id department_id prices department aisle
0 1 Chocolate Sandwich Cookies 61 19 5.80 snacks cookies cakes
1 2 All-Seasons Salt 104 13 9.30 pantry spices seasonings
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.50 beverages tea
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.50 frozen frozen meals
4 5 Green Chile Anytime Sauce 5 13 4.30 pantry marinades meat preparation

Checking the second merge with indicator=True

df_products_aisles_check = pd.merge(
    df_products_departments,
    df_aisles,
    on="aisle_id",
    how="left",
    indicator=True
)

df_products_aisles_check["_merge"].value_counts()
_merge
both          49693
left_only         0
right_only        0
Name: count, dtype: int64

Investigating unmatched aisle rows

df_products_aisles_check[
    df_products_aisles_check["_merge"] == "left_only"
].head()
product_id product_name aisle_id department_id prices department aisle _merge

What do we have after these two merges?

At this stage, we have created an enriched product table.

Each product now includes:

  • product_name
  • department
  • aisle

This is much better for analysis than working only with IDs.

Why validate row counts after a merge?

One very important habit is checking whether the number of rows changed unexpectedly.

If departments_t has one row per department, and each product has only one department, then merging products with departments should usually keep the same number of rows.

Checking row counts

print(df_products.shape)
print(df_products_departments.shape)
print(df_products_enriched.shape)
(49693, 5)
(49693, 6)
(49693, 7)

If the number of rows increases unexpectedly, that may indicate duplicated keys in the lookup table.

Checking uniqueness in lookup tables before merging

This is why uniqueness checks matter.

df_departments["department_id"].is_unique
True
df_aisles["aisle_id"].is_unique
True

If these are False, a merge can create duplicated rows in the result.

Homework P2: Questions unlocked after enriching products

After merging products, departments_t, and aisles, we can answer questions such as:

  • How many products belong to each department?
  • How many products belong to each aisle?
  • Which aisles are most populated?
  • Which departments span the widest variety of products?
  • How are departments and aisles related?

Example: products by aisle

df_products_enriched["aisle"].value_counts().head(10)
aisle
missing                 1258
candy chocolate         1246
ice cream ice           1091
vitamins supplements    1038
yogurt                  1026
chips pretzels           989
tea                      894
packaged cheese          891
frozen meals             880
cookies cakes            874
Name: count, dtype: int64

Example: cross-tab of departments and aisles

pd.crosstab(
    df_products_enriched["department"],
    df_products_enriched["aisle"]
)
aisle air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers ... spreads tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt
department
alcohol 0 0 0 0 0 0 0 0 0 387 ... 0 0 0 0 0 0 0 0 147 0
babies 0 0 44 132 718 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
bakery 0 0 0 0 0 297 0 0 0 0 ... 0 0 0 241 0 0 0 0 0 0
beverages 0 0 0 0 0 0 0 0 0 0 ... 0 894 0 0 0 0 0 344 0 0
breakfast 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
bulk 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
canned goods 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
dairy eggs 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1026
deli 0 0 0 0 0 0 0 0 0 0 ... 0 0 159 0 0 0 0 0 0 0
dry goods pasta 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
frozen 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
household 355 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 112 0 0 0 0
international 0 605 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
meat seafood 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
missing 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
other 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
pantry 0 0 0 0 0 0 623 290 0 0 ... 493 0 0 0 0 0 0 0 0 0
personal care 0 0 0 0 0 0 0 0 178 0 ... 0 0 0 0 0 0 1038 0 0 0
pets 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
produce 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
snacks 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 69 0 0 0 0 0

21 rows × 134 columns

Merge types in brief

Although we mainly use left joins in this session, it is good to understand the common merge types.

Inner join

Keeps only rows that match in both tables.

pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="inner"
).head()
product_id product_name aisle_id department_id prices department
0 1 Chocolate Sandwich Cookies 61 19 5.80 snacks
1 2 All-Seasons Salt 104 13 9.30 pantry
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.50 beverages
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.50 frozen
4 5 Green Chile Anytime Sauce 5 13 4.30 pantry

Left join

Keeps all rows from the left table.

pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="left"
).head()
product_id product_name aisle_id department_id prices department
0 1 Chocolate Sandwich Cookies 61 19 5.80 snacks
1 2 All-Seasons Salt 104 13 9.30 pantry
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.50 beverages
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.50 frozen
4 5 Green Chile Anytime Sauce 5 13 4.30 pantry

Outer join

Keeps all rows from both tables.

pd.merge(
    df_products,
    df_departments,
    on="department_id",
    how="outer",
    indicator=True
).head()
product_id product_name aisle_id department_id prices department _merge
0 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.50 frozen both
1 8 Cut Russet Potatoes Steam N' Mash 116 1 1.10 frozen both
2 12 Chocolate Fudge Layer Cake 119 1 9.40 frozen both
3 18 Pizza for One Suprema  Frozen Pizza 79 1 10.60 frozen both
4 30 Three Cheese Ziti, Marinara with Meatballs 38 1 13.80 frozen both

Why indicator is especially useful with outer joins

The indicator column becomes even more informative in an outer join because all three categories are easier to observe:

  • rows found only in the left table
  • rows found only in the right table
  • rows found in both tables

This makes outer joins useful for diagnostics, even if they are not always the final analytical choice.

Visual summary of the product enrichment process

flowchart LR
    A[products] -->|department_id| B[products + departments]
    B -->|aisle_id| C[products enriched]
    D[departments_t] --> B
    E[aisles] --> C

pd.merge vs SQL join

If you already know SQL, you may notice an important difference.

In SQL, we can join many tables inside a single query. In Pandas, however, pd.merge() works with only two DataFrames at a time.

This means that when we need to combine several tables, we must do it step by step:

  • merge two DataFrames first
  • store the result in a new DataFrame
  • merge that result with the next DataFrame
  • continue until the final analytical table is built

This approach is not only a limitation. It is also useful, because it makes the merge process easier to understand, validate, and debug.

Continuing the Merge Process

At this point, we already created an enriched product table by combining:

  • products
  • departments_t
  • aisles

Now we move to the behavioral side of the data.

The next goal is to connect:

orders_train \(\rightarrow\) order_products_train

This is the step that links order behavior to basket contents.

Why do we need this merge?

  • df_orders_train tells us when an order happened and which user placed it
  • df_orders_products_train tells us which products were included in the order

Separately, each table is useful. But once we merge them, we can start answering richer analytical questions such as:

  • How many items are there in each order?
  • Do some days tend to have larger baskets?
  • At what hour do larger baskets appear?
  • Does reorder behavior vary by order timing?

Visual view of the merge

flowchart LR
    A[orders_train] -->|order_id| C[merge]
    B[order_products_train] -->|order_id| C
    C --> D[train_transactions]

Performing the merge

We merge the filtered order table with the transaction bridge table using order_id.

df_train_transactions = pd.merge(
    df_orders_train,
    df_orders_products_train,
    on="order_id",
    how="left"
)

df_train_transactions.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered
0 1187899 1 train 11 4 8 14.00 196 1 1
1 1187899 1 train 11 4 8 14.00 25133 2 1
2 1187899 1 train 11 4 8 14.00 38928 3 1
3 1187899 1 train 11 4 8 14.00 26405 4 1
4 1187899 1 train 11 4 8 14.00 39657 5 1

Why do we use how="left" here?

We use a left join because we want to keep all rows from df_orders_train.

That means:

  • every train order should remain in the result
  • transaction rows will be added where a match exists
  • if an order has no matching line items, the product-related columns will become NaN

This is analytically useful because it helps us preserve the order-level structure and inspect missing matches if they exist.

Checking the merge with indicator=True

As before, it is a good idea to validate the merge using the indicator argument.

df_train_transactions_check = pd.merge(
    df_orders_train,
    df_orders_products_train,
    on="order_id",
    how="left",
    indicator=True
)

df_train_transactions_check.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered _merge
0 1187899 1 train 11 4 8 14.00 196 1 1 both
1 1187899 1 train 11 4 8 14.00 25133 2 1 both
2 1187899 1 train 11 4 8 14.00 38928 3 1 both
3 1187899 1 train 11 4 8 14.00 26405 4 1 both
4 1187899 1 train 11 4 8 14.00 39657 5 1 both

Inspecting the _merge column

df_train_transactions_check["_merge"].value_counts()
_merge
both          1384617
left_only           0
right_only          0
Name: count, dtype: int64

How do we interpret the result?

The _merge column tells us whether each row came from:

  • only the left table
  • only the right table
  • both tables

In this case:

  • both means the order from orders_train found matching product rows in order_products_train
  • left_only means the order exists in orders_train but did not find matching rows in order_products_train

For a left join, this is a very important diagnostic check.

Looking at unmatched orders

If any rows are marked as left_only, we should inspect them.

df_train_transactions_check[
    df_train_transactions_check["_merge"] == "left_only"
].head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered _merge

This allows us to investigate whether some train orders are missing from the transaction table.

Why can the number of rows increase after this merge?

This is a very important point.

When we merged products with departments_t, the number of rows was expected to stay the same.

But here the logic is different.

Each order can contain many products, so when we merge orders_train with order_products_train, the number of rows can increase.

This is not an error. It is expected because:

  • orders_train has one row per order
  • order_products_train has one row per product within an order

So after the merge, the resulting table has one row per order-product combination.

Visual explanation of row expansion

flowchart TD
    A[One row in orders_train] --> B[Product 1]
    A --> C[Product 2]
    A --> D[Product 3]
    B --> E[Three rows in merged table]
    C --> E
    D --> E

This is one of the most important conceptual differences between different merge scenarios.

Checking the row counts

print(df_orders_train.shape)
print(df_orders_products_train.shape)
print(df_train_transactions.shape)
(131209, 7)
(1384617, 4)
(1384617, 10)

This helps us confirm that the merged table has the expected order-product level of detail.

What columns do we now have?

df_train_transactions.columns
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order', 'product_id',
       'add_to_cart_order', 'reordered'],
      dtype='str')

After this merge, we have both:

  • order-level variables such as user_id, order_dow, and order_hour_of_day
  • transaction-level variables such as product_id, add_to_cart_order, and reordered

This is already a much more useful analytical table.

Homework P3: Questions unlocked after this merge

After merging orders_train with order_products_train, we can now answer questions such as:

  • How many products are included in each order?
  • What is the typical basket size?
  • Are larger baskets more common on specific days?
  • Are reordered items more common at certain times of day?
  • Does basket construction differ across customer order sequence?

Example: basket size by order

If each row now represents one product within an order, then counting rows per order_id gives us basket size.

df_train_transactions.groupby("order_id").size().head()
order_id
1      8
36     8
38     9
96     7
98    49
dtype: int64

Example: descriptive summary of basket size

df_train_transactions.groupby("order_id").size().describe()
count   131,209.00
mean         10.55
std           7.93
min           1.00
25%           5.00
50%           9.00
75%          14.00
max          80.00
dtype: float64

This helps us understand the distribution of how many items customers place in their baskets.

Example: average basket size by day of week

df_train_transactions.groupby("order_dow").size() / df_train_transactions["order_id"].nunique()
order_dow
0   2.47
1   1.57
2   1.22
3   1.18
4   1.18
5   1.35
6   1.58
dtype: float64

The expression above is only a quick exploratory step. A more careful calculation groups at the order level first and then averages by day.

basket_size_by_order = (
    df_train_transactions
    .groupby(["order_id", "order_dow"])
    .size()
    .reset_index(name="basket_size")
)

basket_size_by_order.groupby("order_dow")["basket_size"].mean()
order_dow
0   11.80
1   10.47
2    9.96
3    9.84
4    9.74
5   10.16
6   10.97
Name: basket_size, dtype: float64

Example: reorder behavior by hour of day

df_train_transactions.groupby("order_hour_of_day")["reordered"].mean()
order_hour_of_day
0    0.57
1    0.58
2    0.58
3    0.58
4    0.60
5    0.62
6    0.65
7    0.65
8    0.64
9    0.62
10   0.61
11   0.59
12   0.59
13   0.59
14   0.59
15   0.58
16   0.59
17   0.59
18   0.58
19   0.59
20   0.60
21   0.61
22   0.60
23   0.59
Name: reordered, dtype: float64

This allows us to explore whether reordered items are more common at certain times.

Are there missing product_id values after the merge?

df_train_transactions["product_id"].isna().sum()
np.int64(0)

Are there missing user_id values after the merge?

df_train_transactions["user_id"].isna().sum()
np.int64(0)

Does every row now represent an order-product combination?

A quick way to inspect this is to look at repeated order_id values.

df_train_transactions["order_id"].value_counts().head()
order_id
2813632    80
1395075    80
949182     77
2869702    76
341238     76
Name: count, dtype: int64

This confirms that orders now appear multiple times, which is exactly what we expect in a line-item-level merged table.

Warning

After a merge, always check both the row count and the missing values in newly attached columns.

A merge may run without errors and still produce incomplete or misleading analytical results.

From transactional table to analytical table

At this stage, we have created a transactional table that combines:

  • order behavior
  • basket contents

However, product_id is still only a numeric identifier. To make the table fully interpretable, we need one final merge.

We now attach the enriched product table.

The last step is to merge df_train_transactions with df_products_enriched.

flowchart LR
    A[train_transactions] -->|product_id| C[merge]
    B[products_enriched] -->|product_id| C
    C --> D[train_analytic]

df_train_analytic = pd.merge(
    df_train_transactions,
    df_products_enriched,
    on="product_id",
    how="left"
)

df_train_analytic.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered product_name aisle_id department_id prices department aisle
0 1187899 1 train 11 4 8 14.00 196 1 1 Soda 77.00 7.00 9.00 beverages soft drinks
1 1187899 1 train 11 4 8 14.00 25133 2 1 Organic String Cheese 21.00 16.00 8.60 dairy eggs packaged cheese
2 1187899 1 train 11 4 8 14.00 38928 3 1 0% Greek Strained Yogurt 120.00 16.00 12.60 dairy eggs yogurt
3 1187899 1 train 11 4 8 14.00 26405 4 1 XL Pick-A-Size Paper Towel Rolls 54.00 17.00 1.00 household paper goods
4 1187899 1 train 11 4 8 14.00 39657 5 1 Milk Chocolate Almonds 45.00 19.00 6.80 snacks candy chocolate

Checking the final merge with indicator=True

df_train_analytic_check = pd.merge(
    df_train_transactions,
    df_products_enriched,
    on="product_id",
    how="left",
    indicator=True
)

df_train_analytic_check["_merge"].value_counts()
_merge
both          1384618
left_only          88
right_only          0
Name: count, dtype: int64

Investigating unmatched product rows

df_train_analytic_check[
    df_train_analytic_check["_merge"] == "left_only"
].head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered product_name aisle_id department_id prices department aisle _merge
10780 3009052 1629 train 37 0 13 3.00 6799 24 1 NaN NaN NaN NaN NaN NaN left_only
52822 195363 7955 train 12 6 14 8.00 6799 4 1 NaN NaN NaN NaN NaN NaN left_only
59749 831974 9000 train 11 3 13 20.00 6799 19 0 NaN NaN NaN NaN NaN NaN left_only
64159 206208 9640 train 16 6 11 12.00 6799 6 1 NaN NaN NaN NaN NaN NaN left_only
81585 472316 12168 train 10 4 12 0.00 6799 12 1 NaN NaN NaN NaN NaN NaN left_only

If such rows exist, then some product_id values in the transaction table do not have a match in the enriched product table.

What do we gain from the final merge?

At this stage, the table becomes fully analysis-ready.

Each row now includes:

  • order-level behavior
  • transaction-level basket information
  • product name
  • department name
  • aisle name

This means we can finally move from numeric IDs to interpretable business categories.

Homework P4 Questions unlocked after the final merge

After building the full analytical table, we can answer much richer questions such as:

  • Which products are purchased most often?
  • Which departments dominate train orders?
  • Which aisles dominate train orders?
  • Which departments have the highest reorder rates?
  • Which aisles appear most often in large baskets?
  • At what times are certain departments purchased more frequently?

Example: most frequently purchased products

df_train_analytic["product_name"].value_counts().head(10)
product_name
Banana                    18726
Bag of Organic Bananas    15480
Organic Strawberries      10894
Organic Baby Spinach       9784
Large Lemon                8135
Organic Avocado            7409
Organic Hass Avocado       7293
Strawberries               6494
Limes                      6033
Organic Raspberries        5546
Name: count, dtype: int64

Example: most frequently purchased departments

df_train_analytic["department"].value_counts()
department
produce            409087
dairy eggs         217051
snacks             118862
beverages          113962
frozen             100426
pantry              81242
bakery              48394
canned goods        46799
deli                44291
dry goods pasta     38713
household           35986
meat seafood        30307
breakfast           29552
personal care       21599
babies              14941
international       11902
missing              8251
alcohol              5602
pets                 4497
other                1795
bulk                 1359
Name: count, dtype: int64

Example: reorder rate by department

df_train_analytic.groupby("department")["reordered"].mean().sort_values(ascending=False)
department
dairy eggs        0.67
produce           0.66
beverages         0.66
bakery            0.63
pets              0.63
deli              0.62
alcohol           0.61
meat seafood      0.59
snacks            0.58
bulk              0.58
breakfast         0.57
frozen            0.56
babies            0.54
dry goods pasta   0.49
canned goods      0.49
household         0.43
other             0.39
missing           0.38
international     0.38
pantry            0.36
personal care     0.34
Name: reordered, dtype: float64

Example: cross-tab of department and reordered flag

pd.crosstab(
    df_train_analytic["department"],
    df_train_analytic["reordered"]
)
reordered 0 1
department
alcohol 2203 3399
babies 6857 8084
bakery 17702 30692
beverages 38960 75002
breakfast 12654 16898
bulk 573 786
canned goods 24017 22782
dairy eggs 70549 146502
deli 16924 27367
dry goods pasta 19828 18885
frozen 44258 56168
household 20614 15372
international 7380 4522
meat seafood 12400 17907
missing 5103 3148
other 1098 697
pantry 51744 29498
personal care 14309 7290
pets 1663 2834
produce 137201 271886
snacks 49760 69102

Example: cross-tab of order day and reordered flag

pd.crosstab(
    df_train_analytic["order_dow"],
    df_train_analytic["reordered"],
    normalize="index"
)
reordered 0 1
order_dow
0 0.39 0.61
1 0.40 0.60
2 0.41 0.59
3 0.41 0.59
4 0.41 0.59
5 0.39 0.61
6 0.41 0.59

Visual summary of the full merge pipeline

flowchart LR
    A[products] -->|department_id| B[products + departments]
    B -->|aisle_id| C[products_enriched]
    D[departments_t] --> B
    E[aisles] --> C
    F[orders_train] -->|order_id| G[train_transactions]
    H[order_products_train] -->|order_id| G
    G -->|product_id| I[train_analytic]
    C -->|product_id| I

Homework | Solution

Step 1 | Downlaoding the Data

In order to have the final version of the dataframe and contintue the main analytical tasks, we need to also merge customers and states dimensions.

data = ['customers.csv', 'states.csv']

URL = 'https://raw.githubusercontent.com/hovhannisyan91/aca/refs/heads/main/lab/python/data/raw/'

ldf = []
for i in data: 
    df = pd.read_csv(URL+i)
    # քանի որ ցանկանում ենք նաև միացնել տվյալները մենք կարող ենք դա անել միանգամից՝ նախ սկզբում պահելով այն լիստի մեջ
    ldf.append(df)
df_customers = ldf[0]
df_states = ldf[1]
df_customers = pd.read_csv('../data/raw/customers.csv')
df_states = pd.read_csv('../data/raw/states.csv')

Step 2 | Exploring the Dataframes

customers

df_customers.head()
user_id First Name Surname Gender STATE Age date_joined n_dependants fam_status income
0 26711 Deborah Esquivel Female Missouri 48 1/1/2017 3 married 165665
1 33890 Patricia Hart Female New Mexico 36 1/1/2017 0 single 59285
2 65803 Kenneth Farley Male Idaho 35 1/1/2017 2 married 99568
3 125935 Michelle Hicks Female Iowa 40 1/1/2017 0 single 42049
4 130797 Ann Gilmore Female Maryland 26 1/1/2017 1 married 40374

states

df_states.head()

Step 3 | Merging customers with states

Important

Pay attention to the states name column in df_customers

Before merging first we need to rename it: from STATE to state

df_customers.rename(columns={'STATE':'state'}, inplace=True)
df_customers_states = df_customers.merge(df_states,on = 'state')
df_customers_states.head()
user_id First Name Surname Gender state Age date_joined n_dependants fam_status income region division
0 26711 Deborah Esquivel Female Missouri 48 1/1/2017 3 married 165665 Midwest West North Central
1 33890 Patricia Hart Female New Mexico 36 1/1/2017 0 single 59285 West Mountain
2 65803 Kenneth Farley Male Idaho 35 1/1/2017 2 married 99568 West Mountain
3 125935 Michelle Hicks Female Iowa 40 1/1/2017 0 single 42049 Midwest West North Central
4 130797 Ann Gilmore Female Maryland 26 1/1/2017 1 married 40374 South South Atlantic

Step 4 | merging with the main dataframe

df_train_analytics_final = df_train_analytic.merge(df_customers_states, on='user_id', how='left')

Now we can drop the all the ids

df_train_analytics_final.drop(columns=['user_id','eval_set','product_id', 'aisle_id', 'department_id'],inplace=True)
print(df_train_analytics_final.shape)
df_train_analytics_final.head()
(1384706, 22)
order_id order_number order_dow order_hour_of_day days_since_prior_order add_to_cart_order reordered product_name prices department ... Surname Gender state Age date_joined n_dependants fam_status income region division
0 1187899 11 4 8 14.00 1 1 Soda 9.00 beverages ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
1 1187899 11 4 8 14.00 2 1 Organic String Cheese 8.60 dairy eggs ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
2 1187899 11 4 8 14.00 3 1 0% Greek Strained Yogurt 12.60 dairy eggs ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
3 1187899 11 4 8 14.00 4 1 XL Pick-A-Size Paper Towel Rolls 1.00 household ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central
4 1187899 11 4 8 14.00 5 1 Milk Chocolate Almonds 6.80 snacks ... Nguyen Female Alabama 31 2/17/2019 3 married 40423 South East South Central

5 rows × 22 columns

Step 5 | saving in the data/processed folder

We are going to have very large, thus we will save it as parquet file. parquet type compresses the data much more effectively.

# NOTE THE VIRTUAL ENVIRONMENT MUST BE ACTIVE
pip install pyarrow fastparquet
df_train_analytics_final.to_parquet("../data/processed/instacart.parquet", index = False)

OR

df_train_analytics_final.to_csv("../data/processed/instacart.csv", index = False)
Tip

Try to save the same df_train_analytics_final dataframe both in csv and parquet format, in order to feel the difference!

Pandas Cheatsheet

Important

For mor infformation about pandas you can find here